Scientific Papers

Jointly benchmarking small and structural variant calls with vcfdist | Genome Biology


Joint evaluations allow variant matches across size categories and increase measured performance

In order to understand the impact of jointly benchmarking small and structural variants on measured accuracy, we evaluated three whole genome sequencing (WGS) datasets with vcfdist using several different variant subsets from the truth and query VCFs. More information on these WGS datasets can be found in the “Methods” section and Additional File 1: Table S1. Figure 2 shows that compared to existing methodologies, which evaluate small variants (in purple) and structural variants (in red) separately, jointly evaluating all variants (in green) leads to lower measured error rates for each variant category.

In Fig. 2, the hifiasm-dipcall dataset uses alignment parameters which are identical to the draft benchmark Q100-dipcall VCF. As a result, it sees the lowest rates of improvement from a joint evaluation of small and structural variants: a 4.6% reduction in SNP errors, a 1.2% reduction in INDEL errors, and a 24.9% reduction in SV errors. The hifiasm-GIAB-TR VCF uses the same assembly as hifiasm-dipcall with very different alignment parameters and therefore sees great benefits from a joint evaluation: a 62.5% reduction in SNP errors, a 34.8% reduction in INDEL errors, and a 75.3% reduction in SV errors. The Q100-PAV VCF lies somewhere between these two extremes, with a 19.5% reduction in SNP errors, a 21.7% reduction in INDEL errors, and a 57.0% reduction in SV errors. A visualization of the alignment parameters used for each dataset is included in Additional File 1: Fig. S1.

These performance improvements originate from cases where multiple smaller variants are found to be nearly or exactly equivalent to one or several larger variants. Figure 3 shows an example where this occurs and a joint evaluation of small and structural variants improves measured performance. A similar example containing sequence information in VCF format is provided in Additional File 1: Fig. S2.

Fig. 3
figure 3

An example from the Q100-PAV HG002 variant callset (chr1:3,287,250-3,287,700 on GRCh38) where using vcfdist to jointly evaluate small and structural variants improves measured performance. Single-nucleotide polymorphisms (SNPs) are marked with black crosses, and deletions are represented as red rectangles. A joint evaluation of all variants discovers that truth and query haplotypes are identical, despite variant representation differences. The truth VCF contains a 145 base deletion, whereas the query VCF contains a 47 base deletion, a 1 base deletion, and a 97 base deletion. By prematurely categorizing variants prior to evaluation into small and structural variants, this equivalence cannot be determined and variants would be classified as false positive (FP) and false negative (FN) variant calls instead of true positives (TP). Only by jointly evaluating the small and large deletions can vcfdist detect that such complex variants are in fact equivalent

Sophisticated variant comparison techniques result in better phasing evaluations

To understand joint phasing evaluation accuracy, we compare vcfdist to WhatsHap, a current standard for phasing evaluation [32, 33]. WhatsHap’s compare module performs one-to-one variant comparisons between truth and query VCFs to evaluate phasing correctness. For each heterozygous query variant, WhatsHap searches for an identical truth variant and notes whether that truth variant has the same or opposite phasing of the corresponding query variant. Within each phase block, WhatsHap then uses a simple dynamic programming algorithm to minimize the total number of flip errors (in which the phase of a single variant is mismatched) and switch errors (in which the phases of all following variants are mismatched) [32]. Although WhatsHap’s approach seems intuitively correct, it breaks down in repetitive regions of the genome where differing variant representations can result in false positive reported flip errors. Table 1 clearly shows that WhatsHap reports far more switch and flip errors than vcfdist on the exact same variant calls, particularly for the Q100-PAV and hifiasm-GIAB-TR datasets.

Table 1 Comparison of WhatsHap compare and vcfdist phasing evaluations relative to the Q100-dipcall truth VCF on the whole-genome GIAB-Q100 BED

In contrast to WhatsHap, vcfdist performs full alignment of all nearby truth and query variants (a “supercluster”) and is able to discover equivalencies in variant representations. As a result, vcfdist reports far fewer phasing errors. The Q100-PAV VCF contains the fewest switch and flip errors, likely because it was produced using the same verkko assembly as the draft benchmark Q100-dipcall VCF. For the hifiasm-dipcall and hifiasm-GIAB-TR VCFs, vcfdist reports nearly identical switch and flip error rates. We believe this is because they were both produced using the same hifiasm scaffold [34]. In comparison, WhatsHap reports a much higher combined switch and flip error rate for the hifiasm-GIAB-TR VCF than for the hifiasm-dipcall VCF. We expect this is because the variant representation used by the hifiasm-GIAB-TR callset differs significantly from that used by the draft benchmark Q100-dipcall VCF, whereas the parameters used by the hifiasm-dipcall VCF are identical (Additional File 1: Fig. S1).

In Additional File 1: Fig. S3, we present an extensive comparison of the switch and flip errors reported by WhatsHap and vcfdist. We find that 42.6% of flip errors reported by WhatsHap proved to be false positives, since the truth and query sequences match exactly when all neighboring variants are considered. An example is shown in Fig. 4, where WhatsHap reports a flip error within a complex variant even though both truth and query haplotypes match exactly. A further 49.6% of the flip errors reported by WhatsHap were not classified as flip errors by vcfdist due to insufficient evidence. Since there was no ground truth for these instances, we manually examined 16 random cases from each dataset (48 in total) where the ground truth was unknown. We found that classifying flip errors with WhatsHap resulted in 43 false positives, 4 true negatives, and 1 true positive. In comparison, classifying flip errors with vcfdist resulted in 40 true negatives, 7 false positives, and 1 false negative when compared to a manual examination. As can be seen in Table 1, these excess false positive flip error calls by WhatsHap artificially inflate the reported switch error rate as well, which will significantly impact tertiary analyses.

Fig. 4
figure 4

a The variant call file (VCF) for an example WhatsHap false positive flip error call. Each VCF record shows the variant chromosome (CONTIG) and positions (POS) in addition to the reference (REF) and alternate (ALT) alleles and their genotypes (GT), the benchmarking decision (BD), and benchmarking credit (BC). In isolation, the two-base deletions at position 32,653,659 (the red and blue variants) appear to be the same variant (because POS, REF, and ALT match) phased differently between the truth and query VCFs (i.e., a flip error). b The resulting haplotype sequences. When this supposed flip error is considered in the context of the surrounding variants, vcfdist is able to determine that the two sets of truth and query variant calls are equivalent because both truth and query haplotype sequences are exactly the same. Both truth haplotypes contain an extra T at position 32,653,666 and one fewer T at positions 32,653,658-9. As a result, it is clear that no such flip error has occurred and differences between the truth and query VCF are due solely to differing variant representations

vcfdist enables highly accurate comparisons with reasonable runtime

Next, we compare vcfdist to previously published works vcfeval [15] and Truvari bench [16], designed for evaluating small and structural variants, respectively. We also benchmark the performance of Truvari’s refine module, a recently developed extension which realigns truth and query variants to one another using MAFFT [29], wavefront alignment (WFA) [28], or partial-order alignment (POA) [30] for more accurate benchmarking. Truvari refine achieves similar accuracy to vcfdist but changes the total counts of truth and query variants, making comparisons across different evaluation tools and pipelines difficult (see Table 3 for an example). All current versions of Truvari do not evaluate SNP accuracy, since Truvari was designed for SV evaluation.

At the other end of the spectrum, vcfeval only evaluates variants smaller than 1000 bases. For this reason, we restrict the maximum variant size to 1Kb in Fig. 5. As variant length increases in Fig. 5, vcfeval reports an increasingly high error rate compared to vcfdist (90.6% higher for SNPs, 128% higher for INDELs, and 321% higher for SVs). This is because vcfeval requires truth and query variants to match exactly (which is less likely for larger variants), whereas vcfdist and Truvari do not. A more lenient matching heuristic will lead to strictly fewer false positives and false negatives in Fig. 5. To avoid falsely inflating vcfdist’s performance, we set vcfdist’s credit threshold to 70% in order to match Truvari’s sequence similarity threshold of 70% as closely as possible. We additionally standardize the method of variant counting across all tools in Fig. 5 to be consistent with vcfdist because otherwise differences in counting credit for partial allele matches would dominate the results (see Table 4).

Fig. 5
figure 5

Comparison of vcfdist with prior works vcfeval and Truvari bench (and its unpublished refine variants) in terms of measured a false negative rate (FNR) and b false discovery rate (FDR) on the GIAB-Q100 BED, which contains benchmarking regions covering 90.3% of the human genome. Note that Truvari does not evaluate SNP accuracy, and that all results are plotted on a logarithmic scale

Figure 5 shows that vcfdist measures lower false negative and discovery rates for all variant sizes across all three datasets when compared to previously published works Truvari bench and vcfeval. It is able to accomplish this by allowing inexact matches, evaluating groups of variants simultaneously, and allowing variant matches to occur across size categories. In comparison to Truvari refine, vcfdist achieves a similar improvement in benchmarking accuracy but without modifying the variant representations during benchmarking, and evaluating SNPs in addition to INDELs and SVs. Despite vcfdist’s advantages, we find that Truvari refine currently scales better with variant size, since it uses more memory-efficient alignment algorithms. We plan to incorporate wavefront alignment into the next release of vcfdist, but for now the maximum recommended variant length is 10 Kb. For more details on the advantages of WFA, see [28, 35]. Table 2 shows the runtimes of vcfdist, vcfeval, Truvari bench, and Truvari refine on our server; configuration details are provided in the “Methods” section.

Table 2 Runtime results for vcfdist, vcfeval, Truvari bench, and Truvari refine in (h:)mm:ss format

Both Truvari and vcfdist perform alignment-based evaluation, which allows detection of variant calls that are mostly but not exactly correct. By default, Truvari considers a variant to be a true positive when the sequence similarity (percent of matching bases divided by total sequence length) exceeds 70%. We set vcfdist’s default credit threshold (minimum percent reduction in edit distance when the reported variant is present) to 70% to match this. In practice, this is only slightly more stringent than Truvari’s criterion (see “Different thresholds” in Table 4). This is crucial for identifying large structural variants that are less likely to be called perfectly. In contrast, vcfeval finds matching subsets of truth and query variants that result in the exact same haplotype sequence. This computation would be less expensive if not for the fact that vcfeval does not assume the input VCFs are phased. Because there are \(2^{n}\) possible phasings for n heterozygous variants, vcfeval’s runtime depends more closely on the number and representation of variants than either Truvari or vcfdist. As a result, vcfeval has a wide range of runtimes. It is also important to note that when the number of nearby heterozygous variants is too large, vcfeval fails to compare these variants. This happened for 9712 variants (0.20%) on the hifiasm-dipcall VCF, for 21,886 variants (0.45%) on the Q100-PAV VCF, and for 136,073 variants (2.50%) on the hifiasm-GIAB-TR VCF.

The runtimes of vcfdist and Truvari, on the other hand, depend closely on the size of the sequences to be aligned. Truvari reduces total runtime by evaluating variants in two stages; only complex regions are passed on to Truvari’s refine module for more sophisticated evaluation. vcfdist segments contigs into independent superclusters using heuristics or a bidirectional wavefront algorithm (biWFA) [36], as shown in Additional File 1: Table S2.

An example of benchmarking interpretability for the HLA-DQB1 gene

The results of variant call benchmarking tools such as vcfeval, Truvari, and vcfdist are frequently used in downstream tertiary analyses. In order to compare the interpretability of these tools, we evaluated the chr6:32,664,600-32,664,899 region of the HLA-DQB1 gene, since it is known to cause difficulties during analysis arising from differing variant representations. The HLA-DQB1 gene is part of a family of genes called the human leukocyte antigen (HLA) complex and plays an essential role in the human immune system. Deleterious mutations in HLA-DQB1 are highly associated with common autoimmune diseases such as celiac disease [37] and multiple sclerosis (MS) [38].

We found through a manual examination that all three query VCFs called both haplotype sequences in this region exactly correct. A summary of these results is shown in Table 3, and the resulting sequences are included in Additional File 1: Fig. S4. Although all sequences were the same, there were significant differences in how this genetic variation was represented. Compared to the first Q100-dipcall haplotype, the Q100-PAV and hifiasm-GIAB-TR VCFs chose to represent a 169-base insertion and a 168-base deletion as a 1-base insertion and 34 SNPs. Relative to the second Q100-dipcall haplotype, the hifiasm-GIAB-TR VCF chose to represent a 1-base insertion and 1-base deletion as 5 SNPs. These differences account for the wide range of query true positive variant counts in Table 3.

Table 3 Comparison of tools evaluating the chr6:32,664,600-32,664,899 region of the HLA-DQB1 gene

Note that because both truth sequences were called exactly correct, there should be no false negative (FN) or false positive (FP) variant calls counted, and the number of truth VCF true positives (TP) should be consistent across all three datasets. vcfdist correctly counts all variants, as expected. vcfeval correctly counts the SVs but fails to evaluate a large number of the SNP and INDEL variants because there are too many heterozygous variants in close proximity. It excludes these variants from the analysis and proceeds with a warning that the region is too complex to be evaluated. Truvari bench does not evaluate the SNPs and fails to identify several of the INDELs and SVs. This failure occurs because it discards the SNPs prior to evaluation and therefore does not discover the two cases where numerous SNPs are equivalent to an insertion and deletion. Truvari refine also does not evaluate SNPs. It is able to detect that all variant calls are correct, though it does convert both query and truth SVs to an INDEL for the Q100-PAV and hifiasm-GIAB-TR datasets.

Validation of vcfdist

Lastly, we compare vcfdist to existing variant calling evaluation tools in order to verify its correctness. Following variant normalization (described in the “Methods” section), we organize all variants evaluated by vcfdist, vcfeval, and Truvari in Table 4. 57,865 SNPs in this region are excluded from Table 4 because they were not evaluated by Truvari; we include these results in Additional File 1: Fig. S5a. An additional 723 INDELs and SVs occurring at the border of the GIAB-TR BED regions are excluded because they were only analyzed by some of the VCF comparison tools. Variants are then categorized based on the apparent reason for differences in evaluated correctness between the three tools and counted. An example from each of the seven discovered categories is shown in Additional File 1: Fig. S6 and described in further detail below.

Table 4 A comparison of INDEL and SV variant calling evaluation by vcfdist, vcfeval, and Truvari, restricting to chr20 of the GIAB-TR tandem repeats BED

Firstly, all three tools handle allele matches differently, which accounts for the majority of differences in Table 4. Truvari will match a query homozygous variant to a truth heterozygous variant and consider both to be true positives. vcfeval will perform the same match but consider the variants to be a false positive and false negative. vcfdist will match the heterozygous variant to one haplotype of the homozygous variant, consider both to be true positives, and then consider the second haplotype of the query homozygous variant to be a false positive. None of these methods is best; rather, each has strengths for certain applications. We caution that users consider these performance differences based on their downstream goals.

The second most common area of disagreement between tools stems from the fact that they have different thresholds for considering variant calls to be true positives. For example, vcfeval requires variants to match exactly, whereas vcfdist requires variants to have a partial credit score above 0.7 and Truvari requires a sequence similarity above 0.7. For certain edge cases, such as where a length three deletion is called length four, even Truvari and vcfdist may differ.

The next most common differences are intentional implementation differences between the tools. In particular, vcfdist refuses to split a complex variant into multiple variants and consider only a subset of those to be correct. Truvari, by default, only allows a variant to participate in one match (with the —-pick single parameter), regardless of allele count. vcfeval is the only tool that does not enforce local phasing and allows flip errors of nearby variants to occur.

Lastly, there are rare cases where unintentional implementation differences lead to slightly differing results. Not all backtracking algorithms behave identically, leading to cases in which different algorithms will adjudicate which of a pair of variants is a false positive differently. There are also differences due to directly adjacent or overlapping variants. For example, only vcfeval allows two separate insertions at the same exact location on the same haplotype and Truvari evaluates spanning deletions, whereas vcfeval and vcfdist do not.



Source link