Evaluating the accuracy of variant calling methods using the frequency of parent‐offspring genotype mismatch

Abstract The use of next‐generation sequencing (NGS) data sets has increased dramatically over the last decade, but there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single lodgepole pine (Pinus contorta) parent and the maternally derived haploid tissue from 106 full‐sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the single nucleotide polymorphism (SNP) genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded more SNPs and higher error rates by one to two orders of magnitude, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non‐model study systems.


| INTRODUC TI ON
The decreasing cost and ease of producing short-read nextgeneration sequencing (NGS) data sets has transformed our understanding of organismal diversity across the tree of life. Nextgeneration sequencing offers new opportunities to empirically test both basic and applied hypotheses relating to the molecular ecology and evolutionary genetics within and among populations. However, transforming abundant raw sequence data into biologically meaningful genetic data is nontrivial. Genotyping errors can be introduced at several steps of NGS data analysis, which by extension may induce biases in subsequent inference. This becomes particularly problematic in non-model organisms with large and complex genomes and fragmented genome assemblies. While SNP errors may be less of an issue for window-based approaches like genome scans, where errors add noise to estimating statistics of interest (e.g., mean F ST ), they are probably more problematic for analyses that depend on estimating the site frequency spectrum (SFS; e.g., Lapierre et al., 2017) or the number of singleton alleles (Field et al., 2016). Whatever the application, it is clear that understanding the performance and sources of biases and errors from such tools is critical for the success of any

NGS-based project.
Various open-source programs have been developed to identify genomic variants, such as single nucleotide polymorphisms (SNPs) or insertions and deletions (indels), from short read data.
However, it is often difficult to know a priori how well a given tool will perform given the genomic resources available for any particular study organism, or how the study design will interact with the underlying assumptions of such tools, which are often benchmarked with model organism data.
These studies often conclude that there are substantial differences in precision and sensitivity across tools which depend in large part on aspects of the data such as sample size and coverage as well as the genomic resources available (Cornish & Guda, 2015;Sandmann et al., 2019, but see Bian et al., 2018. As genome-scale data sets are now common for non-model organisms with limited or fragmented reference genomes, it is necessary to expand upon these studies to understand and establish best practices for systems where genomic resources are limited.
Conifers are non-model organisms with genomes recalcitrant to chromosome-scale assembly, especially under budgetary constraint.
For instance, conifers often have exceptionally large genomes (20-40 Gbp;Neale et al., 2017) with histories of whole-genome duplication (Zheng et al., 2015), gene family expansion (Scott et al., 2020;De La Torre et al., 2014), transposable element dynamics (Scott et al., 2020;Wang et al., 2020;Yi et al., 2018), and extensive repeat regions (Wegrzyn et al., 2014). These complexities present a major challenge for NGS data analysis and downstream hypothesis testing in conifers (Lind et al., 2022;Shu & Moran, 2020). Such challenges can be alleviated by quantifying the accuracy of SNP calling when using the above-mentioned variant calling tools. High-quality databases exist for model organisms to calibrate existing programs and account for biases in the data, yet such resources remain elusive for many non-model organisms. A unique biological attribute of conifer reproduction, however, offers us the opportunity to assess the genotyping accuracy of these variant calling tools and to do so in a nonmodel system. Fertilized conifer seeds contain a megagametophyte, a cluster of haploid tissue surrounding the embryo and supplying it with nutrients. This megagametophyte haploid tissue only contains the maternal contribution of the offspring DNA and has been used to advantage in linkage mapping (Bernhardsson et al., 2019;Neves et al., 2014;Pavy et al., 2017) for example, as the phase is already resolved. By leveraging this information for a set of related individuals (i.e., full-sibling megagametophyte tissues and the maternal tree), we can gain insight into the accuracy of genotype calls by quantifying concordance between the haploid and parental genotypes.
Given that the offspring haploid genotype at any given locus could only have been inherited from a single source, if we assume no de novo mutation, then any discordance between the haploid offspring genotype calls and the maternal genotype call must have come as a result of genotyping error.
Here, we used a familial design consisting of diploid tissue from a single lodgepole pine (Pinus contorta) parent and the maternally derived megagametophyte haploid tissue from 106 full-sibling offspring to evaluate similarities and differences among SNP sets generated from FreeBayes, HaplotypeCaller, Samtools, UnifiedGenotyper, and VarScan. We used the rate of mismatches between parent and offspring genotype calls to infer the genotype error rate of each variant caller, given the rarity of mutation within one generation and de-

| Sample preparation and DNA extraction
Our sample data came from a Pinus contorta linkage mapping population consisting of two parental trees located in Bulkley Valley, British Columbia, Canada and the 106 F1 seeds from a single outcross, provided to us by the University of British Columbia (https://coada ptree.fores try.ubc.ca). Needles were collected from the parental trees and stored at -20°C prior to DNA isolation. The F1 seeds were hydrated at room temperature and the tissue layers separated under a microscope using sterile technique. After the seed coat was completely removed, the megagametophyte tissue was carefully separated from the embryo using a surgical blade and tweezers. DNA was then extracted from the needles of the parental trees or from the megagametophyte tissue of the F1 seeds using the NucleoSpin Plant II Mini kit (Macherey-Nagel GmbH & Co.), following modifications recommended in García and Escribano-Ávila (2016).

| Probe design and sequence capture
Due to the substantial size and complexity of conifer genomes, exome sequence capture is the preferred genotyping method for the taxon (Lind et al., 2022). We designed our capture probes based on previous Pinus contorta exome capture probes (Suren et al., 2016) and removed all probes which failed to effectively capture the target sequences. In addition, we included probes corresponding to previously described pathogen response genes , filtering out those which targeted exon sequences of less than 100 base pairs (bp). We then submitted our probe sequences to Roche NimbleGen for Custom SeqCapEZ probe design. We achieved a final capture space of ~44 Mbp.

| Sequence read processing, mapping, and SNP calling
We used fastp v0.19.5  to trim sample reads, removing adapters and polyG tails, in addition to using a 5 bp sliding window and removing all windows with less than 30 mean quality.
We then used the following variant caller programs: FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan, to call SNPs using the same BAM (or mpileup) files as input. After calling SNPs we performed an initial baseline level of filtering, first using criteria specific to each caller based on commonly used practices for the caller, and then secondly filtering each data set by a common set of filtering criteria ( Table 1). The common filtering thresholds we used for each caller required that sites were called in both the parent and the F1 sample, did not have greater than 50% missingness, and were not multiallelic. We used both VCFtools v0.1.14 (Danecek et al., 2011) and R v4.0.5 (R Core Team, 2021) to achieve the abovedescribed filtering.
TA B L E 1 Baseline filtering criteria. Set of filtering criteria unique to each variant caller program and set of common filtering criteria used across all programs. Criteria describe the sites removed FreeBayes Sites with less than 30 quality (QUAL) Genotype calls with less than 5 depth (DP) Genotype calls with less than 20 genotype quality (GQ) HaplotypeCaller Sites with greater than 60 Fisher strand (FS) Sites with less than 40 mapping quality (MQ) Sites with less than −12.5 mapping quality rank sum test (MQRankSum) Sites with less than 30 quality (QUAL) Sites with less than 2.0 quality by depth (QD) Sites with less than −8.0 read position rank sum test (ReadPosRankSum) Sites with greater than 3.0 strand odds ratio (SOR) Genotype calls with less than 20 genotype quality (GQ)

SAMtools
Sites with less than 20 quality (QUAL) Genotype calls with less than 5 depth (DP) UnifiedGenotyper Sites with greater than 60 Fisher strand (FS) Sites with less than 40 mapping quality (MQ) Sites with less than −12.5 mapping quality rank sum test (MQRankSum) Sites with less than 30 quality (QUAL) Sites with less than 2.0 quality by depth (QD) Sites with less than −8.0 read position rank sum test (ReadPosRankSum) Sites with greater than 3.0 strand odds ratio (SOR) VarScan Genotype calls with less than 10 depth (DP) Genotype calls with less than 20 genotype quality (GQ) Heterozygote genotype calls Common filters Sites not called in both parent and offspring Sites with greater than 50% missingness Multiallelic sites 2.4 | Program-specific details for SNP calling and filtering

| FreeBayes
We used FreeBayes v1.3.1-17-gaa2ace8 to call SNPs separately in both the offspring samples and the parental sample to account for the ploidy difference. The resulting VCF files were merged using VCFtools. Sites with less than 30 quality score, and genotype calls with less than 20 genotype quality and/or less than 5 depth were filtered out, before applying the set of common filters as described above ( Table 1).

| HaplotypeCaller
We used GATK v4.1 HaplotypeCaller in -ERC mode to generate 107 individual sample g.vcf files for both parent and offspring. Individual g.vcf's were combined in batches of 20 using CombineGVCFs into a single multiple sample g.vcf comprising the full mixed ploidy cohort. GenotypeGVCF was used to call SNPs on the full cohort. Both commands for combining and genotyping are able to handle mixed ploidy of samples. We used the standard GATK hard-filter expression (https://gatk.broad insti tute.org/ hc/en-us/artic les/36003 55324 12?id=11097) as our initial filtering step, as P. contorta lacks a benchmark set of high-quality SNPs for calibration and therefore we could not use variant quality score recalibration (VQSR). After the hard-filter expression we filtered out sites with less than 30 quality score, genotype calls with less than 20 genotype quality, and then applied the common set of filters ( Table 1).

| SAMtools
We used the bcftools v1.9 utility from SAMtools to call SNPs on the offspring and parent samples as a cohort. Filtering was performed by removing sites with less than 20 quality score, genotype calls with less than 5 depth, and then applying the set of common filters ( Table 1). The -G flag was set to the default (i.e., all samples were treated as one population).

| UnifiedGenotyper
We used GATK v3.8 UnifiedGenotyper to call SNPs separately in both the offspring samples and the parental sample. As with HaplotypeCaller, because VQSR was not applicable, we filtered using the hard-filter expression (https://gatk.broad insti tute.org/hc/ en-us/artic les/36003 55324 12?id=11097), further filtered sites with less than 30 quality score, and then filtered using the common set of filters (Table 1).

| VarScan
We used VarScan v2.4.4 to call SNPs on the offspring and parental data separately, using a p-value of.05. We then merged the resulting VCF files using VCFtools and filtered out all genotype calls with less than 20 genotype quality and/or less than 10 depth. Additionally, as VarScan does not have a ploidy setting (i.e., all genotype calls are diploid) and our offspring sample data were haploid, we removed all heterozygote genotype calls. Finally, we applied the common set of filters ( Table 1).

| Comparing variant caller programs
Variant callers differed considerably in the number of sites they called using the baseline filtering (e.g., UnifiedGenotyper yielded ~3 M vs ~100 k for VarScan; Table 2). We found that the mismatch rates between parent and offspring genotype calls roughly scaled with the number of SNPs called (Table 2; Figure S1), and in order to compare mismatch rates between different callers we standardized their output by adjusting the filtering criteria so that each caller yielded a similar number of SNPs (Tables S1-S5). This additional filtering step was conducted in an iterative manner to assess the effect of varying stringency on mismatch rates, using the QUAL, DP, and GQ metrics (but only DP and GQ for VarScan) and varying each metric according to their empirical distribution. For example, to get approximately x SNPs with a given caller, we would increase the QUAL, DP, and GP filtering criteria from the yth to the zth percentile of each of their respective distributions. We calculated two different mismatch rates, a by-genotype rate and a by-site rate. The

| Base filter
After applying the base level filters to each caller's SNP set, the two GATK callers, UnifiedGenotyper and HaplotypeCaller, resulted in the greatest number of SNPs called and the highest mismatch rates by site and by genotype ( Table 2). While SAMtools, FreeBayes, and VarScan called an order of magnitude fewer SNPs than the GATK callers, they also resulted in mismatch rates 1-2 orders of magnitude lower ( Table 2). The strong correlation between the number of SNPs a program called after base filtering and its mismatch rate (R 2 = 99.4%; Figure S1) led us to apply our additional incremental filtering method to better facilitate the comparison among variant callers.
Despite the different variant callers generating SNP sets orders of magnitude different in size, the distributions of parent-offspring genotype mismatches across the sites called were very similar among programs (i.e., heavily right-skewed; Figure S2). UnifiedGenotyper and VarScan, however, produced an overinflation of sites with a 50% genotype mismatch rate, suggesting a higher rate of genotyping error in the parent than seen with the other variant callers ( Figure   S2).

| Sites called
After applying the additional incremental filtering and reducing each SNP set down to approximately 4 × 10 5 and 1 × 10 5 sites called,  S5) and the sites shared with zero genotype mismatches (Figure 1a, S7A). Notably, however, the specific sites each program called that had genotyping mismatches were largely unique to each caller ( Figure 1b, S7B).

| Mismatch rates
With the additional incremental filtering, the majority of callers dramatically improved in by-site mismatch rates and bygenotype mismatch rates (Figure 2; see also Figure S8; Table S6).
UnifiedGenotyper, HaplotypeCaller, and SAMtools all showed striking improvements in mismatch rates with additional incremental filtering until about the 10 5.5 SNP mark, where mismatch rates for all three callers approached relative plateaus. FreeBayes, on the other hand, showed continued improvements in mismatch rates with increased filtering and did not reach a point of diminishing returns.
Finally, while VarScan did see some limited improvement in mismatch rates with increased filtering this change was less dramatic than in the other callers. However, note that because VarScan called relatively few SNPs, minimal additional filtering could be applied.
After additional incremental filtering to a specific number of sites called, UnifiedGenotyper consistently had the lowest genotype mismatch rate, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes ( Figure 2a). The number of SNPs called (i.e., the degree of filtering applied) did not appreciably change how the five callers ranked in either mismatch rate metric ( Figure 2).

| Comparing within callers
To assess the effects of individual quality metrics on mismatch rates, we explored the effect of varying each metric on the number of genotypes called and the by-genotype mismatch rates for the baseline filter SNP set from each variant caller program ( Table 1).

| FreeBayes
Increasing stringency in either of DP or GQ resulted in monotonically decreasing genotype mismatch rates in the FreeBayes SNP set ( Figure 3a, S9-S10). Similarly, increasing QUAL over lower values decreased the mismatch rate; however, mismatch rates did plateau over very high QUAL values ( Figure S11). All three filtering metrics performed well on our data set and would be useful metrics to filter SNPs with increased stringency.

| HaplotypeCaller
Filtering by any of QUAL, DP, GQ, or MQ produced monotonic decreases in genotype mismatch rates with the HaplotypeCaller SNP set (Figure 3b, S12-S14, S16). These filtering metrics would be useful for researchers unable to use VQSR and requiring further increased filtering stringency. The metrics FS, MQRankSum, QD, ReadPosRankSum, and SOR did not appreciably improve mismatch rates with further filtering, and in the case of QD slightly increased mismatch rates in our data (Figures S15, S17-S20).

| SAMtools
Filtering by either QUAL or GQ resulted in monotonic decreases in genotype mismatch rates in the SAMtools SNP set (S21-S22).
Increasing DP did improve mismatch rates over lower DP thresholds, although mismatch rates plateaued over higher values (Figure 3c, S23). As with the other callers, these three filtering metrics performed well and would be useful for filtering SNPs with further stringency; however, there may be diminishing returns when filtering by DP with SAMtools.

| UnifiedGenotyper
Filtering by any of QUAL, DP, or GQ resulted in decreased genotype mismatch rates with increasing stringency in the UnifiedGenotyper SNP set (Figure 3d, S24-S26

| VarScan
Filtering by either of DP or GQ produced appreciable improvements in genotype mismatch rates in the VarScan SNP set (Figures 3e and   S33-S34). While increasing stringency in DP resulted in quite a sharp decrease in mismatch rates over lower DP values, mismatch rates did plateau or increase over higher values (Figures 3e and S33), somewhat similar to SAMtools (Figures 3c and S23). Rather than suggesting that increasing DP increases error rates in VarScan, however, it is very likely that our mismatch rates are biased over higher values of DP as a consequence of a few erroneously called sites in the parent. For example, at a filtering DP of 80, two sites are responsible for the entirety of the genotype mismatches remaining, one with a parent-offspring mismatch rate of 50% and one with a mismatch rate of 100% (i.e., probably one base miscalled in the parent and both bases miscalled in the parent, respectively). Both DP and GQ appear to be useful for increased filtering with VarScan; however, there could possibly be diminishing returns when filtering by DP with VarScan.

| DISCUSS ION
The reliability of SNP genotypes identified from high-throughput sequencing and the choice of variant caller has been a topic of debate for over a decade (Hwang et al., 2015;Poland & Rife, 2012). Here, we leveraged diploid and haploid sequence data from a single P. contorta parent and 106 full-sibling offspring to compare SNP genotyping across five popular variant calling tools: FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. We used the proportion of mismatches between parent and offspring genotype calls to infer the genotype error rates of each variant caller, given the rarity of mutation within one generation. Our comparison found large differences in the SNPs called and we evaluated the impact of various filtering metrics on the SNP quality and quantity.
After applying an initial, base level of filtering (Table 1) with each program, we found a large disparity in the number of SNPs called and the error rates between callers ( Table 2). As might be expected, we found a strong correlation between the number of SNPs called and the by-genotype error rates (R 2 = 99.4%; Figure S1), which led us to apply our additional incremental filtering to compare callers.
For our specific data set, we show that UnifiedGenotyper consistently had the lowest error rates at all degrees of additional filtering stringency ( Figure 2). Not only did UnifiedGenotyper have the lowest overall error rates after additional filtering, but it also resulted in the most SNPs called in total, offering ample opportunity to prioritize either the number of SNPs called or the error rates. However, note that UnifiedGenotyper is no longer supported by GATK. After UnifiedGenotyper, both HaplotyperCaller and VarScan performed appreciably well in terms of error rates ( Figure 2). VarScan, however, produced quite a low number of SNPs despite initial filtering being relatively lenient (Table 1). After additional filtering, FreeBayes and SAMtools resulted in the highest by-genotype error rate and by-site error rate, respectively ( Figure 2).

F I G U R E 2 Comparison of mismatch rates by genotype (a) and by site (b) between variant callers at variable numbers of sites called.
Mismatch rates were calculated as the proportion of mismatched parent-offspring genotype calls out of the total number of genotypes called (a) and as the proportion of sites with at least one mismatched parent-offspring genotype call out of the total number of sites (b). Inset plots show a magnification of the mismatch rates over 50 to 100 k sites called. Variation in the number of sites called for a particular variant caller was generated by additional incremental filtering to different degrees with depth, genotype quality, and quality score where applicable HaplotypeCaller and UnifiedGenotyper also required extensive additional filtering ( Figure S8) beyond baseline to achieve acceptable error rates ( Figure 2). The huge number of SNPs these two callers produce, the extent of the filtering required, and the large number of different filter parameters involved, may call for a more experienced user and probably more tinkering to curate a suitable SNP set, especially so in non-model organisms where VQSR is not an option.
Other variant callers, such as FreeBayes, SAMtools, and VarScan, achieve decidedly lower error rates when comparing baseline filter SNP sets (Table 2), and probably require much less tinkering and effort to produce an acceptable SNP set. However, these three variant callers may lack the customization and flexibility of the GATK callers, and because they call many fewer SNPs may be at risk of missing some important sites.
When we compared the specific sites produced with each variant caller, we found a large degree of overlap among most programs (3/5) in both the total sites a program called ( Figures S3 and S5), and as well in the error-free sites a program called (Figures 1a and   S7A). The sites where each caller made genotyping errors, however, were largely specific to the caller used (Figures 1b and S7B), suggesting that the processes by which each caller makes errors differ mechanistically. Taken together, the high degree of overlap among callers in correctly called sites and the low degree of overlap in erroneously called sites (Figures 1 and S7), suggests that the practice of calling sites with multiple different variant caller tools and using only the SNPs common to all tools may be a highly effective method to improve accuracy. However, while this may reduce error rates, taking the intersection across multiple tools will result in a smaller number of SNPs called.
Across all callers, we found that filtering by QUAL, DP, or GQ gave excellent results in terms of reducing by-genotype error rates in our data set (Figure 3; also see Supporting Information and as such our results may differ from similar analyses performed on natural populations. For example, a variant caller that is predicated on sample allele frequencies being in Hardy-Weinberg equilibrium will have biased error rates when used on a linkage mapping population where allele frequencies are expected to be 0, 0.5, or 1.
As such, it would be valuable to repeat our comparison of variant callers on natural populations in the future. Second, because a P.
contora reference genome does not yet exist, we aligned our reads against the best available but highly fragmented congeneric loblolly pine (P. taeda) reference genome. Any differentiation or lack of synteny between P. contorta and P. taeda, as well as any assembly errors in the loblolly pine reference genome may have influenced our results. The option to use the reference genome of the correct study species, the quality of that genome, and whether or not the study species is model will all probably have important effects on results. Finally, we chose to include VarScan in our analysis as it is currently a popular variant calling tool, however, VarScan can only call diploid genotypes. Despite these ploidy limitations, our results show that VarScan is still very accurate given haploid input data, with the lowest error rates after baseline filtering out of the variant callers we tested, and comparable error rates after additional incremental filtering. However, because these are diploid genotype calls from haploid input data they still need to be interpreted with some caution. It could prove insightful to repeat a similar analysis in the future, taking into account the genotypes of the pollen donor and diploid offspring to assess variant caller performance given diploid input data.

ACK N OWLED G EM ENTS
We thank Sally Aitken's laboratory at UBC for providing the P.