ABSTRACT
A critical step in the analysis of WGS data is variant calling. Despite its importance, variant calling is prone to errors. Our study investigated the association between incorrect SNP and variant quality metrics and nucleotide context. In our study, incorrect SNPs were defined in twenty Holstein-Friesian cows by comparing their SNPs genotypes identified by whole genome sequencing with the IlluminaNovaSeq6000 and the EuroGMD50K genotyping microarray. The data set was divided into the correct set of SNPs (666,333 SNPs) and the incorrect set of SNPs (4,557 SNPs). The training data set consisted of only the correct SNPs, while the test data set contained a balanced mix of all the incorrectly and correctly called SNPs. An autoencoder was constructed to identify systematically incorrect SNPs that were marked as outliers by a one-class support vector machine and isolation forest algorithms. The results showed that 59.53% (±0.39%) of the incorrect SNPs had systematic patterns, with the remainder being random errors. The frequent occurrence of the CGC trimer was due to mislabeling a call for C. Incorrect T instead A call was associated with the presence of T in the neighboring downstream position. These errors may arise due to the fluorescence patterns of nucleotide labelling.
Competing Interest Statement
The authors have declared no competing interest.