Scientific Papers

Genomic prediction based on selective linkage disequilibrium pruning of low-coverage whole-genome sequence variants in a pure Duroc population | Genetics Selection Evolution


Phenotypic distribution and population structure

Table 1 and Additional file 1: Fig. S1 present the basic statistical characteristics of the discovery, training, and validation populations. Probably due to artificial selection, compared to the other populations, the validation population displayed phenotypic progress. AGE and BF conformed to a normal distribution, as confirmed by the Kolmogorov–Smirnov test (P = 0.326 for AGE and P = 0.235 for BF), but not TTN (P < 2.20 × 10–16), since it showed a discontinuous distribution. We conducted a principal component analysis (PCA) using the LCS_LD0.90 SNP set to determine the population structure of all the populations and samples. Additional file 2: Fig. S2 shows the three populations with the mixed state in principal components 1 to 3 with no sub-group stratification. This revealed the effectiveness of using marker pre-selection and prior GWAS information that was identified in the discovery population.

Genetic parameters and architecture

Table 2 shows the results of the variance component estimates obtained through GBLUP and BayesR for each marker panel. Although there were large differences in the number of markers between the different panels, no significant difference was observed in the heritability estimated with either GBLUP or BayesR. However, the estimated heritability using the BayesR model was significantly higher than that using the GBLUP model for AGE and slightly higher for BF. For example, the estimated heritability using the CC 80k panel in the BayesR model was 0.17 for AGE and 0.36 for BF, higher than the respective estimates of 0.13 and 0.33 based on the GBLUP model. This may be attributed to the greater ability of the Bayes model than GBLUP to capture the QTL effect since only BF and AGE were subjected to artificial selection in this population.

Table 2 Variance components for different traits estimated on four marker panels, based on GBLUP and BayesR

Table 3 lists the number of SNPs that were selected under varying P-value thresholds, ranging from 0.0001 to 0.01. Additional file 3: Fig. S3 presents the Manhattan plot for AGE, BF, and TTN in the discovery population, non-discovery population, and all samples. Under a strict threshold condition (P < 0.0001), a significantly larger number of SNPs was obtained for BF and TTN than for AGE, with almost no significant SNPs detected for AGE. Although numerous significant SNPs were identified for TTN, most of them (64%) were concentrated on the known Sus scrofa 11.1 chromosome (SSC)7 QTL region. We further analyzed the genomic regions covered by significant (P < 0.0001) SNPs for each trait. Genomic regions of 50 kb on either side of each associated SNP were considered as QTL-covered regions; the 343 SNPs for AGE spanned a 7.10-Mb genomic region, the 12,624 SNPs for BF spanned a 21.30-Mb genomic region, and the SNP number for TTN i.e., 12,343 SNPs similar to that for BF in TTN spanned only a 13.87-Mb genomic region. These results suggest that the QTL distribution varies between traits.

Table 3 SNPs detected using different P-value thresholds in the discovery population, the non-discovery population, and in all 3549 samples

Subsequently, we estimated the proportion of SNPs in each distribution using the BayesR model. Overwhelmingly, most (more than 97%) SNPs had a variance of 0 for all traits and were estimated to be in the first distribution. TTN had a higher percentage of SNPs with a variance of 0 (98.17%) and a larger proportion of SNPs in the fourth distribution (0.01 * σ2g) than AGE and BF. Conversely, BF had more SNPs with a variance different from 0 and the highest proportion of SNPs in the second and third distributions yet the smallest number of SNPs in the fourth distribution (Table 4). These results confirmed that fewer QTL control TTN, while BF and AGE are influenced by a larger number of QTL with a small effect. Moreover, the detection of any definite QTL associated with AGE in this study proved challenging.

Table 4 Proportion of the number of SNPs and genetic variance explained in the four normal distributions modeled in BayesR

Genomic prediction based on different marker density panels using real data

We used two SNP chips and two sequencing-based methods (GBS and LCS) for GP in order to investigate the effects of marker density and SNP genotyping on prediction accuracy and bias. Marker density exhibited a several 100-fold increase from the SNP chip to LCS data (Table 2). However, an increase in marker density did not correspondingly improve prediction accuracy; in some cases, the accuracy even decreased to a certain extent when using the GBLUP model (Table 5). Apart from the LCS_LD1 set, the CC 80k panel was particularly effective for GP of AGE, demonstrating a 2.60% higher accuracy than the LCS panel. GBS outperformed the other panels in terms of accuracy for BF and TTN. Table 6 provides the prediction accuracy and bias for the three traits based on the BayesR model; the LCS panel was not included in this analysis due to the extensive computational resources necessary for BayesR with over 10 million SNPs. The LCS_LD1 set showed a significant advantage for AGE and BF in both GBLUP and BayesR models. In addition, the accuracy of BayesR was superior to that of GBLUP for all three traits across all marker panels, exhibiting a 5.4 to 7.3% increase for AGE, a 0.2 to 2.6% increase for BF, and a 1.7 to 2.5% increase for TTN. The bias due to using the LCS_LD1, CC 80k, and LCS panels in the GBLUP model, and that due to using the CC 80k, LCS_LD1, and CC 50k panels in the BayesR model, were relatively lower than those due to using other SNP panels for the three real traits. Compared to the two SNP chips, the use of unfiltered WGS data did not improve the prediction accuracy in the GBLUP model. However, prediction accuracy increased when SNPs in perfect LD were removed from the LCS panel for AGE and BF. This result indicated that WGS data does not offer accuracy and bias advantages over the SNP array or GBS when all markers are used.

Table 5 Prediction accuracy and bias in the validation population for three traits based on GBLUP
Table 6 Prediction accuracy and bias in the validation population for three traits based on BayesR

Prediction performance of the top 10% and bottom 10% SNPs in GWAS

SNPs in the LCS panel were ranked according to their GWAS P-values for the three real traits in the discovery population (from small to large). The top 10% and bottom 10% SNPs were selected for accuracy assessment in the validation population to explore the performance of phenotypically related and unrelated SNPs during GP. The prediction accuracy of the top 10% SNPs for each trait was significantly higher than that of the bottom 10% SNPs in both the GBLUP and BayesR models (Fig. 2). This indicates that the contribution of markers to the prediction differed throughout the genome and suggested that markers exhibiting a higher correlation with QTL had higher prediction power than the others. Both informative and noisy markers exist in WGS data, which may explain why the WGS data did not have a significant advantage in GP compared with SNP chips. This also suggested that pre-selecting SNPs from high-density WGS panels may improve the prediction accuracy.

Fig. 2
figure 2

Comparison of genomic prediction accuracy. Comparison of genomic prediction accuracy of the top 10%, LCS or LCS_LD1, and bottom 10% SNPs in the GBLUP and BayesR models in the validation population. AGE age to 100 kg live weight, BF back fat thickness, TTN total teat number, Top_10% the top 10% significant SNPs in the discovery GWAS, LCS low-coverage sequencing, Bottom_10% the bottom 10% significant SNPs in the discovery GWAS

Prediction performance of LDP and SLDP

In this study, the LCS panel was filtered using a sliding-window LDP approach based on r2 threshold gradients (see “Methods”); a lower r2 indicated that fewer markers remained. Additional file 4: Table S1 shows the remaining number of SNPs after filtering with different thresholds. The number of LCS SNPs decreased rapidly after LD filtering. For example, when r2 was > 0.99, the number of LCS SNPs decreased from 10.1 M to 361 K. The accuracy of the GBLUP and BayesR models for predicting the three real traits after LDP exhibited different patterns, as shown in Fig. 3. In the GBLUP model, with a decrease in r2, the accuracy of GP for AGE and BF initially increased and then decreased; by contrast, the accuracy of GP for TTN decreased with a lower r2 setting. In the BayesR model, the general trend was the same as that found in GBLUP. In contrast, for TTN, the accuracy of the BayesR model declined faster than that of GBLUP, with a decrease in r2.

Fig. 3
figure 3

Genomic prediction accuracy of the LCS panel after LDP. Genomic prediction accuracy of the LCS panel after LDP according to the r2 gradient (from 1 to 0.05) in the GBLUP and BayesR models in the training population. AGE age to 100 kg live weight, BF back fat thickness, TTN total teat number. The accuracy was obtained by five-fold cross-validation in the training population and defined as the Pearson correlation between GEBV and phenotypes

To consider GWAS prior information in GP, we used an improved LDP method termed SLDP. Figure 4 shows the prediction accuracy of the 180 combinations of parameters for the three real traits in the GBLUP and BayesR models used on the training population determined by five-fold cross-validation. The number of SNPs used in each scenario for AGE, BF, and TTN is shown in Additional files 5, 6, 7: Tables S2, S3, S4, respectively. For the GBLUP model, the accuracy of prediction for AGE and BF was more affected by LDP, whereas the accuracy for TN was significantly affected by the P-value. For GBLUP, the optimal parameter combinations for the GP of AGE was a P-value of 0.0001 and r2 of 0.30, of BF was a P-value of 0.0001 and r2 of 0.40, and of TN was a P-value of 0.01 and r2 of 0.30 (Fig. 4a, c, and e). A similar pattern was obtained with the BayesR model, and the optimal combination of the two models had the same P-value. For BayesR, the optimal parameter combination for the GP of AGE was a P-value of 0.0001 and r2 of 0.40, of BF was a P-value of 0.0001 and r2 of 0.50, and of TN was a P-value of 0.01 and r2 od 0.45 (Fig. 4b, d, and f). Notably, the optimal combination for BayesR tended to have a higher r2 threshold, suggesting that BayesR is more suitable for high-density marker panels than GBLUP. The parameter-optimized marker sets by LDP and SLDP (i.e., those with the highest prediction accuracies for each trait) were further evaluated in the validation population. The distribution of MAF changed after LDP, with the proportion of the lower and higher MAF SNPs increasing, especially when using a lower r2 threshold (see Additional file 8: Fig. S4), which is a concern. Based on LDP, SLDP increases the number of SNPs in some MAF intervals since it tends to select clusters of SNPs in the genome.

Fig. 4
figure 4

Heat map of genomic prediction accuracy for three real traits using SNPs after SLDP. Heat map of genomic prediction accuracy for three real traits using SNPs after SLDP according to the r2 (from 1 to 0.05) and P-value (from 0.0001 to 0.01) gradient in the training population. AGE, age to 100 kg live weight; BF, back fat thickness; TTN, total teat number. Panels a, c, and e show the results of the GBLUP model; panels b, d, and f show the results for the BayesR model. Each square represents a parameter combination and the accuracies are indicated with deeper colors corresponding to high (red) or low (blue) accuracy. Black dots mark the optimal combination of parameters for accuracy. The accuracy was obtained by five-fold cross-validation in the training population and defined as the Pearson correlation between GEBV and phenotypes

Figure 5 summarizes the accuracy of each SNP panel in the validation population. LDP had an advantage over the two SNP chips for AGE and BF, but the accuracy of LDP was relatively lower for TTN. In the GBLUP model, the accuracy of prediction based on SLDP for the three traits (AGE, BF, and TTN) increased by 2.23, 1.87, and 3.22% compared to the CC 50k chip, and increased by 1.19, 3.01, and 2.94% compared to the CC 80k chip, respectively. We also observed a similar improved accuracy for SLDP in the Bayesian model. Compared to LDP, SLDP showed an improvement of 1.04 and 3.07% in GBLUP and of 0.84 and 1.40% in BayesR for BF and TTN, respectively, but exhibited a slightly reduced accuracy for AGE. However, the optimal marker panel for bias was not robust, and SLDP did not have a significant advantage over the other panels. In summary, after marker selection, LCS data had the highest accuracy (compared to SNP chips and GBS) in most cases, indicating that using sequencing data has potential in GP.

Fig. 5
figure 5

Prediction accuracy and bias of SNP sets for real traits in GBLUP and BayesR models in the validation population. a and b Results of the GBLUP model. c and d results of the BayesR model. CC_50k commercial chip 50K, CC_80k commercial chip 80K, GBS genotyping-by-sequencing, LCS low-coverage sequencing, LCS_LD1 low-coverage sequencing set after removing completely linked disequilibrium SNPs, LDP linkage disequilibrium pruning and parameter (r2) optimized in the training population, SLDP selective linkage disequilibrium pruning and parameters (P-value and r2) optimized in the training population. The accuracy was defined as the Pearson correlation between GEBV and phenotypes

Prediction performance of SLDP with simulation data

To assess the performance of SLDP for evaluating different traits with a known genetic structure, we performed prediction using simulated data (see Methods). Figure 6 shows the prediction accuracy for SLDP with 180 combinations of parameters in the training population by GBLUP; the result using the BayesR model is shown in Additional file 9: Fig. S5. The traits that were controlled by fewer QTN tended to gain more accuracy improvement from GWAS prior information (with a lower LD r2). Figure 7 shows the prediction accuracy and bias for different SNP sets in the GBLUP and BayesR models under six simulation scenarios in the validation population. In general, the prediction accuracy of each marker panel was significantly improved with increasing simulated heritability. In the three simulation scenarios with fewer QTN, the accuracy of SLDP_QTN, where the real QTN location was used, was significantly higher than that of the other marker panels; for example, the accuracy of SLDP_QTN increased by 17.15, 13.89, and 15.92% relative to those of LCS in the GBLUP model and increased by 13.01, 6.28, and 3.39% relative to those of LCS_LD1 in the BayesR model at heritability levels of 0.15, 0.30, and 0.60, respectively. The improvement in accuracy with SLDP was clearly greater for the GBLUP model than for the BayesR model. For the scenarios using the infinitesimal model, SLDP_QTN showed only a 0 to 0.4% increase in accuracy relative to LCS in GBLUP and an increase of 0 to 0.8% in accuracy relative to LCS_LD1 in BayesR when the simulated number of QTN was 10,000.

Fig. 6
figure 6

Heat map of genomic prediction accuracy for simulated traits using SNPs after SLDP. Heat map of genomic prediction accuracy for simulated traits using SNPs after SLDP according to the r2 (from 1 to 0.05) and P-value (from 0.0001 to 0.01) gradient in the training population. The results of the GBLUP model are shown; each square represents a parameter combination and the accuracies are indicated with deeper colors corresponding to high (red) or low (blue) accuracy. Black dots mark the optimal combination of parameters for accuracy. The accuracy was obtained by five-fold cross-validation in the training population and defined as the Pearson correlation between GEBV and TBV. a QTN_100_h2_0.15 simulated traits with 100 QTN and a heritability of 0.15, b QTN_10000_h2_0.15 simulated traits with 10,000 QTN and a heritability of 0.15, c QTN_100_h2_0.30 simulated traits with 100 QTN and a heritability of .030, d QTN_10000_h2_0.30 simulated traits with 10,000 QTN and a heritability of 0.30, e QTN_100_h2_0.60 simulated traits with 100 QTN and a heritability of 0.60, f QTN_10000_h2_0.60 simulated traits with 10,000 QTN and a heritability of 0.60

Fig. 7
figure 7

Prediction accuracy and bias of SNP sets based on simulation data with the GBLUP and BayesR models. a and b Results of the GBLUP model. c and d Results of the BayesR model. CC_50k commercial chip 50 K, CC_80k commercial chip 80 K, GBS genotyping-by-sequencing, LCS low-coverage sequencing, LCS_LD1 low-coverage sequencing set after removing completely linked disequilibrium SNPs, SLDP_GWAS selective linkage disequilibrium pruning based on discovery GWAS prior and parameters (P-value and r2) optimized in the training population, SLDP_QTN selective linkage disequilibrium pruning based on real QTN. The accuracy was defined as the Pearson correlation between GEBV and TBV

In practice, the information on the genomic localization of all the real QTN cannot be accurately obtained; therefore, we performed SLDP based on the GWAS results where the parameters were optimized in the training population to match the real traits. For SLDP_GWAS, the accuracy also increased compared with that obtained with the other marker panels for the three traits controlled by fewer QTN. For example, the accuracy of SLDP_GWAS increased by 6.93, 7.70, and 11.47% relative to LCS in the GBLUP model and increased by 5.30, 2.45, and 1.23% relative to the LCS_LD1 set in the BayesR model at heritability levels of 0.15, 0.30, and 0.60, respectively. However, the accuracy of SLDP_GWAS was significantly lower than that of SLDP_QTN due to the presence of false positives and negatives obtained in the GWAS; that is, some noisy SNPs might have been detected, while some true QTN might have been missed. Regarding bias, both SLDP_GWAS and SLDP_QTN had no stable advantage for all traits.

The results of the simulated GWAS based on six simulated traits in the discovery population are shown in Fig. 8 and Additional file 10: Table S5. They show that only a small proportion of QTN could be detected by GWAS, especially in the low-heritability scenarios when only 1000 samples were used. For example, only 22 of the 100 QTN could be detected at a P-value threshold of 0.05 in scenario “QTN_100_h2_0.15” (see Additional file 10: Table S5), with a 98.10% estimated false-positive rate. This indicates that the detection power of GWAS is limited. In addition, the improvement in accuracy obtained with SLDP was lower for real traits than for the simulated traits, possibly due to their simpler genetic architecture because only additive effects are considered in the simulated data and the reliable prior information obtained by GWAS for real traits is more limited. This suggests that future studies in GP should integrate more clues, in addition to those obtained from GWAS.

Fig. 8
figure 8

Manhattan plot of genome-wide association analysis for six simulated traits in the discovery population. Simulated QTN and other SNPs are colored by red and gray points, respectively. The blue dotted line represents the P-value threshold 5 × 10–8, 10–4, 0.001, 0.01 from top to bottom. The number of QTN detected and the estimated false-positive rate for each P-value threshold are shown in Additional file 10: Table S5. a QTN_100_h2_0.15 simulated traits with 100 QTN and a heritability of 0.15, b QTN_10000_h2_0.15 simulated traits with 10,000 QTN and a heritability of 0.15, c QTN_100_h2_0.30 simulated traits with 100 QTN and a heritability of 0.30, d QTN_10000_h2_0.30 simulated traits with 10,000 QTN and a heritability of 0.30, e QTN_100_h2_0.60 simulated traits with 100 QTN and a heritability of 0.60, f QTN_10000_h2_0.60 simulated traits with 10,000 QTN and a heritability of 0.60. The result shows the difficulty in detecting most QTN by GWAS, especially for the low-heritability trait



Source link