Scientific Papers

PAGER: A novel genotype encoding strategy for modeling deviations from additivity in complex trait association studies | BioData Mining


PAGER achieves competitive performance and scalable efficiency in simulations

PAGER achieves encoding speeds up to 55 times faster than EDGE when leveraging GPU integration and operating on a sample size of 50,000 (Fig. 1; File S2). Significant speed increases are also observed at lower sample sizes and when utilizing a CPU. These speed increases are significant and highlight that genotype encoding by PAGER will not significantly burden large-scale single locus analyses. Additionally, PAGER accurately describes eight theoretical inheritance models used for SNP simulations (Fig. 2; File S1), underscoring its efficiency and versatility. Comparisons of heterozygote encodings between EDGE and PAGER reveal that both PAGER and EDGE generate heterozygote values close to theoretical levels for both phenotypes, even at a low MAF and high noise, which lead to the most challenging SNPs (File S2). However, EDGE has difficulty describing heterotic inheritance models (heterosis, underdominant, and overdominant). This arises because EDGE uses two anchors (AA and aa) while PAGER only uses one (AA), allowing PAGER to be more flexible and accurately describe any inheritance model. EDGE’s lack of flexibility in this regard can make post-analysis interpretations of SNP inheritance models challenging. Despite this, EDGE’s power to detect significance in SNPs simulated with these heterotic models are highly comparable to PAGER (Fig. 3; Files S1 and S2). This suggests that, from a modeling perspective, the heterozygote values derived by EDGE for heterotic SNPs are still informative. Consequently, EDGE should still be effective in detecting significant associations in SNPs following heterotic inheritance patterns.

Although PAGER and EDGE do not achieve the highest levels of power, they compensate with their flexibility. Since both methods derive encoding values from a training set, applying these encodings to an external validation set results in some power loss. Nevertheless, EDGE and PAGER significantly outperform the additive encoding in SNPs simulated from non-additive inheritance models, with the greatest performance differences observed in recessive and heterotic models and at the highest noise level (PEN_DIFF = 0.1). This finding suggests that the additive model can underperform in real-world situations, potentially leading to type II errors when SNPs follow alternative inheritance models.

While respective inherent models outperform EDGE and PAGER, especially in scenarios with lower sample sizes and MAFs, employing multiple inherent models in a GWAS or QTL analysis will significantly increase the burden of multiple testing (as observed in our real-world experiment). Therefore, it is more prudent to use EDGE or PAGER to flexibly and dynamically assign each SNP an encoding that closely matches the actual inheritance model at that locus. This approach reduces the burden of multiple testing while maintaining reasonable power. Since we observe that EDGE and PAGER power losses diminish as sample sizes increases, we recommend using training and validation splits in large-scale studies. However, for studies with inherently lower power due to sample size limitations, increasing the significance threshold statistically acknowledges this constraint while still penalizing EDGE and PAGER. While we acknowledge that selecting between penalization strategies is not ideal, our suggestion offers a pragmatic approach for leveraging EDGE and PAGER in practical applications and balancing between power availability and the potential of overfitting.

PAGER’s performance on these benchmarking metrics shows that the approach does not incur the significant costs or penalties, such as time loss, often associated with other genotype encodings. PAGER distinguishes itself through its streamlined mathematical approach and flexibility, which facilitate its application to a broad range of systems and phenotypes. Most encoding strategies are tailored to specific species, models, and/or phenotype categories (such as binary or continuous traits). Additionally, it is common for studies to implement a limited selection of inherent inheritance models tailored to particular phenotypes of interest [9, 10, 13]. In contrast, PAGER’s architecture supports its use in any genetic framework (including polyploid systems) or phenotype (binary or continuous) and can be implemented for both univariate analyses and investigation of epistasis by simple extension of the algorithm. Moreover, PAGER’s ability to incorporate and describe any theoretical inheritance model (Fig. 2; File S1), significantly simplifies the analytical process by removing the need to apply multiple encodings, and thus, the associated multiple testing burden. Finally, PAGER’s enhanced processing speed significantly diminishes computational costs associated with genotype encoding. Through these advantages, PAGER emerges as a highly versatile and efficient tool for simplifying and expediting the encoding process for a diverse spectrum of genetic investigations.

PAGER and EDGE reveal a biologically relevant and novel putative QTL

In their 2019 review on the benefits and limitations of GWAS, Tam et al. use an iceberg metaphor to contrast current knowledge with potential future discoveries by GWAS [16]. The tip of the iceberg, visible above water, symbolizes our existing understanding of GWAS, including the reliance on the additive inheritance model. The larger submerged portion represents the untapped future potential of GWAS, including the implementation of alternative inheritance models. This exploratory study builds on that premise and demonstrates that three alternative encoding strategies – recessive, EDGE, and PAGER – identify a novel putative QTL (QTL 3) that additive did not in two phenotypes (EDGE in only one phenotype – BMI_TAIL).

LD intervals around putative SNPs for QTL 3 do not overlap with any QTL found in the original GWAS study [30, 31]. Thus, QTL 3 is novel for this population of rats and these phenotypes. Gene models in the LD interval of QTL 3 (chr18.26640423 – chr18.27355039) show enrichment in 150 specific GO terms (five MF, 143 BP, and two CC) and 16 KEGG pathways (File S3). KEGG pathways are related to Wnt signaling and certain diseases including cancers (including colorectal cancer and gastric cancer), Alzheimer’s disease, and Cushing syndrome (File S3). GO terms are associated with primarily two genes: Wnt8a and Apc.

Wnt8a participates in the Wnt signaling and thus is likely involved in roles including cell fate determination, cell migration, cell polarity, neuron differentiation, and organogenesis [46]. Indeed, many of the enriched biological functions for QTL 3 are associated with these roles (File S3). Apc, just upstream of Wnt8a, is an APC (adenomatous polyposis coli) regulator of the Wnt signaling pathway. Apc, in addition to being involved in growth, development, and cell differentiation, is also a tumor suppressor gene linked to certain cancers including colorectal and brain cancers [47, 48]. In humans and mice, obesity and obesity-driven inflammation, precursors to colorectal cancers, are linked to overactivation of the Wnt signaling pathway [49, 50]. In turn, Apc negatively regulates increased Wnt signaling [49,50,51]. Our results could indicate that mutations in Wnt8a and Apc are associated with some of the variation in BMI and bodyweight we observe in this population of rats. Additional studies are required to validate this claim. Nevertheless, EDGE and PAGER effectively highlight a novel genetic locus, and areas for scientific investigation, that were not identified in the initial GWAS.

Encoding methods Differ in Peak QTL marker positions and significance

Variation in the base pair positions of peak markers of the same QTL are observed depending on the encoding method employed (Table 2; File S3). Because of this, some encoder-specific LD intervals do not have the same start and end points yet do overlap at least at one end. This likely occurs as each encoding method implicates (or prefers) peak markers that conform better to the inheritance pattern(s) modeled. These preferences result in fluctuations of the signal of each encoding’s peak marker and explain the varying significance levels observed across encoding methods, including dominant and recessive (Table 2; File S3).

PAGER can model each SNP to describe the inheritance pattern observed from average phenotypes more accurately than EDGE (primarily in heterotic models) and other inheritance encodings (Table 1: File S2), making it more sensitive to variation. Thus, PAGER can accurately subsume all theoretical inheritance models. In theory, it follows that PAGER should achieve the highest significance level for every QTL. However, it is likely that additive encoding inflates the signal of some SNPs that do not conform to strict additivity [16, 52], even those exhibiting moderate to high deviations from additivity. This is likely true for dominant and recessive encodings as well. Indeed, according to PAGER heterozygote values, no SNP at a peak marker is observed to follow a purely additive model of inheritance in which the heterozygote is completely intermediate (i.e., 0.5; File S3). Despite this, additive achieves higher significance than any other model in three QTL.

An explanation for why significance values differ and reach higher levels in different models may involve how EDGE and PAGER dynamically derive encoding values. Modeling SNPs closer to their true inheritance pattern likely results in QTL signals closer to their ‘true’ significance level and chromosomal position as EDGE and PAGER account for and quantify deviations from additivity on an SNP-by-SNP basis. Although it could be argued that this is a type of overfitting, higher significance values observed in other models point to an alternative explanation. It is also possible that additive, recessive, and dominant models, which are applied uniformly to all SNPs, may introduce some error, either increasing or decreasing main effect signals and potentially resulting in type I and type II errors in marginally significant SNPs [16, 52]. Another way to frame this is that uniform inheritance models enforce their own inherent biases across the entire dataset, introducing error.

An alternative explanation for the varying levels of significance across models is that PAGER and/or EDGE interact with LD structures differently compared to uniform models. This interaction may increase or decrease the significance of nearby loci by effectively shortening or lengthening the virtual LD windows. While our steps for controlling proximal contamination and ensuring QTL independence address some of these issues, there could be additional aspects to explore. Yet another alternative, exclusively regarding PAGER, is the potential violation of the assumption of normality in phenotypic distributions for each genotypic class. However, this is unlikely as MAFs are greater than 0.20 in every QTL where an encoding other than PAGER achieves a higher signal (Table 2; File S3). Although this does not guarantee normality, it implies that adequate sample sizes exist for each genotypic class in these QTL, hence not limiting PAGER’s efficacy. Despite these differences observed in significance levels, PAGER and additive encodings share the same peak marker in 80% (12/15) of instances where both encodings identify the same QTL (Table 2; File S3). Alternatively, additive and EDGE and PAGER and EDGE only share peak markers in 53.3% (8/15) and 56.25% (9/16) of instances, respectively (Table 2; File S3). Thus, in most circumstances, additive and PAGER encodings highlight the same peak marker underlying these complex phenotypes and, in that way, are more comparable. This implies that PAGER may generalize better to complex traits across phenotypes and systems compared to EDGE. However, additional experimentation and biological validation is required to bolster this claim.

It is unclear why additive detects some SNPs with notable deviations from additivity while not detecting QTL 3 on chromosome 18 (Table 2; File S3). For example, additive encoding detects QTL 8 and 13, both of which have large deviations from additivity, according to PAGER values and respective detections from the dominant encoding (QTL8; File S3). It may be that these SNPs still contribute significantly to the additive genetic variation of the traits (BW and RetroFat), which allows the additive model to detect them [14]. In addition to EDGE and PAGER, QTL 3 is also identified by the recessive encoding GWAS (Fig. 5; File S3), reinforcing its potential as a genuine true positive as the PAGER values for this locus point to a recessive model of inheritance. However, this QTL is not detected by the additive encoding. Interestingly, our simulation experiments reveal that the additive model’s power to detect recessive-simulated SNPs is significantly lower compared to its performance with dominant-simulated SNPs (Files S1 and S2). The representations of the recessive and dominant inheritance models in File S1 provide insight on this observation. The slopes and linear relationships of the additive and dominant models are better aligned compared to those of the additive and recessive models. This suggests that the dominant model (and the dominant encoding) more closely resembles the additive model than the recessive model does. Indeed, both QTL 8 and 13 follow highly dominant inheritance patterns, according to PAGER encoding values. This discrepancy between the additive and recessive models makes detecting recessive SNPs more challenging for the additive encoding and explains why additive fails to identify QTL 3 in the GWAS. Additional validation experiments are required to elucidate QTL 3’s role, if any, in obesity and metabolism in this system. However, it is important to note that we provide evidence that sole use of the additive model in GWAS significantly reduces the power to detect putative QTL following a recessive model of inheritance.

Notably, while the dominant model did not identify any unique QTLs beyond those detected by the additive model, it yielded more significant p-values for QTL 1 across all phenotypes (Table 2; File S3). This is likely due to the peak markers at this locus exhibiting strong deviations from additivity, according to PAGER values, that align more with subadditive models of inheritance (Aa range = 0.191–0.218; File S3). While QTL 1 exerts substantial main effects on the phenotypes tested in this rat population, the increased significance with the dominant model suggests it could detect this locus where the additive model might fail if the effects were more marginal. However, it must be noted that the dominant model highlights different peak markers than the additive model, which show greater deviations from additivity (File S3). This supports the notion that different encoding strategies may favor variants aligning more closely with their model assumptions. Interestingly, both dominant and additive encodings identify the same peak marker for QTL 7, yet the dominant model yields a more significant p-value despite this SNP exhibiting an inheritance pattern that is nearly additive (Aa = 0.445), according to PAGER values (File S3). The reason for this is unclear but highlights that uniform model assumptions may lead to the introduction of noise and error in single-locus analyses.

The inability of the additive encoding to detect QTL 3 in BMI_TAIL and BW raises concerns about its effectiveness in identifying QTLs following heterotic models of inheritance. Indeed, additive power values in SNPs simulated from heterotic models are low (Files S1 and S2). In this study, we employ three uniform encoding strategies in the tri-encoding GWAS, additive, dominant, and recessive, which successfully identify concordant QTLs with EDGE and PAGER. However, in other systems and phenotypes, SNPs exhibiting strong effects can follow heterotic patterns (e.g., heterosis, underdominance, and overdominance). This is especially true in economically significant domesticated plants [5, 7] and animals [53, 54]. It remains uncertain whether the additive encoding, or dominant and recessive, can adequately capture all, most, or any of these variants. Had such variants been present in our study, additional uniform encoding models might have been necessary for their detection, further increasing the multiple testing burden of the tri-encoding GWAS. This underscores the potential of using dynamic tools like PAGER and EDGE, which are designed to detect any significant SNP association, regardless of the inheritance model. Further experimentation using real-world data is needed to evaluate the effectiveness of the additive encoding in capturing significant heterotic signals and to further explore PAGER’s capabilities in this context.

Limitations of PAGER

Every model has assumptions, and therefore limitations. Indeed, the assumption underlying the additive inheritance model (and all uniform models) could be considered the most unrealistic when compared to EDGE and PAGER. It assumes that for every SNP, across all systems and phenotypes, heterozygotes exhibit traits that are precisely intermediate between those of homozygotes. PAGER, on the other hand, uses the relative differences between mean phenotype values of each genotypic class. This dynamic nature makes PAGER a powerful tool, but it can lose power to accurately describe inheritance models when phenotype distributions deviate from normality. This is an issue only in continuous phenotypes, as the mean phenotype per genotypic class is the proportion of cases in case/control studies. However, skewed phenotypic distributions can affect any encoding strategy [27, 55]. We expect that with large sample sizes and typical MAFs, significant deviations from normality will not be common. However, if this is not the case, we suggest transforming phenotypes or editing the PAGER formulae by replacing the mean phenotype value per genotypic class with the median and comparing performance between the two approaches. In some highly skewed distributions, the median may be more descriptive than the mean [56] and better capture the central tendency of the data. It is important to prune data of SNPs that have low MAFs to further reduce instances of skewed phenotype distributions. Future work will focus on the impact of skewness on PAGER accuracy. Small sample sizes and low allele frequencies can also lower PAGER’s ability to detect true significance. Yet, as stated above concerning skewed genotype distributions, these issues negatively impact all encoding strategies [27, 57]. Indeed, we observe how low sample sizes and MAFs negatively affect power in our simulated data experiments for all methods including additive and inherent (Fig. 3: Files S1 and S2).

Although PAGER does not generate statistical models and perform hypothesis tests like EDGE, it does use the phenotype to derive SNP encodings. Often termed ‘double-dipping,’ this can lead to significant overfitting [58]. Even though PAGER does achieve higher significance in most QTL compared to other encodings, these signals are not largely inflated nor is the trend universal (i.e., other encodings are observed to achieve higher signal for some QTL). Despite these observations, PAGER, along with EDGE, should be penalized due to the costs associated with more accurately modeling each SNP’s inheritance pattern using the phenotype. For our application of PAGER to real-world data, we adjusted the significance threshold by halving the Bonferroni cutoff to make it comparable to the tri-encoding GWAS, which utilizes the same population of rats and divides the Bonferroni cutoff by three. As we have touched upon previously and demonstrated in our simulation experiments, in non-exploratory studies with large sample sizes, using training and validation splits can provide a viable and potentially more robust alternative. Regardless of the penalty selected, their implementation can prevent the detection of some variants with substantial main effects. Despite this, we demonstrate that PAGER not only captures the same genetic associations as multiple uniform inheritance models, including a novel putative QTL, but also achieves greater efficiency, as fewer tests and corresponding corrections are required compared to when multiple models are applied. While no approach to genotype encoding is flawless, PAGER is expected to perform efficiently in the vast majority of cases, especially within robust experimental designs featuring large sample sizes. When selecting phenotypes for PAGER encoding, researchers should choose traits that are directly relevant to their research questions and for which they have high-quality phenotype data. Ensuring adequate sample size and appropriate phenotype distribution (e.g., avoiding highly skewed data) will enhance the accuracy of the genotype encodings and the reliability of results.



Source link