Scientific Papers

Next-Gen GWAS: full 2D epistatic interaction maps retrieve part of missing heritability and improve phenotypic prediction | Genome Biology

The NGG model and its fast resolution

To attempt to make full epistatic maps a reality, we decided to use a different mathematical formalism combined with solving systems meant to take advantage of Graphic Processing Units (GPUs) being increasingly popular thanks to the rise of gaming and deep learning [17]. Our solution, named NGG for Next-Generation GWAS, is based on the massive use of modern acceleration architecture (GPU, see Fig. 1 and Additional file 1: Text 1, Material for details) enabled by the use of recent mathematics techniques (Compressed Sensing) for regularized least square estimation in a sparse linear model paradigm, that we rewrote in a new way, called “model compression”. This achieves linear scaling with the number of SNP and interactions, instead of an exponential complexity (Fig. 1). The outcome is a sparse estimate collecting the effects of each variant and each SNP interaction, instead of retrieving p-values as regular GWAS. As such, the NGG algorithm can be seen as a sparse signal detection analysis and does not use multiple statistical testing which precludes the use of FDR correction. This classical correction is replaced here by a drastic procedure for variable selection and extensive simulation and testing.

Hereby, we established the NGG model that first states and defines heritability in this framework as done before by Zuk et al. [18] and others: We define $$X$$ the matrix with n rows and p columns containing the genetic information (Fig. 1). Each column displays the coded genetic variants (here SNP) for the n individuals. We also define $$Y$$ a vector containing the phenotype. The broad-sense heritability $$H$$ may be defined via the following nonparametric “random signal plus noise” model: $$Y=f(X)+\varepsilon$$ (NP: stands for non-parametric.) where the function $$f$$ is unknown and general and $$\varepsilon$$ is a random noise, independent from $$X$$ that collects all other effects (other than genetic) on the phenotype $$Y$$, such as environmental effects for instance. Thus, the broad-sense heritability is expressed as $$H=var(f(X))/var(Y)$$. The narrow-sense heritability $$h$$ also sometimes called additive heritability accounts for part of the variance explained by genetics in the linear model $$Y=X{\theta }+\varepsilon$$ (L: stands for linear). The definition is $$h=var(X{\theta })/var(Y)$$. We note that, of course, $$model(L)\subset model(NP)$$. Notice for further use that since the slope parameter $${\theta }$$ is unknown, the narrow sense heritability cannot be computed but only estimated (for example, via a plug-in estimator $$\widehat{h}=var(X\widehat{\theta })/\widehat{var}(Y))$$. At last when the estimation method is Ordinary Least Squares (OLS) or some of its (regularized/penalized) variants, the definition above matches the classical $${R}^{2}$$ and adjusted $${R}^{2}$$. Below the adjusted $${R}^{2}$$ is preferred for reasons related to both the dimensionality of the data (usually p is much larger than n) and the well-known inflation of $${R}^{2}$$.

We further consider two models:

Where $$Z=X\star X$$ is the partial face-splitting (or transposed Khatri Rao product) of matrices [19]. For self-containedness notice that when:

$$X=\left[\begin{array}{ccc}{a}_{1}& {a}_{2}& {a}_{3}\\ {b}_{1}& {b}_{2}& {b}_{3}\end{array}\right] then \; X*X=\left[\begin{array}{ccc}{a}_{1}{a}_{2}& {a}_{1}{a}_{3}& {a}_{2}{a}_{3}\\ {b}_{1}{b}_{2}& {b}_{1}{b}_{3}& {b}_{2}{b}_{3}\end{array}\right]$$

Matrix $$Z$$ contains all the pairwise Kronecker products of columns of $$X$$, excluding the products of a column with itself. This matrix $$Z$$ will be referred to as the matrix of interactions or shortly 2D matrix as performed before by others [20]. When $$X$$ has p columns, $$Z$$ has p(p-1)/2 columns. The matrix Z captures all interactions between the SNP’s. Although Model 2 remains linear, it is not additive anymore and bridges between Model L (or Model 1) and Model NP.

Algorithmic validation of NGG on simulated data

Now that the model has been established, we need to evaluate its performance in retrieving epistatic signals. For this, we first worked on simulated data (see methods and repository for details). The first simulation has been performed in two steps.

First, we simulated $$X$$ and $$Y$$ (Fig. 2A and C); second, we simulated $$Y$$ for real $$X$$ (SNP matrix) retrieved from the Arabidopsis related 1001 genome project [21] (Fig. 2B and D; Additional file 1: Fig. S1). These simulations are built to control narrow sense heritability (h2) of the trait (Fig. 2). Using simulations, we show that the NGG formalism is able to capture simulated epistatic events for a wide range of model parametric values (Fig. 2E, F and Additional file 1: Fig. S1). We found that NGG is quite resilient to noise but sensitive to the number of individuals used for the analysis (as discussed further, see remarks on Very High Dimension), as it radically improves for larger numbers of individuals (Fig. 2E, F; Additional file 1: Fig. S1). For instance, for a 500k SNP epistatic landscape (1000 × 1000), for which we implemented 10 non-null epistatic signals, 50% of these are found in the top 10 NGG predicted signals (Fig. 2E) for a h2 = 0.2 when using 10,000 individuals. This number is maintained to 27% when the number of individuals is reduced to only 1000.

We also measure the NGG ability to detect epistatic signals when the interacting SNPs are not randomly selected. Indeed, we simulated a scenario whereby a SNP can have a simple (1D) effect combined with interaction effects (2D). Again, NGG is able to retrieve the simulated epistatic interactions in an even more complex mixture of simple and combinatorial effects (Additional file 1: Fig. S2).

In this first pass of validation procedure on simulated data, the epistatic effects were computed to reflect the Arabidopsis genome structure that contains a very high homozygosity (simulation code is available in the GitHub repository). However, epistatic interactions, in particular in heterozygous organisms, can be of different kinds as described earlier by Marchini et al. (2005) [22] (Additional file 1: Fig. S3A). Thus, to analyze further the potential of the NGG algorithm to retrieve a certain diversity of interactions, a second simulation was performed including now heterozygosity and three sorts of epistatic interactions [22] (simulation code is available in Git repository as well). We measured NGG capacity to detect 3 different types of epistatic interactions fully described by Marchini et al. (2005) [22], namely Type 1: multiplicative within and between loci, Type 2: Two-locus interaction multiplicative effect, Type 3: Two-locus interaction threshold effect. We show that the sparsity of the signal is important (although not crucial) for NGG to detect epistatic signals (Additional file 1: Fig. S3). This can be explained by the mathematical construction of the compress sensing problem. We also demonstrate that Type 1 and Type 2 interactions are easier to discover than Type 3 interactions and that having different types of interactions in the same simulation run does not affect NGG detection capacities (Additional file 1: Fig. S3).

Having defined the potential of NGG to discover epistatic signals on simulated data we then moved to compare NGG with previously benchmarked results of regular 1D GWAS analyses.

Estimation of NGG efficiency for 1D GWAS on real data

We further benchmark our method on state-of-the-art available datasets and modeling approaches [4, 23]. For this we first compared unidimensional (i.e., 1D or classical) GWAS results using the 107 Arabidopsis phenotypes studied in the landmark paper Atwell et al. [4]. We observed that major signals retrieved with the EMMA algorithm [4, 23] are also retrieved by NGG (Fig. 3A and Additional file 1: Fig. S5 for the 107 phenotypes). For instance, EMMA and NGG methods both identify a major peak for the phenotype 88: bacterial disease resistance (Fig. 3A). This peak directly identifies the resistance gene RESISTANCE TO P. SYRINGAE PV MACULICOLA 1 (RPM1) [24]. It is worth noting that for this particular phenotype, some signals emerge in NGG that are not detected by EMMA (Fig. 3) and that for certain phenotypes, NGG and EMMA converge towards a $$x$$2 relationship (Additional file 1: Fig. S6 S7). The opposite is also true although less frequent (see for the 107 phenotypes Additional file 1: Fig. S6 S7). Similar analyses have been realized on the 18 phenotypes from the Campos et al. paper [25], leading to the same conclusions (Additional file 1: Fig. S6 S7). Interestingly, NGG clearly identifies in the top hits the effect of FLOWERING LOCUS C (FLC), a major gene in the control of flowering time [26, 27] in contrast to EMMA. Here we took this gene as an example of which NGG may be good at retrieving such important signals since it is intrinsically built to retrieve $$\widehat{\theta }$$, considering the other SNP effects (Model 1 and Model 2). For this, we compared the capacity of NGG and EMMA to detect signals in the vicinity of the FLC locus (20 kb window). Interestingly, Fig. 2B shows that the NGG model indeed retrieves FLC as being the second strongest signal when EMMA reports it as the 30th signal (Fig. 3B). Finally, for every single phenotype, we quantified the overlap between the increasing k-top $$\widehat{\theta }$$ and the EMMA signal. For the vast majority of signals, we found a good congruence between signals varying between 40 and 100% for Atwell et al. phenotypes and between 9 and 59% for Campos et al. phenotypes (Additional file 1: Fig. 5).

It is noteworthy that the observed good correlation between EMMA and NGG results may indicate that NGG performs genome population structure correction comparably to mixed models. Population structure correction is explained in the light of the NGG procedure (fully described in Additional file 1: Text 1) and as follows. Equation #16, in Additional file 1: Text 1, solves the compressed problem by utilizing a compressed version of the kinship matrix estimation (the AXtXAt matrix), which automatically incorporates a renormalization via this estimated projected kinship matrix during resolution. Furthermore, the final algorithm (Additional file 1: Text 1 and provided code), which has a specific piecewise structure and involves averaging, enables the estimation of effects by breaking any connections that may exist between the coordinates. However, further simulations incorporating diverse genetic architectures and models of population structure would be required to fully validate this observation.

Having shown that NGG is able to retrieve GWAS signals on original datasets we decided to evaluate its speed in comparison to other algorithms.

Estimating acceleration towards 2D-GWAS

We then compared the runtime of NGG with permGWAS which is the fastest GWAS to date [29]. PermGWAS was itself challenged on the same dataset, containing 1 million SNPs and 1000 individuals, to GEMMA and SNPTESTv2. Their respective runtimes were 6, 29, and 23 min. Regarding these results then further report only the comparison of NGG to permGWAS only.

All runtime experiments were measured on the same server using Ubuntu 20.04.3 LTS with 40 CPUs, 377 GB of memory, and 4 Quadro RTX 6000 GPU, each with 24 GB of memory.

First we estimate the effect of the number of markers on the runtime (Additional file 1: Fig. S4a), we fixed the number of samples to 1000 and varied the marker between 50 K and 2 M SNPs. As summarized in Additional file 1: Fig. 4, both permGWAS and NGG are proportional to the number of SNPs with a logarithmic relationship. The NGG is more than two orders of magnitude faster than permGWAS. For the maximum marker size (2 M) NGG took approximately 5 s, while for permGWAS the runtime took more than 460 s (~ 7 min).

Then we estimated the effect of the number of samples on the runtime (Additional file 1: Fig. S4b), we fixed the number of markers to 1 M and varied the number of samples between 1 and 10 K. Additional file 1: Fig. 4 allows us to observe that NGG outperforms again permGWAS by at least two orders of magnitude. For the maximum number of samples (10 K), NGG took approximately 6 s, while for permGWAS the runtime was more than 985 s (~ 16 min).

In conclusion, these results demonstrate that NGG delivers similar and potentially more accurate results in regards to other regular GWAS techniques and is a hundred times faster. This speed improvement brings the calculation for ~ 60 billion SNP or combination of SNPs below an hour making possible the computation of entire epistatic maps.

2D GWAS on real data

Being confident that NGG has the potential to point to true epistatic effects (Fig. 2) and having in mind that the number of individuals greatly improves the detection capacity of our model (Fig. 2E, F; Additional file 1: Fig. S1), we tested NGG analysis on the dataset with the greater number of Arabidopsis ecotypes extracted from work by Campos et al. 2021 [25]. In this work, Campos et al. (2021) provide the elementary composition (18 different elements) for > 1100 different Arabidopsis ecotypes having been fully sequenced by the 1001 genome project [21].

Figure 4 reports results of unidimensional NGG for Phosphorus content (noted P31) of Arabidopsis leaves, that can be displayed at the same time as (i) $$support \,for \,the \,model \,x \,|SNP \,effect|$$ or as (ii) pure $$SNP \,effect$$ ($$\theta$$). The latter provides a Manhattan plot with negative values that can be interpreted as the SNP having a negative effect on the phenotype as compared to the reference genome (here Columbia-0 ecotype) (Fig. 4). Also, the effect reported in this Manhattan plot is now expected to be directly proportional to the effect of the genetic variation as compared to Col-0 phenotype, helping to choose for the best variant or gene to study.

We further proceeded with the computation of full epistatic maps or 2D-NGG for phenotypes retrieved from the Campos et al. [25] and Atwell et al. [4] datasets. We focused on these datasets as they present a relatively high number of ecotypes (> 1000), and an important diversity of well-known phenotypes respectively. To do this, we prefiltered SNPs having a particular Minor Allele Frequency (MAF) because the probability for the combination of SNPs (that we call MIAF for Minor Interaction Allele Frequency) to be of interest for epistatic measurements directly depends on the MAF as $$Z=X\star X$$ (above). For this we prefiltered 346,094 SNPs, for Campos et al., and between 341,067 and 371,956 SNPs, for Atwell et al., having a MAF greater than 0.3. The full epistatic landscape is thus 59.890 billion interactions for Campos et al. [25] and between 58.163 and 69.175 billion interactions for Atwell et al. [4].

Nowadays, this quantity of data represents a challenge on its own to compute, store, and display the results as it relates to a “Very High-Dimensional” (VHD) framework [30]. VHD is mathematically defined in terms of the size of the genotypic matrix $$X$$ (n rows and p columns) and in terms of sparsity of the unknown parameter to be estimated or tested, here the number of “active” SNPs and interactions for a given phenotype: k. In this framework [30], we can evaluate the effects of VHD genotypic input matrices on the performance of several popular methodologies (for hypothesis testing, support estimation, and prediction) and show that when k log(p/k) is large with respect to n then statistical estimation and testing errors inflates dramatically. We believe that this is at least in part the reason for which full epistatic maps (2D-GWAS) were so far out of reach.

Following this line, in our study (Fig. 5), n = 999, and p is around 60 billion. Following a reasonable estimate for sparsity k, granting satisfactory and reliable outputs is estimated to be not more than 50. This is why, in our forthcoming study we mainly consider and analyze in a final stage around 30 significant SNPs interactions or composite components.

Figure 4 displays 2D-NGG results for (i) Arabidopsis flowering times (Fig. 5A–C) [4] and (ii) Arabidopsis phosphorus (P31) leaf content (Fig. 5D–F) [25]. Results are displayed as a square heatmap triangle for which ~ 60 billion signals $$|\widehat{\theta }|$$ are provided. One 2D-NGG result dataset represents ~ 500 Go of data. To navigate through this large dataset, a visualization tool named Luciol has been developed that can be understood as a “Google Earth” for full epistatic maps. Briefly, results are organized in layers as such the max intensity of a genomic region is reported on the higher layers. Here in Fig. 5, layer 11 represents our maximum zoom-out condition. A zoom between layer 11 to layer 0 (the layer for which a given pixel represents a direct SNP/SNP combination) corresponds to a 4.2 million times zoom. In other words, a pixel in layer 11 reports the max intensity of 4.2 million SNP/SNP interactions underlying layer 0. Observation of full epistatic maps as well as local signals informs on the genetic architecture underlying a given phenotype (Fig. 5).

In the case of the flowering trait (Fig. 5A), around 6 major epistatic signals emerge where 2 of them are close to the diagonal. The proximity of the diagonal refers to potential epistatic interactions of neighboring genes (although a few Mbp away). To display unambiguous epistasis we thus decided to report here the fourth stronger effect that lies very far from the diagonal. A zoom at the 2D-locus reveals the structure of a 2D-GWAS peak that appears bi-modal (i.e., supported by at least 2 distant SNP combinations, 2 local bright spots in the epistatic map, Fig. 5B). This peak points to 2 loci predicted to be epistatic. The first locus is at position CHR4:6,524,710, and the second one is at CHR1:6,243,417. Using these coordinates, the matrix X and phenotype Y are parsed to plot the phenotypic distribution following the combination of SNPs (a sort of 2D-haplogroup). This is reported by the box plot in Fig. 5C. Herein we observe that this epistatic effect involves 2 loci having a moderate effect individually as reported to the SNPi and SNPj boxplots (left and right panels). However, the combination of the simple effect cannot predict the effect of the combination since the positive effect of SNPj, from 0 to 1 modality, seems enhanced by the SNPi (0) modality and totally repressed by the SNPi (1) modality. This clearly indicates the potential presence of an epistatic effect between these 2 loci.

We also report (Fig. 5D to F) the epistatic interactions in the control of plant leaf phosphorus content. This epistatic map reports around 8 strong epistatic signals. As an example, we zoomed into 2 of them being the stronger ones with respect to their predicted value ($$|\widehat{\theta }|$$). The first one is relatively close to the diagonal although both epistatic SNPs lie in chromosome 1 but eleven Mbp away (Fig. 5F top panel). The second one concerns an epistatic effect predicted to involve SNPs on 2 different chromosomes namely CHR5 and CHR3 (Fig. 5F bottom panel). These 2 epistatic signals are built upon the effect of a strong combination of SNP effects as it appears impossible to predict the combinatorial output of these SNPs by solely analyzing the effect of the simple SNP modalities (compare box plots of SNPi and SNPj to box plot of SNPi:SNPj). Here, these effects can totally be missed by previous studies of epistasis that, to date, necessarily implied a selection of genetic variables [12].

We succeed in demonstrating that the NGG algorithm allows the computation of full epistatic maps with a gene resolution at least for high gene density genomes such as that of Arabidopsis.

Missing heritability recovery and phenotype prediction

We set out to understand the “missing heritability” explained by our recovered epistatic interactions. To do so, we calculated the increased variance explained (as an $${h}^{2}$$ proxy) being retrieved from 2D-GWAS as compared to regular 1D-GWAS (Fig. 6). The differential heritability between 1 and 1D + 2D GWAS was estimated by Principal Component Regression (PCR) [31], carried out on a set of selected SNP and SNP-interactions (Fig. 6A). The principle of PCR dates back to the late 50’s [31]. PCR combines Principal Component Analysis (PCA) on the input features of a model followed by linear regression [31]. First, a PCA of X provides a low number of principal components and a dimension reduction by selecting fewer components associated with the highest eigenvalues modulus of X. Regression is then performed on this reduced set of components (related to the VHD problem that we described above) that play the role of new synthetic inputs. PCA concentrates the information of the large matrix X or Z in a smaller matrix, removing collinearity as well because the components are, by definition, not correlated.

Here, PCR is carried out (i) on a set of p SNPs and then (ii) on a set of the same SNPs as in (i) + q SNP/SNP-interactions (Fig. 6B). The plots show the retrieved “missing heritability” (difference between blue [1D signal] or red lines [1D signal plus 2D_random], the controls, and the green line [1D signal plus 2D]) as measured by the adjusted R2 when the number of selected components increases (x-axis Fig. 6B). For the vast majority (16 of the 18 phenotypes), a good proportion of heritability is retrieved in the 2D signals. Only, Cobalt or Selenium do not display a radical improvement in the explained variance. By applying this method, we observed that information in the epistatic landscape indeed contains a good proportion of the missing heritability (Fig. 6B). For the Phosphorus content of Arabidopsis leaves, for instance, the heritability measures in the 1D GWAS ranges around 22%. Estimated h2 then increases to 33% when the information in the 2D-GWAS is considered. We note here that we do not strictly evaluate the missing heritability recovered but rather a proxy of it.

Having a comprehensive view of gene resolution epistatic maps opens up possibilities for at least two developments. The first is experimental validation. This process is extremely labor-intensive and could take a considerable amount of time to precisely dissect epistatic interactions. Although these are currently under investigation, we chose to publish our findings primarily due to the second potential development. The second area of advancement is in phenotypic prediction. Essentially, NGG can be viewed as a highly effective variable selection process (Fig. 6A), which could significantly benefit precision medicine and various agronomic selection programs for plants and animals.

We thus further evaluate the role of NGG signals for phenotypic predictions through the use of a broad set of machine learning algorithms including Deep Neural Networks (DNN), Support Vector Machine (SVM), Gaussian Processes (GP), Gradient boosting (GB), Random Forest (RF), Linear regression, Lasso, Elastic Net. These techniques were used to predict the 18 phenotypes from the Campos et al. [25] work (described above). We also crossed these machine learning techniques with an increasing number of 1D and 2D signals/SNPs (10, 100, 500, 1000, 5000, 10,000). To perform a proper control, we repeated this in silico experiment but instead of providing the models with proper 2D signals, we randomly sampled epistatic signals (named 2D_random) to evaluate our capacity to predict plant mineral content. As classification problems are easier to solve and that the number of individuals is still a bit limited for regression approaches, we also used quantiles to rank phenotypes into 5 or 3 classes (Fig. 7A). By crossing all these parameters, we ended up with 1728 different models for 1D + 2D signals (y-axis of the plot Fig. 7B) and the same number of models for 1D + 2D_random (x-axis Fig. 7B). Our capacity to predict phenotype is performed on 50% of the dataset (validation set) that were not used to (i) perform the NGG analysis, (ii) neither to fit or train the models. The quality of the models is evaluated through classical precision/recall curves and F scores. Figure 7A presents the F1 scores of best-predicted classes (measure a precision/recall compromise) for 1D + 2D random against 1D + 2D models. All the models lying above the diagonal (x = y) correspond to models for which predictive power is improved by epistatic signals (Fig. 7A, Additional file 1: Table 1).

We observe that 2D epistatic signals improve phenotypic classification (Fig. 7B) as 57% of the models are improved. Interestingly, models having a low F1 score and models having a high F1 score tend to beneficiate most of the epistatic signals. Furthermore, this improvement is even more dramatic when we consider the models having a lower number of SNP and SNP interactions (30 + 30) (Fig. 7C). In this particular case with a low number (60) of explanatory variables 80% of the machine learning models are improved by 2D signals as compared to randomly picked ones (Fig. 7C). We wish to highlight some particular points for which we observe an increase in our capacity to predict phenotypic classification and an overall good classification outcome (Fig. 7B, D, E). The arrows in Fig. 7B point to 2 models for which the 1D signal alone does not allow a very good classification (xD = 0.534, xE = 0.535), when the same models with an epistatic signal reach an F score of 0.732 (yD) and 0.728 (yE), respectively. We also observe some phenotypes such as Na23 (sodium leaf content) for which most models and parametric values of the machine learning procedures globally beneficiate the epistatic signal showing that retrieved 2D signals are globally bringing new information (purple circle Fig. 7B). Figure 6D and E display an example of our capacity to predict phenotype classification for molybdenum leaf concentrations. This level of precision and recall opens avenues for plant selection procedures assisted by epistatic markers.

In summary, our study successfully presents the creation of comprehensive epistatic 2D maps with sufficient SNP density to achieve gene-level resolution. We applied our method to the model organism Arabidopsis thaliana, leveraging its readily available dataset. Importantly, our approach is universally applicable and can be readily adapted to other biological models, especially in the context of human genetics. As hypothesized before [18], we demonstrate that a substantial part of the missing heritability lies in epistatic interactions (Fig. 6). Finally, we show that this never observed fine grained 2D epistatic signal brings us a bit closer to the prediction of phenotypic values by machine learning procedures on plants, but we hope soon, in other biological models as well.