Scientific Papers

Statistical learning quantifies transposable element-mediated cis-regulation | Genome Biology

Description of Image

craTEs models variations in gene expression as a linear combination of TE-encoded cis-regulatory elements

Using RNA-Seq data, we aimed to systematically uncover TE subfamilies that regulate the expression of protein-coding genes in cis. Integrants of the same TE subfamily share a high level of sequence similarity. Thus, they are predicted to exert a similar cis-regulatory influence on protein-coding genes located in their vicinity, for example through the simultaneous recruitment of a specific set of transcriptional regulators at multiple genomic loci. We thus assumed that the subfamily composition of TE integrants located within cis-regulatory distance of protein-coding genes contributes to a discernible fraction of the variation in gene expression (Fig. 1A) [27, 29]. As a first approach, we set this distance to 50 kb since differentially expressed (DE) genes were found to be enriched within this range of epigenetically perturbed cis-regulatory LTR5-Hs and SVA TE subfamilies [5].

Fig. 1
figure 1

craTEs uncovers cis-regulatory TE subfamilies from RNA-seq. A Overview of the craTEs model. Differences in expression [log(TPM)] for protein-coding genes between treatment and control samples (columns of matrix E) are modeled as a linear combination of the per-subfamily TE counts found in the cis-regulatory region (shaded beige) of each gene (columns of N). Differences in cis-regulatory activities for each treatment vs. control experiment (columns of A) are estimated by least squares. The cis-regulatory regions of each gene are defined as 50-kb long stretches of DNA 5′ and 3′ from promoter regions. Cis-regulatory regions exclude the exons (gray boxes) and promoters (orange boxes) of the genes they are assigned to. Gray bold lines: gene introns. Sequences of introns and exons: transcripts. B Proportion of integrants remaining at each step of the construction of N with respect to the original number of TEs present in the annotation (indicated in gray). “All TEs” refers to all integrants found in the TE database “Repeatmasker RELEASE 20170127” (number of unmerged TEs are indicated in gray). “cis-TEs” refers to integrants found in cis-regulatory regions before (“unfiltered”) and after (“filtered”, numbers indicated in red) removing those overlapping exons and promoters of the corresponding gene. C Seven case studies exemplifying the estimation of the cis-regulatory activities of TE subfamilies from RNA-seq data. Black dots are TE subfamilies with statistically significant (BH-adj. p-value <0.05, t-test) differences in activities between the treatment and control groups. 95% confidence intervals for the estimated cis-regulatory activities are shown as gray bars. Gray dots are TE subfamilies with non-significant differences in activities. Subtitle: p-value from the F-test of overall significance in regression. From left to right: CRISPRi-mediated repression of LTR5-Hs and SVA integrants in naïve hESCs, gRNA #1 (g#1) \(n=3\) (3 treatment samples vs. 3 control samples) [33]; CRISPRi-mediated repression of LTR5-Hs and SVA integrants in naïve hESCs, gRNA #1 (g#1) \(n=3\); CRISPRi-mediated repression of LTR5-Hs/A/B integrants in an embryonal carcinoma cell line (NCCIT), \(n=2\) [35]; CRISPRa-mediated activation of LTR5-Hs/A/B integrants in NCCIT, \(n=2\); CRISPRi-mediated repression of LTR2B integrants in K562, \(n=2\) [34]; overexpression of the pluripotency TF KLF4 in primed hESCs, \(n=4\) [33]; overexpression of the SVA-targeting KZFP ZNF611 in naïve hESCs, \(n=2\) [33]. D Proportion of variance of E explained by craTEs for each experiment in C

Considering two experimental conditions denoted as 1 and 2, for example, “control cells” and “cells with transgene overexpression”, we modeled the variation in gene expression \(\Delta E\) of each of the p protein-coding genes as a linear combination of the per-subfamily TE integrant counts \(N_{pm}\) located within cis-regulatory distance of its promoters (see Methods). \(N_{pm}\) represents the regulatory susceptibility [27] of gene p to TE subfamily m. We trade biological complexity for statistical simplicity by treating members of the same TE subfamily as “regulatory black boxes” of equal cis-regulatory potential. A well-known caveat of currently available ERE annotations is that integrants are often fragmented into multiple sequences [13, 30], causing an artificial inflation of \(N_{pm}\) and potentially deteriorating model performances. We therefore merged closely located (< 100 bp) ERE fragments of the same subfamily into single ERE integrants. LINEs, LTRs, and SVAs were particularly prone to spurious fragmentation (Fig. 1B), with numbers of integrants dropping by 8.3%, 7,9%, and 15%, respectively, after merging. To define the regulatory susceptibility \(N_{pm}\) of each gene p to each TE subfamily m, we counted the number of integrants of subfamily m falling within cis-regulatory distance of the promoters of p. We found that between 45.9% (LTRs) and 72.5% (SVAs) of all integrants were located within cis-regulatory distance, i.e., 50 kb up/downstream, of at least one protein-coding gene promoter (Fig. 1B). In rare instances, TEs overlap gene exons. Since these are used to quantify RNA-seq reads, this may introduce a spurious association between the presence of an annotated TE integrant and gene expression. We addressed this by excluding TEs overlapping exons from the set of putatively cis-regulatory TEs susceptible to regulate the corresponding gene. Finally, we chose to emphasize TE-driven cis-regulation dependent on distal sequences, i.e., located more than 1.5kB up/downstream of a transcription start site, as the role of TEs as alternative promoters has been extensively studied elsewhere [31, 32]. Thus, we prevented TEs overlapping with promoters of gene p from contributing to the set of regulatory susceptibilities \(N_{pm}\) (Fig. 1A–B). The combination of the last two filtering steps excluded 1.2% of TEs found within cis-regulatory distance of protein-coding genes from N (Fig. 1B).

The main purpose of craTEs is the estimation of \(\Delta A_{m\;2-1}\) which we define as the difference in cis-regulatory activity exerted by subfamily m between conditions 1 and 2 (see Eq. 1). For the purpose of this study, we chose the convention that a positive cis-regulatory activity refers to an “enhancer-like” effect in condition 2 with respect to condition 1. Conversely, a negative cis-regulatory activity may reflect either the gain of a “silencer-like” effect or the loss of an “enhancer-like” effect in condition 2 versus condition 1. The cis-regulatory activity \(\Delta A_{m\;2-1}\) has an intuitive interpretation: it is the quantity in expression that would be gained by any gene in condition 2 with respect to condition 1 upon insertion of an integrant of subfamily m within cis-regulatory distance of one of its promoters. An independently and identically distributed Gaussian noise term centered around zero \(\epsilon _{2 + 1}\) captures the variation in gene expression that is not accounted for by the linear model, and represents the sum of the noise terms corresponding to gene expression in each condition.

$$\begin{aligned} \Delta E_{p,2-1} = \Delta A_{0, 2-1} + \sum _{m} N_{pm} \Delta A_{m,2-1} + \epsilon _{2+1} \end{aligned}$$


craTEs estimates the vector of cis-regulatory TE subfamily activities \(\hat{\Delta A_{2-1}}\) by minimizing the squared difference between the observed logged expression values \(\Delta E_{2-1}\) and those modeled as linear combinations of the columns of the susceptibility matrix N, containing the regulatory susceptibilities \(N_{pm}\). Furthermore, craTEs assesses whether there is statistical evidence that \(\hat{\Delta A_{m\;2-1}}\) differs from zero: each component of \(\hat{\Delta A_{2-1}}\) is tested against the null hypothesis \(H_0: \Delta A_{m\;2-1} = 0\), i.e., that there is no difference in activity between conditions 1 and 2 for subfamily m, by means of a t-test (see the “Methods” section). This provides a measure of statistical significance for the estimated differences in TE-dependent cis-regulatory activities between conditions 1 and 2.

craTEs uncovers cis-regulatory TE subfamilies from RNA-seq data

We then assessed the ability of craTEs to detect cis-regulatory TE subfamilies under controlled experimental settings. In particular, we leveraged three RNA-seq datasets derived from experiments in which specific TE subfamilies were epigenetically silenced or activated, thus ablating their cis-regulatory effect on neighboring protein-coding genes [33,34,35]. These datasets provide a biological “ground truth” against which we evaluated the output of craTEs. The targeted epigenetic modulation of specific genomic loci was achieved by means of the CRISPR interference or activation systems [36]. CRISPRa/i relies upon a catalytically dead Cas9 domain (dCas9) that binds to DNA sequences complementary to user-defined guide RNAs (gRNAs). Once bound to the DNA, the dCas9-fused KRAB domain elicits the local deposition of repressive histone marks, thereby suppressing any enhancer activity exerted by the target site (CRISPRi). Conversely, the dCas9-fused VPR transactivator domain recruits a diverse set of transcriptional activators encompassing histone acetyltransferases upon DNA binding, thereby stimulating enhancer activity at the target site (CRISPRa) [37, 38]. As TEs of the same subfamily exhibit high levels of sequence similarity, hundreds of related integrants can be targeted for activation/silencing by only a handful of carefully designed gRNAs [5, 39, 40]. We have previously shown that the hominoid-specific LTR5-Hs and SVA TE subfamilies serve as enhancers in naïve hESCs and that this cis-regulatory activity can be ablated by CRISPRi [5]. We reanalyzed RNA-seq data from naïve hESCs where large fractions of the LTR5-Hs and SVA subfamilies were epigenetically silenced via CRISPRi across two independent experiments, each through a distinct guide RNA (g#1 and g#2). We applied craTEs to the vector \(\Delta E_{CRISPRi -control}\) containing the differences in gene expression between each CRISPRi experiment and control naïve hESCs. The association between the differences in gene expression \(\Delta E_{CRISPRi -control}\) and the cis-regulatory susceptibilities N of promoters to TE subfamilies was statistically significant (g#1: p-val = \(5.47\cdot 10^{-146}\), F-test, g#2: p-val = 0, F-test), strongly suggesting an interrelation between changes in expression observed at protein-coding genes and the genomic distribution of integrants for at least a subset of all TE subfamilies. After correcting for multiple testing using the Benjamini-Hochberg procedure [41], we uncovered 13 (g#1), resp. 39 (g#2) TE subfamilies with statistically significant differences in cis-regulatory activity (Fig. 1C, Table S1), i.e., non-zero \(\Delta A_{m, 2-1}\) activity coefficients. Among these, LTR5-Hs, SVA B, C, and D subfamilies displayed the largest and most statistically significant absolute estimated cis-regulatory activities. The negative activity values reflect the abrogation of the enhancer effect exerted by LTR5-Hs and SVAs in naïve hESCs by the CRISPRi system and are best interpreted as the log2 fold-change in protein-coding gene expression attributable to the presence of a single integrant from the corresponding subfamily near the promoter of the given gene. Specifically, the expression of any given gene bearing an LTR5-Hs integrant in its cis-regulatory window decreases by an estimated fold-change contained within the 95% confidence interval \(2^{-0.133\pm (1.96\cdot 0.0095)} = [0.90;0.92]\), i.e., by approximately 10% upon CRISPRi using g#1 (Table S1). We then applied craTEs to RNA-seq data generated from the CRISPRi-mediated repression of LTR5-Hs, LTR5A and LTR5B in the NCCIT human embryonal carcinoma cell line [35] and found that LTR5-Hs and LTR5B showed the largest and most statistically significant absolute differences in cis-regulatory activity (Fig. 1C), with the related LTR5A subfamily also displaying a weaker yet also statistically significant difference (Table S1). Conversely, craTEs uncovered LTR5-Hs, LTR5A, and LTR5B as the TE subfamilies with the largest and most statistically significant absolute difference in cis-regulatory activity when applied to RNA-seq derived from NCCIT cells subjected to LTR5-Hs/LTR5A/LTR5B CRISPRa (Fig. 1C), this time with positive activities mirroring the increased enhancer effect exerted by CRISPRa-targeted LTR5-Hs/LTR5A/LTR5B on neighboring genes. Thus, craTEs correctly inferred gains and losses of enhancer effect at the subfamilies targeted by CRISPRa/i and did so from the expression of protein-coding genes alone.

As it is well established that TEs are particularly active in hESCs [6, 42], we wondered whether craTEs would be able to recover TE-dependent cis-regulatory changes in other cellular contexts. A subset of LTR2B elements are marked by the enhancer histone mark H2K27ac in various leukemia cell lines, including in chronic myelogenous leukemia-derived K562 cells [43]. We used craTEs to estimate the differences in TE-driven cis-regulatory activities between K562 cells where LTR2B were repressed via CRISPRi and their control counterparts. craTEs correctly identified LTR2B as significantly less active in LTR2B-CRISPRi K562 cells compared to control K562 cells (Fig. 1C). Thus, craTEs recovers TE-dependent cis-regulatory mechanisms beyond the context of hESCs.

Next, we empirically verified whether the ability of craTEs to detect changes in cis-regulatory TE activity is generalized beyond experiments of targeted TE repression via CRISPRi. TEs often exert cis-regulatory effects by serving as docking platforms for TFs. For example, the core pluripotency TF KLF4 is highly expressed in naïve hESCs, where it binds to LTR7, LTR5-Hs, and SVA integrants [5]. Interestingly, these subfamilies also display elevated levels of the enhancer histone mark H3K27ac in naïve hESCs. In contrast, primed hESCs generally express lower levels of KLF4 and TEs than their naïve counterparts [5, 6]. Using craTEs, we assessed the impact of KLF4 overexpression on TE-dependent cis-regulation in primed hESCs. craTEs identified LTR7, LTR5-Hs, and SVA D as the most statistically significant and highly activated TE subfamilies upon KLF4 overexpression, thereby recapitulating our previous findings [5] agnostically with respect to epigenomics data and TE-derived transcripts (Fig. 1C). Interestingly, we previously observed that the KLF4-dependent enhancer activity of SVAs in primed hESCs did not correlate with increased SVA transcription (Fig. S1A) but instead with an accumulation of H3K27ac enhancer histone marks at SVA integrants [5]. This suggests that craTEs detects TE-dependent cis-regulatory effects that would not be inferred from studying the variation in the expression of TE integrants. Furthermore, overexpression of the repressive SVA-binder KRAB-zinc finger protein ZNF611 [9] in naïve hESCs abrogates the enhancer activity of SVAs [5]. We used craTEs to estimate the differences in TE-dependent cis-regulation between ZNF611-overexpressing and control naïve hESCs. As expected, craTEs identified SVAs as the TE subfamilies with the most statistically significant and largest absolute differences in cis-regulatory activity between the two settings (Fig. 1C), with negative activity values reflecting the loss of enhancer effect at SVAs upon ZNF611 overexpression. Of note, the proportion of the variance in gene expression explained by the distribution of TEs across cis-regulatory windows [2–12%, adj. \(R^2\)] (Fig. 1D) overlapped with the typical fraction of explained variance reported upon modeling gene expression as a function of the distribution of TF binding motifs at gene promoters [29]. Together, these results show that craTEs correctly identifies TE-dependent cis-regulatory activity changes beyond the context of targeted TE epigenetic perturbations and demonstrate its utility for identifying TE-dependent regulatory mechanisms under biological perturbations that affect TEs indirectly. In addition, craTEs identifies cis-regulatory TE subfamilies without resorting to mapping RNA-seq reads emanating from transcriptionally active TEs or performing epigenomics assays.

craTEs outperforms enrichment approaches based on differential expression analyses

The notion that differences in gene expression may reveal candidate cis-regulatory TEs has already been exploited in previous studies [5, 44] though the statistical methodologies differ from craTEs in key aspects. More specifically, these methods identify cis-regulatory TEs through a two-step process. First, differentially expressed (DE) genes are identified through ad hoc statistical methods [45, 46]. Then, per-subfamily scores for the enrichment of differentially expressed genes in the vicinity of TE integrants are computed. A high enrichment is reflected by a small probability (p-value) of finding more DE genes in the vicinity of a specific subfamily than the observed number of DE genes. We empirically compared the output of craTEs with that of the enrichment approach on the RNA-seq dataset whereby LTR5-Hs/SVA were silenced via CRISPRi [33]. Using the enrichment approach, we found that DE genes whose expression fell under LTR5-Hs/SVA epigenetic repression (Fig. S1B, p-val <0.05, Fisher’s exact test, lenient DE calling) were statistically significantly enriched in the vicinity of LTR5-Hs, SVA B and SVA D integrants (Table S2, BH-adj. p-val <0.05, hypergeometric test). Note that the DE enrichment approach failed to detect the regulatory link between gene downregulation and the TE subfamily SVA C (adj. p-val = 1, hypergeometric test), whereas these were identified by craTEs (Fig. 1C, Table S1). Moreover, when correcting for multiple testing during differential expression analysis (Fig. S1B, BH-adj. p-val <0.05, Fisher’s exact test, stringent DE calling), DE genes were enriched near SVA D integrants, but not LTR5-Hs or other SVA subfamilies (Table S3). These results indicate that craTEs is more sensitive than DE enrichment approaches in the task of detecting cis-regulatory TE subfamilies from the expression of protein-coding genes. To assess whether this came at the cost of decreased specificity, we quantified the ability of craTEs — resp. the DE enrichment approach — to recover a ground truth set of cis-regulatory TE subfamilies upon CRISPRi-mediated repression of LTR5-Hs and SVAs either using g#1 or g#2 (Figs. 1C, S1D). Since no epigenomic data was available for g#1 [33], we leveraged ATAC-seq to generate chromatin accessibility profiles in naïve hESCs subjected to g#1-mediated CRISPRi against LTR5-Hs/SVAs. We then defined ground truth cis-regulatory TE subfamilies for each gRNA as those (1) with integrants directly targeted by the gRNA (LTR5-Hs, SVA A-F) (2) with enrichment for decreased ATAC-seq (g#1: {LTR5-Hs, SVA A-D}, Table S4; g#2: {LTR5-Hs, SVA A-F}, Table S5) and/or increased H3K9me3 upon CRISPRi (g#2: {LTR5-Hs, SVA A-F}, Table S6). We used the subfamily-specific BH-adjusted p-values computed according to craTEs, the lenient and the stringent DE enrichment approaches to classify subfamilies into two classes — cis-regulatory versus not cis-regulatory and subsequently computed the area under the receiver operating characteristic curves (AUCs) (Fig. S1C). craTEs displayed higher AUCs than DE enrichment approaches for both gRNAs (g#1: 0.996 vs. {0.800, 0.600}, g#2: 1.0 vs. {0.85, 0.88}), noting that the low rates of true positives versus true negatives may partially explain the elevated AUCs. Overall, this suggests that by pooling information across all genes, and not just DE genes, craTEs offers increased statistical power over classical DE enrichment approaches in the task of identifying cis-regulatory TE subfamilies. Moreover, this emphasizes that cis-regulatory subfamilies as identified by craTEs agree with those displaying an enrichment based on differential context-matched epigenomic data.

craTEs estimates cis-regulatory TE activities by considering expression variations across hundreds of protein-coding genes. Consequently, craTEs does not require replicates to estimate TE subfamily cis-regulatory activities. To illustrate this, we reanalyzed the RNA-seq data derived from LTR5-Hs/SVA CRISPRi experiments [33]. We treated each pair of LTR5-HS/SVA CRISPRi and control samples as a single experiment, in effect ignoring the information provided by the replicate structure. We applied craTEs to each of the four replicates for both gRNAs (Fig. S1D). Consistent with our findings while accounting for replicates (Fig. 1C), LTR5-Hs and SVA subfamilies collectively exhibited a statistically significant decrease in cis-regulatory activity upon CRISPRi across replicates, though not all of them passed the significance threshold. In addition, classifying subfamilies as cis-regulatory based on the measure of statistical significance reported by craTEs (BH-adj. p-val, t-test) yielded AUCs ranging from 0.87 to 0.99. Thus, whereas the discovery power of craTEs grows together with the number of replicates, the method can still uncover statistically significant changes in the cis-regulatory activities of TEs even in the absence of replicates. In contrast, any DE enrichment approach requires at least three samples due to the prerequisites of the DE analysis methods [45, 46], and therefore cannot perform better than a random classifier in the absence of replicates (AUC = 0.5). In addition, craTEs not only quantifies the statistical significance of TE subfamily cis-regulatory activities but also provides a measure of the effect size through the estimated coefficient \(\hat{\Delta A_{m,2-1}}\) which can be interpreted as the log2 fold-change in gene expression that would affect any gene upon insertion of an integrant from subfamily m within its cis-regulatory window (Fig. 1A). Overall, this case study suggests that craTEs is more powerful and more informative than DE-based enrichment approaches to discover cis-regulatory TE subfamilies from RNA-seq data, supporting the notion that TEs act as cis-regulatory fine-tuners, the dynamics of which may be overlooked when restricting the analysis to DE genes only.

Influential TE-embedded cis-regulatory information resides up to 500 kb from gene promoters

In a first implementation of craTEs, we defined cis-regulatory regions as 50-kb-long stretches of DNA directly adjacent to the 5′ and 3′ sides of protein-coding gene promoters. Though informed by previous work [5], this choice of genomic distance was based upon data corresponding to LTR5-Hs and SVA subfamilies in hESCs only and may not reflect the general range of action of cis-regulatory TEs across all subfamilies and cellular contexts. We therefore modified the craTEs model to weight the regulatory influence of each integrant i on each gene p as a continuous and decreasing function of its distance \(d_{p,i}\) to the closest promoter of p (Fig. 2A). We defined the regulatory susceptibility \(N_{pm}\) as:

$$\begin{aligned} N_{pm} = \sum _{i\in \{m,c\}}e^{-\frac{-d_{p,i}^2}{2L^2}} \end{aligned}$$


where each weight was computed using a Gaussian kernel applied to the integrant-promoter distances \(d_{p,i}\). We considered all combinations of genes and integrants located on the same chromosome c. Note that integrants falling within exons or promoters of p were excluded from \(N_{pm}\). We computed 11 susceptibility matrices N by varying the bandwidth of the Gaussian kernel L between 1 kilobase (kb) and 10 Gigabases (Gb) thus spanning the entire range of possible cis-regulatory distances. Setting L to 1kb restricts cis-regulatory regions to the direct vicinity of gene promoters. In contrast, at 10 Gb, L exceeds the length of human chromosomes by two orders of magnitude, thus yielding nearly equal regulatory susceptibility scores across genes located on the same chromosome (Fig. S2A). We then tested which of these 11 matrices led to the smallest prediction error using 5-fold cross-validation. For the LTR5-Hs/SVA epigenetic repression, ZNF611 overexpression, KLF4 overexpression [33], and LTR2B epigenetic repression experiments [34], the validation error was minimized for \(L = 100\) kb or \(L = 500\) kb (Fig. 2B). As 95% of the area under a Gaussian curve is contained within two standard deviations from its mean (Fig. S2B), this suggests that TEs encode discernible cis-regulatory information up to distances of approximately 200 kb to 1 million bases (Mb) from gene promoters. We note that errors estimated for small (1 kb) and very large (\(>= 100\) Mb) values of L were unstable due to the high degree of collinearity between predictors. Indeed, a small L results in high numbers of zero-inflated columns in the N matrix. Conversely, very large values of L yield nearly equal weights for TE-gene pairs located on the same chromosome (Fig. S2A). Both cases make the least squares problem ill-posed by making the matrix N singular.

Fig. 2
figure 2

Influential TE-embedded cis-regulatory information resides up to 500kb from gene promoters. A Overview of the weighting process whereby the cis-regulatory influence of TEs decreases as a function of the distance to the closest promoter. The scheme depicts a protein-coding gene with two alternative promoters (in orange), coding for two alternative isoforms (in gray). Gaussian kernels with a maximum value of 1 and of varying bandwidth L are centered on each promoter. Before being added to the corresponding element in the matrix N, each TE is weighted as a function of its distance to the closest gene promoter. TEs overlapping exons (gray boxes) and promoters (orange boxes) of the gene are excluded. B To find the bandwidth L leading to the smallest prediction error, the root-mean-squared error (RMSE) was computed for each validation fold and averaged across the five folds over different values of L. C Overview of the experimental design of the hESC “perturbome” [50]. hESC cell lines carrying a stably integrated dox-inducible transgene overexpression construct were established from individual cells. In each of the 441 transgene overexpression experiments, dox-treated samples (dox+) are compared to the same cell line in the absence of dox (dox−). Note that the number of replicates per experiment varies. D Histogram depicting the number of times each Gaussian kernel bandwidth L — either TAD-informed or agnostic — led to the smallest mean validation RMSE in a 5-fold cross-validation scheme for the 441 transgene overexpression experiments. TAD-informed (red): the cis-regulatory weights linking integrants to genes were restricted by topologically associating domain (TAD) boundaries. TAD-agnostic (black): TAD boundaries were not considered. Individual mean RMSE estimations for GATA6, KLF4, and NEUROG1 are shown as illustrative examples. E Estimation of the cis-regulatory activity of TE subfamilies upon KLF4 overexpression [33] using the matrix N computed with \(L=250\) kb

We wondered whether the optimal cis-regulatory bandwidths estimated from the four datasets treated thus far (Fig. 2B) generalized to other TE subfamilies as well. We took advantage of a recently published RNA-seq dataset where hundreds of transgenes, mostly TFs, were overexpressed in primed hESCs through a dox-inducible system [47] (Fig. 2C). We considered this dataset as a “perturbome” where each overexpressed transgene polarizes the primed hESC transcriptome towards a specific direction, e.g., towards the naïve hESC GRN or down a differentiation path. We used the same 5-fold cross-validation scheme to find the optimal value of L for each transgene overexpression experiment (Fig. 2D), this time comparing the prediction error for gene expression between cis-regulatory weight assignment informed by versus agnostic to hESC-specific topologically associating domains (TADs) [48]. In 436/441 transgene overexpression experiments, the optimal bandwidth L took values between 50 kb and 500 kb. As most of the area below a Gaussian curve is contained within two bandwidths from its mean (Fig. S2B), TE subfamilies encode cis-regulatory information up to distances comprised between 100 kb and 1 Mb from the promoters of protein-coding genes in hESCs. In 187/441 transgene overexpression experiments, \(L = 250\) kb led to the smallest cross-validation error, the majority of which (163/187) did not benefit from TAD-informed cis-regulatory weight assignment (Fig. 2D), although the performance gap separating TAD-agnostic versus TAD-informed TE subfamily activity estimation was modest, as illustrated for GATA6, KLF4, and NEUROG1. Thus, TAD-agnostic cis-regulatory weight assignment according to a bandwidth of \(L = 250\)kb can be chosen to weight cis-regulatory TE integrants such as to maximize predictability. As an illustration, with \(L=250\) kb, a TE located 250 kb away from a gene promoter receives a weight of 0.61 (Fig. S2B). The weight drops to 0.01 for a TE-promoter distance of 750 kb resulting in a virtually negligible contribution to the craTEs model.

A predictor matrix N based on TE contributions weighted by their distance to protein-coding genes (Fig. 2A) has two potential advantages over a predictor matrix N computed from hard distance thresholds as we did when first validating the discovery power of craTEs (Fig. 1A). First, the quality of the predictors is likely to improve, as the optimal distance until which cis-regulation affects gene expression is estimated directly from the expression data. In other words, a continuous and decreasing weighting function may better represent the regulatory potential of TEs on protein-coding genes than a hard threshold approach. Second, as we require that each TE subfamily included in N sums up to a total regulatory potential greater than 150 (see the “Methods” section), the continuously decreasing weighting approach may allow for the inclusion of more TE subfamilies in the columns of N, leading to the discovery of previously overlooked statistically significant cis-regulatory TE subfamilies. We used the KLF4 overexpression RNA-seq dataset we previously generated [33] to illustrate these points. We replaced the regulatory susceptibilities \(N_{pm}\) of matrix N computed according to a hard distance threshold (Fig. 1A) with those corresponding to the same subfamilies, this time computed either through TE-promoter distance weighting (\(L=250\) kB) or according to an approximately equivalent hard-thresholded cis-regulatory window width of 500 kB (Fig. 2A, Eq. 2). All three models thus use the exact same number of predictors, i.e., cover the same TE subfamilies. Running craTEs with the weighted matrix N computed with \(L=250\) kb increased the fraction of gene expression variation explained from 4.5 to 5.4% compared to using the matrix N derived from 100-kB-wide hard-thresholded cis-regulatory windows. As the number of predictors in N remained unchanged, this suggests that the distance weighting approach better approximates the cis-regulatory potential of TE subfamilies than the hard distance threshold approach. Notably, as 500-kB-wide hard-thresholded cis-regulatory windows explained 5.2% of the variance in gene expression, most of the increase in explained variance observed under the weighted (\(L=250\) kB) versus the unweighted model (Fig. 1) likely stems from considering more distant TEs as putatively cis-regulatory. Next, we empirically evaluated whether allowing for the inclusion of TE subfamilies that passed the minimum per-subfamily regulatory potential with distance weighting (Eq. 2) — but not with hard distance thresholding — would uncover additional biologically validated TE-dependent cis-regulatory changes. LTR7Y was identified as the most statistically significantly activated subfamily upon KLF4 overexpression in hESCs (Fig. 2E), in agreement with previously published results [5] while it was absent from the model specified through hard distance thresholding (Fig. 1C, Table S1). In addition, though also absent from the hard distance thresholding model, the primate-specific MER41G subfamily was found as statistically significantly and strongly activated in the distance-weighted model. Regarding the LTR5-Hs/SVA CRISPRi, ZNF611 overexpression, and LTR2B CRISPRi experiments, using the distance-weighted matrix N still uncovered LTR5-Hs/SVAs, LTR2B, resp. SVAs as differentially cis-regulatory (Fig. S2E). To sum up, TE subfamilies typically encode cis-regulatory potential up to distances of approx. 500 kb from the promoters of protein-coding genes, at least in the context of hESCs. This reinforces the notion that TEs form a layer of regulatory fine-tuners exerting a measurable impact on the expression of protein-coding genes.

TFs controlling gastrulation and organogenesis promote the cis-regulatory activity of evolutionarily young TE subfamilies activated during pluripotency

Having validated the ability of craTEs to agnostically recover well-established cases of TE-dependent cis-regulatory activities [5, 39, 43], we next aimed at characterizing the landscape of TF-induced TE-dependent cis-regulation in primed hESCs. As the epigenome of hESCs is markedly more open than that of differentiated cells [49], the number and strengths of the TF-TE regulatory interactions constituting the GRN of hESCs can be understood as upper bounds on those constituting the GRNs of differentiated tissues. We therefore applied craTEs to the “perturbome” dataset, where 441 transgenes, most of them TFs, were individually overexpressed in primed hESCs for 48 h through a dox-inducible system (Fig. 2C) [47]. Using the regulatory susceptibility matrix N computed according to the best-performing cis-regulatory bandwidth (\(L= 250\) kb, Fig. 2D), we estimated the changes in cis-regulatory TE activities associated with each dox-induced transgene overexpression experiment (Additional file 8, Table S7). Dox-treatment alone and dox-induced GFP overexpression were not associated with any robust statistically significant change in cis-regulatory TE activity (Figs. 3A, S3A–B, Table S7) suggesting that neither the addition of doxycycline nor the metabolic cost entailed by strong transgene overexpression measurably altered the cis-regulatory activity of TE subfamilies. Interestingly, overexpression of the core pluripotency TF POU5F1 (also known as OCT4) was not associated with differential TE cis-regulatory activity (Fig. S3A–B), suggesting that overexpressing an already highly expressed gene, namely POU5F1, in a cellular context that largely relies on it, i.e., primed hESCs, may not necessarily alter TE-dependent cis-regulation. Together, these results suggest that TE-dependent cis-regulatory activities inferred from the remaining transgene overexpression experiments are not driven by technical factors inherent to the system used but induced by the overexpressed transgene itself.

Fig. 3
figure 3

TFs controlling gastrulation and organogenesis promote the cis-regulatory activity of evolutionarily young TE subfamilies activated during pluripotency. A TE subfamily cis-regulatory activities (color: activity coefficients; area: statistical significance) estimated from dox-induced transgene overexpression experiments at 48 h in primed hESCs [47] using N computed with L = 250 kb. The number of replicates for each condition varies. Experiments were clustered using complete linkage hierarchical clustering on Euclidean distances computed from activity coefficients. Selected TE subfamilies were ordered by evolutionary age in millions of years, as previously estimated [5]. The number of protein-coding genes with total cis-regulatory weights > 0.13 (weight obtained at a distance of 2L, see Fig. S2) is shown for each subfamily. The color labeling of the estimated activities was saturated at \(|\Delta A| < 0.1\)B Top binding enrichment at selected evolutionarily young TEs by selected TFs controlling germ layer development. TEs (rows) were ordered as in A. ChIP-seq experiments (columns) were ordered by developmental stage or germ layer lineage. Color: number of peaks overlapping with subfamily-specific integrants, normalized for subfamily size. Area: statistical significance. Left: ChIP-seq peaks obtained from the ChIP-Atlas [62]. Right: ChIP-seq peaks obtained from [86] and [70]. C Top: Estimated differences in TE-dependent cis-regulatory activities between hESC-derived EpCAM\(^+\)/INTEGRIN\(\alpha\)6\(^+\) double positive (DP) hPGCLCs and double negative (DN) somatic cells at day 6 of differentiation, replicate #1 \(n=2\) [86]. Bottom: SOX15 KO DP hPGCLCs vs. DP hPGCLCs, day 6, \(n=2\). D Left: hESC-derived differentiating endoderm, 48 h vs 24 h, \(n=3\) [70]. Middle: iPSC-derived GATA6 KO mesendoderm vs. iPSC-derived mesendoderm, \(n=2\) [69]. Right: GATA6 rescue in iPSC-derived GATA6 KO mesendoderm vs. iPSC-derived GATA6 KO mesendoderm, \(n=2\)

To reveal how TE-dependent cis-regulation relies on TF/transgene overexpression in hESCs, we performed hierarchical clustering on the matrix containing the statistical strengths of the estimated cis-regulatory TE activities (Additional file 9, Fig. S3C). Stratifying subfamilies according to size and evolutionary age [5] did not reveal any discernible bias regarding the distribution of statistically significant cis-regulatory activities (Fig. S2C–D). Additionally, the directionality and statistical significance of TE-dependent cis-regulatory activities were robust to varying the cis-regulatory bandwidth L (Fig. S4, Table S7) and consistent across replicates when analyzed individually (Fig. S5). Overexpressed TFs of the same family tended to cluster together — e.g., NEUROD1, NEUROD2, NEUROG3, NEUROD4; PAX2, PAX5, PAX8; SNAI1, SNAI2, SNAI3; RUNX1, RUNX3; HES1, HEY1; LHX1, LHX5; GATA1, GATA2, GATA3 — whether considering effect size (Figs. 3A, S4) or statistical significance (Fig. S3C) of the estimated differences in TE-dependent cis-regulatory activity. This suggests that commonalities in gene expression [50] likely driven by shared DNA binding motifs [51] were partially mirrored by similar cis-regulatory TE activity patterns. Experiments where the core trophoblast TF CDX2 [52, 53] was overexpressed clustered away from all other experiments according to statistical significance (Additional file 9, Fig. S3C) but less so according to cis-regulatory activity estimates (Fig. 3A). This may reflect both the widespread rewiring of TE-dependent cis-regulation as primed hESCs differentiate towards trophectodermal cells [23] and a bias towards the detection of more differentially active cis-regulatory TE subfamilies in the CDX2 overexpression experiments due to a larger sample size compared to the other experiments (Fig. S5) [47].

The LTR7 subfamily clustered away from all other TE subfamilies (Additional file 9) and was statistically significantly less active in 186 out of 441 transgene overexpression experiments, making it the most frequently differentially active TE subfamily in this dataset (Fig. S2D). Among overexpressed transgenes tied to a decrease of LTR7-dependent cis-regulatory activity, we found multiple TFs involved in post-implantation developmental stages (Fig. 3A, Table S7), e.g., the meso/endodermal master TF GATA6 [54] and several homeobox-domain-containing TFs including PDX1 and RUNX1. LTR7 cis-regulatory activity also decreased upon overexpression of organ and tissue-specific TFs, e.g., NEUROD1, NEUROD2, NEUROD3, NEUROD4, MYT1, NR2E1, POU4F1, and POU4F3, all involved in the formation of the nervous lineage [55,56,57,58,59]. Overexpression of TBX5, a key TF in the developing heart [60] also decreased the cis-regulatory activity of LTR7. Lastly, overexpression of TFs involved in the development and maintenance of the placenta, e.g., CDX2, TEAD4 [61] also led to a decrease of LTR7 cis-regulatory activity. Overall, inducing TFs tied to development and differentiation dampened the pluripotency-specific activity of LTR7 elements.

In contrast, we found rare transgenes (9/441) whose overexpression in primed hESCs led to an increase in LTR7 cis-regulatory activity (Additional file 8, Fig. 3A, Table S7). Overexpressing KLFs collectively increased the cis-regulatory activity of LTR7Y elements in agreement with previous studies characterizing KLFs as inducers of LTR7Y enhancer activity in naïve hESCs [5]. Interestingly, induction of KLF5 — but not of KLF1, KLF2, and KLF4 — increased the cis-regulatory activity of LTR7, matching previously reported visual inspections which revealed that among these, only KLF5-overexpressing cells retained an ESC-like phenotype 72 h post-induction [50]. By leveraging a large compendium of homogeneously reprocessed ChIP-seq data [62], we confirmed that KLF4 binding was enriched at LTR7 and LTR7Y in various contexts related to primed hESCs [5, 63] (Fig. 3B). MYB (also known as c-MYB), a TF involved in the maintenance of self-renewal in stem cells of the intestinal crypt, the bone marrow, and the nervous system [64] as well as the formation of stem-like memory CD8 T cells [65], led to a marked increase in LTR7 cis-regulatory activity upon induction and displayed a modest enrichment in binding at related LTR7C integrants in monocytic-derived THP1 cells [66] (Fig. 3B). Thus, MYB overexpression may reinforce self-renewal in the GRN of hESCs, a process tied to an increase in LTR7 cis-regulatory activity. More provocatively, this hints at the possible involvement of a MYB-LTR7 axis in the maintenance of self-renewal and stemness in the adult hematopoietic system. Our analysis thus suggests that a limited set of TFs linked to development and stemness may rely upon the enhancer potential of the LTR7 subfamily to establish, regulate, and maintain these processes throughout development and adult life.

Other primate-specific TE subfamilies displayed partially overlapping patterns of cis-regulatory activity upon transgene overexpression. SVAs and LTR5-Hs were collectively activated by KLF4 and other KLFs (Fig. 3A, Table S7) and enriched for KLF4 binding in hESCs (Fig. 3B), consistent with previous work establishing the KLF4-dependent enhancer activity of these subfamilies in naïve hESCs [5]. Interestingly, the cis-regulatory activity of LTR5-Hs and SVAs also increased upon overexpression of TFAP2C and NR5A1, both of which polarize hESCs towards the naïve state [67, 68]. Overall, these results suggest that recently emerged TE subfamilies form functional collections of enhancer-like CREs during pre-gastrulation embryogenesis.

We then wondered whether the overexpression of transgenes necessary for embryonic development during and after gastrulation was associated with an increase in cis-regulatory activity in recently emerged TE subfamilies. Overexpression of the core meso/endodermal TF GATA6 as well as other GATA family members increased the cis-regulatory activity of SVAs and LTR5-Hs (Fig. 3A, Table S7), thereby resulting in the activation of a TE-dependent cis-regulatory network partially reminiscing that of naïve hESCs [5, 6]. This is surprising given that naïve hESCs resemble cells of the early blastocyst while GATA6 controls post-implantation developmental stages such as the formation of the mesoderm and the endoderm during gastrulation. Furthermore, overexpressing GATA family members increased the cis-regulatory activity of additional primate-specific TE subfamilies including the LTR5-Hs-related HERV-K subfamily LTR5B and the ERV1 subfamilies LTR6A, LTR6B, PRIMA4-LTR, MER4A1, MER4D, and MER4D1. Importantly, LTR6B displayed the largest and most statistically significant increase in TE-dependent cis-regulatory activity along stem cell to endoderm differentiation across two independent datasets (Fig. 3D, S3F) [69,70,71]. Moreover, GATA6 KO [69] reduced LTR6B cis-regulatory activity in differentiating endodermal cells, whereas GATA6 re-expression rescued it (Fig. 3D). Along the same lines, GATA2 deletion in hematopoietic progenitor cells [72] decreased the cis-regulatory activity of LTR5-Hs and LTR5B (Fig. S3D). Of note, GATA ChIP-seq peaks [62, 70] were strongly enriched at SVAs, LTR5-Hs, LTR5B, LTR6A, LTR6B, PRIMA4-LTR, MER4A1, MER4D, and MER4D1 integrants across primitive streak-derived [73] as well as mesendodermal [70], mesodermal [74] — including blood-derived [75,76,77,78] — and placental lineage [79] cells, with enriched binding at SVAs, LTR5-Hs, LTR5B, LTR6A, and LTR6B extending to the endodermal lineage [80, 81] (Fig. 3B). Together, these patterns of binding suggest that the changes in cis-regulatory activity observed upon GATA overexpression in hESCs and meso/endodermal differentiation result from the direct binding of GATA family members to primate-specific TE subfamilies. Interestingly, overexpressing EOMES, another regulator of germ layer formation and mesoendodermal differentiation [82], markedly increased the cis-regulatory activity of LTR6B elements (Fig. 3A), at which it also displayed enriched binding in hESC-derived mesendodermal cells [70] (Fig. 3B). Moreover, overexpression of SOX17, an additional regulator of endodermal differentiation [83], increased the cis-regulatory activity of LTR5-Hs, SVA-C, and LTR6B (Fig. 3A), while SOX17 binding was strongly enriched at SVAs, LTR7, and MER4A1 in germ cell-derived Tcam-2 cancer cells [84] (Fig. 3B), which share some phenotypic features with primordial germ cells (PGCs).

Evidence linking transcription during post-gastrulation embryogenesis with primate-specific TE-mediated cis-regulation extended beyond SOX17 to TFAP2C and SOX15, which both display elevated expression in the PGC lineage [85]. Specifically, overexpression of SOX15 in hESCs markedly increased the cis-regulatory activity of LTR5-Hs (Fig. 3A, Table S7), the latter exhibiting the largest statistically significant increase in cis-regulatory activity in human PGC-like cells (hPGCLCs) compared with cognate hESC-derived somatic cells (Fig. 3C) [86]. Knocking out SOX15 in hPGCLCs led to a drop in LTR5-Hs cis-regulatory activity across two biological replicates (Figs. 3C, S3E), while SOX15 binding was strongly enriched at LTR5-Hs in hPGCLCs (Fig. 3B). Additionally, inducing TFAP2C in hESCs increased the cis-regulatory activity of LTR5-Hs and SVAs (Fig. 3A), both of which displayed a considerable enrichment in TFAP2C binding in hPGCLCs [87] and, interestingly, in cells of the ectoderm lineage [88, 89] (Fig. 3B). In summary, evolutionarily recent and pre-implantation specific TE subfamilies form sets of CREs that regulate the expression of protein-coding genes in cis well past the epiblast stage, including during and after gastrulation, as evidenced firstly by increased cis-regulatory activity following germ layer-specific TF overexpression in hESCs, secondly by enriched TF binding in cells derived from the corresponding germ layers and thirdly by substantial stage-specific increases in cis-regulatory activity which were reverted upon germ layer-specific TF KO.

Older TE subfamilies that emerged prior to the speciation of primates also contribute to GRNs by donating CREs. Despite having spread before the speciation of amniotes hundreds of millions of years ago, AmnSINE1 elements are retained in the genomes of extant amniotes including humans and mice [90], and some AmnSINE1 elements were found to exert long-range enhancer effects on genes controlling brain development [91]. We observed that in primed hESCs, overexpression of several homeobox domain-containing TFs, e.g., RUNX1, a regulator of hematopoietic ontogeny [92], and PDX1, involved in pancreatic development [93], was associated with an increased AmnSINE1 cis-regulatory activity (Fig. 3A, Table S7). Interestingly, AmnSINE1 elements are enriched within active enhancers in epigenomes derived from fetal human cell lines [21]. Lastly, MER135, an ancient subfamily of currently unidentified origin [94] showed increased cis-regulatory activity upon overexpression of homeobox domain-containing TFs in primed hESCs (Fig. 3A). More generally, these results hint that ancient TE subfamilies may retain their cis-regulatory potential at the subfamily level in extant species despite having colonized the genome of an evolutionarily distant common ancestor.

Cis-regulatory activities are more pronounced at epigenetically active TEs

We showed that craTEs agnostically uncovers SVAs and LTR5-Hs as the subfamilies with the most statistically significant and strongest loss of cis-regulatory activity upon CRISPRi-mediated epigenetic repression in naïve hESCs (Fig. 1). However, it is highly likely that only a fraction of all integrants constituting a subfamily truly exert cis-regulatory effects. For example, integrants found within dynamic chromatin regions may be more differentially active than integrants located in stable chromatin regions. To empirically verify this hypothesis, we leveraged the epigenomic profiles matched to the LTR5-Hs/SVA CRISPRi RNA-seq dataset [33] and labeled the following integrants as “functional”: those overlapping with genomic coordinates where loss of chromatin accessibility (ATAC-seq) or gain of the repressive histone mark H3K9me3 (ChIP-seq) were detected upon epigenetic repression of SVAs and LTR5-Hs. Conversely and by complementarity, we considered all other integrants from these subfamilies as “non-functional.” We then expanded the weighted susceptibility matrix N (\(L = 250\) kb) column-wise by splitting TE subfamilies into complementary functional and non-functional integrant subsets (Fig. 4A). Finally, we used craTEs to jointly estimate the differences in cis-regulatory activity for functional and non-functional subsets of TE subfamilies upon epigenetic repression of LTR5-Hs and SVAs in naïve hESCs (Fig. 4B). Functional subsets of SVAs and LTR5-Hs subfamilies displayed greater decreases in cis-regulatory activity upon epigenetic repression than complementary non-functional subsets. In addition, the estimated decrease in cis-regulatory activity was more pronounced for the subset of functional LTR5-Hs than that estimated for the corresponding unsplit subfamily. Of note, functional LTR5-Hs and SVA integrants tended to show slightly lower mappability scores [95] than non-functional integrants (Fig. S7A), raising the concern that difficulties in ChIP-seq and/or ATAC-seq read assignment at repeats [96] may drive functional versus non-functional integrant calling, thereby biasing TE subfamily cis-regulatory activity estimates. However, low mappability functional LTR5-Hs/SVA integrants still exhibited larger and more statistically significant decreases in estimated cis-regulatory activity upon CRISPRi than their low mappability non-functional counterparts (Fig. S7A).

Fig. 4
figure 4

Cis-regulatory activities are more pronounced at epigenetically active TEs. A Overview of the procedure whereby TE subfamilies are split between so-called “functional” and “non-functional” fractions based on additional evidence, e.g., differential chromatin accessibility. The regulatory susceptibility scores tying TE subfamilies to protein-coding genes are distributed between the functional and non-functional fractions of each TE subfamily, leading to an experiment-specific column-wise expansion of N. Concretely, functional and non-functional fractions of TE subfamilies are treated as independent TE subfamilies in the subsequent cis-regulatory activity estimation process. B Estimated differences in cis-regulatory activity for the functional (in red) and non-functional fractions (in blue) of LTR5-Hs and SVA subfamilies under CRISPRi-mediated epigenetic repression in naïve hESCs [33]. The cis-regulatory activities for the unsplit subfamilies were estimated in a separated iteration of craTEs, using the standard distance-weighted N matrix (\(L=250\) kb), and are shown in black. The dotted line represents the significance threshold of BH-adjusted \(p.val = 0.05\). Note that even though only selected subfamilies are plotted for clarity, all TE subfamilies were included in the fitting process. C Estimated differences cis-regulatory activity for the functional and non-functional fractions of selected TE subfamilies according to definitions of the functional state that are either based on differential chromatin states (1st and 3rd panels from the left) or differential TF binding (2nd and 4th panels from the left) at integrants [33]. D Estimated differences in cis-regulatory activities for the functional (bound by both GATA6 and EOMES [70]) vs. non-functional fractions of selected TE subfamilies during hESC-derived endoderm differentiation, 48 h vs. 24 h, \(n=3\) [70] (left), functional (GATA6-bound [70]) vs. non-functional fractions of selected TE subfamilies upon GATA6 KO in iPSC-derived mesendoderm, \(n=2\) [69] (center) and GATA6 rescue in GATA6 KO iPSC-derived mesendoderm, \(n=2\) (right). E Estimated differences in cis-regulatory activity for the functional (SOX15-bound) vs. non-functional fractions of selected TE subfamilies between DP hESC-derived hPGCLCs and DN somatic cells, day 6, \(n=2\) [86] (top) and SOX15 KO in DP hESC-derived hPGCLCs (bottom). F Multiple sequence alignment (MSA) of all 152 LTR6B integrants considered by craTEs (central white rectangle). Gray patches within the central white rectangle indicate gaps. Sequences at loci found in gray rectangles flanking the MSA region are shown for convenience and were not aligned. The intensity of GATA6 (left, \(n=1\)) and EOMES (right, \(n=2\)) ChIP-seq signal is indicated at the corresponding genomic loci. The fraction of sequences adorned with ChIP-seq signal for each position is shown on top. Consensus sequences found underneath high-density ChIP-seq signal regions (\(>\frac{1}{3}\) of sequences overlapping ChIP-seq reads) with the highest density of GATA6 (underlined), resp. EOMES (entire consensus) signals are reported, with GATA consensus DNA-binding sites in bold

Next, we leveraged the epigenomics-informed adaptation of craTEs to test whether differences in TF binding or histone marks could single out integrants with detectable changes in cis-regulatory activity in the context of TF overexpression experiments. To this end, we completed the matched transcriptomics and histone profiles available for KLF4 and ZNF611 overexpression in hESCs [33] by generating ChIP-seq profiles against KLF4 and ZNF611. We used craTEs to estimate differences in cis-regulatory activity for functional integrant subsets defined according to differences in histone marks or TF binding and focused on the main cis-regulatory TE subfamilies identified under KLF4 and ZN611 overexpression in hESCs, namely LTR7, LTR5-Hs, and SVAs (Fig. 4C). For both KLF4 and ZN611 overexpression experiments, histone mark-defined functional subsets had greater cis-regulatory activity than non-functional subsets, except for SVA-D under KLF4 overexpression, as well as SVA-A under ZNF611 overexpression. However, histone mark-defined functional subsets generally displayed only modest increases in cis-regulatory activity over unsplit subfamilies, at the cost of a marked decrease in statistical significance. In contrast, TF-bound-defined functional integrants displayed increased cis-regulatory activity compared with the unsplit subfamily for all subfamilies except SVA-D under KLF4 overexpression, with increased significance in most cases, including when matching integrants for mappability prior to functional versus non-functional subfamily splitting (Fig. S7B). Turning to cellular contexts featuring endogenous TF expression levels, we compared the usefulness of mesendodermal GATA6, EOMES, and H3K27ac ChIP-seq peaks [70] for delineating the most active fraction of TE subfamilies in terms of cis-regulation. GATA6 and EOMES binding, whether considered individually or in combination, proved remarkably informative for discriminating cis-regulatory LTR6B integrants from their inactive counterparts during endoderm differentiation (Figs. 4D, S6A), and conversely aptly accounted for the decrease, resp. increase, in LTR6B, PRIMA4-LTR, MER4D, MER4D1, and LTR5-Hs observed upon GATA6 KO, resp. rescue, in iPSC-derived endodermal cells [69] (Fig. 4D). In contrast, H3K27ac peaks failed to single out cis-regulatory LTR6B integrants in this context. Indeed LTR6B integrants devoid of the canonical enhancer histone mark exerted a more statistically significant cis-regulatory activity than LTR6B integrants overlapping H3K27ac peaks (Fig. S6A). Similarly, SOX15 CUT &TAG peaks [86] pinpointed LTR5-Hs integrants displaying increased cis-regulatory activity in hPGCLCs versus cognate somatic cells, a trend that was inverted upon SOX15 KO in hPGCLCs (Fig. 4E). As implied by differing patterns in statistical significance at functional versus non-functional LTR5-Hs integrants, SOX15 binding better isolated cis-regulatory active LTR5-Hs than chromatin accessibility as assessed by ATAC-seq, although both epigenomic signals were in general agreement (Fig. S6B). Importantly, we did not observe any noticeable difference in mappability between functional and non-functional integrants in ZNF611 overexpressing cell, endodermal cell [70], or hPGCLC-derived [86] epigenomic data (Fig. S7C–E). Thus, TF binding appears to better single out bona fide cis-regulatory integrants than changes in histone marks.

As both EOMES and GATA6 binding were associated with increased LTR6B, LTR6A, LTR5-Hs, MER4D, MER4D1, and PRIMA4-LTR cis-regulatory activity in the differentiating endoderm (Fig. 3B), we performed multiple sequence alignment (MSA) of subfamily-restricted ChIP-seq peaks in search of EOMES and GATA6 TF binding motifs. MSA of GATA6 and EOMES peaks at LTR6A and LTR6B revealed consensus sequences containing several canonical GATA binding sites (Fig. 4F, Additional file 10). Fittingly, the DNA sequence of most LTR6B integrants adorned with GATA6/EOMES peaks contained recognizable GATA6 DNA binding motifs (Fig. S6C), as assessed through motif search [97]. Of note, MSA of EOMES peaks revealed additional GATA6 motifs in per-subfamily consensus sequences at LTR5-Hs, MER4D, MER4D1, and PRIMA4-LTR (Additional file 10), while we failed to call consensus sequences from GATA6 ChIP-seq peaks at these subfamilies. This may stem from differences in ChIP-seq protocols and/or quality. Intriguingly, MSA of EOMES-bound TE regions did not reveal any canonical EOMES DNA binding site, a finding supported by the absence of EOMES motifs at LTR6B integrants subjected to motif search (Fig. S6C). This suggests that GATA6 may first bind at primate-specific TE subfamilies in the differentiating endoderm, to then promote EOMES recruitment at these loci. Accordingly, LTR6B and MER4D1 integrants carrying GATA6 DNA-binding motifs were more cis-regulatory than those devoid of the motif in the differentiating endoderm, whether considering statistical significance or effect size (Fig. S6D). The presence of a GATA6 motif also separated truly cis-regulatory from non-cis-regulatory LTR6B integrants upon GATA6 KO/rescue in the differentiating endoderm but proved less discriminant than ChIP-seq-derived GATA6/EOMES binding at LTR6A, LTR5-Hs, MER4D, and MER4D1 (Fig. S6D). Overall, the agreement between GATA6/EOMES, resp. SOX15 binding, and primate-specific TE-mediated cis-regulation in endodermal fetal cells, resp. hPGCLCs, suggests that recently evolved TEs spread functional cis-regulatory platforms at which core TFs controlling post-gastrulation embryogenesis directly bind, in turn affecting protein-coding gene expression.

Description of Image

Source link