A naive analysis suggests technical issues in dosage compensation detection
We sought to identify the sources of apparent dosage compensation, both technical and molecular. To this end, we focus on lymphoblastoid cell lines derived from a family of individuals where one child has Down syndrome (Fig. 1A). We reasoned that if dosage compensation did occur, it would either occur via inhibition of transcription or via an increase in the RNA degradation of that transcript. Thus, we examined both steady-state RNA levels via RNA-seq and nascent transcription with global run-on sequencing (GRO-seq), in triplicate (see the “Methods” section, Additional File 2: Supplemental Table 1 for sequencing information, Additional File 1: Figs. S1 and S2).
As an initial baseline, we first examined the typical differential analysis pipeline that leverages DESeq2 . In this naive analysis, we make no adjustments to the defaults inherent to programs within the pipeline. Using the naive approach, we find the median fold change (MFC) of all genes on chromosome 21 in RNA-seq is 1.41, with 57.6% of these genes having a fold change below the expected 1.5-fold change. The trends in GRO-seq are similar, with an overall chromosome 21 median fold change of 1.38 and 48.8% of genes below the expected 1.5-fold change (Additional File 1: Fig. S3). Consistent with previous reports, we also observed many genes which were significantly above gene-dosage levels (e.g., interferon-related genes MX1 and MX2; see also Additional File 1: Figs. S4, S5, S6, S7).
To identify genes as dosage compensated, we must specify specific criteria — i.e., how much below the expected 1.5× levels is unusual? First, we considered a fold change cutoff. In this idealized case, if two populations of genes exist (i.e., dosage compensated and not compensated) one would expect to see two distinct but overlapping distributions of fold change in the data. To examine this, we generated a cumulative distribution function of fold change for all chromosome 21 genes in both RNA-seq and GRO-seq (Additional File 1: Fig. S3, red line). We noted that the majority of genes were distributed around 1.5; however, no apparent secondary population was immediately clear — instead, we observed a smooth distribution of fold changes below 1.5, suggesting fold change is distributed on a continuum. While this distribution does not in and of itself disprove the existence of transcription dosage compensation, these results argue that there is no obvious cutoff for identifying dosage-compensated genes. Indeed, these results are in agreement with Hwang et al. findings that dosage-compensated genes (defined in their approach as FDR q-value< 0.01) are rare in trisomy RNA-seq datasets . This led to the question of how trisomy data influences differential expression and the assessment of dosage compensation. To address this question, we turn our attention to dissecting the typical analysis pipeline, using DESeq2 as a representative technique, in order to identify how T21 influenced these results.
Simulations reveal the technical basis of reduced fold change calculations in trisomic datasets
To carefully assess the impact of trisomy data on the standard differential expression analysis, we need to know a priori the correct answer. To achieve this, we simulated T21 and D21 data sets, using the child with D21 data as a reference. Briefly, we used the D21 data along with parameters (such as variance) utilized by DESeq2 to create an artificial gene counts table (Fig. 1B, see the “Methods” section for full details). The simulated T21 individual was generated in the same way, but now all genes on chromosome 21 are at a 1.5× increase from the simulated D21 individual.
We then run the standard analysis pipeline (Fig. 2A) on the simulated data, calculating the fold change and its significance for all genes from each simulation. First, we run the simulation using the parameters from the naive analysis pipeline (Fig. 2B) and even though the chromosome 21 genes in the simulated data are at 1.5×, we observe distributions and median fold change similar to the biological data (MFC = 1.4). Next, we seek to isolate individual components of the differential expression pipeline (Fig. 2A) by running the simulation several times, modifying the count table’s read depth, replicate number, or variance to test each variable’s effect on both differential expression and fold change estimation. For simplicity, we present only the distributions for chromosome 21 (expected to be at 1.5×) and chromosome 22 (expected to be at 1×; all other disomic chromosomes showed the same results as chromosome 22).
Sequencing depth and read counting methodologies
Tools such as DESeq2 seek to quantify within-condition variability using principled models of read-count data to determine changes between conditions that exceed expected variability . As such, the key first characteristics of sequencing data is the number of replicates available per condition. Therefore, we first examined the impact of replication, assuming initially that all replicates are of relatively high quality (low variance). To this end, we simulated data with varying replication, between 2 and 25 replicates per condition. We found that the median fold change estimation generally increased with additional replicates (Fig. 2C), consistent with the notion that additional replication is always a good strategy.
The parallel concern to replication is sequencing depth. It is important to note that nascent transcription typically has lower overall counts per gene than RNA-seq when the two protocols are sequenced to roughly equivalent depths. This arises because a much larger fraction of the genome is transcribed than is stable. Consistent with this, we noticed that low fold change estimates correlated with low expression levels (Additional File 1: Fig. S8) in actual data. Therefore, we next simulated datasets with varying depths, ranging from 0.1 times to 10 times the depth of the original data. We found that decreasing the depth of the simulated datasets resulted in a decreased fold change estimation for many genes, with concomitant reduced median fold change estimates (Fig. 2D). Thus, it is important to have adequate depth for accurate gene fold change estimates. However, additional sequencing is not an option when reanalyzing public data sets and in some cases, increased sequencing depth can be cost-prohibitive. Importantly, increased depth is not expected to fully alleviate the fold change estimation issue; as with increased depth, unexpressed genes are more likely to have reads assigned to them due to noise. Thus we suggest users employ a minimum coverage filter to remove low signal genes as potential false positives when determining which genes are potentially dosage compensated.
In these initial simulations, we noticed that genes with the most dramatic apparent dosage compensation in our naive analysis disappeared in the simulations. Specifically, highly expressed chromosome 21 genes with many genomic repeats, such as ribosomal genes, often appeared at typical expression levels in our real datasets (Additional File 1: Fig. S9), but not in our simulated data. This suggests that repeat regions shared between chromosome 21 and other chromosomes are sensitive to the mapping strategy. During mapping, these reads can be sponged away from chromosome 21 genes, resulting in a lower fold change estimation. This is, of course, dependent on the employed mapping strategy and how multiple mapped reads are handled. Combining a minimum read cutoff and masking repeat regions or removing multi-mapping reads effectively removed many of these genes as false positive dosage compensations. We suggest that users filter these genes from their list before dosage compensation analysis, either by masking repeat regions before counting reads or manually removing genes with a high number of genomic repeats before subsequent analysis .
Size factor calculation for sample normalization
After counting reads, the next step is to normalize the data between libraries (step II in Fig. 2A). Normalization accounts for differences in sequencing depth between samples and is crucial to proper differential analysis. DESeq2 utilizes a median-of-ratios method to find a normalizing “size factor” for each library . In short, DESeq2 assumes that the majority of genes are similarly expressed from sample to sample. By calculating the ratio between the counts for a gene in one sample versus the mean count in all samples and then finding the median of these ratios, samples can be effectively normalized to the genes most likely to remain unchanged in all samples. We thus wondered if this normalization method was being influenced by the trisomic condition of one of our samples, where we expect a priori for a set of genes to be changed between the two samples simply because they reside on the aneuploid chromosome.
To investigate the impact of trisomy on size factor estimation, we removed chromosome 21 from both the actual and simulated data sets. In both cases, chromosome 21 genes had only a minimal effect on size factor calculation (Additional File 1: Fig. S10), consistent with the relatively small proportion of genes on chromosome 21 (approximately 1%) compared to the rest of the genome. Importantly, this result was robust to overall sequencing depth, which we showed by modulating the sequencing depth of our simulations. While the empirical result suggests including chromosome 21 genes in size factor estimation has little effect on the results, we nevertheless recommend removing chromosome 21 genes for the size factor calculation, as this is more consistent with the theory behind the median-of-ratios method.
Dispersion estimation and sample replication
We next investigated the effects of the model fitting process of DESeq2 on fold change estimation in T21 cells. In general, gene expression is estimated by fitting a negative binomial distribution with two parameters: the mean and the dispersion (both of which inform the variance of the distribution). Both values can be inferred directly by maximum likelihood estimation, which calculates the mean expression level for each group of replicates, and then determines the dispersion value. However, this method is susceptible to error at lowly expressed genes or when a low number of replicates are available, as systemic noise begins to dominate . More optimal methods employ a Bayesian process allowing information sharing across multiple genes and replicates. In particular, DESeq2 assumes that genes with similar expression levels exhibit similar dispersion values. Information can thus be shared across genes and samples to more accurately estimate gene-wise dispersion, which increases the fidelity of fold change estimation and dispersion calls even within noisier data. However, if this assumption fails — i.e., if a cluster of similarly expressed genes has higher dispersion than expected – the resulting fits will not accurately reflect the underlying biology. We thus endeavored to answer this question: does the presence of a T21 sample affect the model estimation steps of differential analysis?
The default method used in DESeq2 for calculating gene-wise dispersion (step III in Fig. 2A) involves starting with the maximum likelihood estimate of dispersion, plotting these values against expression levels, and then fitting an asymptotic curve in the form of \(y = a + b/x\) . Here, a and b are the fitting parameters, y is the dispersion estimate, and x is the gene expression level. The parameters a and b control the two ends of the curve; for low-expression genes, the parameter b will be more meaningful, leading to higher dispersion estimates for the fitted curve at these genes, a phenomenon we refer to as extra-Poisson noise. For high-expression genes, the b/x term trends toward 0, and thus the fitted curve will asymptotically approach the value of parameter a, a trend we refer to as asymptotic dispersion. The resulting fitted curve is then used to inform one more round of dispersion estimation, effectively shrinking gene-wise dispersion towards the fitted curve. An increase in either parameter increases the value of the initial dispersion estimate. Still, the effects are felt asymmetrically depending on the expression level and the amount of information available to each gene (i.e., replicates and sequencing depth).
To determine the effects of T21 on dispersion estimation, we extracted the fitting parameters used in dispersion estimation from our biological data sets, with and without chromosome 21 genes. As a control, we also compared removing an equivalent number of random genes from the other chromosomes (see the “Methods” section). We observed an increase in both fitting parameters relating to gene-wise dispersion when trisomic genes were included (Additional File 1: Fig. S11), leading to the question: how does this increase in the initial dispersion fit affect differential calls and fold change estimates?
To address this question, we generated data sets sampled from a negative binomial distribution with the same means but varying dispersion values for chromosome 21 genes in the T21 simulations. When we compared across a wide range of dispersion estimates, we noted higher dispersion parameters resulted in decreased fold change estimations for chromosome 21 genes (Fig. 2E, F; Additional File 1: Fig. S12). Specifically, the distribution of fold change drastically shifted towards 1.0, albeit with a broader spread (MFC = 1.28). The effects of higher dispersion on fold change estimation could be partially offset in simulated data by adding more replicates and depth (Additional File 1: Fig. S13), consistent with the fact that more replication and read depth are critical to overcoming the effects of systemic noise. Notably, the fact that a high number of replicates creates a MFC closer to 1.5× may also contribute to the disparate results regarding dosage compensation reported in previous studies; large-scale studies that integrate several available data sets result in more confident fold change estimation indicating no dosage compensation (i.e., most genes are at the expected 1.5× fold change) .
Fold change shrinkage and hypothesis testing
The final steps of differential expression analysis are fold change estimation and hypothesis testing. Here, DESeq2 provides the option to utilize a Bayesian method for fold change estimation, using a prior distribution centered around a fold change of 1.0, which effectively shrinks fold change estimates towards 1.0. As with dispersion estimation, the resulting shrinkage effect is more substantial for low-expression genes. These estimates (known as maximum a posteriori or MAP estimates) are generally considered more reliable than MLE calculations of fold change for low expression genes . However, this assumption can fail if the prior distribution does not represent the underlying biology, as is true for trisomic genes (Additional File 1: Fig. S15). Researchers who utilize MAP estimates will thus note more genes that appear dosage compensated, although this apparent “compensation” is mainly due to the shrinkage effects of the prior distribution. In general, we recommend users exercise caution when interpreting MAP estimates as evidence for dosage compensation; genes that experience strong fold-change shrinkage should be filtered out from analysis, or MLE calculations should be used instead.
Hypothesis testing in the standard differential analysis pipeline uses the default null hypothesis that each gene’s expression levels are equal in both groups. Given that chromosome 21 genes exist in three copies and DESeq2 seeks to identify deviations from typical, one might expect most of chromosome 21 to be differentially expressed, e.g., statistically significant. However, when we ran our simulation using parameters similar to the real data (Fig. 2B, padj \(< .01\)), no genes on chromosome 21 are called as statistically significant. This arises because while the median fold change of chromosome 21 encoded genes is elevated (MFC = 1.4), the dispersion genome wide is high enough that this change is not deemed significant. In our simulations, we altered dispersion over a broad range (Additional File 1: Fig. S12) and while the median fold change varied (Fig. 2E, F), no genes on chromosome 21 were deemed statistically significant. In contrast, increasing replicates and read depth, even with higher dispersion values, did yield 126/262 differentially expressed chromosome 21 genes in RNA-seq simulations (MFC = 1.43) (Additional File 1: Figs. S13, S14). In all cases, significant genes showed fold change estimates near or greater than the expected 1.5-fold. Thus, with high replication DESeq2 detects the typical 1.5× transcription levels of chromosome 21 as statistically significant, but finds no genes with lower-than-expected expression levels.
Arguably, this arises because differential expression analysis is a distinct question from identifying dosage compensation. The default null hypothesis used in hypothesis testing is incorrect for identifying dosage-compensated genes, as all of chromosome 21 is expected to be elevated. Consequently, to identify dosage compensation using the standard differential expression pipeline it is necessary to either adjust the null hypothesis or the input data.
The first method is to change the fold-change threshold for the hypothesis tests of chromosome 21 genes from the default value of 1 to the dosage-informed value of 1.5, such that significant gene calls deviate from the DNA dosage-informed expectation. Under these tests, significant gene calls below a fold change of 1.5 are potential candidates for dosage compensation. In both our actual and simulated data, most chromosome 21 genes are not considered below levels expected by gene dosage when using this method (Additional File 1: Fig. S16). However, we note that this method cannot reliably utilize the MAP estimates of fold change, as the prior distribution does not reflect the new alternative hypothesis.
The second method, which we prefer, is to perform an additional normalization step before the differential analysis. In DESeq2, the ploidy number of each gene in each sample can be loaded into a normalizing matrix. The resulting read counts are normalized both by the library’s size factor and the gene’s ploidy number. Subsequent fold change shrinkage and hypothesis testing can thus utilize the default parameters, as even trisomic genes are expected to exhibit a fold change of 1.0 under these conditions (Fig. 2G, Additional File 1: Fig. S17). Thus, significant genes with a fold change of less than 1.0 are candidate dosage-compensated genes. Given we simulated data with a 1.5-fold change, this approach correctly finds no significant chromosome 21 genes in our simulated data, regardless of the dispersion parameters used (Additional File 1: Fig. S18).
In biological data, ploidy normalization brings the distribution of chromosome 21 fold change estimates in line with other chromosomes (Fig. 3A–C, Additional File 1: Fig. S19). Thus chromosome 21 genes exhibited an MFC of 0.96 with dosage normalization in RNA-seq (Fig. 3C). Furthermore, 1009 genes were considered differentially expressed between the two brothers, of which only five were on chromosome 21 (out of 143 total chromosomes 21 genes, after read count filtering). In GRO-seq analysis, chromosome 21 genes had an MFC of 0.97 and 3820 genes were differentially expressed, of which 20 genes fell on chromosome 21 (out of 144 total chromosome 21 genes, after filtering) (Additional File 1: Fig. S20). We note that the proportion of significant genes on chromosome 21 is similar to those obtained when we compare two disomic individuals; the distribution of fold changes and the number of differential genes are consistent so long as the reads are normalized by the ploidy number (Additional File 1: Fig. S21). Normalizing by ploidy is additionally advantageous, as MAP estimates of fold change can be utilized for visualization and downstream analysis. Furthermore, subsequent power analyses are more relevant to trisomic data once the counts have been adjusted .
In all, our simulations led us to the construction of a differential analysis pipeline more suited to trisomic samples. First, we find that multi-mapped reads can cause a handful of genes to appear dosage-compensated; we thus suggest removing these reads prior to analysis. Next, we note that low expression genes are especially prone to appearing dosage compensated; we thus suggest employing a minimum coverage filter to avoid these false positives. Finally, our trisomic samples appeared to be noisier relative to their disomic counterparts; this variance resulted in lower fold change estimations for many chromosome 21 genes. As such, we suggest either employing a normalization matrix to adjust all gene counts by their ploidy number or using the appropriate null hypothesis to determine potential dosage compensation in trisomy. Rather than relying on cutoffs, our trisomy-aware pipeline uses DESeq2’s significance calls to determine which gene expression levels are significantly different than their expected values based on gene dosage (Fig. 3C, D, Additional File 1: Fig. S22). Furthermore, by accounting for the genes which contribute the most variance in fold-change calculations, we remove likely false positives from the analysis. We thus contend that our pipeline is a more rigorous determinant for dosage-compensated genes which properly accounts for the underlying variance of the data.
Reduced fold change on chromosome 21 are consistent with identified eQTLs
After adjusting the data analysis pipeline to be trisomy-aware, we conclude that nearly all chromosome 21 encoded gene transcription levels (GRO-seq) and expression levels (RNA-seq) are proportional to DNA dosage(Additional File 1: Fig. S18). However, a small number of genes on chromosome 21 remain potentially dosage compensated, as precisely five genes in RNA-seq and 20 genes in nascent transcription (Fig. 3D, Additional File 1: Fig. S20) have lower-than-expected levels.
Given the small number of genes remaining and the lack of known dosage compensation mechanisms in humans, we reasoned that there might be a genetic basis for the reduced expression of these genes. More concretely, we hypothesize that apparent dosage compensation could arise from allele-specific sequence variation. To explore this possibility, we sought to compare the genome sequence of these individuals to known expression quantitative trait loci (eQTL) where a specific sequence variant is associated with lower expression within the population . Thus we performed whole genome DNA sequencing on the cell lines of all four family members (see the “Methods” section for full details; Additional File 2: Supplemental Table 1 for sequencing information).
We used the GATK package to call single-nucleotide polymorphisms (SNPs) in each member of the quartet . Briefly, reads were mapped and realigned before variants were called using Haplotype caller (see the “Methods” section for full description). Because chromosome 21 is triploid in the individual with Down syndrome, Haplotype caller was used twice — once with the default ploidy of two and once with ploidy set to three. Variants called using the ploidy of three versions were kept only for chromosome 21 in the individual with Down syndrome; otherwise, the default calls were used.
We reasoned that any genetic variation that reduces chromosome 21 gene expression in the general population could be present in our T21 sample, leading to reduced expression levels in the individual. To test this hypothesis, we compared genome variations identified in each individual to previously identified eQTLs [21, 26]. We limited our search to eQTLs identified in either lymphoblastoid cell lines or their nearest related tissue (whole blood). For this analysis, we required the variant to be identifiable in at least one of the parent samples as well as the child with trisomy 21. In RNA-seq, we found that the reduced expression of three of the five apparently dosage-compensated genes (CLIC6, ITSN1, C2CD2) could potentially be explained by known expression controlling polymorphisms (Fig. 3D, Additional File 3: Supplemental Table 2). For example, the rs1075704 eQTL exists in the human population as a GG, GT, or TT (Fig. 3E), and the T allele correlates with lower expression of CLIC6 in lymphoblastoid cells. This SNP, rs1075704, shows variable CLIC6 gene expression across the disomic samples in both our RNA-seq and GRO-seq (Fig. 3F). In our trisomic sample, CLIC6 has an expression level less than 1.5× the average of all disomics, with the genotype TTT. So the lower-than-expected expression observed at CLIC6 can be explained by the genotype, which is not considered by the typical differential expression pipeline. Consistent with the allele identity, these three genes also have lower-than-expected transcription in GRO-seq.
We also reasoned that the eQTL data in GTEx may be incomplete, that there may exist other alleles that lead to reductions in expression data. To explore this possibility, we next compared the two parents to each other, reasoning that differences observed between the parents could be subsequently inherited by either of the two children. Using this technique, we found that 9 of the remaining 17 genes in GRO-seq were also differentially transcribed between the parental samples, suggesting the lower than expected levels in the trisomy sample may be modulated by an inherited parental haplotype (Additional File 1: Fig. S22). Altogether, we found reasonable explanations for most genes (60%) which fell below expected levels in T21 (Fig. 3). These results were consistent in both RNA- and GRO-seq (Additional File 1: Fig. S22).