Scientific Papers

A validated heart-specific model for splice-disrupting variants in childhood heart disease | Genome Medicine


Study cohorts

Congenital heart disease (CHD) cases

The overall cohort included 1101 probands, of which 875 had tetralogy of Fallot (TOF) and 226 had dextro-transposition of the great arteries (TGA) (Table S1) [16]. Among these cases, 505 TOF and 226 TGA were enrolled through the Heart Centre Biobank Registry at the Hospital for Sick Children (Ontario, Canada), 245 TOF were enrolled through the Kids Heart BioBank at the Heart Centre for Children, The Children’s Hospital at Westmead (Sydney, Australia), and 125 TOF were enrolled through the CONCOR registry at the Amsterdam Medical Center (Netherlands).

CHD patients (n = 154) with both DNA for GS and myocardium for RNA-sequencing (RNA-Seq) were divided into a training set for model development, i.e., Discovery cohort (n = 106) and a set for testing, i.e., Validation cohort (n = 48). A second validation cohort included 43 unrelated cardiomyopathy probands from Ontario [6]. The Extension cohort for model application included 947 unrelated TOF (n = 721) and TGA (n = 226) probands with GS but without myocardium available for RNA-Seq. Two hundred thirty two family members with GS were additionally used for variant segregation analysis.

Collection and use of biospecimens through the registries was approved by local or central Research Ethics Boards and written informed consent was obtained from all patients and/or their parents/legal guardians and study protocols adhered to the Declaration of Helsinki.

Controls

The control cohort included 2570 genome sequencing (GS) data from the Medical Genome Reference Bank (MGRB) [17]. MGRB variants were obtained from the original publication, after alignment to GRCh37 and variant calling for all samples. Control cohort characteristics are provided in Table S1.

Genome sequencing processing, alignment, and variant calling

GS of CHD and cardiomyopathy cases was performed on high-quality DNA from blood or saliva of probands and their family members using the Illumina HiSeq X or NovaSeq platform by The Centre for Applied Genomics (TCAG, The Hospital for Sick Children, Toronto), or Macrogen (South Korea). Illumina TruSeq DNA PCR-Free kits were used for library preparation. GS samples from CHD probands were sequenced to a median average depth of 30.9X (range 12.9–46.6X). In order to identify putatively pathogenic and other DNA variants of interest, GS samples were further processed as follows:

For all Discovery cohort, Extension cohort, and cardiomyopathy Validation cohort GS samples, paired-end raw reads were trimmed and cleaned by trimmomatic v.0.32 [18], then mapped to human reference genome hg38 using bwa v.0.7.15 [19]. The reference genome sequence and training datasets were downloaded from the Genome Analysis Toolkit (GATK) resource bundle (ftp://ftp.broadinstitute.org/bundle) [20]. Mapped reads were realigned and calibrated by base quality score recalibration tools (GATK v4.1.2.0). HaplotypeCaller was used to generate genotype Variant Call Format (gVCF) files for each sample, then gVCF files for batches of samples were combined and joint-called by using CombineGVCFs and GenotypeGVCFs tools. In order to filter out probable artifacts in the calls, single-nucleotide variants (SNVs) and insertion-deletions (indels) were recalibrated separately by variant quality score recalibration (VQSR) tools, and variants that passed VQSR truth sensitivity level 99.5 for SNPs and level 99.0 for indels were retained. The VariantFiltration tool was used to mark out the low Genotype Quality (GQ) SNV and indel sites whose GQ values were lower than 20 and read depths were lower than 10. Copy number variants (CNVs) were called as described in [21], using ERDS [22] and CNVnator [23]. Structural variants (SVs) were called using Manta [24] and Delly [25]. Sample ancestry and relatedness among family members was estimated and verified using somalier v0.2.11 [26] with default parameters.

All CHD Validation cohort GS samples were processed using DRAGEN Bio-IT Platform v3.8.4 [27]. Paired-end reads were aligned to hg38 human genome reference (hg38-alt-aware-graph). Small variants (SNV and Indel), CNVs, and SVs were called according to the above DRAGEN workflow. Files in standard output format were generated, with crams for alignment and vcfs for small variants, CNVs, and SVs. Small variant calls were annotated using an Annovar-based workflow and the CNVs and SVs were annotated using custom scripts. Sample ancestry and relatedness among family members was estimated and verified using somalier v0.2.11 [26] with default parameters.

GS data for the MGRB control cohort were generated as previously reported [17]. Briefly, DNA was extracted from blood and Illumina TruSeq Nano DNA High Throughput kits were used for library preparation. Reads were sequenced on Illumina HiSeq X, then aligned to the 1000 Genomes Phase 3 decoyed version of build 37 of the human genome (GRCh37) using GATK best practices. GATK HaplotypeCaller was used to generate gVCFs for SNVs and indels, then joint-called in a single batch using GATK GenotypeGVCFs. All SNVs and indels were converted to hg38 using LiftoverVcf [28].

Identification and interpretation of pathogenic protein-coding variants

Variant identification

To identify putatively disease-causing variants in cases, we first searched for SNVs, indels, and CNVs that were classified as pathogenic or likely pathogenic according to the ACMG/AMP criteria [29, 30]. SNVs and indels were first annotated for pathogenicity using InterVar v2.0.2 [31]. Variants with internal Human Gene Mutation Database (HGMD) Pro 2019 [32] classifications of Disease-associated polymorphism with supporting functional evidence (DFP) were assigned a PS3 score, while variants with an internal classification of disease-associated polymorphism (DP) or disease causing mutation (DM) were assigned a PP5 score. Variants in CHD patients were then mapped to CHD genes.

Variant mapping to genes

CHD gene list: Tier 1 CHD genes were selected based on a moderate, strong, or definitive association with CHD according to ClinGen criteria[33]. We further annotated and categorized additional CHD genes using (i) published literature; (ii) existing databases including Online Mendelian Inheritance in Man (OMIM) [34], Clinical Genome Resource (ClinGen) [35], and CHDgene [36]; (iii) inclusion in clinical gene panels; and (iv) expert curation. Genes with a limited evidence for association with CHD were classified as Tier 2 genes. This yielded 99 Tier 1 CHD genes with moderate, strong, or definitive associations with CHD according to ClinGen criteria (17 isolated CHD genes, 82 syndromic CHD genes), and 626 Tier 2 CHD genes with limited association with CHD. Canonical transcriptional isoforms were annotated using Matched Annotation from NCBI and EMBL-EBI (MANE) [37]. Gene constraint annotations were obtained from the Genome Aggregation Database (gnomAD) (v2) [38] (Table S2). Cardiomyopathy gene list: Variants in cardiomyopathy patients in the validation cohort were mapped to Tier 1 and 2 cardiomyopathy genes that were annotated and classified using ClinGen criteria [35] as previously reported by us [6].

SNV and indel classification for pathogenicity

SNVs and indels classified by InterVar as “Pathogenic” or “Likely pathogenic” and occurring in Tier 1 or Tier 2 genes were subsequently manually reviewed. Gene inheritance and associated disease conditions were obtained from OMIM [34]. Variants in recessive genes were required to either have a homozygous or bi-allelic genotype. When full trio GS was available, parental data were used to determine phasing. Otherwise, when variants were in close proximity, DNA reads were searched to determine if variants occurred on different alleles. Ratios of observed/expected (o/e) loss-of-function (LoF) or missense variants for affected genes were obtained from the Genome Aggregation Database (gnomAD) v2.1.1 [38]. Population variant allele frequencies were obtained from gnomAD v3.1.2. Where possible, variant segregation among family members was considered. Variants in genes for dominant disorders had allele frequencies < 0.01% for PM2, between 1 and 5% for BS1, and > 5% for BA1. Variants in genes for recessive disorders had allele frequencies < 0.1% for PM2, between 2 and 10% for BS1, > 10% for BA1. The UCSC Genome Browser was used to investigate low mappability and RepeatMasker annotations [39]. Variant reads were manually inspected using the Integrative Genomics Viewer (IGV) [40] to exclude any likely false positive variants with insufficient evidence or insufficient read coverage, consistent with recommended best practices for reducing false-positive variant calls in clinical sequencing [41, 42]. Variants with a heterozygous genotype call and a variant allele fraction of less than 33% or greater than 66%, variants with < 20X coverage, and variants with many mismatched bases in nearby reads were excluded. ClinVar [43, 44] was used to search for any pre-existing classifications or other variants occurring at the same nucleotide or amino acid position.

CNV classification for pathogenicity

For deletions and duplications identified by CNV and/or SV callers, automatic filters were applied and variants were retained if they met the following criteria: (i) were absent from or present at 1% frequency or less in a database of CNVs/SVs, generated from Illumina HiSeq X sequencing data of parents of children with autism spectrum disorder at TCAG [45] and called using the same methodology, (ii) variants that overlapped with an exonic region, and (iii) overlapped with a gene in the CHD gene list (with the exception of de novo variants which were assessed even if they did not overlap a CHD gene). To reduce the number of false-positive CNVs, only variants called by both ERDS and CNVnator were retained. Each variant that passed these automatic filters was queried through the DECIPHER browser [46]. Variants that overlapped with benign/population variants as seen in at least 10 individuals in DGV [47] or gnomAD structural variants [48] were not further considered, depending on the suspected mode of inheritance. The remaining variants were visualized using either IGV [40] or Samplot [49], depending on their size and complexity, to confirm their authenticity.

To visually determine if a variant was a true positive, the read depth, insert size, and orientation of paired reads in IGV or Samplot were assessed. Variants of interest were required to be associated with a noticeable drop (for deletions) and increase (for duplications) in coverage compared to the surrounding region to pass visual inspection. Coloring alignments by insert size in IGV was used to recognize differences in expected versus observed insert size which was helpful in detecting deletions and insertions. Orientation of paired reads was used as evidence to support structural variants such as inversions. True inversions were accounted for by the presence of cyan (forward) and purple (reverse) reads on both breakpoints.

Variants that passed the above inspections proceeded to manual ACMG/AMP classification [30]. Intragenic CNVs/SVs were submitted to AutoPVS1 [50], to automatically assign a PVS1 criterion for haploinsufficient genes (i.e., genes with a probability of being loss-of-function intolerant (pLI) score greater than or equal to 0.9, a ClinGen dosage curation indicating haploinsufficiency, and/or literature evidence supporting a loss-of-function pathogenic mechanism). Complex variants (called as dual DUP-DEL, DUP-DEL-INV, etc.) were only considered if there was phenotype support for the genes harboring them. Literature and databases such as OMIM and ClinVar were surveyed to identify similar CNVs/SVs reported in individuals with the phenotype of interest. Parental genome read alignments, if available, were visualized together with the proband, when determining variant inheritance (i.e., whether the variant was de novo, parentally inherited, or unknown), for the aforementioned ACMG/AMP guidelines.

All putative pathogenic/likely pathogenic protein-coding variants that were identified in GS data, but which had not been previously reported upon clinical genetic testing, were evaluated by our return of results committee [51] and were subsequently re-confirmed using clinical genetic testing.

Identification of putatively splice-disrupting variants in GS data

Putatively splice-disrupting SNVs and indels were identified using SpliceAI [52], the tool recommended by the ClinGen Sequence Variant Interpretation Subgroup [13]. Exome annotations and splice junctions were similarly obtained from SpliceAI [52]. Pre-computed masked SpliceAI delta scores were utilized where possible (https://basespace.illumina.com/projects/66029966); otherwise, SpliceAI (v1.3.1) was used to generate masked delta scores with a maximum distance of 100 bp between the variant and gained/lost splice site. SNVs and indels with a “PASS” flag were extracted using bcftools v1.9 [53]. Variants with a SpliceAI delta score ≥ 0.2 were retained and subsequently annotated with the predicted effect (VEP v102 [54]), reported pathogenicity (ClinVar 2022–04-03 and HGMD Pro 2019), control allele frequency (gnomAD v2.1.1 and v3.1.2), gene constraint (gnomAD v2.1.1), genomic low complexity regions (https://github.com/lh3), genomic RepeatMasker regions (https://www.repeatmasker.org), and wild-type splicing branchpoints [55, 56]. Select variants were manually annotated with CADD-Splice PHRED scores [57], in order to assess the correlation between those scores and those derived by SpliceAI. Ensembl RNA transcripts were further annotated as canonical by MANE v1.0 (MANE Select or MANE Plus Clinical) [37]. These variants were then analyzed in corresponding myocardial RNA-Seq data from the same patient to determine if they were associated with splicing events.

Identification of aberrant splicing events in myocardial RNA-Seq

Myocardial RNA-Seq

To detect aberrant splicing events at the tissue level, RNA-Seq was performed on ventricular myocardial samples available from 154 unrelated TOF probands and 43 unrelated cardiomyopathy probands from SickKids Heart Centre Biobank. In patients who had consented to biobanking, myocardium was obtained from leftover tissue at the time of cardiac surgery and was immediately snap-frozen in the operating room and stored in liquid nitrogen. Among CHD cases, the median age at surgery was 0.5 (range 0.1–14.3) years. None of the patients received inhibitors of nonsense-mediated decay prior to tissue resection. Total RNA was extracted from myocardial samples using the RNeasy Mini kit (QIAGEN, Canada). RNA samples were sent to TCAG (The Hospital for Sick Children, Toronto) for ribosomal RNA depletion and library preparation using the Illumina Stranded Total RNA Prep Ligation with Ribo-Zero Plus, and sequenced using Illumina HiSeq 2500 or NovaSeq platforms to generate paired end reads of 150 bases. Raw sequencing reads were trimmed by Trimmomatic v0.36 [18] for quality trimming and adapter clipping. The remaining reads were aligned to the GRCh37 reference genome (1000 Genomes Project reference genome, hs37d5) using STAR (v2.6.1.c) [58] with basic two-pass mode and Ensembl GTF (release version 87) [59] was used for the annotation. Gene and transcript expression level quantification were prepared using RSEM (v1.2.22) [60].

Identification of aberrant splicing events in myocardial RNA-Seq data

Aberrant splicing events were identified in RNA-Seq data using FRASER v1.8.1 [61]. Introns with unreliable detection were filtered out using the “filterExpressionAndVariability” method with default parameters except for requiring a minimal read count of 15 in at least one sample. An “AE” beta-binomial denoising autoencoder was used to fit the splicing models, with hyperparameters ψ5 = 14, ψ3 = 11, and θ = 5 for the Discovery cohort, ψ5 = 6, ψ3 = 5, and θ = 2 for the CHD Validation cohort, and ψ5 = 5, ψ3 = 4, and θ = 2 for the cardiomyopathy Validation cohort. Cohorts utilized distinct hyperparameters due to their different sizes and properties. Optimal autoencoder hyperparameters for each cohort were determined using the “optimHyperParams” method. Splice events were annotated using biomaRt as part of the “annotateRanges” method [62]. Observed events were considered to be significant with a false discovery rate < 0.2, an absolute Z-score ≥ 1, an absolute Δψ/θ score ≥ 0.2, and ψ/θ − Δψ/θ ≤ 0.1 or ≥ 0.9. Events annotated as not mapping to a gene or to multiple genes were excluded. All reported splicing events in CHD genes were visually inspected. A large number of RNA splicing outliers annotated as being in MYH6 or between MYH6 and MYH7 were excluded, despite MYH6 being a Tier 1 autosomal dominant CHD gene. MYH6 and MYH7 are adjacent in the genome, and these two genes share highly homologous exons. Visual inspection of the anomalous RNA-Seq data demonstrated that reads were often being aligned between homologous MYH6MYH7 exon junctions, rather than being true splicing outliers, and were therefore excluded as likely false-positives.

Identification of gene expression outliers in myocardial RNA-Seq data

As aberrant RNA splicing may result in nonsense-mediated decay, gene expression outliers in RNA-Seq data were identified using OUTRIDER (v1.8.0) [63]. OUTRIDER was run on the Discovery (TOF), CHD Validation, and cardiomyopathy Validation cohorts separately. Low-expressed genes were first filtered out by only selecting genes with at least 10 read counts in more than 50% of the input samples in each cohort. Before fitting the input cohort to the OUTRIDER model, the optimal encoding dimension “q” was first determined by using the “findEncodingDim” method. The optimal encoding dimension was estimated to be 16 for the CHD Discovery cohort, 8 for the CHD Validation cohort, and 8 for the cardiomyopathy Validation cohort. To further identify whether variants were associated with reduced RNA expression, we used the OUTRIDER Z-scores calculated across each gene.

Generation and validation of random forest models for selecting heart-specific splice-disrupting variants

To identify true-positive splice disrupting variants, genome-wide variants were classified by whether or not they were associated with confirmed splicing events in the myocardium by searching for matching significant events within ± 100 bp of altered splicing boundaries called by FRASER. Variants associated with significantly altered outlier splicing in the same gene, but occurring outside of these boundaries were deemed indeterminate and were excluded from model training and validation. To test for the enrichment of univariable features that associate with variants that validated by FRASER, we used two-sided Mann–Whitney U tests for continuous variables and two-sided Fisher’s exact tests for binary variables. The R package randomForest (v4.7–1.1) [64] was used with default parameters to create and test four machine learning models for the prediction of splice-disrupting variants. Model 1 used only SpliceAI Δ scores as input; model 2 included DNA variant features associated with tissue splicing events, i.e., the variant distance to the nearest annotated splice junction, the variant type (SNV or indel), and whether the variant occurred in a branchpoint region, low complexity region, and/or repetitive region, while excluding SpliceAI Δ scores; model 3 included all of the DNA variant features from model 2 in addition to the corresponding median gene expression TPM value in RNA-Seq data, and model 4 included the maximum SpliceAI Δ score as well as additional DNA variant features and gene expression TPM values. Variants with missing feature values were omitted by the models (i.e., missing values were not imputed). To account for the imbalance in the training class frequencies, training models were either inversely weighted by the corresponding number of observations in the training data (Weighted models), or used Synthetic Minority Over-sampling Technique (SMOTE models) to artificially create new training inputs of the minority positive (variant resulting in confirmed splice-disruption) class [65]. All models were internally evaluated for performance by area under the curve (AUC), sensitivity, specificity, odds ratios, and Fisher’s exact test p-values using internal five-fold cross-validation. To reduce bias in odds ratios calculations and to avoid “zero cells” in the contingency tables, 0.5 was added to each observed cell frequency (Haldane-Anscombe correction). Bias-variance–covariance decomposition analysis was performed on binary classifications using the R library “randomUniformForest” (v1.1.6). The models were subsequently retrained on the entire Discovery cohort prior to its application to additional cohorts.

Validation of model performance in independent CHD and cardiomyopathy cohorts

In order to independently assess the performance of random forest models for selecting splice-disrupting variants, all models were applied to two independent cohorts of CHD (n = 48) and cardiomyopathy (n = 43) cases. Putative splice-disrupting variants in these samples were identified and annotated as described above. Variants were further limited to those found in only one sample in each cohort (i.e., an internal minor allele frequency (MAF) < 0.03), as only internally rare variants were expected to be associated with outlier splicing events. Each variant was annotated by whether or not it was associated with a confirmed splicing event in the myocardium as detected by FRASER, and whether or not it was selected by each model. Model performance was evaluated within each validation cohort by AUC, sensitivity, specificity, odds ratios, and Fisher’s exact test p-values. In addition, in order to determine whether putative splice-disrupting variants led to nonsense-mediated decay, gene expression Z-score values were obtained using OUTRIDER for each associated sample harboring a variant in the gene. The values were stratified for variants selected by each model versus those rejected by the model, and then compared using two-sided t-tests. The weighted model 4, which validated as having optimal performance, was then applied to the remainder of the CHD cohort to identify additional high-confidence splice-disrupting variants.

Identification of high-confidence splice-disrupting variants in extended CHD cases and controls using the optimal random forest model

The optimal random forest weighted model 4 was applied to all GS data from the Extension and Control cohorts in order to identify high-confidence splice-disrupting variants. Variants were further filtered to include only those rare in gnomAD control populations [38] (gnomAD v2 allele frequency < 0.0001, and gnomAD v3 PopMax allele frequency at 95% confidence < 0.0001). Both gnomAD v2 and v3 statistics were used in order to account for differences in the reference genomes used to align the Extension (hg38) and Control (GRCh37) cohorts). Variants in CHD genes in the Extension cohort were subsequently visually inspected. Six putative splice-disrupting indels in SMAD4 found in two TOF probands were subsequently excluded. The DNA reads associated with these variants had poor alignments upon visual inspection, and further investigation identified a report of a rare pseudogene containing homologous SMAD4 exons, which is found in < 1% of the population [66], suggesting that these DNA variants were false positives. Alignment files for control samples were not available, and thus variants in this cohort were not visually confirmed or otherwise excluded.

Gene set enrichment analysis

To identify which phenotypic abnormalities in human disease were enriched for splice-disrupting variants, enrichment analysis of gene sets harboring 133 high-confidence splice-disrupting variants in CHD genes among all CHD probands (n = 1101) was performed using g:Profiler core tool g:GOSt [67, 68]. Phenotypic abnormality gene sets were obtained from the Human Phenotype Ontology [69]. We limited the search to a custom background gene set containing all genes annotated by Matched Annotation from NCBI and EMBL-EBI (MANE) [37], that were expressed in the RNA-Seq profiles derived from patient myocardium in the Discovery cohort (median TPM ≥ 1), and that had at least one annotation in Human Phenotype Ontology gene sets. Genes were input as symbols derived from SpliceAI annotations. An adjusted p-value threshold of 0.01 was used to determine significance of all gene sets. Adjusted p-values were calculated using the g:SCS (Set Counts and Sizes) method which considers dependencies between multiple tests by taking into account the overlap in functional terms [67, 68].

Case–control burden analyses

To directly compare the burden of short variants in cases and controls, we first determined whether technical differences between cases and controls (e.g., sequencing facility, GS platform, reference genome version, variant detection workflows) contribute to a systematically different burden of synonymous variants genome-wide. Our strategy to filter for rare synonymous variants in samples was similar to how we had identified high-confidence splice-disrupting variants. We limited to only SNVs and indels with a “PASS” flag, and then annotated variants using VEP v102 [54]. Variants were filtered to only include those with an internal minor allele frequency (MAF) < 0.01, gnomAD v2 allele frequency < 0.0001, and gnomAD v3 PopMax allele frequency < 0.0001 [38]. Both gnomAD v2 and v3 statistics were used in order to account for differences in the reference genomes used to align the Extension (hg38) and Control (GRCh37) cohorts, and internal allele frequencies were used to further reduce possible false-positive variant calls. We calculated for each sample the number of rare variants predicted to result in a synonymous substitution. Protein-coding regions of the genome were filtered using the UCSC Table Browser [70] for assembly hg38 with the GENCODE v44 track [71], and limited to genes in autosomal chromosomes 1–22. Synonymous variants were further required to be annotated as such in a MANE canonical transcript [37]. The burden of synonymous variants between cases and controls was derived using a Mann–Whitney U test to compare the continuous allele frequency of variants across samples. For burden analysis of splice-disrupting variants between cases and controls, we identified the set of rare SNVs and indels across all samples that were selected by the optimal machine learning model as being high-confidence splice-disrupting variants. For case–control burden of splice-disrupting variants, p-values were calculated using a two-sided Fisher’s exact test, comparing the number of samples in each cohort that harbored at least one high-confidence splice-disrupting variant versus those that harbored none. To reduce bias in odds ratios calculations and to avoid “zero cells” in the contingency tables, 0.5 was added to each observed cell frequency (Haldane-Anscombe correction).

Data analyses and visualizations

All aforementioned statistical analyses, as well as data visualizations, were carried out using the R Programming Environment v4.1.2. Graphical data plots were created using the ggplot2 [72] and pROC [73] libraries.



Source link