Scientific Papers

Identification of transcription factor high accumulation DNA zones | BMC Bioinformatics


In the following subsections, we comprehensively present our TFHAZ R/Bioconductor package and the original methodology we developed and implemented in it. Notably, we highlight its different analytical alternatives, input data settings and main strengths: three quantification strategies to compute accumulation from DNA single-base to region-oriented exploration, two methods to identify the HOT zones, and the possibility of using a moving window strategy to account for the influence of the accumulation of DNA bases close to that/those under exam.

TFHAZ R/bioconductor package

With the aim of providing ready-to-use software to apply our proposed search methodology for HOT DNA zones, we developed a computationally efficient R/Bioconductor package named TFHAZ (Transcription Factor High Accumulation Zones) [27]. It allows users to easily find DNA zones of high accumulation from datasets containing the genomic positions of TF binding regions. Input data must be loaded as a GRanges object, a widely used R/Bioconductor data structure with a section for genomic ranges where TF binding regions are listed and a metadata section where the name of the bound TF is annotated for each region. Note that IRanges, GenomicRanges and GenomicFeatures [28] packages offering this kind of scalable data structure are the core of the R/Bioconductor infrastructure for handling genomic data. They also provide efficient functions for data extraction and range-based operations, including coverage calculation. This latter one is used in TFHAZ to compute the accumulation vector, modelled as a run-length-encoded (RLE) object, which is a compact representation for very long vectors fully supported by IRanges functions. Hence, TFHAZ is fully integrated with existing R/Bioconductor functions and data structures, guaranteeing effective treatment and efficient processing of the considered genomic data, and interoperability with other R/Bioconductor packages (e.g., [28,29,30]).

TFHAZ R/Bioconductor package is freely available both in the official Bioconductor release (currently, version 3.16)Footnote 1 and on a GitHub repositoryFootnote 2 together with its complete documentation and its vignette. In its first five years, TFHAZ has been progressively enhanced, and overall it counts more than 5,500 downloads by more than 2,200 distinct IPs from Bioconductor only. TFHAZ is easy-to-use thanks to its clear documentation and vignette: these latter allow exploring all the package functionalities along with some examples of possible usage. Taking the TF binding regions of interest (as the ones extracted from ChIP-seq data) as input, TFHAZ offers functions for the computation of the different accumulation types, for the search of TF-dense DNA zones and the identification of the HOT ones. Additionally, it provides evaluation and plotting functions to compare the results obtained with different moving window sizes, accumulation types and threshold values. Specifically, to identify DNA HOT zones based on the computed accumulation vector, its high_accumulation_zones() function allows specifying the moving window size, the method to be used (binding region or overlap), and the desired thresholding procedure. Furthermore, the resulting HOT zones are returned together with the min, max, mean, median and standard deviation of their lengths.

Methodology

In Fig. 1, we schematically illustrate our proposed procedure to find HOT zones implemented in the TFHAZ package. A dense zone (potentially HOT) is here defined as a DNA area in which a high number of different TFs bind, as summarized by its corresponding accumulation index; when this index exceeds a given threshold value, the dense zone is considered a HOT zone. This definition clarifies how to distinguish HOT zones from other TF-dense zones and leaves open the alternatives for computing the decisive accumulation index and threshold, as well as for the choice of the areas of interest between entire input binding regions or more specific subsets of DNA bases. Indeed, our procedure offers innovative ways to explore DNA areas of different sizes, also providing local single-base evaluation of TF-DNA interactions and three alternative quantification strategies to compute accumulation. Furthermore, we formalize two different methods to trace HOT zones, named binding region and overlap methods, as described in “Identification of DNA HOT zones” section. Such methods can work on a single chromosome as well as on the entire genome (based also on the ChIP-seq data provided as input), and are included in a systematic and fully-reproducible procedure for searching HOT zones.

Fig. 1
figure 1

Workflow for the identification of high accumulation DNA zones, starting from input binding regions and using two alternative approaches: the Binding region or the Overlap method

The binding region method (top panel in Fig. 1), among all input binding regions, identifies those with a total number of different TFs greater than a computed threshold. Instead, the overlap method (bottom panel in Fig. 1), starting from the local evaluation of each DNA base under exam, finds contiguous bases having a number of overlapping bindings greater than a computed threshold. Regardless of the chosen method, the procedure starts with a local evaluation of the accumulation on every DNA base under analysis as to construct an accumulation vector. This can be obtained from ChIP-seq TF binding region data using the TF accumulation, the binding region accumulation, or the genomic base accumulation strategy alternatively, as detailed in “Accumulation types” section. Additionally, the accumulation value of each base can be influenced by the accumulation values of the bases on its neighborhood, according to the use of a moving window innovatively introduced for this purpose, as illustrated in “Moving window” section. This allows exploring different granularities of analysis together with distinct accumulation types. Local accumulation values are then aggregated to associate each zone of interest with an accumulation index. The binding region method focuses on the input binding regions (each bound by at least one TF) and requires an intermediate scoring step to compute region accumulation indexes. Conversely, the overlap method extracts as regions of interest all the sets of contiguous bases with the same local accumulation value: this simply becomes the accumulation index of the corresponding region. The procedure ends with a thresholding step, used to distinguish HOT zones from dense ones with a lower accumulation index; alternative options to compute a proper threshold are discussed in “Thresholding procedures” section.

Fig. 2
figure 2

Overview of the strategies and parameters included in our methodology, and, at the bottom of the Figure, brief descriptions of 7 typical analytical scenarios for users. Together with the presentation and with some usage details for every strategy/parameter, advised settings are indicated on Figure right, considering each of the 7 analytical scenarios separately

In Fig. 2, we briefly recap all the strategies and parameters of our proposed methodology and indicate suggested choices for a series of typical analytical scenarios. Indeed, suitable parameter values and methodological options depend on the analysis intents and on the required output for following investigations. Factors such as the need for dense regions of bigger width rather than narrower peaks (e.g., using or not a moving window and adjusting the stringency of the thresholding), the trade-off between sensitivity and precision in recognizing HOT zones (e.g., using the binding region or the overlap method, respectively), the selection of a given portion of zones with higher density (e.g., using top k percentage thresholding) rather than of actual outlier zones (e.g., using an over k standard deviations thresholding) are decisive in guiding user choices. HOT zones can be retrieved following a specific setting or comparing alternative adequate options to strengthen the obtained results. Yet, at the state-of-the-art there are no absolute criteria to define the best HOT zones since there are no definitive sets of true HOT zones to track, nor a well-recognized gold standard to trace them. Thus, we used some comparative evaluation criteria to assess the reliability of the results found by our methodology, which emerged as crucial to offer a standardized approach ensuring complete repeatability but also flexibility to adapt to different analytical scenarios. Our HOT zones demonstrated both a solid core of concordant results between our two defined search methods (when both are adequate for a given analysis), and impressive coherence with existing studies, like [14], or with databases including comprehensive regulatory hotspot annotations, like [8, 9]; furthermore, they reflect known characteristics of the HOT zones, like enrichment in promoters, CpG islands, and promotorial CpG islands.

Input data settings

As input data, we consider all the TF binding regions organized within a matrix, where each row specifies a binding region with its genomic location and the name of the bound TF. This kind of data mostly comes from ChIP-seq narrow peaks. The first four columns of an input matrix represent the genomic coordinates of each binding region (i.e., the chromosome, the start and end positions, and the strand, respectively), while an additional fifth column hosts the corresponding TF. As for genomic coordinates, we consider a 1-base inclusive coordinate system: it directly numbers every DNA base (instead of specifying coordinates that flank each base) and includes in the analysis also the extreme bases at the start and end positions of the considered regions.

Moving window

ChIP-seq experiments generate millions of short reads that are usually retained only if they match unique locations (uni-reads) once mapped to the reference genome. This practice is mostly not harmful because many uni-reads are adjacent to discarded multi-reads and allow identifying an underlying peak of a binding site [31]. Yet, this can lead to a partial underestimate of the occupancy of binding regions, which may appear sparser and with contiguous bases much less homogeneous than they really are. To face this issue, a moving window mode was used to innovatively introduce the concept of neighborhood in the search for HOT zones. During the local analysis, performed one base at a time, the accumulation value of each base is computed considering what happens not only in the base itself but also in its symmetrical neighborhood of size \(2*w\). Indeed, a moving window of semi-width w is centered on each DNA base under exam. The w value is an input parameter of this algorithm to search for dense zones; its variation offers multi-scale analyses with different resolutions in the local investigation of the binding regions.

A not-null moving window (i.e., of semi-width \(w \ne 0\)) allows capturing differences based on the chosen accumulation type, as discussed in the following “Accumulation types” section. Also, it mitigates the local discontinuities that clearly emerge in its absence, each time that consecutive bases have very different accumulation values. The smoothing effect grows as the moving window semi-width increases since the accumulation value of each base is more influenced by farther bases. Yet, too high values of w can lead to the aggregation of very distant bases and provide undesired fake artefacts on dense zones. Overall, the analysis detailed in Additional file 1: Section S1.1 indicates 1000 bases as the ideal semi-width of a moving window able to smooth local discontinuities without aggregating too far bases.

Accumulation types

The procedure computes local accumulation values, which are stored within an accumulation vector where each position refers to a specific DNA base. These values are obtained according to one of the following quantification strategies.

  1. 1

    TF accumulation returns the number of distinct input TFs that bind the considered DNA base b, or any portion of its neighborhood (of amplitude \(2 * w { + 1}\) centered on the base, if \(w > 0\)), i.e.,:

    $$\begin{aligned} TF_{acc} = \sum _{t=1}^{n_{TF}}t: \exists _{i=b-w}^{b+w} \left\{ t(i) \ne 0 \right\} \end{aligned}$$

    where \(n_{TF}\) is the number of distinct TFs, i spans all over the DNA bases in the neighborood of b and \(t(i) \ne 0\) denotes that a given TF t binds the base i.

  2. 2

    binding region accumulation returns the number of all the different input binding regions, for either new or already considered TFs, that include the examined DNA base b, or any portion of its neighborhood (if \(w > 0\)), i.e.,:

    $$\begin{aligned} BR_{acc} = \sum _{r=1}^{n_{IR}}r: \left\{ r \cap (b\pm w) \ne 0 \right\} \end{aligned}$$

    where \(n_{IR}\) is the number of distinct input binding regions, and \(r \cap (b\pm w)\) is the intersection between a given input region r and the neighborood of the base b under exam.

  3. 3

    genomic base accumulation returns the amount of DNA bases from all the different input TF binding regions that overlap the considered DNA base b, or its neighborhood (if \(w > 0\)), i.e.,:

    $$\begin{aligned} GB_{acc} = \sum _{r=1}^{n_{IR}}\sum _{d=1}^{n_{r}}d: \left\{ d \cap (b\pm w) \ne 0 \right\} \end{aligned}$$

    where \(n_{IR}\) is the number of distinct input binding regions, \(n_{r}\) is the number of bases of a given input region r, and \(d \cap (b\pm w)\) is the intersection between a single DNA base d from the input region r and the neighborood of the base b under exam.

Without considering any neighborhood (i.e., for \(w = 0\)), the results of the three strategies are equal as long as no overlapping binding regions for the same transcription factor are present in the input data (as it usually is). Just in this latter case, a DNA base in the overlapping area of more binding regions for the same transcription factor has its region and base accumulation values equal and greater than its TF accumulation value.

Fig. 3
figure 3

Accumulation values of an example dataset with 10 input transcription factor (TF) binding regions of three different TFs X, Y and Z (a). When considering a null moving window (\(w = 0\)) (b, c), the TF accumulation (b) is different from the region accumulation and base accumulation (c), due to the overlap of input binding regions for the same TF starting at position 4250. For an ideal moving window of 1000 bases of semi-width (\(w = 1000\)), each of the three accumulation types shows well-differentiated values (d, e, f)

For an explanatory example, let us consider an input dataset with 10 input binding regions of 3 distinct TFs (X, Y, Z) as illustrated in Fig. 3a, where the eighth and ninth regions are both bound by the TF Z with a partial overlap in the coordinate range [4250, 4500]. Accordingly, for \(w = 0\) all three accumulation types return the same values everywhere except in the range [4250, 4500]. There, the TF accumulation is equal to 2 (Fig. 3b) since both TF Y and TF Z (although within two distinct regions) bind there; instead, the region and base accumulation are equal to 3 in that range (Fig. 3c) because exactly 3 regions and 3 bases from all input regions occur in [4250, 4500] when such range is analyzed one base at a time (without any neighborhood).

Differences in the three accumulation types are appreciable when using a w value different from zero to calculate the accumulation values. In this case, not only all DNA bases of each input region r, but also those bases included within every pair of intervals starting from the extremes of each region r and extending outward it for w positions have a not-null accumulation value. This is clearly visible when comparing panels b, c (w = 0) and d, e, f (w = 1000) in Fig. 3. Each accumulation value is influenced by the considered neighborhood: based on the used accumulation type, this implies that additional TFs, input regions or bases can contribute to computing each local accumulation. The region accumulation of a base under exam increases each time the neighborhood of the base intercepts another input binding region; conversely, the TF accumulation does it only if another binding region of a different TF is intercepted. This can be noticed by comparing these two accumulation types for w = 1000 at position 3250 in panels d and e of Fig. 3: in panel e, the region accumulation becomes 4 since the moving window intercepts the starting base of another input region, the ninth, located in [4250, 6000]; yet, the TF of the ninth region is Z, which is already accounted for the TF accumulation in panel d because it also binds the eighth region, which includes the base at position 3250. Accordingly, the TF accumulation at position 3250 is 3, which is also the maximum possible value of the TF accumulation for a dataset including three different TFs.

However, the moving window is incisive mainly in computing the base accumulation, since its dynamic range of values can change several orders of magnitude as the window size increases. To better explain how base accumulation is affected, we can focus on a single input region, as in the cases represented and described in Additional file 1: Section S1.2. Obviously, when generalizing to the typical dataset with multiple TFs and binding regions, all input region bases intercepting the neighborhood of each position under exam are summed and contribute to computing the base accumulation value. This leads to obtain wider dense zones, where base accumulation rises more smoothly but reaches much higher values than for TF or region accumulation; this is evident even in our simple example data when comparing panel f of Fig. 3 with all the other reported cases.

Identification of DNA HOT zones

Two different methods have been defined to find DNA HOT zones, the binding region and overlap methods, which investigate as DNA bases of interest only those with a not-null accumulation value.

The binding region method, commonly used in the literature, identifies regions characterized by a certain number of co-occurring bindings of different TFs. Specifically, its regions of interest are the original input binding regions without any neighborhood: yet, the ones with at least one base of overlap are combined in a unique region. A scoring step is performed to move from the base-level evaluation of TF accumulation to a region-oriented evaluation. This scoring assigns each region r with an accumulation index, which is the number of different TFs that bind at least one base of r. Thus, this accumulation index is always equal to or greater than the maximum TF accumulation that we can obtain for a single base within r. Dense zones can be easily traced by selecting regions based on their accumulation index: a final thresholding step extracts as HOT zones all those dense zones having an index at least equal to a data-driven threshold, computed as detailed in “Thresholding procedures” section.

The overlap method focuses on DNA zones of finer granularity compared to the input regions as to identify contiguous bases with accumulation peaks. Given whichever accumulation type and moving window size, the accumulation vector is examined to extract zones made of contiguous bases having the same accumulation value: this value is inherited as the accumulation index for the specific DNA zone. Dense zones can be selected based on their accumulation index, while HOT zones are just those with an accumulation index not lower than a threshold value (see “Thresholding procedures” section).

Since the overlap method searches for contiguous bases of high accumulation and the accumulation index of the bases forms peaks, the so-identified HOT zones may be more in number but always of smaller dimension than those found with the binding region method. In addition, within dense zones obtained with the overlap method, each base has an accumulation value equal to the accumulation index value of its zone; consequently, HOT zones include only bases with local accumulations not lower than the considered threshold. Conversely, with the binding region method, typically, not all bases of any dense or HOT zone reach the same number of co-present TFs. Some bases may have a local accumulation value smaller than the accumulation index value of their zone and even equal to 1 only. Such main differences are clearly visible in Additional file 1: Fig. S3, where, for both binding region and overlap methods, the same threshold is used to identify HOT zones in the example dataset presented in Fig. 3. In Additional file 1: Fig. S3, all bases of a HOT zone that have accumulation lower than the threshold are shown in yellow; they are extracted only by the binding region approach. The overlap method instead avoids this issue, overcoming the region-level aggregation to compute the accumulation index used in the binding region method.

Thresholding procedures

A proper choice of the threshold needed to identify the DNA HOT zones is crucial. It may be defined arbitrarily, but data-driven thresholds, calculated based on the distribution of the dense zones and of their accumulation values as resulting from the described methodological steps, are warmly suggested. Specifically, two alternative types of thresholds are here following discussed: the top k percentage and the over k standard deviations, both related to the k parameter.

The top k percentage threshold is determined through a parametric generalization of a state-of-the-art approach to deal with different accumulation methods, resolution levels, and amounts of binding regions and TFs under exam. After sorting all zones of interest in increasing order of their accumulation index, the threshold is computed as the minimum accumulation index of the zones belonging to the top k% of the total number of zones. The more the k value increases, the more the number of resulting HOT zones increases.

The over k standard deviations threshold is instead defined by an alternative procedure, which obtains a threshold value based on the main statistical indicators of the distribution of the single-base accumulation value. Considering all and only the DNA bases with not-null accumulation values (regardless of the chosen accumulation type), the threshold is computed as the ceiling of the mean of these accumulation values plus k-times their standard deviation. Here, increasing k implies a higher threshold, which leads to finding fewer and likely smaller HOT zones.



Source link