Scientific Papers

PerFSeeB: designing long high-weight single spaced seeds for full sensitivity alignment with a given number of mismatches | BMC Bioinformatics


Sequence alignment

Let there be an alphabet \({\mathcal{A}}\) of symbols and two strings x and y of length n consisting of characters from \({\mathcal{A}}\). The term distance d(xy) between two strings x and y was introduced in [25] for binary alphabet \({\mathcal{A}} =\{0, 1\}\) in the space of \(2^n\) points as the number of mismatches between points x and y. In general, we may introduce a list of elementary operations which convert a source string x into a target string y. Each function may have a different cost. The distance is then a sum of all costs. If we cannot transform x into y, then the cost is \(\infty\). A good review of possible distances can be found in [26]. The most common distances are the following ones.

  1. 1.

    Levenshtein (or edit) distance (insertions, deletions, substitutions are allowed at equal cost of 1) [27]. For example, we may transform word “health” into “shale” as \(\text{“health”}\rightarrow \text{“healt”}\rightarrow \text{“heale”}\rightarrow \text{“sheale”}\rightarrow \text{“shale”}\) (2 deletions, 1 substitution, 1 insertion), so \(d(\text{“health”},\text{“shale”}) = 4\).

  2. 2.

    Hamming distance (only substitutions) [28].

  3. 3.

    Longest common subsequence distance (insertions/deletions) [29].

Suppose we have two sequences \(x=\texttt{CTTGTCGTTGGAGATCGGAAGAGCA}\) and \(y=\texttt{TAGGTGCTCG}\) of length \(n_1=25\) and \(n_2 = 10\), respectively. We define \(m=n_1 – n_2 + 1 = 16\) and for each index \(j=1,\ldots ,m\) we find distance \(d_j=\sum _{k=1}^{n_2} \delta (x_{j+k}, y_k)\), where \(\delta (a,b) = 1\) if symbols a and b are different, otherwise \(\delta (a,b)=0\). For example, to find \(d_8\) we align the corresponding strings and calculate values for \(\delta (x_{j+k}, y_k)\):

$$\begin{aligned} \begin{array}{ll} \text{string 1} &{}\texttt{CTTGTCGTTGGAGATCGGAAGAGCA}\\ \text{string 2} &{}\texttt{\_\_\_\_\_\_\_TAGGTGCTCG\_\_\_\_\_\_\_\_}\\ \delta _k &{}\texttt{\_\_\_\_\_\_\_0100101000\_\_\_\_\_\_\_\_} \end{array} \end{aligned}$$

So, \(d_8 = 3\), in a similar way we get \(d = (6, 5, 9, 7, 6, 9, 9, 3, 7, 8, 7, 8, 8, 8, 7, 7)\). We compare all \(n_2\) symbols for both sequences. However, by introducing spaced seeds we may ignore some symbols. For example, we may use seed of length \(n_2\): 1011011111 (binary notation used in [8]) or #-##-##### (notation used in [9]), where 1 or # stand for a match and 0 or for ‘do not care’. Thus, to find the distance \(d_8\) accounting for the spaced seed we get

$$\begin{aligned} \begin{array}{ll} \text{string 1} &{}\texttt{CTTGTCGTTGGAGATCGGAAGAGCA}\\ \text{string 2} &{}\texttt{\_\_\_\_\_\_\_TAGGTGCTCG\_\_\_\_\_\_\_\_}\\ \delta _k&{}\texttt{\_\_\_\_\_\_\_0100101000\_\_\_\_\_\_\_\_}\\ \text{seed}&{}\texttt{\_\_\_\_\_\_\_\#}\text{-}\texttt{\#\#}\text{-}\texttt{\#\#\#\#\#\_\_\_\_\_\_\_\_}\\ \text{new}\, \delta _k &{}\texttt{\_\_\_\_\_\_\_0000001000\_\_\_\_\_\_\_\_} \end{array} \end{aligned}$$

or \(d_8 = 1\). Similarly, we get \(d = (5, 3, 7, 6, 5, 7, 7, 1, 5, 6, 7, 6, 7, 6, 5, 5)\). From now, we will use binary notation as it is closer to binary numbers utilized for storage and computation.

Now let us consider the library of records. Suppose we have a contiguous seed \(s=\texttt{1111}\) of weight \(w=4\). So, if we take the first four symbols of vector x, then we encode string \(\texttt{CTTG}\) as \(1 + 4\cdot 3 + 4^2\cdot 3 + 4^3\cdot 2 = 1 + 12 + 48 + 128 = 189\) (if we set \(\texttt{A} = 0\), \(\texttt{C} = 1\), \(\texttt{G} = 2\), \(\texttt{T} = 3\)). Therefore the library of records for the reference sequence x and seed s has 22 pairs (“key”, “value”):

$$\begin{aligned} \begin{array}{llll} (1, \texttt{CTTG} = 189) &{} (2, \texttt{TTGT} = 239) &{} (3, \texttt{TGTC} = 123) &{} (4, \texttt{GTCG} = 158) \\ (5, \texttt{TCGT} = 231) &{} (6, \texttt{CGTT} = 249) &{} (7, \texttt{GTTG} = 190) &{} (8, \texttt{TTGG} = 175) \\ (9, \texttt{TGGA} = 43) &{} (10, \texttt{GGAG} = 138) &{} (11, \texttt{GAGA} = 34) &{} (12, \texttt{AGAT} = 200) \\ (13, \texttt{GATC} = 114) &{} (14, \texttt{ATCG} = 156) &{} (15, \texttt{TCGG} = 167) &{} (16, \texttt{CGGA} = 41) \\ (17, \texttt{GGAA} = 10) &{} (18, \texttt{GAAG} = 130) &{} (19, \texttt{AAGA} = 32) &{} (20, \texttt{AGAG} = 136) \\ (21, \texttt{GAGC} = 98) &{} (22, \texttt{AGCA} = 24) &{} \end{array} \end{aligned}$$

Similarly, we may find “values” for sequence y:

$$\begin{aligned} \texttt{TAGG}&= 163, \texttt{AGGT} = 232, \texttt{GGTG} = 186, \texttt{GTGC} = 110,\\ \texttt{TGCT}&= 219, \texttt{GCTC}= 118, \texttt{CTCG} = 157. \end{aligned}$$

We may see that no “values” for y can be found in the library generated for x. Therefore the use of “seed and extend” approach does not allow us to align y with respect to x, since no candidate positions can be found.

Suppose we use seed \(s=\texttt{101011}\) (also of weight 4). We generate a new library for x (now containing 20 pairs):

$$\begin{aligned} \begin{array}{llll} (1, \texttt{CTTC} = 125) &{} (2, \texttt{TGCG} = 155) &{} (3, \texttt{TTGT} = 239) &{} (4, \texttt{GCTT} = 246) \\ (5, \texttt{TGTG} = 187) &{} (6, \texttt{CTGG} = 173) &{} (7, \texttt{GTGA} = 46) &{} (8, \texttt{TGAG} = 139) \\ (9, \texttt{TGGA} = 43) &{} (10, \texttt{GAAT} = 194) &{} (11, \texttt{GGTC} = 122) &{} (12, \texttt{AACG} = 144) \\ (13, \texttt{GTGG} = 174) &{} (14, \texttt{ACGA} = 36) &{} (15, \texttt{TGAA} = 11) &{} (16, \texttt{CGAG} = 137) \\ (17, \texttt{GAGA} = 34) &{} (18, \texttt{GAAG} = 130) &{} (19, \texttt{AGGC} = 104) &{} (20, \texttt{AACA} = 16) \end{array} \end{aligned}$$

For sequence y we get “values”:

$$\begin{aligned} \texttt{TGTG} = 187, \texttt{AGGC} = 104, \texttt{GTCT} = 222, \texttt{GGTC} = 122, \texttt{TCCG} = 151. \end{aligned}$$

There are three “values” (187, 104 and 122) found in the new library. Therefore we need to try three positions: \(5 – 1 + 1 = 5\), \(19 – 2 + 1 = 18\), \(11- 4+1 = 8\). For the first position (5) we get

$$\begin{aligned} \begin{array}{ll} \text{string 1} &{}\texttt{CTTGTCGTTGGAGATCGGAAGAGCA}\\ \text{string 2} &{}\texttt{\_\_\_\_TAGGTGCTCG\_\_\_\_\_\_\_\_\_\_\_}\\ \delta _k &{}\texttt{\_\_\_\_0101001111\_\_\_\_\_\_\_\_\_\_\_} \end{array} \end{aligned}$$

and \(d_5 = 6\), the second position (18) cannot be used, since the aligned string y will be out of the range for string x and the third position (8) gives us

$$\begin{aligned} \begin{array}{ll} \text{string 1} &{}\texttt{CTTGTCGTTGGAGATCGGAAGAGCA}\\ \text{string 2} &{}\texttt{\_\_\_\_\_\_\_TAGGTGCTCG\_\_\_\_\_\_\_\_}\\ \delta _k &{}\texttt{\_\_\_\_\_\_\_0100101000\_\_\_\_\_\_\_\_} \end{array} \end{aligned}$$

and \(d_8 = 3\). So, the seed \(\texttt{101011}\) allows us to find the best alignment position of y with respect to x.

We are interested in aligning genetic sequences. Reference sequences may consist of several separate sequences. For example, T2T_CHM13v2.0 Telomere-to-Telomere assembly (see [30]) contains 24 separate sequences of four symbols A, C, G, T. However, GRCh38.p14 release has 705 separate sequences (chromosomes, genomic scaffolds and patches) and five symbols (A, C, G, T, and N for unknown symbols). We may always concatenate separate sequences into one long sequence by adding extra “void” symbols in between to avoid forming library records containing symbols from different original sequences. In the case of the T2T reference genome, we may use two bits to encode each symbol. For the GRCh38.p14 reference genome, we have five possible values to combine every three consecutive symbols into a 7-bit number (as \(5^3 = 125 < 128 = 2^7\)). So, if we do not use any data compression, then storing the T2T sequence may require 0.73 GB, and for GRCh38.p14, we need about 0.84 GB. These numbers are minimal, even for a budget computer. Therefore we may use more storage space to achieve better performance. The 2-bit encoding is preferable even if we need to exclude some substrings containing any character except A, C, G, T.

Modern computers allow users to exploit SIMD (single instruction, multiple data) properties of CPUs (central processing units). For example, one instruction can be applied to a 128-, 256- or 512-bit number. CPUs can do various logical and shift operations very fast. A list of Intel’s SIMD intrinsics designed for various CPU instruction sets can be found in [31]. Most computers (servers, workstations and home computers) built for the past decade support the 128-bit instruction set. Thus, we will use 128-bit instructions. There are similar instructions for 256- and 512-bit numbers, so the ideas discussed in the paper can also be applied for new architectures.

We may split the reference sequence into groups of 32 symbols, then form 128 bits such that the first 32 bits are 1s if the corresponding symbols are A, the second 32 bits are for symbols C, the third and fourth groups of 32 bits are for G and T, respectively. For example, the string CATAGNCACGTGATCCTAGNCATGTTACCTGT of 32 symbols has the following components

$$\begin{aligned} \begin{array}{cll} \text{sequence} &{} \texttt{CATAGNCACGTGATCCTAGNCATGTTACCTGT} &{} \text{32-bit number}\\ \texttt{A} &{} \texttt{01010001000010000100010000100000} &{} \texttt{0x0422108a}\\ \texttt{C} &{} \texttt{10000010100000110000100000011000} &{} \texttt{0x1810c141}\\ \texttt{G} &{} \texttt{00001000010100000010000100000010} &{} \texttt{0x40840a10}\\ \texttt{T} &{} \texttt{00100000001001001000001011000101} &{} \texttt{0xa3412404}\\ \texttt{A|C|G|T} &{} \texttt{11111011111111111110111111111111} &{} \texttt{0xfff7ffdf} \end{array} \end{aligned}$$

If an element is N symbol, then the corresponding bits of all four 32-bit numbers are zeros. We use symbol “|” for logical OR operation, similarly, symbol “&” is for logical AND operation.

Let there be two sequences of length 32. We want to find how many symbols are the same for them (if one symbol is N, then there is no match).

$$\begin{aligned} \begin{array}{cc} \texttt{m1} &{} \texttt{CATAGNCACGTGATCCTAGNCATGTTACCTGT}\\ \texttt{m2} &{} \texttt{GCCTCAGTTTTCACTCTATCAATATGTAATAA}\\ \texttt{m1 \& m2} &{} \texttt{\_\_\_\_\_\_\_\_\_\_T\_A\_\_CTA\_\_\_AT\_T\_\_\_\_T\_\_} \end{array} \end{aligned}$$

There are nine such symbols. We may apply logical AND operation for corresponding 32-bit components of m1 and m2

$$\begin{aligned} \begin{array}{ccccc} &{} \texttt{A} &{} \texttt{C} &{} \texttt{G} &{} \texttt{T}\\ \texttt{m1} &{} \texttt{0x0422108a} &{} \texttt{0x1810c141} &{} \texttt{0x40840a10} &{} \texttt{0xa3412404}\\ \texttt{m2} &{} \texttt{0xd8b21020} &{} \texttt{0x0008a816} &{} \texttt{0x02000041} &{}\texttt{0x25454788}\\ \texttt{m1 \& m2} &{} \texttt{0x00221000} &{} \texttt{0x00008000} &{} \texttt{0x00000000} &{} \texttt{0x21410400} \end{array} \end{aligned}$$

Since 0x00221000 | 0x00008000 | 0x00000000 | 0x21410400 = 0x21639400, then the number of 1-bits in 0x21639400 equals 9 and is the total number of matches. Now we approach the main goal of the manuscript, i.e. how to design seeds that will allow us to find candidate positions.

Choice of seeds

Suppose there is a short sequence we want to align (read) for a known long reference sequence. The length of the read is \(n_r\). We may always assume that a read contains only four symbols (A, C, G and T). While there may be cases of reads containing symbols N for unknown letters, the number of such cases is often negligible. Let there be a seed s of length \(n_s\) (total number of all bits, 1s and 0s) and weight w (the number of 1s). If we have 1-bit, then the corresponding symbol for a given sequence is taken into account; otherwise (in the case of 0-bit), it is ignored. We may also assume that a seed’s first and last bits are 1-bits. Contiguous seeds have only 1-bits, and the lengths and weights of these seeds are equal. Spaced seeds have at least one 0-bit.

Seeds allow us to form a shorter sequence from a longer one. We apply a seed of length \(n_s\) to a sequence of the same length and form a new sequence of length w when symbols of the original sequence are in front of 1-bits of the seed. We consider the new shorter sequence as a characteristic/signature of the longer sequence. In the case of a 4-letter alphabet, we may perform one-to-one mapping of the short sequence and a 2w-bit number. We use these numbers to find pairs of (“key”, “value”) in the library of records generated for the reference sequence. These pairs should have “value”-components equal to our 2w-bit numbers. We may generate only \((n_r-n_s+1)\) 2w-bit numbers for each read since the first and last bits of a seed should be within the read’s boundaries.

Suppose we have a seed of weight four, and by applying it at some reads’ position, we get “value” ACGT. Let there be N pairs in the reference library with the same “value”. Now we expand the original seed and assume the new seed at the chosen position will also be within reads’ boundaries. This means that the new seed may provide us with four new “values” (ACGTA, ACGTC, ACGTG, ACGTT). The number of (“key”, “value”) pairs found for the new seeds will be less or equal to N. For example, if the reference sequence is AGTGACGT, then we have one pair for ACGT and no pairs for the 5-symbol sequences. Of course, it might happen that the number of pairs in the library generated for “value” ACGTA is N, and there are no pairs found for ACGTC, ACGTG and ACGTT. However, we may often think that the number of pairs for each 5-symbol sequence is around N/4.

For a library of records generated for a reference sequence we may estimate the total number of pairs found for a given “value” as \(\alpha /4^w\), where \(\alpha\) is a constant. We generate \((n_r-n_s+1)\) numbers (“values”) for a read. Based on the above empirical rule, we should expect to consider about \(\alpha (n_r-n_s+1)/4^w\) candidate positions in the reference sequence where we try to align the given read. Each “value” is found for a substring of a read with a given shift. Since the goal is to pre-align the read, the “keys” we found should be corrected by the corresponding shifts of a substring to the read’s first element. For example, if a read has an exact match within a read, then all its “values” should also have an exact match. So, in the best scenario, we should have about \(\alpha /4^w\) unique candidate positions (when a read can be placed exactly at several locations of the reference sequence). However, it is better to estimate the number of candidate positions from the above as

$$\begin{aligned} \alpha \frac{n_r-n_s+1}{4^w}. \end{aligned}$$

(1)

This number indicates what steps we should perform to reduce the number of candidates’ locations and thus improve performance of sequence alignment algorithms. The steps are

  1. 1.

    find seeds of maximum weight,

  2. 2.

    among the found seeds choose seeds of maximum length.

Of course, the trivial approach is to maximise \(n_s\) and w. So, a contiguous seed of length \(n_r\) is the best candidate. However, there are restrictions as we should account for various mismatches between the reference and “patient” sequences. In this paper we consider only pointwise mismatches (SNPs) and suppose all reads have at most \(n_m\) mismatches. Hereafter, we assume that all seeds we try to find meet this rule.

Suppose there are two seeds \(s_1\) and \(s_2\) and we may align \(s_1\) with respect to \(s_2\) in such a way that all 1-bits of seed \(s_1\) are also present in \(s_2\). Then \(s_1\) is a subset of \(s_2\) and can be denoted as \(s_1\subset s_2\). For example, if \(s_1=\texttt{10111}\), \(s_2=\texttt{11011101}\), then by shifting \(s_1\) by one element to the right we get

$$\begin{aligned} \begin{array}{ll} s_2 &{} \texttt{11011101}\\ s_1 &{} \texttt{\_10111\_\_} \end{array} \end{aligned}$$

Note that there may be multiple possible shifts and \(s_1\) may have more 0-bits. For example, \(s_3=\texttt{10101}\subset s_1\) and we get two possible alignments

$$\begin{aligned} \begin{array}{ll} s_2 &{} \texttt{11011101}\\ s_3 &{}\texttt{\_10101\_\_}\\ s_3 &{} \texttt{\_\_\_10101} \end{array} \end{aligned}$$

Seed \(s_4=\texttt{1101011} \not \subset s_2\).

The human genome has many repeated regions, so several candidate seeds have similar weights/lengths. Therefore, it is reasonable to choose a seed with a more uniform distribution of 1-bits rather than seeds with grouped 1-bits. In any case, if we have many repeated experiments providing us with reads of the same length for a known reference sequence, it is worth performing a statistical analysis of the “keys” distribution to avoid cases when for some “values” we have thousands of “keys”.

According to [24], we may consider two main alignment problems: lossless alignment when we detect all locations in the reference sequence and lossy alignment when we may miss some of them. As for any statistical analysis, we may have true-positive (TP), true-negative (TN) and false-positive (FP) and false-negative (FN) events. We aim to find seeds that provide us with lossless alignment. Therefore for a given read and seed, we construct all “values” according to the procedure described above, and then we do not miss any candidate location. Thus the number of false negative events is zero. Since

$$\begin{aligned} \text{sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \end{aligned}$$

(2)

then for our seeds \(\text{FN} = 0\) and \(\text{sensitivity} = 100\%\). So, our seeds have full sensitivity or we have lossless seeds.

Seed validation

Let us consider an example. We check if seed \(s=\texttt{110101011001011}\) is a full sensitivity seed for reads of length 20 and at most two mismatches. The length of s is 15. Therefore there are \((20-15+1) = 6\) possible positions of the seed with respect to a read. We shift the seed and pad it with zeros (five extra zeros for each row). Thus we get the following six rows of length 20:

$$\begin{aligned} L_1 &{} = {} \texttt{11010101100101100000},\\ L_2 &{} = {} \texttt{01101010110010110000},\\ L_3 &{} = {} \texttt{00110101011001011000},\\ L_4 &{} = {} \texttt{00011010101100101100},\\ L_5 &{} = {} \texttt{00001101010110010110},\\ L_6 &{} = {} \texttt{00000110101011001011}. \end{aligned}$$

A seed is valid if for any two positions of mismatches within a read, there is at least one row of the matrix with both 0-elements at the given columns. For example, we choose columns 7 and 12, then we may pick up the third row with both 0-elements (underlined):

$$\begin{aligned} \mathtt{001101\underline{0}1011\underline{0}01011000} \end{aligned}$$

Note that 0-elements can be from the original seed or the padded ones. For example, if we choose columns 5 and 16, then the last row of the matrix has both 0-elements (the first 0-element is the padded element):

$$\begin{aligned} \mathtt{0000\underline{0}1101010110\underline{0}1011} \end{aligned}$$

However, if we choose columns 4 and 13, then for each row of the matrix there is at least one 1-element:

$$\begin{aligned} \mathtt{110\underline{1}01011001\underline{0}1100000}\\ \mathtt{011\underline{0}10101100\underline{1}0110000}\\ \mathtt{001\underline{1}01010110\underline{0}1011000}\\ \mathtt{000\underline{1}10101011\underline{0}0101100}\\ \mathtt{000\underline{0}11010101\underline{1}0010110}\\ \mathtt{000\underline{0}01101010\underline{1}1001011} \end{aligned}$$

Therefore seed \(s=\texttt{110101011001011}\) is not a valid full sensitivity seed for \(n_m=2\), \(n_r =20\). To check validity we should check all possible combinations, for the above example we have to check \(C_{20}^2\equiv 20!/2!(20-2)! = 190\) cases, where \(m!\equiv 1\cdot 2\cdot 3 \cdots m\). For example, we have two sequences AAAAAAAAAAAAAAAAAAAA (reference sequence) and AAATAAAAAAAATAAAAAAA (read). For the reference sequence we create its library of records:

$$\begin{aligned}&(1, \texttt{AAAAAAAAA}),\quad (2, \texttt{AAAAAAAAA}),\quad (3, \texttt{AAAAAAAAA}),\\&(4, \texttt{AAAAAAAAA}),\quad (5, \texttt{AAAAAAAAA}),\quad (6, \texttt{AAAAAAAAA}), \end{aligned}$$

i.e. all “values” are the same. The read provides us with the following six “values”:

$$\begin{aligned}&\texttt{AATAAAAAA},\qquad \texttt{AAAAAATAA},\qquad \texttt{ATAAAAAAA},\\&\texttt{TAAAAAAAA},\qquad \texttt{AAAAATAAA},\qquad \texttt{AAAATAAAA}. \end{aligned}$$

There is no read’s “value” equal to “values” in the library.

Now consider a general case. Let there be a seed s of length \(n_s\). We want to check if the seed meets the full sensitivity requirement for any read of length \(n_r\) and a maximum number of \(n_m\) mismatches. Suppose \(n_m\) and \(n_r\) are set. Then we create \(n_m\)-vector of indices \(i_k\), \(k=1,2,\ldots ,n_m\) such that \(1\le i_k\le n_r\). By definition, the length of a seed is not greater than the length of a read, \(n_s\le n_r\). We may shift the first element of a seed by \(\delta\) with respect to the first element of the read. As the last element of the seed should be within the read’s elements, then \(\delta\) can vary from 0 to \((n_r-n_s)\). For each value of \(\delta\), we check that none of the 1-elements of the seed shifted by \(\delta\) has indices \(i_k\), \(k=1,2,\ldots ,n_m\). If for any possible combination of indices \(i_k\), there is at least one value of \(\delta\) (depending on \(i_k\), \(k=1,\ldots ,n_m\)), the requirement is met, then the seed is a valid seed for given \(n_m\) and \(n_r\).

Parameter \(\delta\) has \((n_r-n_s+1)\) values. By padding the seed vector with 0-elements from left and right, we can form \((n_r-n_s+1)\) vectors of length \(n_r\). We pad from left and right and the total number of 0-elements to be used for each vector is \((n_r-n_s)\). One may perform padding differently; however, it may be more convenient to have the first element of the seed aligned with the first element of the read for the first vector. The next vectors are just the previous vectors padded by one 0-element from the left (as is done above). Or we may align the last elements of the seed and read and pad by 0-element from the right.

In any case, for a given seed, we form \((n_r-n_s+1)\) vectors of length \(n_r\). We generate all possible combinations of \(i_k\) indices. There are \(n_m\) indices, and we want them to be different. Thus we get

$$\begin{aligned} C_{n_r}^{n_m} \equiv \frac{n_r!}{n_m! (n_r- n_m)!} \end{aligned}$$

(3)

such combinations. If, for any of these combinations, all \((n_r-n_m+1)\) vectors/rows contain at least one 1-element for chosen indices \(i_k\), then the seed cannot be used. Otherwise, the seed meets the requirements, i.e. it is a valid seed.

Now we discuss how to perform seed validation as vector operations. As it is shown above, by padding seed s of length \(n_s\) with 0-elements from the left and right, we create \((n_r – n_s+1)\) rows \(L_p\) of length \(n_r\). If there are \(n_m\) mismatches, then we need to process \(C_{n_r}^{n_m}\) cases. We create a vector V for each case, so all its elements are 1s except \(n_m\) elements, which are 0-elements. For the first example, we choose elements 7 and 12, so one can write the corresponding \(V_1\) vector as \(V_1=\texttt{11111101111011111111}\). Then one needs to check that \(V | L_p\) equals V for at least one index \(p=1,\ldots ,(n_r-n_s+1)\). For the above example, we check \(L_3\):

$$\begin{aligned} \begin{array}{lcl} V_1 &{}= &{} \texttt{11111101111011111111},\\ L_3 &{} = &{} \texttt{00110101011001011000},\\ V_1 | L_3 &{} = &{} \texttt{11111101111011111111}, \end{array} \end{aligned}$$

so we get \(V_1 | L_3 = V_1\). Note that logical OR operations are bitwise, i.e., applied to each element of a vector.

The third example (elements 4 and 13) gives us vector \(V_3 = \texttt{11101111111101111111}\), then for all vectors \(L_p\), \(p=1,\ldots ,6\), we find \(V_3|L_p\):

$$\begin{aligned} \begin{array}{lclcl} V_3 | L_1 &{} = &{} \texttt{11111111111101111111} &{} \ne &{} V_3,\\ V_3 | L_2 &{} = &{} \texttt{11101111111111111111} &{} \ne &{} V_3,\\ V_3 | L_3 &{} = &{} \texttt{11111111111101111111} &{} \ne &{} V_3,\\ V_3 | L_4 &{} = &{} \texttt{11111111111101111111} &{} \ne &{} V_3,\\ V_3 | L_5 &{} = &{} \texttt{11111111111111111111} &{} \ne &{} V_3,\\ V_3 | L_6 &{} = &{} \texttt{11111111111111111111} &{} \ne &{} V_3. \end{array} \end{aligned}$$

Therefore, one cannot use the seed for the given combination of indices, and, as a result, it cannot provide us with full sensitivity. Only when the procedure is successful for all \(C_{n_r}^{n_m}\) combinations, then the seed is valid.

It is clear that if we pad all vectors \(L_p\) and V with 1-elements, then the validation criterion is still valid since \(1 | 1 = 1\). Therefore we may use various SIMD intrinsics to perform logical OR operations on 128-bit numbers.

There is an alternative approach to validating seeds. We create \(n_r\) binary vectors (columns) \(U_t\), \(t=1,\ldots , n_r\), of length \((n_r-n_m+1)\), see Fig. 1. As before, we may pad them with 1-elements to form numbers of a given length, e.g. 32-, 64- or 128-bit numbers. The task is to consider all \(C_{n_r}^{n_m}\) combinations of columns \(U_t\), perform logical OR operations, i.e.

$$\begin{aligned} U_{j_1} | U_{j_2} |\ldots | U_{j_{n_m}}, \end{aligned}$$

(4)

and check if the resultant column has all 1-elements (let us call it the saturated vector).

Fig. 1
figure 1

Seed validation based on columns. When the requirement is met, the row is in green colour, otherwise, in red colour

Now we try to reduce the number of columns. We may identify the same columns and leave only different ones (remove 10 out of 36 columns in the example in Fig. 2). So, as \(n_m = 4\) and \(n_r = 36\), then instead of \(C^4_{36} = 58905\) combinations, we should check only \(C^4_{26}=14950\). The next step is identifying those columns that are subsets of other columns. We denote \(U_k \subset U_m\) if \(U_k | U_m\) is \(U_m\). This means that all 1-elements of column \(U_k\) are also 1-elements of \(U_m\) (however, there may be positions such that element q of \(U_m\) is 1-element but element q of \(U_k\) is 0-element). For example, \(U_{14}=(\texttt{0010000100})^T\), \(U_{19}=(\texttt{0010001100})^T\) and \(U_{14}\subset U_{19}\). If there is a combination of vectors \(U_k\) that includes \(U_{14}\) and provides us with the saturated resultant vector, then the same combination but with \(U_{19}\) also provides us with the saturated vector. So, we may exclude \(U_{14}\). Removing columns that are subsets of other columns allows us to decrease the number of combinations further. For the example in Fig. 2, we excluded extra 16 columns, so the total number of combinations is \(C^{4}_{36-10-16} = C^{4}_{10} = 210\), i.e. the number of combinations we need to check is 280 times fewer compared to the original case.

Fig. 2
figure 2

Reducing the number of combinations when checking seed’s validity

We may use similar approaches when forming resultant columns. Suppose \(n_m > 2\). As \(U_{i_1} | U_{i_2} = U_{i_2} | U_{i_1}\) and \(U_{i_1} | U_{i_1} = U_{i_1}\), then we vary \(i_1\) from 1 to \((n_r-1)\) and then \(i_2\) from \((i_1 + 1)\) to \(n_r\), so we consider \(n(n-1)/2\) combinations. \(U_t\) are binary vectors, so the resultant vectors or OR operations. So all vectors can be considered as numbers in binary representation and sorted in ascending/descending order. Likely, some of the \(n(n-1)/2\) resultant vectors are the same, so we exclude them. Therefore by keeping resultant vectors for intermediate steps and checking if new vectors (i.e. \(U_{i_3}\)) are subsets of vectors from the set of intermediate resultant vectors (i.e. \(U_{i_3} \subset \left( U_{i_1} | U_{i_2}\right)\)), we may speed up processing.

Seed expansion

The first and last elements of a seed are 1-elements. There is only one seed of length 1 (1), one seed of length 2 (11), two seeds of length 3 (101, 111), four seeds of length 4 (1001, 1011, 1101, 1111). Therefore for a read of length \(n_r\), there are

$$\begin{aligned} 1 + 1 + 2 + 4 + \ldots + 2^{n_r-2} = 2^{n_r-1} \end{aligned}$$

(5)

seeds in total. As validation of each seed is also quite time-consuming, we should identify ways to reduce the number of candidate seeds. For example, seed 10001110100100011101 is valid for \(n_r=30\) and \(n_m=3\). Any subset of this seed is also valid (seeds of a lower weight). So, 1110100100011101 (the first four elements of the original seed are removed), 100011101001000111 (the last two elements), 10001010100100010101 (two random 1-elements are removed) are also valid seeds. Therefore, we can implement the following procedure.

  1. 1.

    Suppose we have already found all valid seeds of length less or equal to k.

  2. 2.

    We pad all these seeds with 0-elements from their right ends, so we get vectors of length k.

  3. 3.

    We concatenate them with 1-element (also at the right ends of the padded seeds).

  4. 4.

    Before applying the validation procedure, we form a seed by removing the first 1-element and all 0-elements from the left end and checking that the seed is in the list of valid seeds.

For example, if an original valid seed was 100111011 (of length 9), and we aim to find valid seeds of length 13, then the extended seed is 1001110110001, and we need to check if 1110110001 is already in the list of valid seeds.

Clear that a seed is valid if and only if the reverse seed (the order of elements is reversed) is valid. So, seeds 1001110110001 and 1000110111001 are valid (or are not valid) simultaneously.

Periodic seeds

The above procedure allows us to reduce the number of candidate seeds. However, there are still a lot of seeds to be validated, so finding all possible seeds for a read’s length of more than 45 is very slow. However, we can observe a very important property. When for given \(n_r\) and \(n_m\) we find all seeds of maximum weight, then there are almost always periodic seeds such that

$$\begin{aligned} n_r = n_s + T – 1, \end{aligned}$$

(6)

where T is the size of the periodic block. Such seeds have a whole number \(n_b\) of these periodic blocks and the “remainder” (the first \(n_d > 0\) elements of the block), so \(n_s = n_b \cdot T + n_d\).

For example, for \(n_r=17\) and \(n_m=3\) we get three pairs of seed: 111011, 1101000011, 1100100011 (and the reversed seeds). For the first seed, we have \(\texttt{111011} = \texttt{1110} + \texttt{11}\), so \(T=4\) and \(n_s = 6\). We check that \(n_s+T-1 = 6+4-1 = 9\ne 17 = n_r\). However, for the second seed we get \(\texttt{1101000011} = \texttt{11010000} + \texttt{11}\), \(T = 8\), \(n_s=10\), so \(n_s+T-1 = 10 + 8 -1 = 17 = n_r\). Thus, for four seeds Eq. (6) is met but for other two seeds it is not.

There may be seeds of different period T for the same length \(n_r\) of a read. For example, in Fig. 3 there are seeds found for \(n_r=43\), \(n_m=4\). The maximum weight is \(w=12\). We get two pairs of seeds for \(T=19\), three pairs for \(T=17\) and one for \(T=13\).

Fig. 3
figure 3

All seeds of maximum weight (12) found for reads of length 43, number of mismatches is 4. Reversed seeds are not shown

Note that when one can find such seeds, other (shorter) seeds are often available. There are also several exceptions from the observation. For example, for some values of \(n_r\) and \(n_m\), we may have seeds such that (6) is valid for smaller values of \(n_r\), e.g. \(n_r-1\), \(n_r-2\). However, there may be cases when the formula is not true for the best seeds. See Table 1. For example, if \(n_r=16\), \(n_m=5\) the best seeds are 1101 and 1011 (\(n_s=4\) and \(T=3\)), so \(n_s+T-1 = 4+3-1 = 6\ne 16\). However, for 80% of read lengths, formula (6) is valid for a given value of \(n_r\), for 87% is true for smaller values of \(n_r\).

Table 1 Exceptions from the observed formula

Suppose there is a periodic seed such that \(n_s = n_b T + n_d\) and formula (6) is true. We want to check if a seed is valid for a given \(n_m\); see Fig. 4. We need to generate indices \(j_k\), \(k=1,\ldots ,n_m\) such that \(1\le j_k\le n_r\) and check if the resultant column \(U_{j_1} | U_{j_2} |\ldots |U_{n_m}\) is the saturated vector. A periodic block should meet the same requirements if the seed is valid. For this purpose, we choose \(j_k\) such that \((1+ T)\le j_k \le 2T\). Now we assume that the resultant column for the periodic block is not the saturated one for all possible combinations of indices. For any index \(j_k\), \(1\le j_k\le n_r\), we may write down \(j_k = j_k^* + m\cdot T\), where \((1+T)\le j_k^*\le 2T\) and m is an integer number. Therefore \(U_{j_k} \subset U_{j^*_k}\) and instead of \(U_{j_k}\) we may consider \(U_{j_k^*}\).

Fig. 4
figure 4

A seed is valid only when its periodic block is valid

Periodic blocks

We validate a periodic block as the whole seed:

  • generating all possible combinations of indices \(j_k\), \(k=1,\ldots ,n_m\), such that \(1\le j_k\le T\);

  • checking if the resultant column is the saturated one.

A periodic block is valid if and only if its reverse is valid. We may also perform a cyclic rotation of elements of the block; those new blocks are valid at the same time. As a result, we have groups of 2T periodic blocks of length T (some may be identical) that are valid/not valid simultaneously. We call them equivalent blocks. Therefore it is enough to consider the validity of only one block. To reduce the number of blocks to be considered, we require that the first element is always 0-element and the last element is always 1-element. See examples of equivalent blocks generated for a given periodic block in Fig. 5.

Fig. 5
figure 5

Equivalent periodic blocks generated for block 1011101001000100111000011010100010101. Only blocks with the first 0-element and last 1-element are shown. The bottom part is for the reverse block. The circled block is the one we choose from the group

We need a procedure to choose only one block instead of 2T blocks. Of course, for some blocks (e.g. 00101101), there may be cases of identical blocks obtained via reversion/cyclic rotation. By varying indices \(j_k\), we create patterns of 0-elements. Those patterns should be within the periodic block (or one of the equivalent blocks obtained by cyclic rotation) since there should be a row such that the corresponding elements of the chosen columns are all 0-elements. Therefore each periodic block must have a contiguous chunk of \(n_m\) 0-elements. Thus when we generate various periodic blocks, we may always assume that the first \(n_m\) elements are 0-elements.

To choose one block, we pick up the one with the highest number of 0-elements at the beginning. Among blocks obtained for the reverse block, there will also be a block with the maximum number of 0-elements. Then, we need to perform a further comparison. We may count the number of 1-elements in contiguous blocks from the left/right side of the zero-block and choose the block with fewer 1-elements. If the contiguous blocks are the same, then consider neighbouring blocks of 0-elements (and choose the smallest one). We repeat the procedure if it is needed.

Algorithm 1 allows us to validate a block, i.e. check if it can be used for reads of at most \(n_m\) mismatches. Suppose we have a binary vector s of length T. By performing a cyclic rotation of the vector when the last n elements of the vector become the first elements of a new vector (the order of elements is preserved), we generate T vectors \(\eta _i\) of length T. Then we just need to consider all possible combinations of \(n_m\) vectors and check if the resultant vector obtained by applying logical bitwise OR operation is the saturate vector, i.e. the vector with all 1-elements. For this purpose we create \(n_m\) indices \(k_p\), \(p=1,\ldots , n_m\). When we have \(n_m\) binary vectors and apply OR operation, then the order of vectors does not matter. Therefore we may assume that \(k_p<k_{p+1}\), \(p=1,\ldots ,(n_m-1)\). So, we initialise the indices as \(k_p=p\), \(p=1,\ldots ,n_m\), and to create a new set of indices we increment the last index \(k_{n_m}\). When \(k_{n_m}\) reaches the maximum value (T), we start incrementing the previous index \(k_{n_m-1}\) and reset \(k_{n_m}\) to the smallest value permitted. This procedure is applied in a similar way to other indices. Clearly, that if after incrementing of \(k_p\) by one we get \(k_{p}\) equal to \((T+p-n_m+1)\), then we should increment \(k_{p-1}\) are reset all other indices \(k_p, k_{p+1}, k_{p+2}, \ldots\).

figure a

The algorithm should stop when \(p-1=0\). However, we may stop earlier when \(p-1=1\). This means that we may actually always set \(k_1=1\). Suppose that we have an arbitrary combination of ordered indices \(k_p\), \(p=1,\ldots , n_m\) and form a resultant vector v after applying bitwise OR operations. However we may create another set of indices \({\bar{k}}_p\) such that \({\bar{k}}_p\equiv (k_p-k_1)\%T + 1\). The resultant vector found for \(k_p\) indices is the resultant vector found for \({\bar{k}}_p\) indices but after the cyclic shift by \((k_1-1)\) elements was performed. So, both resultant vectors are (or are not) saturated at the same time.

We store intermediate \(n_m\) resultant vectors \(u_p\) and perform bitwise OR operation with the corresponding \(\eta _{k_p}\) to find \(u_{p+1}\). If \(\eta _{k_p}\) is a subset vector for \(u_p\), then as it was discussed in previous subsections we may ignore \(\eta _{k_p}\) and consider other vectors.

As we want to find seeds of maximum weight, it is reasonable to find periodic blocks of maximum weight. Suppose there are blocks found for weight w. By replacing 1-elements with 0-elements we form periodic blocks of weight \((w-1)\). However, there may also be blocks of weight \((w-1)\) that cannot be formed by replacing 1-elements in blocks of weight w. For example, there are five blocks for \(T=11\), \(n_m=2\) and weight \(w = 7\): 00011011111, 00101011111, 00101110111, 00101111011, 00110101111. Block 00010110111 of weight \(w=6\) cannot be formed from those blocks. Seeds are formed of an integer number of periodic blocks and a “remainder”. Therefore, there is a possibility that the best seeds are formed of blocks of non-maximum weight. However, when we generated seeds of non-maximum weight, they never formed seeds of weight larger than seeds formed from maximum-weight blocks.

Thus to generate all seeds of maximum weight for blocks of length T and \(n_m\) mismatches we set maximum weight w (at most \(T-n_m\)) and consider all binary vectors such that the first \(n_m\) elements are zeros and the last element is one. Then validate these seeds using Algorithm 1. If no seeds are found for a given w, then reduce w by 1 and generate a new set of candidate vectors.

All ideas mentioned above allow us to reduce the number of blocks to be validated. It is also possible to parallelise processing so each CPU thread validates only specific seeds from a pre-generated list. Together with SIMD instructions for validation steps, we sped up the generation of periodic blocks. It may take less than a second for \(n_r < 35\), however, finding periodic blocks for \(n_r\approx 50\) may still take hours as trillions of blocks should be validated.

Algorithm 1 for validation of seed blocks can be implemented on various computational architectures. The authors implemented it using various SIMD operations. The algorithm’s performance will depend on input parameters and possible optimisation done by a compiler and specific elementary operations. This algorithm has five major operations: OR, XOR, binary shift and extraction of a number applied to 128-bit numbers (all SIMD operations) and elementary addition (as increment/decrement operations) for 32-bit numbers. We may ignore the time spent on other operations. In principle, the number of XOR, binary shift and extraction operations is the same for this algorithm. So, validating a single block will take about \(O(n_{\text{OR}} + 3 n_{\text{XOR}} + n_{add})\) elementary operations. Thus, we mention only OR (\(n_{\text{OR}}\)), XOR (\(n_{\text{XOR}}\)) and elementary addition (\(n_{add}\)) operations. These numbers are in Table 2. We may see that the ratio \(n_{\text{XOR}}/n_{\text{OR}}\) is almost the same (\(\approx 0.92\)), while \(n_{add}/n_{\text{OR}}\) is increased with \(n_m\) (from 0.3 for \(n_m=2\) to 0.9 for \(n_m=9\)). The total number of OR operations may differ for different T; however, we usually see a five-fold increase when \(n_m\) is increased by one. The main issue when generating all possible seed blocks is exponential (as a function of T, different functions for different values of \(n_m\), e.g. \(\sim T^{12}\) or \(T^{20}\)) increase of the number of blocks to be validated even after various procedures to reduce the number of candidate blocks.

Table 2 Number of operations used for Algorithm 1: number of mismatches (\(n_m\)), length of a periodic block (T), total number of seeds to be validated (# tests), averaged numbers of SIMD OR (\(n_{\text{OR}}\)), XOR (\(n_{\text{XOR}})\) and standard addition (\(n_{add}\)) operations for each binary block as an absolute number or as percentage of \(n_{\text{OR}}\) operations

The times needed to generate periodic blocks depend on CPU architecture. Run times for Intel Core i5-9600K processor (6 cores, 12 threads, base frequency 3.70 GHz) can be found in Table 3 and Fig. 6. One may see that calculations for \(n_m=4\) are the slowest, and in this case, the run times increase exponentially as \(2.85\cdot 10^{-10}\cdot 2.03^T\) (in seconds). If blocks are less than 30, we can perform all calculations in less than a second. When blocks have 40 elements, we need an hour to complete the task, while 50 elements may require us to spend a week or use a CPU with tens of cores.

Fig. 6
figure 6

Times required to generate periodic blocks of maximum weight for a given number of mismatches \(n_m\) and a known block size

Table 3 Calculation times (in seconds) for Algorithm 1 for Intel Core i5-9600K processor (\(n_m\) is the number of mismatches, T is the size of the periodic block)

Forming periodic spaced seeds

The final step of PerFSeeB approach (periodic full sensitivity blocks) is to form spaced seeds of maximum weight. For this purpose, we consider all periodic blocks formed for a given number \(n_m\) of mismatches. As \(n_s = n_b T + n_d\), \(n_b \ge 1\), \(1 \le n_d < T\), and \(n_r = n_s + T-1\), then \(n_r = n_b T + n_d + T – 1 = (n_b + 1)T +(n_d-1) \ge 2T\) or \(T \le n_r/2\). For each periodic block we form all equivalent blocks and check if

  1. 1.

    the first symbol of the equivalent block is 1-element,

  2. 2.

    the \(n_d\)-th element is also 1-element.

In principle, there is no need to consider blocks obtained from the reverse periodic block as the final seed will be a reverse of seeds formed from the original block. Once the seed is formed we count its weight. By processing all periodic blocks we find seeds of maximum weight. The detailed procedure is shown in Algorithm 2. We assume that we already have files with periodic blocks found for a set of lengths, from \(T_{min}\) to \(T_{max}\), and valid for at most \(n_m\) mismatches. After application of Algorithm 2 we get a list \(\Omega\) of \(n_{sol}\) periodic seeds of maximum weight. Note that for the same weight w and the number of mismatches \(n_m\) we may have seeds made of periodic blocks of different length/weight. Time required to run Algorithm 2 is the sum of times to process all blocks available for given values of \(n_m\) and T. While the number of valid blocks is different for a pair of \(n_m\) and T (from single entries to a million), it takes at most 5 s on Intel’s i5-9600K CPU to find the best seed for a given length of a read.

figure b



Source link