The pipeline \(\text{O} \textsc{rtho} \text{FFGC}\) (with optimal capping) was previously integrated into the \(\textsc {FFGC}\) workflow [12, 22], which includes the pre-computation of gene similarities, allowing therefore the automatic generation of families directly from the genome data. We implemented the new pipeline \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) (with heuristic capping) as another extension of the same workflow. The optional \(\textsc {mcl}\) step for refining ambiguous families is integrated to both \(\text{O} \textsc{rtho} \text{FFGC}\) and \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\). The implementation and its documentation can be downloaded from our GitLab server at https://gitlab.ub.uni-bielefeld.de/gi/FFGC or as a Conda package at https://anaconda.org/bioconda/ffgc. (Note that mcl is an external dependency, automatically installed only if the download is obtained from Conda.)

An important recent modification of our pipeline is that now it performs pairwise similarity computations by default via \(\textsc {diamond}\) [23] and no longer via \(\textsc {blast}\) [24]. This change and some further optimizations improved greatly the running times of \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\), that are now closer to the fastest alternative tools. Suprisingly, adopting \(\textsc {diamond}\) also produced a small improvement of the quality of our gene families, compared to the version previously published [18]. The current values for quality and running times can be found in the end of this section.

### Alternative inference tools used for comparison purposes

\(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\) and \(\textsc {Poff}\). \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\) [16] is a fast tool that clusters genes to find significant orthologous groups, based on a heuristic of reciprocal best hits of the corresponding sequence similarities. Its \(\textsc {Poff}\) extension [17] takes into account gene adjacencies as an additional criterion for the discrimination of orthologs.

\(\textsc {Oma}\). Based on sequence similarities and on phylogeny, \(\textsc {Oma}\) [25] is the underlying tool of the homonym online orthology browser. The standalone tool allows custom genomes to be compared to infer orthologous groups. Two types of families are reported: *hierarchical orthologous groups* (\(\textsc {OmaHOGs}\)), which may include ambiguous families, and *resolved groups* (\(\textsc {OmaGroups}\)).

### Gene similarities in \(\text{O} \textsc{rtho} \text{FFGC}\)/\({\textsc {OrthoFF}\textsc {GC}}\!\!\thickapprox\)

The computation of gene similarities is done via \(\textsc {FFGC}\) and is described as follows. For any given pair of genes *x* and *y*, the software diamond is used for computing the value \(\textsc {bitscr}(x\rightarrow y)\), that corresponds to the bitscore of gene *x* *with respect to* gene *y*. If gene *x* is in genome \({\mathbb {A}}\) and gene *y* is in genome \({\mathbb {B}}\), both \(\textsc {bitscr}(x\rightarrow y)\) and \(\textsc {bitscr}(y\rightarrow x)\) must be taken into consideration for calculating the similarity \(\sigma (x,y)\), that is equivalent to their *relative reciprocal score* [26]:

$$\begin{aligned} \sigma (x,y)=\frac{\textsc {bitscr}(x\rightarrow y) + \textsc {bitscr}(y\rightarrow x)}{\textsc {bitscr}(x\rightarrow x)+\textsc {bitscr}(y\rightarrow y)}. \end{aligned}$$

Negligible similarities are then identified and discarded by the filter \(\digamma\), by requiring that (1) for a first given threshold \(\digamma _{\epsilon } \in (0, 1]\), \(\sigma (x, y) \ge \digamma _{\epsilon }\) (*absolute filter*); and (2) for a second given threshold \(\digamma _{t} \in (0, 1]\), the value \(\textsc {bitscr}(x\rightarrow y)\) must be at least a \(\digamma _{t}\)-fraction of the highest bitscore of *x* with respect to any gene in \({\mathbb {B}}\) and the value \(\textsc {bitscr}(y\rightarrow x)\) must be at least a \(\digamma _{t}\)-fraction of the highest bitscore of *y* with respect to any gene in \({\mathbb {A}}\) (*relative minimum reciprocal similarity for additional hits*, which was developed and used in the alternative tools \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\) [16] and \(\textsc {Poff}\) [17]). Once these conditions are fulfilled, the edge *xy* is added to the gene similarity graph with score \(\sigma (xy)=\sigma (x,y)\). The adopted values are \(\digamma _{\epsilon }=0.1\) and \(\digamma _{t}=0.8\), the latter also being adopted for \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\) and \(\textsc {Poff}\) whenever these tools were used in our experiments.

The default values of the other parameters for the pre-computation of gene similarities via \(\textsc {FFGC}\) were kept, except for one of them: the minimum number of genomes for which each gene must share some similarity in is set to 1, otherwise genes not similar to any other gene, which should be still considered in indels, would not appear in the chromosomal gene order.

### Computational environment and additional parameters of \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\)

We ran experiments in a 2.7GHz multi-core machine. Whenever possible, tasks ran using 8 cores. As an ILP solver, we used Gurobi.

For the post-processing refinement of ambiguous families in \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\), we used the default parameters of mcl with a conservative “inflation” value of 1.4 suggested in its manual (https://micans.org/mcl).

In the following we will describe our experiments, based on genome assemblies fetched from NCBI. We performed two distinct comparisons. First, based on a set of five completely assembled primate genomes, we compared \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) families (inferred with the heuristic capping) to \(\text{O} \textsc{rtho} \text{FFGC}\) families (inferred with the optimal capping). Then, based on the set of 11 *Drosophilas* including partially assembled genomes, we compared \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) families to the gene families inferred by other tools, using \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) families as reference.

### Analysis of completely assembled primate genomes: comparing \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) to \(\text{O} \textsc{rtho} \text{FFGC}\)

The goal of this experiment is to evaluate the impact of the heuristic capping on running times and on the quality of results, with respect to the optimal capping in a dataset of large genomes. We compared families inferred by \(\text{O} \textsc{rtho} \text{FFGC}\) and \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\), considering the following five primate genomes: bonobo (*P. paniscus*), chimpanzee (*P. troglodytes*), gorilla (*G. gorilla*), human (*H. sapiens*) and orangutan (*P. abelii*). Those genomes comprise roughly \(20{,}000 \sim 22{,}000\) genes distributed in 25 chromosomes, except for the human genome that has 24 chromosomes. The average number of edges (similarities) for each vertex (gene) in the gene similarity graphs is 1.09, totaling 429268 vertices and 234297 edges. Considering only the genes with multiple similarities, the average degree is 2.93.

For the capping heuristic in \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\), we set \(\epsilon =0.1\) and \(\tau = 2\), a choice that reduces significantly the number of capping sets, but still gives some options to the ILP solver. With these choices, the largest number of edges in \({\widehat{\mathcal {C}}}({\mathbb {A}},{\mathbb {B}})\) was for the comparison of gorilla and human. In this case, \({\widehat{\mathcal {C}}}\) has 27 edges distributed among 21 components with 2 vertices, and 2 components with 4 vertices (including 1 dummy segment). That corresponds to at most \(\sim 1.1 \times 10^6\) capping-sets in \(\theta _\thickapprox\), while the pairwise comparison with the optimal \(\theta _\star\) has 50! capping-sets.

This significant reduction of the search space enabled 7 out of 10 pairwise comparisons with the heuristic capping to finish within the time limit of 60 min, in contrast to the lengthy comparisons with the optimal capping, that were limited to 1440 min (24 h). The results are shown in Table 1, where we can also see that the size of the ortholog-sets \({\mathcal {O}}\), the number of DCJ-indel operations \(\textrm{d} _\textsc {dcj} ^\textsc {id}\), and the weighted rearrangement distances \(\textrm{wd} _\textsc {dcj}^\textsc {id}\) of the two approaches are almost identical for all pairwise comparisons. While 21090 families were inferred using \(\text{O} \textsc{rtho} \text{FFGC}\), 21101 were inferred using \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\), and 99.2% of those families are the same, showing that moving from the optimal capping to the heuristic capping of the family-free relational diagram had very little impact on the gene family inference. The total running time of \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) was of about 321 min (of which 306 min were spent by the pairwise ILP computations). The numbers of classified genes and of inferred families are displayed in Table 2.

This experiment accomplished its goal, by showing that the heuristic capping can have the same quality of the optimal capping, while having substantial impact in speeding up the ILP computations. Indeed, the time required to analyze the dataset of five primates with \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) was quite reasonable. However it was still considerably longer than the fastest alternative tool: the total running time of \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\) for the same dataset was of about 19 min only, with the corresponding numbers of classified genes and inferred families being also displayed in Table 2. It is worth mentioning that in \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) a resolved family has the additional confidence of resulting from independently computed pairwise ortholog-sets. Therefore it is remarkable that \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) could identify a significantly higher number of resolved complete families, including almost all resolved complete families also inferred by \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\).

### Analysis of partially assembled *Drosophila* (fruit flies) genomes

The \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) consortium (https://flybase.org) sequenced, assembled and annotated the genomes of 12 *Drosophilas* with \(\sim 12{,}000\)–16,000 protein-coding genes (for genes with multiple transcripts, we kept only the longest), however only 11 of those genomes are available on NCBI together with the complete annotation: *D. ananassae*, *D. erecta*, *D. grimshawi*, *D. melanogaster*, *D. mojavensis*, *D. persimilis*, *D. sechellia*, *D. simulans*, *D. virilis*, *D. willistoni* and *D. yakuba*.

In the family-free analysis, the total number of vertices and edges in the gene similarity graphs are 1514930 and 633521, respectively, with average degrees of 0.83 considering all vertices, and 2.22 looking only at those genes with multiple similarities. For the capping heuristic in \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\), we again set \(\epsilon =0.1\) and \(\tau = 2\).

#### Average numbers of cap edges and capping-sets in \(\theta _\star\) and in \(\theta _\thickapprox\)

The analyzed *Drosophila* genomes have 507 contigs on average, therefore each optimally capped family-free relational diagram has \(1{,}014 \times 1{,}014 = 1{,}028{,}196\) cap edges and an unfeasible total of 1, 014! capping-sets on average.

In contrast, considering the perfect shared-content graphs for all pairwise *Drosophila* comparisons, 99.7% of the components in those graphs have only 1 linear segment in each part of the graph. In the remaining 0.3%, 80% have 7 or fewer linear segments in each part, with the largest component having 76 linear segments in each part. The perfect shared-content graphs have an average of 1,419 edges. For that number of edges, each heuristically capped family-free relational diagram has 5,676 cap edges on average. As the exact number of perfect matchings in arbitrary graphs is not trivial to estimate, we computed an upper limit for the average number of distinct capping-sets by the Bregman-Cinc inequality [27] of the permanent of a squared matrix: \(\sim 45!\), with median \(\sim 18!\).

#### Benchmark for our experiments

Reference families were obtained directly from \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) (https://flybase.org). Since the set of genes classified in \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) is slightly different from the set of genes present with their coding sequences in database files, we filtered out a small portion (\(\sim 7\%\)) of genes in \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) families so that only those present in the NCBI databases with their coding sequences were kept. Prior to any comparison of inferred families to \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) families, we also filtered out from the inferred families genes not present in \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) families.

#### Comparing \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) to \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\), \(\textsc {Poff}\) and \(\textsc {Oma}\)

We analyzed the dataset of 11 *Drosophila* genomes for comparing the performance of \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) against \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\), \(\textsc {Poff}\) and \(\textsc {Oma}\), providing as additional input to the latter the phylogenetic tree (obtained from FlyBase) of the same 11 *Drosophilas*. Unlocalized contigs were not filtered out, resulting in genomes with 11 to 1041 linear segments.

##### Quality of inferred families

The numbers of families in \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) and those inferred by \(\textsc {OmaHOGs}\), \(\textsc {OmaGroups}\), \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\), \(\textsc {Poff}\) and \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) are all bigger than 12,000. In order to have a hint on the quality of results, we focus on the intersections with \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) families and on *precision* and *recall* values computed for all methods.

The numbers of classified genes and families inferred by all methods, together with the respective intersections with \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) are shown in Table 3. Figure 7 shows the comparative picture focusing on the numbers of resolved incomplete and complete families for all methods.

We also counted the numbers of pairwise gene homologies that are classified as *true positive* (TP), *false positive* (FP) and *false negative* (FN) as follows. First, denote a subset of size two by *2-subset*. Now let \({\mathcal {H}}_\textsc {fly}\) be the set composed of 2-subsets of all \(\text{F} \textsc {ly} \text{B} \textsc{ase}\) families, and, for any considered set of families *X*, let \({\mathcal {H}}_X\) be the set composed of 2-subsets of all families in *X*. Then TP of *X* is the size of \({\mathcal {H}}_\textsc {fly}\! \cap \!{\mathcal {H}}_X\), FN of *X* is the size of \({\mathcal {H}}_\textsc {fly}\!\setminus \!{\mathcal {H}}_X\) and FP of *X* is the size of \({\mathcal {H}}_X\!\setminus \!{\mathcal {H}}_\textsc {fly}\). Based on that we computed the values of *precision* \(\left( \frac{\text {TP}}{\text {TP}+\text {FP}}\right)\) and *recall* \(\left( \frac{\text {TP}}{\text {TP}+\text {FN}}\right)\) for the sets of families inferred by the considered methods.

The results (Fig. 8) show that, while all methods performed quite well, our tool \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) had the lowest precision. However, when the optional \(\textsc {mcl}\) refinement step is enabled in our pipeline, its overall results were improved, with an increase of the precision that is bigger than the corresponding decrease of the recall, reflected on the increase of the \(F_1\)-score.

##### Running times

We compiled the running times of the four methods in Table 4. The (preprocessing) step 1 of computing the pairwise sequence similarities is required by all methods, but, while \(\textsc {Oma}\) has an internal implementation of the Smith-Waterman algorithm, the other methods used diamond [23], which is the fastest known tool for accomplishing that task.

Having at hand the sequence similarities, \(\text{P} \textsc {rotein} \text{O} \textsc {rtho}\) and \(\textsc {Oma}\) build families taking into consideration alignments and similarities. Additionally, \(\textsc {Oma}\) takes into consideration the provided phylogenetic tree of the 11 *Drosophilas*. The running times of these procedures are given in step 3. For the other methods, we separated in step 2 core procedures that use additional criteria to find pairwise orthologs: \(\textsc {Poff}\) does it via the analysis of synteny by means of conserved gene adjacencies, while \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) generates and solves the ILPs for obtaining pairwise \(\textsc {OrthoFF}\!\!\thickapprox\) gene orthologies.

Adopting \(\textsc {diamond}\) for similarity computations instead of \(\textsc {blast}\) was an important recent modification of our pipeline. This change and the use of more restrictive parameters for the heuristic capping improved greatly the running times of our method compared to its previous version [18]. With these improvements the running times of \(\text{O} \textsc{rtho} \text{FFGC}\!\!\thickapprox\) are now the fastest among the compared tools for the *Drosophila* dataset.

## Add Comment