The code of all our heuristics and for generating data is written in Python and is available at https://github.com/estherjulien/learn2cherrypick. All experiments ran on an Intel Xeon Gold 6130 CPU @ 2.1 GHz with 96 GB RAM. We conducted experiments on both synthetic and real data, comparing the performance of Rand, TrivialRand, ML and TrivialML, using threshold \(\tau =0\). Similar to the training data, we generated two synthetic datasets by first growing a binary orchard network *N* using [22], and then extracting \(\mathcal {T}\) as a subset of the exhaustive trees displayed in *N*. We provide details on each dataset in Sect. “Experimental results“.

We start by analysing the usefulness of tree expansion, the heuristic rule described in Sect. “Improving heuristic TrivialRand via tree expansion“. We synthetically generated 112 instances for each tree set size \(|\mathcal {T}|\in \{5,10,20,50,100\}\) (560 in total), all consisting of trees with 20 leaves each, and grouped them by \(|\mathcal {T}|\); we then ran TrivialRand 200 times (both with and without tree expansion) on each instance, selected the best output for each of them, and finally took the average of these results over each group of instances. The results are in Fig. 6, showing that the use of tree expansion brought the output reticulation number down by at least 16% (for small instances) and up to 40% for the larger instances. We consistently chose to use this rule in all the heuristics that detect trivial cherries, namely, TrivialRand, TrivialML, ML (although ML does not explicitly favour trivial cherries, it does check whether a selected cherry is trivial using feature number 2), and the non-learned heuristic that will be introduced in Sect. “A non-learned heuristic based on important features“.

### Prediction model

The random forest is implemented with Python’s scikit-learn [24] package using default settings. We evaluated the performance of our trained random forest models on different datasets in a holdout procedure: namely, we removed 10% of the data from each training dataset, trained the models on the remaining 90% and used the holdout 10% for testing. The accuracy was assessed by assigning to each test data point the class with the highest predicted probability and comparing it with the true class. Before training the models, we balanced each dataset so that each class had the same number of representatives.

Each training dataset differed in terms of the number *M* of networks used for generating it and the number of leaves of the networks. For each dataset, the number *L* of leaves of each generated network was uniformly sampled from \([2, \max L]\), where \(\max L\) is the maximum number of leaves per network. We constructed LGT networks using the LGT generator of [22]. This generator has three parameters: *n* for the number of steps, \(\alpha\) for the probability of lateral gene transfer events, and \(\beta\) for regulating the size of the biconnected components of the network (called blobs). The combination of these parameters determines the level (maximum number of reticulations per blob), the number of reticulations, and the number of leaves of the output network. In our experiments, \(\alpha\) was uniformly sampled from [0.1, 0.5] and \(\beta =1\) (see [22] for more details).

To generate normal networks we used the same generator with the same parameters, but before adding a reticulation we check if it respects the normality constraints and only add it if it does. Each generated network gave rise to a number of data points: the total number of data points per dataset is shown in Table 3 in Appendix B. Each row of Table 3 corresponds to a dataset on which the random forest can be trained, obtaining as many ML models. We tested all the models on all the synthetically generated instances: we show these results in Figs. 18, 19 and 20 in Appendix C. In Sect. “Experimental results” we will report the results obtained for the best-performing model for each type of instance.

Among the advantages of using a random forest as a prediction model, there is the ability of computing feature importance, shown in Table 4 in Appendix B. Some of the most useful features for a cherry (*x*, *y*) appear to be ‘Trivial’ (the ratio of the trees containing both leaves *x* and *y* in which (*x*, *y*) is a cherry) and ‘Cherry in tree’ (the ratio of trees that contain (*x*, *y*)). This was not unexpected, as these features are well-suited to identify trivial cherries.

‘Leaf distance’ (t,d), ‘LCA distance’ (t) and ‘Depth *x*/*y*’ (t) are also important features. The rationale behind these features was to try to identify reticulated cherries. This was also the idea for the feature ‘Before/after’, but this has, surprisingly, a very low importance score. In future work, we plan to conduct a thorough analysis of whether some of the seemingly least important features can be removed without affecting the quality of the results.

### Experimental results

We assessed the performance of our heuristics on instances of four types: normal, LGT, ZODS (binary non-orchard networks), and real data. Normal, LGT and ZODS data are synthetically generated. We generated the normal instances much as we did for the training data: we first grew a normal network using the LGT generator and then extracted all the exhaustive trees displayed in the network. We generated normal data for different combinations of the following parameters: \(L \in \{20, 50, 100\}\) (number of leaves per tree) and \(R \in \{5, 6, 7\}\) (reticulation number of the original network). Note that, for normal instances, \(|\mathcal {T}| = 2^R\). For every combination of the parameters *L* and *R* we generated 48 instances: by instance group we indicate the set of instances generated for one specific parameter pair.

For the LGT instances, we grew the networks using the LGT generator, but unlike for the normal instances we then extracted only a subset of the exhaustive trees from each of them, up to a certain amount \(|\mathcal {T}| \in \{20, 50, 100\}\). The other parameters for LGT instances are the number of leaves \(L \in \{20, 50, 100\}\) and the number of reticulations \(R \in \{10, 20, 30\}\). For a fixed pair \((L,|\mathcal {T}|)\), we generated 16 instances for each possible value of *R*, and analogously, for a fixed pair (*L*, *R*) we generated 16 instances for each value of \(|\mathcal {T}|\). The 48 instances generated for a fixed pair of values constitute a LGT instance group.

We generated non-orchard binary networks using the ZODS generator [25]. This generator has two user-defined parameters: \(\lambda\), which regulates the speciation rate, and \(\nu\), which regulates the hybridization rate. Following [26] we set \(\lambda = 1\) and we sampled \(\nu \in [0.0001, 0.4]\) uniformly at random. Like for the LGT instances, we generated an instance group of size 48 for each pair of values \((L,|\mathcal {T}|)\) and (*L*, *R*), with \(L \in \{20, 50, 100\}\), \(|\mathcal {T}| \in \{20, 50, 100\}\), \(R \in \{10, 20, 30\}\).

Finally, the real-world dataset consists of gene trees on homologous gene sets found in bacterial and archaeal genomes, was originally constructed in [27] and made binary in [3]. We extracted a subset of instances (Table 2) from the binary dataset, for every combination of parameters \(L\in \{20, 50, 100\}\) and \(|\mathcal {T}|\in \{10,20,50,100\}\).

For the synthetically generated datasets, we evaluated the performance of each heuristic in terms of the output number of reticulations, comparing it with the number of reticulations of the network *N* from which we extracted \(\mathcal {T}\). For the normal instances, *N* is the optimal network [23, Theorem 3.1]; this is not true, in general, for the LGT and ZODS datasets, but even in these cases, *r*(*N*) clearly provides an estimate (from above) of the optimal value, and thus we used it as a reference value for our experimental evaluation.

For real data, in the absence of the natural estimate on the optimal number of reticulations provided by the starting network, we evaluated the performance of the heuristics comparing our results with the ones given by the exact algorithms from [3] (TreeChild) and from [7] (Hybroscale), using the same datasets that were used to test the two methods in [3]. These datasets consist of rather small instances (\(|\mathcal {T}|\le 8\)); for larger instances, we run TrivialRand 1000 times for each instance group, selected the best result for each group, and used it as a reference value (Fig. 10).

We now describe in detail the results we obtained for each type of data and each of the algorithms we tested.

#### Experiments on normal data

For the experiments in this section we used he ML model trained on 1000 normal networks with at most 100 leaves per network (see Fig. 18 in Appendix C). We ran the machine-learned heuristics once for each instance and then averaged the results within each instance group (recall that one instance group consists of the sets of all the exhaustive trees of 48 normal networks having the same fixed number of leaves and reticulations). The randomised heuristics Rand and TrivialRand were run \(\min \{x(I), 1000\}\) times for each instance *I*, where *x*(*I*) is the number of runs that can be executed in the same time as one run of ML on the same instance. We omitted the results for LowPair because they were at least 44% worse on average than the worst-performing heuristic we report.

In Fig. 7 we summarise the results. Solid bars represent the ratio between the average reported reticulation number and the optimal value, for each instance group and for each of the four heuristics. Dashed bars represent the ratio between the average (over the instances within each group) of the best result among the \(\min \{x(I), 1000\}\) runs for each instance *I* and the optimum.

The machine-learned heuristics ML and TrivialML seem to perform very similarly, both leading to solutions close to optimum. The average performance of TrivialRand is around 4 times worse than the machine-learned heuristics; in contrast, if we only consider the best solution among the multiple runs for each instance, they are quite good, having only up to 49% more reticulations than the optimal solution, but they are still at least 4% worse (29% worse on average) than the machine-learned heuristics’ solutions: see the right graph of Fig. 7.

The left graph of Fig. 7 shows that the performance of the randomised heuristics seems to be negatively impacted by the number of reticulations of the optimal solution, while we do not observe a clear trend for the machine-learned heuristics, whose performance is very close to optimum for all the considered instance groups. Indeed, the number of existing phylogenetic networks with a certain number of leaves grows exponentially in the number of reticulations, thus making it less probable to reconstruct a “good” network with random choices. This is consistent with the existing exact methods being FPT in the number of reticulations [3, 28].

The fully randomised heuristic Rand always performed much worse than all the others, indicating that identifying the trivial cherries has a great impact on the effectiveness of the algorithms (recall that ML implicitly identifies trivial cherries).

#### Experiments on LGT data

For the experiments on LGT data we used the ML model trained on 1000 LGT networks with at most 100 leaves per network (see Fig. 19 in Appendix C). The setting of the experiments is the same as for the normal data (we run the randomised heuristics multiple times and the machine-learned heuristics only once for each instance), with two important differences.

First, for LGT data we only take proper subsets of the exhaustive trees displayed by the generating networks, and thus we have two kinds of instance groups: one where in each group the number of trees extracted from a network and the number of leaves of the networks are fixed, but the trees come from networks with different numbers of reticulations; and one where the number of reticulations of the generating networks and their number of leaves are fixed, but the number of trees extracted from a network varies.

The second important difference is that the reference value we use for LGT networks is not necessarily the optimum, but it is just an upper bound given by the number of reticulations of the generating networks which we expect to be reasonably close to the optimum (see Sect. “Obtaining training data“).

The results for the LGT datasets are shown in Fig. 8. Comparing these results with those of Fig. 7, it is evident that the LGT instances were more difficult than the normal ones for all the tested heuristics: this could be due to the fact that the normal instances consisted of all the exhaustive trees of the generating networks, while the LGT instances only have a subset of them and thus carry less information.

The machine-learned heuristics performed substantially better (up to 80% on average) than the best randomised heuristic TrivialRand in all instance groups but the ones with the smallest values for parameters \(R,|\mathcal {T}|\) and *L*, for which the performances are essentially overlapping. On the contrary, the advantage of the machine-learned methods is more pronounced when the parameters are set to the highest values. This is because the larger the parameters, the more the possible different networks that embed \(\mathcal {T}\), thus the less likely for the randomised methods to find a good solution.

From the graphs on the right of Fig. 8, it seems that the number of reticulations has a negative impact on both machine-learned and randomised heuristics, the effect being more pronounced for the randomised ones. The effect of the number of trees \(|\mathcal {T}|\) on the quality of the solutions is not as clear (Fig. 8, left). However, we can still see that the trend of ML and TrivialRand is the same: the “difficult” instance groups are so for both heuristics, even if the degradation in the quality of the solutions for such instance groups is less marked for ML than for TrivialRand.

#### Experiments on ZODS data

For the experiments on ZODS data we used the ML model trained on 1000 LGT networks with at most 100 leaves per network (see Fig. 20 in Appendix C). The setting of the experiments is the same as for the LGT data, and the results are shown in Fig. 9.

At first glance, the performance of the randomised heuristics seems to be better for ZODS data than for LGT data (compare Figs. 8 and 9), which sounds counterintuitive. Recall, however, that all the graphs show the ratio between the number of reticulations returned by our methods and a reference value, i.e., the number of reticulations of the generating network: while we expect this reference to be reasonably close to the optimum for LGT networks, this is not the case for ZODS networks. In fact, a closer look to ZODS networks shows that they have a large number of redundant reticulations which could be removed without changing the set of trees they display, and thus their reticulation number is in general quite larger than the optimum. This is an inherent effect of the ZODS generator not having any constraints on the reticulations that can be introduced, and it is more marked on networks with a small number of leaves.

Having a reference value significantly larger than the optimum makes the ratios shown in Fig. 9 small (close to 1, especially for TrivialRand on small instances) without implying that the results for the ZODS data are better than the ones for the LGT data. The graphs of Figs. 8 and 9 are thus not directly comparable.

The reference value for the experiments on ZODS data not being realistically close to the optimum, however, does not invalidate their significance. Indeed, the scope of such experiments was just to compare the performance of the machine-learned heuristics on data entirely different from those they were trained on with the performance of the randomised heuristics, which should not depend on the type of network that was used to generate the input.

As expected and in contrast with normal and LGT data, the results show that the machine-learned heuristics perform worse than the randomised ones on ZODS data, consistent with the ML methods being trained on a completely different class of networks.

#### Experiments on real data

We conducted two sets of experiments on real data, using the ML model trained on the dataset trained on 1000 LGT networks with at most 100 leaves each. For sufficiently small instances, we compared the results of our heuristics with the results of two existing tools for reconstructing networks from binary trees: TreeChild[3] and Hybroscale[7]. Hybroscale is an exact method performing an exhaustive search on the networks displaying the input trees, therefore it can only handle reasonably small instances in terms of the number of input trees. TreeChild is a fixed-parameter (in the number of reticulations of the output) exact algorithm that reconstructs the best tree-child network, a restricted class of phylogenetic networks, and due to its fast-growing computation time cannot handle large instances either.

We tested ML and TrivialRand against Hybroscale and TreeChild using the same dataset used in [3], in turn taken from [27]. The dataset consists of ten instances for each possible combination of the parameters \(|\mathcal {T}|\in [2,8]\) and \(L\in \{10,20,30,40,50,60,80,100,150\}\). In Fig. 10 we show results only for the instance groups for which Hybroscale or TreeChild could output a solution within 1 h, consistent with the experiments in [3]. As a consequence of Hybroscale and TreeChild being exact methods (TreeChild only for a restricted class of networks), they performed better than both ML and TrivialRand on all instances they could solve, although the best results of TrivialRand are often close (no worse than 15%) and sometimes match the optimal value.

The main advantage of our heuristics is that they can handle much larger instances than the exact methods. In the conference version of this paper [19] we showed the results of our heuristics on large real instances, using a ML model trained on 10 networks with at most 100 leaves each. These results demonstrated that consistently with the simulated data, the machine-learned heuristics gave significantly better results than the randomised ones for the largest instances. When we first repeated the experiments with the new models trained on 1000 networks with \(\textsf {max}L=100\), however, we did not obtain similar results: instead, the results of the randomised heuristics were better or only marginally worse than the machine-learned ones on almost all the instance groups, including the largest.

Puzzled by these results, we conducted an experiment on the impact of the training set on real data. The results are reported in Fig. 11, and show that the choice of the networks on which we train our model has a big impact on the quality of the results for the real datasets. This is in contrast with what we observed for the synthetic datasets, for which only the class of the training networks was important, not the specific instances of the networks themselves. According to what was noted in [3], this is most likely due to the fact that the real phylogenetic data have substantially more structure than random synthetic datasets, and the randomly generated training networks do not always reflect this structure. By chance, the networks we used for training the model we used in [19] were similar to real phylogenetic networks, unlike the 1000 networks in the training set of this paper.

#### Experiments on scalability

We conducted experiments to study how the running time of our heuristics scales with increasing instance size for all datasets. In Fig. 12 we report the average of the running times of ML for the instances within each instance group with a 95% confidence interval, for an increasing number of reticulations (synthetic datasets) or number of trees (real dataset). The datasets and the instance groups are those described in the previous sections. Note that we did not report the running times of the randomised heuristics because they are meant to be executed multiple times on each instance, and in all the experiments we bounded the number of executions precisely using the time required for one run of ML.

We also compared the running time of our heuristics with the running times of the exact methods TreeChild and Hybroscale. The results are shown in Fig. 13 and are consistent with the execution times of the exact methods growing exponentially, while the running time of our heuristics grows polynomially. Note that networks with more reticulations are reduced by longer CPS and thus the running time increases with the number of reticulations.

#### Experiments on non-exhaustive input trees

The instances on which we tested our methods so far all consisted of a set of exhaustive trees, that is, each input tree had the same set of leaves which coincided with the set of leaves of the network. However, this is not a requirement of our heuristics, which are able to produce feasible solutions also when the leaf sets of the input trees are different, that is when their leaves are proper subsets of the leaves of the optimal networks that display them.

To test their performance on this kind of data, we generated 18 LGT instance groups starting from the instances we used in Sect. “Experiments on LGT data” and removing a certain percentage *p* of leaves from each tree in each instance uniformly at random. Specifically, we generated an instance group for each value of \(p\in \{5,10,15,20,25,50\}\) starting from the LGT instance groups with \(L=100\) leaves and \(R\in \{10,20,30\}\) reticulations. Since the performances of the two machine-learned heuristics were essentially overlapping for all of the other experiments, and since TrivialRand performed consistently better than the other randomised heuristics, we limited this test to ML and TrivialRand. The results are shown in Fig. 14.

In accordance with intuition, the performance of both methods decreases with an increasing percentage of removed leaves, as the trees become progressively less informative. However, the degradation in the quality of the solutions is faster for ML than for TrivialRand, consistent with the fact that ML was trained on exhaustive trees only: when the difference between the training data and the input data becomes too large, the behaviour of the machine-learned heuristic becomes unpredictable. We demand the design of algorithms better suited for trees with missing leaves for future work.

#### Effect of the threshold on ML

We tested the effectiveness of adding a threshold \(\tau >0\) to ML on the same datasets of Sects. “Experiments on normal data“, “Experiments on LGT data” and “Experiments on ZODS data” (normal, LGT and ZODS). Recall that each instance group consists of 48 instances. We ran ML ten times for each threshold \(\tau \in \{0,0.1,0.3,0.5,0.7\}\) on each instance, took the lowest output reticulation number and averaged these results within each instance group.

The results are shown in Fig. 15. For all types of data, a threshold \(\tau \le 0.3\) is beneficial, intuitively indicating that when the probability of a pair being reducible is small it gives no meaningful indication, and thus random choices among these pairs are more suited. The seemingly best value for the threshold, though, is different for different types of instances. The normal instances seem to benefit from quite high values of \(\tau\), the best among the tested values being \(\tau =0.7\). While the optimal \(\tau\) value for normal instances could be even higher, we know from Figure 7 that it must be \(\tau <1\), as the random strategies are less effective than the one based on machine learning for normal data. For the LGT and the ZODS instances, the best threshold seems to be around \(\tau =0.3\), while very high values (\(\tau =0.7\)) are counterproductive. This is especially true for the LGT instances, consistent with the randomised heuristics being less effective for them than for the other types of data (see Fig. 8).

These experiments should be seen as an indication that introducing some randomness may improve the performance of the ML heuristics, at the price of running them multiple times. We defer a more thorough analysis to future work.

### A non-learned heuristic based on important features

In this section we propose FeatImp, yet another heuristic in the CPH framework. Although FeatImp does not rely on a machine learning model, we defined the rules to choose a cherry on the basis of the features that were found to be the most relevant according to the model we used for ML and TrivialML.

To identify the most suitable rules, we trained a classification tree using the same features and training data as the ones used for the ML heuristic (see Fig. 17 in Appendix A). We then selected the most relevant features used in such tree and used them to define the function PickNext listed by Algorithm 3: namely, the features 4, \(8_t\), \(11_d\) and \(12_t\) of Table 1 (the ratio of trees having both leaves *x* and *y* in which (*x*, *y*) is reducible, the average of the topological leaf distance between *x* and *y* scaled by the depth of the trees, the average of the ratios \(d(x,\textsf {LCA}(x,y))/d(y,\textsf {LCA}(x,y))\) and the average of the topological distance from *x* to the root over the topological distance from *y* to the root, respectively).

To compute and update these quantities we proceed as described in Sect. “Time complexity” and Appendix A. The general idea of the function PickNext used in FeatImp is to mimic the first splits of the classification tree by progressively discarding the candidate reducible pairs that are not among the top \(\alpha \%\) scoring for each of the considered features, for some input parameter \(\alpha\).

We implemented FeatImp and test it on the same instances as Sects. “Experiments on normal data“, “Experiments on LGT data” and “Experiments on ZODS data” with \(\alpha =20\). The results are shown in Figure 16. As expected, FeatImp works consistently worse than ML on all the tested datasets, and it also performs worse than TrivialRand on most instance groups. However, it is on average 12% better than TrivialRand on the LGT instance group having 50 leaves and 30 reticulations and on all the LGT instance groups with 100 leaves, which are the most difficult for the randomised heuristics, as already noticed in Sect. “Experiments on LGT data“. The results it provides for such difficult instances are only on average 20% worse than those of ML, with the advantage of not having to train a model to apply the heuristic.

These experiments are not intended to be exhaustive, but should rather be seen as an indication that machine learning can be used as a guide to design smarter non-learned heuristics. Possible improvements of FeatImp include using different values of \(\alpha\) for different features, introducing some randomness in Line 8, that is, instead of choosing the single top scoring pair to choose one among the top \(\alpha \%\) at random, or to use fewer/more features.

## Add Comment