This scoping review search identified a total of 98 studies from the PubMed and 159 studies from the Embase. After duplication elimination and abstract and title screening, studies were considered potentially relevant and selected for full-article review. Among these, 28 peer-reviewed studies involving at least one unique ML model across a broad list of patient populations and settings were included in this review (Fig. 1).
Data source and sample size
The majority of these studies (N= 14) were conducted using data from the US setting [26,27,28,29,30,31,32,33,34,35,36,37,38,39]. Among these US studies, most of them used administrative claims datasets [26, 27, 30, 31, 36, 39] (N = 6), including SEER-Medicare, Veteran health administrative claims, followed by electronic health records or electronic medical records [32,33,34, 37, 38] (N = 5), and a few used patient registry cohort datasets [28, 29, 35] (N = 3). The remaining non-US studies used datasets from Europe [40,41,42,43,44,45] (N = 6), including Italy, Netherlands, Denmark, Switzerland, or Germany, and a few others used data from England (N = 3), China (N = 4), or India (N = 1). The median sample size was 10,614 (range: 142- 247,960 patients).
Study population and time-to-event outcomes
Most of these studies involving ML-based prediction for survival analyses focused on cancer patients [26, 27, 30, 31, 34, 36, 38, 39, 42, 43, 46, 47] (N = 12 studies); for ML studies in oncology, these models were used to predict their survival outcomes or cancer recurrence.
The remaining studies focused on patient populations in the cardiology [28, 35, 48, 49], COVID-19 [37, 50, 51], diabetes [29, 40, 41, 45], schizophrenia disorder patients , HBV infection , inpatients patients , those undergoing heart transplantation , or intensive care unit (ICU) patients . Across these non-cancer disease areas, these ML studies predicted clinical outcomes, such as the development of cardiovascular events [29, 40, 41, 45], the incidence of sudden cardiac arrest or venous thromboembolism or ventricular fibrillation, and death. Only one study used ML for treatment outcomes . A detailed summary of included studies is provided in e-supporting Table 1.
Characteristics of ML Models
Use of ML for survival outcomes
The types of ML algorithms used are reported in Table 1. From this review, the popular ML algorithms for survival analyses include random survival forests (N = 16) [26,27,28, 31,32,33,34, 36, 42, 43, 45,46,47,48,49, 53], boosted tree methods [31, 34, 42, 43, 45, 51, 53], and artificial neural networks [30, 31, 37, 39,40,41, 43, 44, 46, 47, 49, 50]. Support vector methods [34, 35, 42, 53] and regularization (LASSO, ridge, elastic net) [43, 49, 52, 53] were also common, and other algorithms included naïve bayes [29, 35, 53], K-nearest neighbor , multi-layer perceptron . Table 2 provides a description of these ML algorithms.
ML model performance
Across these studies, while three studies [28, 33, 45] failed to report model performance in AUC, others reported AUC for model evaluation. Among those studies reporting AUC for evaluation of model performance, there was a variation across the AUCs reported, with their mean at 0.7852 and their median at 0.789 (IQR: 0.73–0.847; range: 0.6–0.9503). While one study developed one ML model  with an AUC below 0.7, most of these studies developed at least one ML model with an AUC above 0.70. The boxplot and beeswarm plot of model performance based on the AUC, stratified by the type of ML algorithms, are shown in Fig. 2. The descriptive findings of the AUC across these ML models are shown in Table 3.
Among all included studies, twenty-five studies (89%) applied model validation. Table 4 details model validation methods among these included studies. Nineteen studies used internal validation, with fifteen studies randomly split datasets into a training set and a test set for validation of model performance [26, 27, 29, 31, 32, 36, 38,39,40,41, 44, 46, 49, 50, 53], while four studies internally validated model performance using cross-validation methods [35, 42, 48, 52]. Six studies applied external validation methods, including using an independent dataset for model performance validation [30, 34, 37, 43, 47], or used prospective validation . Still, three studies did not report any validation methods [28, 33, 45].
Comparison between model performance of ML vs. CPH
A total of 17 studies (61%) compared the performance of ML models with the traditional regression-based CPH. Most studies (N= 15 studies, 88%) reported that ML had better performance than CPH models [26, 30,31,32, 34, 36, 38,39,40,41,42,43, 48,49,50]. Only one study reported that ML algorithms did not surpass the CPH model , and one study did not make a comparison, although it included CPH . Details can be found in e-supporting Table 1.
Among the included studies, a majority had high quality based on the appraisal of six domains of the QUIPS tool. Details of quality assessment for all included studies are summarized in e-supporting doc Table 3.