Data collection and inclusion criteria
We searched the Gene Expression Omnibus (GEO) for gene expression datasets using the following search terms: ((Breast cancer) AND chemotherapy) AND “Homo sapiens”[porgn: txid9606], applying the filters: entry type: series, study type: gene expression by array, and attribute name: tissue. The inclusion criteria consisted of breast fine needle aspiration (FNA) or core biopsy samples obtained from patients with TNBC before initiating NACT, with information about the pathologic response.
In our preprocessing steps, we ensured that all datasets were normalized and log-scaled before analysis, and we Z-transformed the gene expression of each dataset separately to ensure that all datasets were on the same scale. Subsequently, we combined a portion of the datasets together in a single metadata based on a subset of common genes. We then partitioned the data into a training set and a testing set (first testing dataset) comprising 70% and 30% of the samples, respectively. By using balanced stratification, we ensured that both divisions had an equal representation of important covariates, such as age, tumor grade, T-stage, N-stage, and the parent dataset. An additional dataset was used for an independent assessment of the signature performance as a second testing dataset (Additional file 1: Table S1).
Construction of the Notch mechanism
Well-established biological knowledge supports the role of the Notch signaling pathway in mediating cancer stem cell plasticity and chemoresistance in several cancer types, including BC . To improve the performance and robustness of our classifier, we embedded this preexisting knowledge into its decision rules by building a set of gene pairs that captured Notch signaling biology. Specifically, we used the Molecular Signature Database (MSigDB)  to retrieve gene sets comprising genes regulated by Notch signaling and extracted their individual genes. Subsequently, we paired genes upregulated with those downregulated by NOTCH signaling to build a matrix of gene pairs, each consisting of a gene upregulated and another downregulated by Notch signaling. This matrix was then used as biological constraints during the training of the k-top scoring pairs (k-TSPs) algorithm as described below.
Training the k-top scoring pairs model
The k-TSPs algorithm was used to identify gene pairs whose relative ordering consistently switched between the two classes of interest, namely, pCR versus RD. To reduce noise and identify informative features, we imposed biological constraints on the model training process by restricting the pair search process to the Notch mechanism described above. Subsequently, the k-TSPs algorithm was employed to identify all possible gene pairs whose expression consistently changed in samples from patients who achieved pCR and those who had RD after NACT. These pairs were further refined using a robust feature selection process to select the smallest set of pairs that could distinguish between these two phenotypes.
We performed a feature selection process using a regularized random forest (RRF) [14, 15] on the training data. The RRF algorithm is similar to random forest (RF) but adds a penalty on the features used for splitting if their information gain is similar to features used at previous splits [14, 15]. We bootstrapped the training data 100 times, and we trained an RRF model on each. To control overfitting, we used a regularization coefficient of 0.5 and an initial feature set of 0.1. To account for the imbalance in class size, we used prior weights of 1.0 and 0.6 for pCR and RD, respectively. Given that this training process is repeated on 100 different resamples of the training data, it is expected that the selected features would be different with each training round; however, important features should be selected more frequently. We ranked the selected features based on their frequency across the 100 iterations and selected the most frequent gene pairs to be our final classifier. It is important to note that while RRF was used for feature selection, the final classifier is still rank-based, and its decision rules follow those of the k-TSPs algorithm [16, 17].
Evaluation of performance
We evaluated the resulting k-TSPs signature on the testing data using different performance metrics, including the area under the receiver operating characteristic (ROC) curve (AUC), balanced accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC). In addition to the testing set generated using the stratified sampling approach, we used an additional dataset [18, 19] for extra validation of the signature performance. Notably, p-values for the AUC-ROC curves were derived from the Delong’s test  comparing the classifier’s ROC-AUC to a random ROC curve with AUC of 0.5. Similarly, we assessed the performance of our signature against other signatures using Delong’s method, specifically testing if our signature’s AUC was significantly higher.
Testing the signature on non-triple-negative breast cancer samples
To test the specificity of the signature, we used it to distinguish RD from pCR in all non-TNBC samples extracted from the same datasets used for the training and testing process. The prediction probability scores were then used to plot the ROC curve and calculate the AUC.
Comparison with existing signatures in predicting the response to NACT
To further assess the predictive power and potential clinical utility of our signature, we compared the signature’s performance at distinguishing RD and pCR with other signatures available from similar studies. We queried the literature and identified several studies introducing gene signatures that are either predictive of the response to NACT in BC patients or based on signaling pathways and processes related to the chemotherapeutic response . For instance, Zhao et al. developed a gene signature based on the expression of 143 genes and used it to develop a probability score for response to NACT in patients with TNBC . Other signatures included in this analysis included signatures associated with acidosis response , androgen receptor (AR) signaling, epithelial-mesenchymal transition (EMT), growth factors, stemness, proliferation, DNA damage, and WNT signaling  together with other signatures associated with response to taxane therapy in TNBC . All signatures were trained and tested on the same training and testing sets, respectively, to provide a fair assessment of performance. Genes from each signature were compiled into a logistic regression model to predict the response to NACT (pCR versus RD), and their testing performance was compared using the AUC metric.
Association of the gene signature with recurrence-free survival
We then proceeded to test whether the signature was also associated with RFS in TNBC patients who received adjuvant chemotherapy. For this purpose, we used the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort . First, we computed the probability of RD in each sample based on the gene signature and then categorized this probability into low and high scores. These were then used to compute the RFS over time using Kaplan‒Meier survival estimates . Furthermore, we performed a multivariate Cox proportional hazard model  accounting for other clinical and pathological variables, including T stage, age, type of breast surgery, menopausal status, and radiotherapy administration.
Software and packages
All steps of this analysis were performed using R version 4.0.3. The feature selection process was performed using the RRF and boot packages [14, 15, 29], while the k-TSPs model was trained using the SwitchBox R package . Finally, survival analysis was performed using the survival [31, 32] and survminer  packages.