### Patients selection

In this study, 364 CSVD patients (217 patients with aSVD, 106 patients with CAA, and 41 patients with CADASIL) treated at three independent centers, Third Affiliated Hospital, Sun Yat-sen University (SYSUTH), the Maoming People’s Hospital (MMPH), and the First Affiliated Hospital of SHANTOU University Medical College (STUMFH), were enrolled between January 2017 and May 2022 All patients with aSVD had vascular risk factors, such as hypertension, and their MRI neuroimaging met the STandards for ReportIng Vascular changes on nEurouimaging (STRIVE) for CSVD [3, 13, 30]. Patients with CAA met the diagnostic criteria for probable CAA according to the Boston criteria version 2.0 [15, 16]. CADASIL was confirmed by a genetic diagnosis of *NOTCH3* gene mutation or a granular osmiophilic material identified in a skin biopsy [31, 32]. Detailed inclusion criteria can be found in the supplementary materials.

### Dataset curation

SWS of enrolled patients were collected to develop and test the model, regardless of SWS parameters, to improve robustness and applicability. The SWS acquisition details are shown in the supplementary materials (Additional file 1: Table S1). SWS of patients from SYSUTH (a hospital in Guangzhou city) and MMPH (a hospital in Maoming city) served as the development set, and SWS of patients from STUMFH (a hospital in Shantou city) served as the external test set for testing the geographic performance of the model [33]. Additionally, the tenfold cross-validation was used to evaluate the stability of the DL model.

### Image of SWS preprocessing

The SWS images were pre-processed before inputting the DL model. Firstly, the orientation of all SWS was uniformly adjusted to left-posterior-superior, and the brain area was extracted from the irrelevant background. Subsequently, the pixel value in the image was ranked, and those topping ≥ 99.5% were denoted as \({x}_{u}\). Finally, the pixel value of the entire image was normalized and scaled (\(X=x/{x}_{u}\)).

### Annotation procedures

To train the model for detecting CMBs, an author (R.Z.W), who was blinded to patients’ clinical data, performed segmentations of the CMBs using the software ITK-SNAP (version 3.8.0). Another author (B.Q) reviewed the results. To evaluate the reproducibility of the manual segmentations, 30 SWS were randomly selected from the development dataset, and an additional segmentation was performed 3 months after the initial segmentation.

To train and test the model for classifying CSVD into aSVD, CAA, and CADASIL, all SWS underwent additional annotation procedures to confirm the type of CSVD, i.e., the SWS label. A committee of three senior neurologists (B.Q, Z.Q.L, and H.L) annotated labels based on inclusion criteria, with more resource-intensive neurologist annotations being reserved. In cases where of disagreement regarding a label, the committee discussed and reached a consensus. If consensus could not be achieved, the SWS was excluded. In summary, each SWS label received one committee consensus annotation, which was regarded as the gold-standard for model evaluation. In the external test set, four neurologists (L.W, T.T.L, X.H.H, X.L) that were not part of the committee provided individual annotations for labels, and these labels were utilized to compare the model’s performance. To evaluate the consistency and accuracy of the annotations, we employed the Fleiss Kappa coefficient to measure the inter-rater agreement among committee members.

### Network architecture

Figure 6 shows detailed information regarding our architecture. Mask R-CNN was used to detect CMBs slice-wise and to obtain a semantic segmentation mask for CMBs [34]. Next, a multi-instance learning (MIL) network was applied to classify CSVD sequence-wise [35]. Semantic segmentation masks of CMBs generated by Mask-RCNN were then spliced with the original corresponding SWS to form a double-channel image as the model’s input for classifying CSVD, inducing the model to concentrate on representations of CSVD on SWS. Resnet50 served as the backbone to extract representations of SWS on each slice layer by layer. Representations of each slice were exhibited as a 1024-dimension vector. Representations of every slice belonging to the same SWS were aggregated into one feature vector using the attention module, among which those representations of slices relating to the final prediction CSVD type were given higher weights, and vice versa. This feature vector was concatenated with the corresponding patient’s age and sex to generate a 1026-dimension vector, serving as the sequence’s representation. Finally, a fully connected layer was used to predict CSVD type according to sequence representation. Gradient-weighted Class Activation Mapping (Grad-CAM) [36] was applied to visualize where the MIL network focuses its attention when developing a prediction.

### Model training

Adjacent slices in the SWS were combined to form a three-channel two-dimensional image and input into the Mask-RCNN network to identify CMBs. The manual segmentation of CMBs from the middle channel of the two-dimensional image was used as the ground truth. In CMBs detection training, loss functions used for segmentation and boundary box placement included per-pixel sigmoid loss, binary loss, and regression loss, respectively. During the training for CSVD classification, a balanced random sampling of three categories of CSVD samples was realized using weighted sampling, and the batch size was defined as 1, which guaranteed that only one sample was input into the model. The loss function of the model for CSVD classification was the cross-entropy loss function. The parameters of the model, including the feature extraction module, were trained from end to end using the reverse value of the cross-entropy loss function. The Adam optimizer was used to optimize the parameters of the second model, and the initial learning rate was set at 0.001. An early stopping mechanism was used to control the performance of the training process. When the loss function on the internal test set of the development set did not decline for five consecutive epochs, the training of the model was terminated, and the best model was preserved to avoid a large overfitting of the model on the training set. To classify CSVD, we implemented the tenfold cross-validation to obtain ten models on the development set. To test the model’s applicability for classifying CSVD, the performance of the ensemble model obtained by aggregating ten models was compared with that of four neurologists (L.W, T.T.L, X.H.H, X.L) on the external test set. The training parameters and source code can be found online (https://github.com/Huatsing-Lau/CSVD-CMBs-Detection-and-Classification.git).

An AMD EPYC 7742, 2.25-GHz CPU, and an A100 GPU (Nvidia) were used and run on a Linux system (Ubuntu, version 7.5.0) with a CUDA version 11.7 platform. Model implementation was performed using open-source software (Python, version 3.8.3; Python Software Foundation), Pytorch, version 1.13.0, and torchvision version 0.14.0.

### Statistical analysis

Mean and standard deviation (SD) were used to describe continuous variables, while percentages were used to describe non-continuous variables. Differences between groups were tested using one-way ANOVA and the Chi-square test, respectively.

The IoU, Dice score, and recall were calculated to evaluate the model’s performance in detecting CMBs. The IoU formulas was expressed by Eq. (1) was used to assess the performance of bounding boxes. Bounding box placement was considered correct for an IoU above 0.5. The Dice score formulas expressed by Eq. (2) was used to assess the performance of the segmentation.

$$IoU = \frac{TP}{{FP + TP + FN}},$$

(1)

$$Dice score = \frac{2TP}{{FP + 2TP + FN}},$$

(2)

where TP, FP, and FN represent true positive pixels, false positive pixels, and false negative pixels, respectively. The recall formulas expressed by Eq. (3) was used to evaluate the proportion of the true CMBs predicted by the model.

$$Recall = \frac{TP}{{TP + FN}},$$

(3)

where TP and FN represent the number of true positive CMBs detected by the Mask R-CNN (only those predicted CMBs with a Dice score exceeding 0.5 are considered true positive CMBs).

Confusion matrices, ROC curves, precision, and accuracy were used to evaluate the performance of the model in classifying CSVD. Confusion matrices were used to illustrate the label of CSVD classification, where the model prediction or the individual neurologist predictions were discordant with the committee consensus. ROC analysis was used to calculate AUC to assess the discrimination of the model for each label class. With false positive rate (FPR) as the horizontal coordinate and true positive rate (TPR) as the vertical coordinate, ROC can be drawn and AUC can be calculated. The formulas of TPR and FPR were expressed in Eq. (4) and Eq. (5), respectively. Accuracy precision, and F1 score were calculated to provide additional information for ROC analysis [37], providing complementary performance measures to the AUC, especially in the context of multi-class prediction, and reducing sensitivity to class imbalance issues. The weighted-average of the indices was calculated to eliminate the imbalance between the number of categories. The formulas of accuracy, precision, and F1 score were expressed in Eq. (6), Eq. (7), and Eq. (8), respectively.

$$TPR = \frac{TP}{{TP + FN}},$$

(4)

$$FPR = \frac{FP}{{FP + TN}},$$

(5)

$$Accuracy = \frac{TP + TN}{{TP + FN + FP + TN}},$$

(6)

$$Precision = \frac{TP}{{TP + FP}},$$

(7)

$$F1{ }score = \frac{2TP}{{2TP + FN + FP}},$$

(8)

where TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative numbers, respectively.

Statistical significance was set at P < 0.05. The 95% confidence interval (CI) was calculated for each index. IBM SPSS Statistics (version 26), scikit-learn (version 0.24.2), and the statsmodels (version 0.13.5) were used to analyze the metrics of the models.

## Add Comment