Scientific Papers

Weakly supervised video-based cardiac detection for hypertensive cardiomyopathy | BMC Medical Imaging

This study was approved by institutional review board of our hospital and was conducted in compliance with institutional human research policy. All of the participants in the study gave written informed consent before enrollment. From August 2018 to January 2021, 185 HTCM patients and 112 healthy normal controls (N) were enrolled in this study, as shown in Table 1.

Inclusion and exclusion criteria

The inclusion criteria were described as follows: the diagnosis met the criteria of Chinese guidelines for the prevention and treatment of HTN (revised version in 2018), systolic blood pressure 140 mmHg and/or diastolic blood pressure 90 mmHg (Note: 1 mmHg = 0.133 kPa), and/or regular use of antihypertensive drugs within two weeks. The course of disease was 2–5 years, and drug treatment was insisted, LVEF 55%, left ventricular mass index (LVMI) > 115 g/m2 (male), LVMI > 95 g/m2 (female) or relative wall thickness (RWT) > 0.42 [17]. The exclusion criteria were summarized as follows: arrhythmia, congenital heart disease, coronary heart disease, valvular disease, cardiomyopathy, and secondary hypothyroidism. All subjects included in both groups were characterized by the sinus rhythm.

Data curation

The echocardiographic data set used in this study was collected at Henan Provincial People’s Hospital (Zhengzhou, China) by the GE Ultrasound Vivid E95 in Digital Imaging and Communications in Medicine (DICOM) format. We deployed the pydicom, a pure Python package, to access video frames in the source DICOM data. The data set was a decoded pydicom object in off-the-shelf functions or software to convert private tag data in DICOM files into numerical arrays that were unavailable. Nonsensible echocardiographic videos were excluded by a trained image-based binary CNN as input the clear and informative frames of apical four chamber view and other meaningless or rather noisy, blurred frames. Simultaneously, we observed long-time videos, specifically those lasting for more than 1000 frames, and were susceptible to contain undesirable meaningless frames when clinicians aimed to seek for the right view of one particular cardiac view or to put the transducer aside and do something else. Along with the tremendous calculation requirement for 3-D convolutions,videos with no more than 300 frames were kept.

Deep learning model

Diagnosis of an echocardiographic video practically arises from unusual left ventricular motion of consecutive frames though an untrimmed video that mainly exhibits extremely complex motion dynamics, and the desirable infrequent motion may be trivial and subtle, accounting for small portions of the overall video. Therewithal, discriminative snippets of importance were herein supposed to be attached more attention or weight when representations of segregating proposal snippets from one echo video were assigned with different probabilities and constituted a final weighted representation for downstream cardiac disease detection and domain classification, as shown in Fig. 1. The beat-to-beat based sampling could detect every cardiac cycle in the video through a trained ES/ED timing detection model (Fig. 2), and a clip comprising of 32 frames starting from end-diastole timing was selected, which was found sufficient to cover a complete cardiac contraction and dilatation process when frame per second (fps) of the GE Ultrasound Vivid E95 system (GE Healthcare, Chicago, IL, USA) was set to 30 and the frame rate of clips was 2.

Fig. 1
figure 1

Overall pipeline for detection of hypertensive cardiomyopathy. Two sampling methods are compared (uniform sampling and cardiac cycle-based sampling), and the latter is favored. The cardiac cycle-based sampling further requires a trained ES/ED timing detection neural network based on bi-directional LSTM and 3D spatiotemporal convolutions, which is illustrated in Fig. 2

Fig. 2
figure 2

Cardiac cycle-based detection model. ES and ED timing can be determined by the last frame in consecutive systole and diastole durations, respectively. Consequently, the last’0’ in a consecutive’0’ s and the last’1’ in a consecutive’1’ s indicate end-systole and end-diastole timing, respectively


The MIL can perfectly fit medical problems, such as pathological diagnosis using whole-slide imaging (WSI), due to the huge size and the lack of pixel-level annotations. The MIL attention-based method assigns the contribution of each sliced patch by introducing trainable parameters [18,19,20,21,22]. Apart from the independent and identical distribution hypothesis made by the classical MIL, the representations from proposal snippets are temporally correlated, and thus, the bidirectional LSTM is simply introduced to model temporal-dependent attention. The representation ensemble module (Fig. 1) sequentially outputs an ordered list of weights for clip representations and converts clip-level representations to a video-level representation. The multiple instance learning (MIL) based feature ensemble module in our proposed pipeline is shown in Fig. 3.

Fig. 3
figure 3

The multiple instance learning (MIL) based feature ensemble module in our proposed pipeline

Echocardiography examination

The LVEF was calculated by Simpson’s biplane method at apical views. The left ventricular end-diastolic diameter (LVDd), left ventricular end diastolic volume(LVEDV), left ventricular end-systolic volume(LVESV), interventricular septum thickness (IVST), left ventricle posterior wall thickness (LVPWT), and left ventricle mass (LVM) were measured by conventional M-mode echocardiography in all patients. The longitudinal peak strain dispersion (PSD) and global longitudinal strain (GLS) were acquired using commercially available analysis workstation (EchoPAC; GE Healthcare). All the measurements were conducted by two experienced sonographers, and all parameters were measured for three times to take the average value.

Domain adversarial neural networks

Acquisition of echocardiographic videos of desirable standard cardiac views consists of a multitude of parameters, even varying within the same lab over time mainly because of encounters’ attributes and sonographers’ immediate preferences, resulting in a significant variation in image quality that hinders the generalization of deep learning algorithms. It is not uncommon that practitioners’ preferences are used to adjust the transducer and to acquire a high-quality echo change over time, which may further affect the clinical diagnosis. In the majority of cases, implicit approaches, such as data augmentation, are employed intensely, extending accessible training data domain to a considerable degree. Nevertheless, medical artifacts can unnecessarily lose subtle, while discriminative information in a clinical manner. As far as the echocardiogram is concerned, introduction of noises or pixel jittering may reduce the informative effect of slim mitral valve in the apical four chamber views. Domain adversarial neural networks [23] remove the domain information from the model representation by assuming that all frames extracted from an echocardiographic video originate from the same unique data distribution, and thus, constitute a domain and bifurcating from the baseline network (Table 1), in order to train a domain classification model that is less discriminative to the domains.

Experimental details

The pre-trained non-local I3D model [24, 25] on Kinetics-400 data set [26] served as the 3D ConvNet feature extractor (Fig. 1). The architecture of the 3D ConvNet feature extractor is illustrated in Fig. 4. During training, we decomposed one echocardiographic video into proposal snippets by the trained ES/ED timing detection model, forwarded these snippets through latter 3-D ConvNet feature extractor, and weakly supervised feature ensemble module to make the final prediction in cardiology and domain classification in an appealing end-to-end manner. The input image size was 224 × 224 by convention and the sampling rate was 2 by default. We treated every video as an individual domain, and thus, the number of domain classes was equal to the number of all videos in training and testing datasets. The training strategy was that for each iteration, a training video was firstly propagated through the whole pipeline to deliver the predicted cardiac disease class and the predicted domain class, and another video was randomly selected from the whole data set that was re-forwarded to produce only the predicted domain class, in order to include domains in test data set. The aforementioned three predicted outputs were compared with corresponding ground truth labels to yield the loss, and the PYTORCH [27] framework could back propagate and optimize the parameters once. Uniform training was employed to alleviate the data imbalance effect by training the same number of healthy and hypertensive videos for one epoch. The parameters for the HTCM detection model shown in Fig. 1 is listed as follows: the stochastic gradient descent optimizer with an initial learning rate of 0.0001, cosine decay learning rate strategy [28], weight decay of 4e-5, dropout of 0.5 [29], label smoothing of 0.1 [30], and batch size of 1 was trained for 100 epochs, accounting for a total of 50000 iterations using a Tesla V100 GPU (NVIDIA Corp., Santa Clara, CA, USA). During inference, in order to completely evaluate the performance metrics, the experiments were carried out on both the subject classification task (subject based) and the video classification task (video based). We first run the algorithms on all the individual videos to get the video classification results, and then average the classification results (the outputs of the last layer in the deep neural networks) with a same subject to get the subject classification results. Besides, only apical 4-chamber view videos were used.

Fig. 4
figure 4

The architecture of the 3D ConvNet feature extractor used in our proposed pipeline

Statistical analysis

Python scikit-learn [31] library was utilized for statistical analysis. Continuous variables are expressed as mean ± standard deviation when following the normal distribution, or the median (interquartile range) for abnormal distribution. Categorical variables were represented by numbers and percentage. Comparison between hypertensive cardiomyopathy group and normal group was carried out by the two-sample t-test if normality and homogeneity of variance were both satisfied, t’-test when only homogeneity of variance was met, or the Wilcoxon signed-rank test in case of satisfying normal distribution. Receiver operating characteristic (ROC) curves are used to assess the accuracy of a diagnostic test. We use the trapezoidal rule to calculate the areas under the ROC curves (AUCs) directly. The accuracy (ACC), sensitivity (SEN), specificity (SPE), positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio(PLR), and negative likelihood ratio(NLR) were calculated. A p-value < 0.05 was considered statistically significant.

Source link