Scientific Papers

A 4D-CBCT correction network based on contrastive learning for dose calculation in lung cancer | Radiation Oncology

Description of Image

Patient data

4D images of 20 patients with thoracic tumors were selected to train and test the deep-learning model. Patient data were obtained from a publicly available dataset in the Cancer Imaging Archive (TCIA, created by the National Cancer Institute [28, 29]. All the patients had locally advanced non-small cell lung cancer and received concurrent chemoradiotherapy, with a total dose ranging from 59.4 to 70.2 Gy delivered in daily 1.8 or 2 Gy fractions. All patient clinical information used for training and testing is shown in Table 1. Throughout their treatment, the patients all underwent 4D-CT imaging at least once and most received 4D-CBCT imaging during treatment fractions. Consequently, the dataset consisted of a total of 82 4D-CT and 507 4D-CBCT images from these 20 patients.

Table 1 Clinical information for 20 patients

Image data


4D-CT images were acquired on a 16-slice helical CT simulator (Brilliance Big Bore, Philips Medical Systems, Andover, MA, USA) under scanning conditions with a tube voltage of 120 kVp, tube currents of 50–114 mA, and exposure times of 3.53–5.83 ms. The respiratory signals obtained from the RPM respiratory gating system were divided into 10 phases from 0 to 90% in phase order, with the 0% phase corresponding to the end of inspiration. The slice thickness for each phase was 3 mm, and the image size was 512 × 512 with a pixel spacing of 0.9766 × 0.9766 mm2.


4D-CBCT images were acquired on a commercial CBCT scanner (On-Board Imager v1.3, Varian Medical Systems, Inc.) with 360° scanning at a tube voltage of 125 kVp, a tube current of 20 mA, and an exposure time of 20 ms. To promise the appropriate calculation of radiotherapy dose, CT number to electron density (CT-ED) calibration was performed with a CIRS (Norfolk, Virginia, US) phantom named Model 062M Electron Density Phantom on 4D-CBCT. During scanning, the respiratory surrogate used for 4D-CT were integrated into the 4D-CBCT acquisition system. The projection was sorted into the same 0–90% phases according to respiratory signal of surrogate. Each phase was reconstructed using the Feldkamp–Davis–Kress reconstruction algorithm with a slice thickness of 3 mm, an image size of 512 × 512, and a pixel spacing of 0.8789 × 0.8789 mm2.

4D-sCT based on CLCGAN

Image preprocessing

The training dataset comprised 4D images of 10 phases from 20 patients. Each phase comprised 50 slices, with a total of 10,000 4D-CT and 10,000 4D-CBCT slices. Each patient was centered on the lung cancer region, including the whole lung. Each phase of 4D-CT images were adjusted to the same size and resolution as the 4D-CBCT images using an open-source registration tooltik, elastix [30, 31]. The adjusted images were used for paired training with CLCGAN, and random flipping was applied during training to achieve data augmentation.

Network architecture

The CLCGAN network model applied the idea of CL to the dual-domain CycleGAN. It used only the similar features in the dual domain for image generation to realize the removal of streak artifacts. Therefore, CLCGAN comprised two branches: CycleGAN and CL. CycleGAN realized the mutual mapping of CBCT/CT to CT/CBCT to obtain the feature information of two samples. CL implemented constraints on the feature space to better guide image generation. Figure 1a shows the network architecture of CLCGAN. The implementation details of these two branches are described as follows.

Fig. 1
figure 1

Architecture and module details of the CLCGAN network: a illustrates the overall architecture of CLCGAN; and b showcases the detailed principle of contrastive learning, where (b−1) and (b−2) show the internal diagrams of the generator and discriminator, respectively

CycleGAN contained two symmetric sub-networks for generating 4D-sCT (CT → sCBCT → sCT) and 4D-sCBCT (CBCT → sCT → sCBCT). Each sub-network comprised two generators and one discriminator. Figure 1b shows the architecture of generators, where each generator comprised a three-layer encoder, a nine-residue block structure, and a three-layer decoder, whereas the discriminator comprised a four-layer encoder. The two sub-networks were simultaneously trained to extract features from CBCT and CT and thus form a feature space for regularization. The network performance was improved by optimizing the loss function between the generated and original image until the discriminator cannot distinguish between sCT, sCBCT and CT, CBCT, the model tends to converge. Ultimately, the removal of streak artifacts in 4D-CBCT was achieved by generating 4D-sCT, although the effect of artifact removal was weak. Accordingly, we combined CL to constrain the feature space and realize streak artifacts removal in latent space. CL is an unsupervised learning. The main idea is to set low-difference features with similar or common properties in CBCT and CT to “positive” and vice versa to “negative”. During training, only “positive” features were used for image reconstruction or image recovery. To maintain the model architecture, features were directly extracted from the encoder of the generator, and the features from each layer were sent to a two-layer multilayer perceptron. In the feature embedding space, the feature \(\hat{x}\) from one side of the CT or CBCT served as a query, whereas the other side contained the positive feature \(\hat{x}^{ + }\) and k negative feature {\(\hat{x}_{i}^{ – }\)}\(_{{{\text{i}} – 1}}^{{\text{k}}}\). Positive features were proximity to query, so they were correlated with each other (none streaking →  ← none streaking); otherwise, they were detached from each other (streaking ←  → none streaking). To visualize the impact of CL, the features extracted for image generation with and without CL were visualized using t-distributed Stochastic Neighbor Embedding (t-SNE) [32]. Results are shown in Fig. 2. The two features had closer distances and overlapped more after using CL. When using t-SNE to compare two features, if there is some degree of similarity between the two features, the corresponding data points in the t-SNE’s two-dimensional coordinates will completely overlap and embed each other, rather than exhibiting distinct boundaries. Therefore, the features selected for generating the sCT were free of streak artifacts.

Fig. 2
figure 2

t-SNE Plots of Learned Features with and without CL. a and b represent the feature distribution obtained without and with the incorporation of contrastive learning, respectively

Loss function

In the experiment, the final loss function included a loss function \(L_{cont}\) for enforcing the distribution of the specified features, a loss function \(L_{adv}\) for minimizing the difference between the expected and predicted values of 4D-CT/4D-CBCT, and a loss function \(L_{cyc}\) for minimizing the difference between the original images of 4D-CT/4D-CBCT and the generated images. To further preserve the structure and content information of the images, a frequency loss function \(L_{freq}\) was utilized to fully leverage the frequency domain information. The overall loss function is represented as

$$L_{{{\text{total}}}} = \lambda_{1} L_{cont} + \lambda_{2} L_{cyc} + \lambda_{3} L_{adv} + \lambda_{4} L_{freq} ,$$


\(\lambda_{i}\) is the weight parameter for each item, and we set \(\lambda_{i}\), \(\lambda_{2}\), \(\lambda_{3}\) and \(\lambda_{4}\) to 2, 1, 1, and 0.01 respectively.

Contrastive loss: The feature was normalized to \(f = E(\hat{x})\), \(f^{ + } = E(\hat{x}^{ + } )\), \(f_{{\text{i}}}^{ – } = E(\hat{x}_{i}^{ – } )\) by formula, and the function of the canonical feature distribution is denoted as

$$\begin{gathered} L_{{{\text{cont}}}} {\text{(G}}_{SN}, {\text{G}}_{NS} ) = E_{s\sim S,n\sim N} \left[ { – {\text{log}}\frac{{{\text{sim}}(f,f^{ + } )}}{{{\text{sim(}}f,f^{ + } ) + \sum\nolimits_{{{\text{i}} = 1}}^{{\text{N}}} {{\text{sim(}}f,f_{i}^{ – } {)}} }}} \right], \hfill \\ \hfill \\ \end{gathered}$$


$${\text{sim}}({\text{u}},{\text{v}}) = {\text{exp}}\left( {\frac{{u^{\tau } v}}{{\left\| {\text{u}} \right\|\left\| {\text{v}} \right\|\tau }}} \right),$$


sim(u, v) represents the cosine similarity function between two normalized feature vectors, and τ represents the temperature parameter, which is set to 0.07.

Adversarial loss: The discriminator \(D_{N}\)/\(D_{S}\) was trained to make the discriminating output on 4D-CT/4D-CBCT close to 1 and the generated 4D-sCT/4D-sCBCT image output close to 0. Minimize \(L_{adv}\). Thus, the final generator made the output of discriminator on generated images as close to 1 as possible. Therefore, the adversarial loss function is denoted as

$$L_{adv} (G_{SN}, {\text{D}}_{N} ) = E_{n\sim N} \left[ {\log D_{N} ({\text{n}})} \right] + E_{s\sim S} \left[ {\log (1 – D_{N} (G_{SN} ({\text{s}})))} \right],$$


Cycle consistency loss: The generator \(G_{SN}\)/\(G_{NS}\) was trained to minimize \(L_{cyc}\) so that the difference between the generated image and the real sample s/n was minimized. The cycle consistency loss function is denoted as

$$L_{{{\text{cyc}}}} = E_{n\sim N}^{{}} \left[ {\left\| {G_{SN} \left( {G_{NS} \left( n \right)} \right) – n} \right\|_{1} } \right] + E_{s\sim S}^{{}} \left[ {\left\| {G_{NS} \left( {G_{SN} \left( {\text{s}} \right)} \right) – s} \right\|_{1} } \right],$$


Frequency loss:

$$L_{freq} = E_{n\sim N} [||FT(G_{S2N} (G_{N2S} (n))) – FT(n)||_{2}^{2} ] + E_{r\sim R} [||FT(G_{N2S} (G_{S2N} (r))) – FT(r)||_{2}^{2} ],$$


Parameter selection

During training, a batch size of 1 and instance normalization were used. The training images were randomly cropped into 512 × 512 blocks in a paired manner for CL. In the training process, Adam optimizer with parameters \(\beta_{1}\) = 0.5 and \(\beta_{2}\) = 0.999 and a learning rate of 0.0002 were adopted, and the model was trained for 100 epochs starting from 0. The entire network based on the PyTorch framework was implemented on a deep-learning server (Inter (R) Xeon (R) Gold 6133 CPU @ 2.50 GHz, NVIDIA A100 80 GB, 256 GB).

Evaluation methods

Image-quality assessment

To evaluate the effect of the CLCGAN model in removing image artifacts, we selected five cases comprising 2500 untrained paired 4D-CT and 4D-CBCT slices for testing. The resolution and size of the testing data were kept consistent with the training data. The evaluation comprised two parts: comparing the generated 4D-sCT with the original 4D-CT, and comparing the 4D-sCT generated using the CLCGAN and CycleGAN network individually.

To quantitatively evaluate the image quality, the 4D-CBCT, 4D-sCT based on CycleGAN, and CLCGAN were measured against the original 4D-CT by using SSIM and PSNR. To enable better use of 4D-sCT for guidance and dose calculation in lung-cancer radiation therapy, the CT values of 4D-CBCT and 4D-sCT were measured against the 4D-CT using mean error (ME) and mean absolute error (MAE). To ensure an accurate evaluation of the training results, the precision of the registration was measured by calculating mutual information (MI). Lastly, paired t-tests were performed in Statistical Product and Service Solutions (SPSS) software to assess significant differences between all 4D-sCT and 4D-CBCT results. Given the conduct of multiple hypothesis tests, all p-values were assessed following Bonferroni correction. When the p-value is less than 0.003, the results are significantly different. The corresponding expressions are shown below:

$$SSIM(X,Y) = \frac{{\left( {2{\mu}_{X} {\mu}_{Y} + C_{1} } \right)(2{\sigma}_{X} {\sigma}_{Y} + C_{2} )}}{{({\mu}_{X}^{2} + {\mu}_{Y}^{2} + C_{1} )\left( {{\sigma}_{X}^{2} + {\sigma}_{Y}^{2} + C_{2} } \right)}},$$


$$PSNR = 10\log_{10} \frac{{\max \left| {X\left( {i,j} \right)} \right|^{2} }}{MSE},$$


$$MSE = \frac{1}{M \times N}\sum\limits_{i = 1}^{M} {\sum\limits_{j = 1}^{N} {(X(i,j) – Y(i,j))^{2} } } ,$$


$$\begin{gathered} ME(X,Y) = \frac{1}{M \times N}\sum\limits_{i = 1}^{M} {\sum\limits_{j = 1}^{N} {(X(i,j) – Y(i,j))} } , \hfill \\ \hfill \\ \end{gathered}$$


$$MAE(X,Y) = \frac{1}{M \times N}\sum\limits_{i = 1}^{M} {\sum\limits_{j = 1}^{N} {|X(i,j) – Y(i,j)|} } ,$$


$$p_{i} = h_{i} /\left( {\sum\limits_{i = 1}^{N – 1} {h_{i} } } \right),$$


$$H\left( Y \right) = – \sum\limits_{i = 0}^{N – 1} {p_{i} } \log p_{i} ,$$


$$H\left( {X,Y} \right) = – \sum\limits_{x,y} {p_{xy} \left( {x,y} \right)\log p_{xy} (x,y)} ,$$


$$MI(X,Y) = H(X) + H(Y) – H(X,Y),$$


In the expression of SSIM, X represents 4D-CBCT and 4D-sCT, and Y represents 4D-CT. \(\mu_{{\text{x}}}\) and \(\mu_{{\text{y}}}\) denote the average pixel values of images X and Y, respectively. \(\sigma_{{\text{x}}}\) and \(\sigma_{{\text{y}}}\) represent the variances, whereas C is a regularization constant with C1 and C2 taken as (0.01 × 2000)2 and (0.03 × 2000)2, respectively. The dynamic range of the image pixels was 4095. In the expressions of mean-square error (MSE), ME, and MAE, X represents 4D-CBCT and 4D-sCT, whereas Y represents 4D-CT. M and N represent the width and height of the input images, respectively. The expression for PSNR was obtained by dividing the maximum value by the MSE. In formulas (12) (13) (14) (15), X and Y denote two images, where \(h_{i}\) represents the sum of pixel points in image Y with gray i, N represents the gray level in image Y, and \(P_{i}\) represents the probability of gray i. H(Y) denotes the entropy of an image, H(X,Y) denotes the joint entropy of X and Y. MI reflects the degree of information contained between two images, with value ranging from 0 to positive infinity. The higher the similarity or overlap between images, the smaller the joint entropy and the greater the MI. After conducting paired t-tests, statistical significance was observed in the SSIM, PSNR, ME, MAE and MI of the 4D-sCT images.

To measure the local information of CT values, the 4D-CBCT, 4D-CT, and 4D-sCT images of five patients were outlined with 35 × 35, 15 × 15, and 25 × 25 regions of interests (ROIs) in the lungs, bones, and soft tissues. The mean CT values were then measured. The CT values indicated that the mean CT value difference between 4D-sCT and 4D-CT was smaller, and the images generated based on CLCGAN had the smallest differences. Moreover, to evaluate the CT value errors of the lung tumor, the 4D-CBCT, 4D-CT, and 4D-sCT images of five patients were outlined with 15 × 15 ROIs in the region of the lung tumor. The results indicated that the CT value error of CLCGAN is smaller.

Dose evaluation

To assess the accuracy of dose calculations, the dose distributions of 4D-CT, 4D-CBCT, and 4D-sCT were compared and the relative percentage difference (RPD) was calculated. Each phase of 4D-CT for five tested patients was contoured for target delineation and the GTV and PTV contours averaged by ten phases were used for volumetric-modulated arc therapy planning by using a planning system (Monaco 5.1, Elekta). A prescription dose of 6000 cGy over 30 days was applied. Subsequently, the 4D-CBCT and 4D-sCT generated by both methods were rigidly registered with the reference 4D-CT, and the structure contours and treatment plans from the reference 4D-CT were copied to each image. Dose calculations were performed on all images, and dose–volume histogram (DVH) parameters were assessed for the PTV, left lung, right lung, and spinal cord. For the PTV, the dose at D98% and D2% was calculated, whereas for the spinal cord, the dose at D2% was calculated. For the left and right lungs, the lung volume was calculated at V20Gy and V5Gy, respectively.

$$RPD = \frac{|A – F|}{{(A + F)/2}} \times 100\% ,$$


In the expression of RPD, A represents the dose or volume of 4D-CT, and F represents the dose or volume of 4D-CBCT and 4D-sCT (Cyc, and CLC).

Description of Image

Source link