Scientific Papers

A combined encoder–transformer–decoder network for volumetric segmentation of adrenal tumors | BioMedical Engineering OnLine

Description of Image

Patients selection

This retrospective study was approved by the ethics committee of Tongde Hospital of Zhejiang Province (approval number 2022-183). The data were anonymized, and the requirement for informed consent was waived.

A total of 182 patients diagnosed with adrenal metastases between January 2014 and August 2019 were considered for inclusion in this study. An experienced radiologist and a novice radiologist in the radiology department of our institution labeled the adrenal tumors on the CT images. Patients first received 100–120 ml of contrast material (Ultravist, Bayer Schering Pharma, Berlin, Germany) through an intravenous cannula inserted into a forearm vein, then underwent the multidetector CT examination. The original images were reconstructed from a 5-mm slice thickness into a 1.5-mm slice thickness. The parameters for the CT acquisitions were: 120 kV, 250–300 mA, 1.5–2.5 mm detector collimation, 1:1 table pitch, and 5-mm slice thickness. One author (J.W., abdominal imaging fellow) reviewed each sample manually to ensure that the diameter of each tumor was at least 10 mm in the long axis for the following two reasons: a. when the tumor is tiny, the boundary with normal adrenal tissue may not be apparent; b. If the tumor is tiny, it is more difficult to manually delineate the tumor, which will lead to lower consistency. In summary, we abandoned including tumor samples smaller than 10 mm.

Data pre-processing

For each CT sample, the window level and width were adjusted to 40 and 300, respectively, to remove the information that was not important or irrelevant to tumor segmentation. Then, take one of the samples as an example; the sample size in the x, y, and z axes is \(512 \times 512 \times 101\), respectively. The x and y axes were down-sampled to 0.5 times the original, and the z-axis was sampled to 1 mm slice thickness according to the spacing. The cubic interpolation method was used as the sampling method. Finally, a normalization operation is performed to simplify the computation and unify the dimensions.

Proposed method

Our overall network architecture is shown in Fig. 6, with the encoder, decoder, and Transformer as the main framework. The encoder first down-samples the 3D input data while gradually extracting local features to generate a high-dimensional feature map. Then, the Transformer processes the high-dimensional feature map and establishes the global dependency. Finally, the new feature maps are connected with the previous feature maps which are down-sampled by the encoder at each step through the skip connection, and the segmentation results are obtained after up-sampling.

Fig. 6
figure 6

Overall architecture of the network

Network encoder

For the volumetric medical image segmentation performed in this study, our input data were three-dimensional: \(\mathrm{X}\in {\mathrm{R}}^{\mathrm{C}\times \mathrm{W}\times \mathrm{H}\times \mathrm{D}}\), where C is the channel, W is the width of each image item, H is the height of each image item, and D is the number of slices of the current input sample. In particular, the input image’s dimension of this model is \(256\times 256\times 32\). The encoder stage performs down-sampling four times. For each down-sampling, the comprehensive data are down-sampled by a factor of two using the 3D CNN, and the channel dimension is changed to twice the original. The feature maps generated at each stage are temporarily saved. Because the receptive field of the shallow network is relatively small, the extracted features generally contain edge, texture, and angle information.

In contrast, because of its greater receptive field, the deeper network can extract more abstract information, i.e., deep semantic information. To extract deeper features while ensuring a smaller number of parameters, we used DSConv for the down-sampling. The dimensions of the feature map after the last down-sampling are \({\mathrm{X}}^{\mathrm{^{\prime}}}\in {\mathrm{R}}^{{\mathrm{C}}^{\mathrm{^{\prime}}}\times \frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}}\), where \({\mathrm{C}}^{\mathrm{^{\prime}}}=256\).

Transformer for feature embedding

(1) Linear projection and positional embedding

The Transformer cannot directly process the high-dimensional feature map \(\mathrm{X}\) that is finally generated by the encoder. This feature map first needs to undergo a linear mapping operation. The linear mapping further extracts features using a convolution layer and reshapes the feature map to 256 patches of dimension \(\frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}\). Instead of directly segmenting the original image into patches, the Transformer can model local contextual features in the spatial and depth dimensions of the high-level feature map. The Transformer requires the input to be sequence data, so we use the Reshape operation to flatten the feature map to \(256 \times \mathrm{N}\), where \(\mathrm{N}=\frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}\).

The importance of the position of each pixel in the image cannot be ignored, and such spatial information is indispensable for the accurate segmentation of tumors. Therefore, we also add learnable position embedding to reconstruct location information.

(2) Transformer layer

The architecture of the Transformer layer, which consists of a multi-head attention (MHA) and feed forward network, is shown in Fig. 1.

The MHA consists of eight single attention heads, which can be viewed as mapping a collection of query vectors to output vectors based on key-value pairs. The details are shown in Formulas 1 and 2.

$$\begin{array}{c}Multi\,Head\left(\mathrm{Q},\mathrm{K},\mathrm{V}\right)=Concat\left({\mathrm{head}}_{1},\dots ,{\mathrm{head}}_{8}\right){\mathrm{W}}^{\mathrm{O}}\\ where\, hea{\mathrm{d}}_{\mathrm{i}}=Attention\left({\mathrm{QW}}_{\mathrm{i}}^{\mathrm{Q}},{\mathrm{KW}}_{\mathrm{i}}^{\mathrm{K}},{\mathrm{VW}}_{\mathrm{i}}^{\mathrm{V}}\right)\end{array}$$




where \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{O}}\in {\mathrm{R}}^{8{\mathrm{d}}_{\mathrm{v}}\times \mathrm{D}}\), \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{Q}}\in {\mathrm{R}}^{\mathrm{D}\times {\mathrm{d}}_{\mathrm{k}}}\) and \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{V}}\in {\mathrm{R}}^{\mathrm{D}\times {\mathrm{d}}_{\mathrm{v}}}\) are learnable parameter matrices, and Q, K, and V are query, key, and value, respectively.

The Feed Forward Network comprises a fully connected neural network and an activation function.

Network decoder

Before up-sampling, patches need to be mapped (feature mapping) to the original space, then the up-sampling operation is performed. The decoder also up-samples four times, with the overall operation corresponding to the down-sampling. Because some spatial context information is inevitably lost during down-sampling, we use a skip connection to connect the feature maps corresponding to the down-sampling stage. This skip connection ensures that the new feature maps contain both shallow low-level information and high-level abstract semantic information.

Comparison details

To verify the effectiveness of our proposed method, we make a comparison with the mainstream medical image segmentation methods. The implementation details of the compared methods are as follows:

  1. a.

    3D U-Net first constructs a \(7\times 7\) convolution block, then constructs four encoder and decoder blocks. Finally, a final convolution block is constructed, including a transposed convolution and two sub-convolution blocks.

  2. b.

    TransBTS is constructed with a series of components. It starts with four encoder blocks, which are then followed by a classical Transformer module containing four Transformer layers, each equipped with eight heads of attention. Subsequently, four decoder blocks are added to the model. To complete the architecture, TransBTS finally incorporates a convolutional layer and utilizes the softmax function.

  3. c.

    ResUNet combines ResNet and U-Net by integrating the residual block in each encoder and decoder block. During skip-connection phase, convolutional blocks are additionally constructed to match the dimensions of the encoder output and the decoder output at the corresponding stage.

  4. d.

    UNet++ aggregates 1 to 4 layers of U-Net together and builds a convolutional layer and sigmoid function at the end.

  5. e.

    The structure of Attention U-Net is generally the same as that of U-Net, with the difference that Attention U-Net adds a layer of attention gates before skip-connection.

  6. f.

    Channel U-Net builds six encoder and decoder blocks and adds the Global Attention Upsample module before skip-connection.

Training details

All networks were implemented based on the PyTorch framework, and four NVIDIA RTX 3080 with 10 GB memory were used for training. We divided the entire data set into 80% training set and 20% testing set. The testing set is finally used to test the segmentation performance of our proposed model, and the results can be seen in Fig. 1. Given the large size of a single sample, 32 consecutive slices are randomly selected from a sample as input data. We adopted the Adam optimizer in the training process. The weight decay was set to \(1\times {10}^{-5}\), the learning rate was \(2\times {10}^{-4}\) and \(4\times {10}^{-7}\) when the epoch, respectively, reached 0 and 999 (all networks were trained for 1000 epochs), the batch size was set to 2, and the random number seed was set to 1000.

Evaluation metrics

To evaluate the effectiveness of our proposed method, we used the Dice coefficient (DSC) and Intersection over union (IOU), which are widely used to evaluate the similarity between segmentation results and ground truth data in medical image segmentation. Furthermore, we used the Hausdorff distance and Average surface distance (ASD) to evaluate the similarity of the surface between the segmentation results and ground truth. Mean average error (MAE) is used to assess the absolute error. These metrics are defined in Eqs. 3, 4, 5, 6, and 7, respectively:

$$\begin{array}{c}Dice=\frac{2\left|\mathrm{X}\bigcap \mathrm{Y}\right|}{\left|\mathrm{X}\right|+\left|\mathrm{Y}\right|}\end{array}$$


$$\begin{array}{c}IOU=\frac{\mathrm{X}\bigcap \mathrm{Y}}{\mathrm{X}\bigcup \mathrm{Y}}\end{array}$$


$$\begin{array}{c}H\left(\mathrm{A},\mathrm{B}\right)=max\left(\mathrm{h}\left(\mathrm{A},\mathrm{B}\right),\mathrm{h}\left(\mathrm{B},\mathrm{A}\right)\right)\\ where \,h\left(\mathrm{A},\mathrm{B}\right)=\underset{\mathrm{a}\in \mathrm{A}}{\mathrm{max}}\{\underset{\mathrm{b}\in \mathrm{B}}{\mathrm{min}}\Vert \mathrm{a}-\mathrm{b}\Vert ,h\left(\mathrm{B},\mathrm{A}\right)=\underset{\mathrm{b}\in \mathrm{B}}{\mathrm{max}}\{\underset{\mathrm{a}\in \mathrm{A}}{\mathrm{min}}\Vert \mathrm{b}-\mathrm{a}\Vert \}\end{array}$$


$$\begin{array}{c}ASD\left(\mathrm{x},\mathrm{y}\right)={\sum }_{\mathrm{x}\in \mathrm{X}}{\mathrm{min}}_{\mathrm{y}\in \mathrm{Y}}d\left(\mathrm{x},\mathrm{y}\right)/\left|\mathrm{X}\right|\end{array}$$


$$\begin{array}{c}MAE=\frac{1}{\mathrm{m}}{\sum }_{\mathrm{i}=1}^{\mathrm{m}}\left|{\mathrm{y}}_{\mathrm{i}}-\mathrm{f}\left({\mathrm{x}}_{\mathrm{i}}\right)\right|\end{array}$$


For statistical analysis, we compare the difference between the prediction results of the proposed method and other methods. We first use the Levene test to check the homogeneity of variance and then conduct Student’s t test between our proposed method and other methods.

Table 3 Quantitative analysis of different transformer layers

Description of Image

Source link