Patients selection
This retrospective study was approved by the ethics committee of Tongde Hospital of Zhejiang Province (approval number 2022183). The data were anonymized, and the requirement for informed consent was waived.
A total of 182 patients diagnosed with adrenal metastases between January 2014 and August 2019 were considered for inclusion in this study. An experienced radiologist and a novice radiologist in the radiology department of our institution labeled the adrenal tumors on the CT images. Patients first received 100–120 ml of contrast material (Ultravist, Bayer Schering Pharma, Berlin, Germany) through an intravenous cannula inserted into a forearm vein, then underwent the multidetector CT examination. The original images were reconstructed from a 5mm slice thickness into a 1.5mm slice thickness. The parameters for the CT acquisitions were: 120 kV, 250–300 mA, 1.5–2.5 mm detector collimation, 1:1 table pitch, and 5mm slice thickness. One author (J.W., abdominal imaging fellow) reviewed each sample manually to ensure that the diameter of each tumor was at least 10 mm in the long axis for the following two reasons: a. when the tumor is tiny, the boundary with normal adrenal tissue may not be apparent; b. If the tumor is tiny, it is more difficult to manually delineate the tumor, which will lead to lower consistency. In summary, we abandoned including tumor samples smaller than 10 mm.
Data preprocessing
For each CT sample, the window level and width were adjusted to 40 and 300, respectively, to remove the information that was not important or irrelevant to tumor segmentation. Then, take one of the samples as an example; the sample size in the x, y, and z axes is \(512 \times 512 \times 101\), respectively. The x and y axes were downsampled to 0.5 times the original, and the zaxis was sampled to 1 mm slice thickness according to the spacing. The cubic interpolation method was used as the sampling method. Finally, a normalization operation is performed to simplify the computation and unify the dimensions.
Proposed method
Our overall network architecture is shown in Fig. 6, with the encoder, decoder, and Transformer as the main framework. The encoder first downsamples the 3D input data while gradually extracting local features to generate a highdimensional feature map. Then, the Transformer processes the highdimensional feature map and establishes the global dependency. Finally, the new feature maps are connected with the previous feature maps which are downsampled by the encoder at each step through the skip connection, and the segmentation results are obtained after upsampling.
Network encoder
For the volumetric medical image segmentation performed in this study, our input data were threedimensional: \(\mathrm{X}\in {\mathrm{R}}^{\mathrm{C}\times \mathrm{W}\times \mathrm{H}\times \mathrm{D}}\), where C is the channel, W is the width of each image item, H is the height of each image item, and D is the number of slices of the current input sample. In particular, the input image’s dimension of this model is \(256\times 256\times 32\). The encoder stage performs downsampling four times. For each downsampling, the comprehensive data are downsampled by a factor of two using the 3D CNN, and the channel dimension is changed to twice the original. The feature maps generated at each stage are temporarily saved. Because the receptive field of the shallow network is relatively small, the extracted features generally contain edge, texture, and angle information.
In contrast, because of its greater receptive field, the deeper network can extract more abstract information, i.e., deep semantic information. To extract deeper features while ensuring a smaller number of parameters, we used DSConv for the downsampling. The dimensions of the feature map after the last downsampling are \({\mathrm{X}}^{\mathrm{^{\prime}}}\in {\mathrm{R}}^{{\mathrm{C}}^{\mathrm{^{\prime}}}\times \frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}}\), where \({\mathrm{C}}^{\mathrm{^{\prime}}}=256\).
Transformer for feature embedding
(1) Linear projection and positional embedding
The Transformer cannot directly process the highdimensional feature map \(\mathrm{X}\) that is finally generated by the encoder. This feature map first needs to undergo a linear mapping operation. The linear mapping further extracts features using a convolution layer and reshapes the feature map to 256 patches of dimension \(\frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}\). Instead of directly segmenting the original image into patches, the Transformer can model local contextual features in the spatial and depth dimensions of the highlevel feature map. The Transformer requires the input to be sequence data, so we use the Reshape operation to flatten the feature map to \(256 \times \mathrm{N}\), where \(\mathrm{N}=\frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}\).
The importance of the position of each pixel in the image cannot be ignored, and such spatial information is indispensable for the accurate segmentation of tumors. Therefore, we also add learnable position embedding to reconstruct location information.
(2) Transformer layer
The architecture of the Transformer layer, which consists of a multihead attention (MHA) and feed forward network, is shown in Fig. 1.
The MHA consists of eight single attention heads, which can be viewed as mapping a collection of query vectors to output vectors based on keyvalue pairs. The details are shown in Formulas 1 and 2.
$$\begin{array}{c}Multi\,Head\left(\mathrm{Q},\mathrm{K},\mathrm{V}\right)=Concat\left({\mathrm{head}}_{1},\dots ,{\mathrm{head}}_{8}\right){\mathrm{W}}^{\mathrm{O}}\\ where\, hea{\mathrm{d}}_{\mathrm{i}}=Attention\left({\mathrm{QW}}_{\mathrm{i}}^{\mathrm{Q}},{\mathrm{KW}}_{\mathrm{i}}^{\mathrm{K}},{\mathrm{VW}}_{\mathrm{i}}^{\mathrm{V}}\right)\end{array}$$
(1)
$$\begin{array}{c}Attention\left(\mathrm{Q},\mathrm{K},\mathrm{V}\right)=softmax\left(\frac{{\mathrm{QK}}^{\mathrm{T}}}{\sqrt{{\mathrm{d}}_{\mathrm{k}}}}\right)V\end{array}$$
(2)
where \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{O}}\in {\mathrm{R}}^{8{\mathrm{d}}_{\mathrm{v}}\times \mathrm{D}}\), \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{Q}}\in {\mathrm{R}}^{\mathrm{D}\times {\mathrm{d}}_{\mathrm{k}}}\) and \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{V}}\in {\mathrm{R}}^{\mathrm{D}\times {\mathrm{d}}_{\mathrm{v}}}\) are learnable parameter matrices, and Q, K, and V are query, key, and value, respectively.
The Feed Forward Network comprises a fully connected neural network and an activation function.
Network decoder
Before upsampling, patches need to be mapped (feature mapping) to the original space, then the upsampling operation is performed. The decoder also upsamples four times, with the overall operation corresponding to the downsampling. Because some spatial context information is inevitably lost during downsampling, we use a skip connection to connect the feature maps corresponding to the downsampling stage. This skip connection ensures that the new feature maps contain both shallow lowlevel information and highlevel abstract semantic information.
Comparison details
To verify the effectiveness of our proposed method, we make a comparison with the mainstream medical image segmentation methods. The implementation details of the compared methods are as follows:

a.
3D UNet first constructs a \(7\times 7\) convolution block, then constructs four encoder and decoder blocks. Finally, a final convolution block is constructed, including a transposed convolution and two subconvolution blocks.

b.
TransBTS is constructed with a series of components. It starts with four encoder blocks, which are then followed by a classical Transformer module containing four Transformer layers, each equipped with eight heads of attention. Subsequently, four decoder blocks are added to the model. To complete the architecture, TransBTS finally incorporates a convolutional layer and utilizes the softmax function.

c.
ResUNet combines ResNet and UNet by integrating the residual block in each encoder and decoder block. During skipconnection phase, convolutional blocks are additionally constructed to match the dimensions of the encoder output and the decoder output at the corresponding stage.

d.
UNet++ aggregates 1 to 4 layers of UNet together and builds a convolutional layer and sigmoid function at the end.

e.
The structure of Attention UNet is generally the same as that of UNet, with the difference that Attention UNet adds a layer of attention gates before skipconnection.

f.
Channel UNet builds six encoder and decoder blocks and adds the Global Attention Upsample module before skipconnection.
Training details
All networks were implemented based on the PyTorch framework, and four NVIDIA RTX 3080 with 10 GB memory were used for training. We divided the entire data set into 80% training set and 20% testing set. The testing set is finally used to test the segmentation performance of our proposed model, and the results can be seen in Fig. 1. Given the large size of a single sample, 32 consecutive slices are randomly selected from a sample as input data. We adopted the Adam optimizer in the training process. The weight decay was set to \(1\times {10}^{5}\), the learning rate was \(2\times {10}^{4}\) and \(4\times {10}^{7}\) when the epoch, respectively, reached 0 and 999 (all networks were trained for 1000 epochs), the batch size was set to 2, and the random number seed was set to 1000.
Evaluation metrics
To evaluate the effectiveness of our proposed method, we used the Dice coefficient (DSC) and Intersection over union (IOU), which are widely used to evaluate the similarity between segmentation results and ground truth data in medical image segmentation. Furthermore, we used the Hausdorff distance and Average surface distance (ASD) to evaluate the similarity of the surface between the segmentation results and ground truth. Mean average error (MAE) is used to assess the absolute error. These metrics are defined in Eqs. 3, 4, 5, 6, and 7, respectively:
$$\begin{array}{c}Dice=\frac{2\left\mathrm{X}\bigcap \mathrm{Y}\right}{\left\mathrm{X}\right+\left\mathrm{Y}\right}\end{array}$$
(3)
$$\begin{array}{c}IOU=\frac{\mathrm{X}\bigcap \mathrm{Y}}{\mathrm{X}\bigcup \mathrm{Y}}\end{array}$$
(4)
$$\begin{array}{c}H\left(\mathrm{A},\mathrm{B}\right)=max\left(\mathrm{h}\left(\mathrm{A},\mathrm{B}\right),\mathrm{h}\left(\mathrm{B},\mathrm{A}\right)\right)\\ where \,h\left(\mathrm{A},\mathrm{B}\right)=\underset{\mathrm{a}\in \mathrm{A}}{\mathrm{max}}\{\underset{\mathrm{b}\in \mathrm{B}}{\mathrm{min}}\Vert \mathrm{a}\mathrm{b}\Vert ,h\left(\mathrm{B},\mathrm{A}\right)=\underset{\mathrm{b}\in \mathrm{B}}{\mathrm{max}}\{\underset{\mathrm{a}\in \mathrm{A}}{\mathrm{min}}\Vert \mathrm{b}\mathrm{a}\Vert \}\end{array}$$
(5)
$$\begin{array}{c}ASD\left(\mathrm{x},\mathrm{y}\right)={\sum }_{\mathrm{x}\in \mathrm{X}}{\mathrm{min}}_{\mathrm{y}\in \mathrm{Y}}d\left(\mathrm{x},\mathrm{y}\right)/\left\mathrm{X}\right\end{array}$$
(6)
$$\begin{array}{c}MAE=\frac{1}{\mathrm{m}}{\sum }_{\mathrm{i}=1}^{\mathrm{m}}\left{\mathrm{y}}_{\mathrm{i}}\mathrm{f}\left({\mathrm{x}}_{\mathrm{i}}\right)\right\end{array}$$
(7)
For statistical analysis, we compare the difference between the prediction results of the proposed method and other methods. We first use the Levene test to check the homogeneity of variance and then conduct Student’s t test between our proposed method and other methods.
Add Comment