In this section, we will introduce the proposed network in detail. The first is the overall network framework. Then is the boundary refinement module for the spatial path, the multi-scale feature selection module for context path, and the dual-path feature fusion module. Finally, the multivariate loss function is introduced.

### ADFFNet architecture

In this section, we will introduce an overall framework named as Attention-based Dual-path Feature Fusion Network (ADFFNet). The overall network model structure is shown in Fig. 2. We use the classic VGG16 model as the network backbone, which has five basic convolutional blocks. Each convolutional block is composed of a different number of \(3 \times 3\) convolutional layers and a maximum pooling layer. By halving the size of the feature map of each convolutional block, the number of filters is doubled to keep the time complexity of each layer unchanged. And a spatial path and a context path are created in the lower and upper layers of the network, respectively, so that the network can extract more spatial information in the lower layer features and obtain more context information in the upper layer features.

First of all, in the spatial path, we use the boundary refinement module to enhance the extraction of the relevant information of the edge of the lesion area, which can effectively solve the problem of the rough boundary segmentation of other lesion area segmentation networks and correct the final semantic segmentation result. Secondly, in the context path, the multi-scale feature selection module innovatively considers the enhancement and compression of effective information and redundant information in the multi-scale context, so that the network can achieve a better prediction and segmentation effect. Finally, the dual-path feature fusion module uses the attention mechanism to make the network pay more attention to the lesion foreground area of the skin melanoma image during the fusion stage, and output a more refined segmentation result.

### Boundary refinement module

Pigmented skin lesions have irregular shapes and blurred boundaries, making it difficult to finely segment the boundaries. However, in the task of segmentation of the lesion area, the segmentation network needs to obtain the detailed boundary of the lesion. We find abundant spatial information exists in the low-level network, especially the details of the image boundary. Therefore, a boundary refinement module is designed in the spatial path at the low-level of the network to focus spatial attention on the target foreground region, which is helpful for fine segmentation of the boundary. As shown in Fig. 3, we represent the output feature maps of C_Block1 and C_Block2 as *C*1 and *C*2 respectively. We believe that *C*1 has more texture detail information, but also contains a lot of background interference information, while *C*2 has more semantic information, so we process the feature map of *C*2. \(C2’\) is obtained by upsampling C2, using the long-strip kernel, can pay more attention to a certain area to avoid the introduction of irrelevant information brought by the traditional convolution kernel, and make the output feature map pay more attention to the image foreground. As shown in Eqs. (1) and (2), to increase the receiving field, obtain global information without increasing parameters, we use two convolutional layers, one kernel is 1\(\times\)k, and another kernel is k\(\times\)1 for feature map \(C2’\). Then, using the Sigmoid operation to obtain the attention map *A* as shown in Eq. (3). As shown in Eq. (4), the final output *X* of the BR module is obtained by weighting *C*1 with *A*.

$$\begin{aligned} X_1=Conv_{k\times 1}(Conv_{1\times k}(C2′,W_1^{1}),W_1^{2}) \end{aligned}$$

(1)

$$\begin{aligned} X_2=Conv_{1\times k}(Conv_{k\times 1}(C2′,W_2^{1}),W_2^{2}) \end{aligned}$$

(2)

$$\begin{aligned} A=Sigmoid(X_1+X_2) \end{aligned}$$

(3)

$$\begin{aligned} X=A\cdot C1 \end{aligned}$$

(4)

Where *W* refers to the parameters of the convolution kernel, \(Conv_{1\times k}\) and \(Conv_{k\times 1}\) refers to the convolution layer of \(1\times k\times C\) and \(k\times 1\times 1\) respectively. In our experiments, we set \(k=3\).

### Multi-scale feature selection module

The multi-level feature maps not only have rich context-aware information, but also contain some redundant information that is useless for segmentation. Therefore, this paper uses context-aware pyramid features to extract multi-scale context information at the high level of the network, and to filter the features, adaptively assigning more network attention to the feature channels that more effective for segmentation of the lesion area, thereby improving the skin segmentation effect of the lesion area.

#### Multi-scale context information extraction

Context information is very important for image semantic segmentation. Existing CNN models often extract object features by stacking multiple convolutions and pooling layers. Due to the unevenness of contrast or colour of the skin lesion area, an effective skin lesion segmentation method should be able to segment the lesion area according to the context information of its area. The segmentation of the ambiguous skin lesion area depends on the segmentation of the significant lesion area. However, the size and shape of the lesion area are very different, so the single-scale context information from the input image cannot effectively guide the feature extraction of different scales, and the single-size convolution kernel is difficult to effectively extract the multi-scale information of the lesion area. Therefore, inspired by literature [27], we design a multi-scale context information extraction module based on atrous convolution.

For the skin lesion ISIC 2018 dataset, it is not appropriate to use an excessively large void rate. Too large dilated rate will result in too much useless information of the extracted features and loss of useful information, which will lead to unclear edges and missed segmentation of the segmented lesions. Therefore, we use small dilated rate of 3, 5, 7. As shown in Fig. 4, it consists of parallel 1\(\times\)1 convolution and three 3\(\times\)3 dilated convolutions with the dilation rate of 3, 5, 7 respectively. The original information of the image is preserved through the 1\(\times\)1 convolution, and the semantic information is extracted by the dilated convolution. Finally, all the feature maps are channel spliced. Small-sized convolution kernels are more interested in the local details of skin lesions, while large-sized convolution kernels are more interested in the contour information of skin lesions. We use multi-scale dilated convolution to not only extract better local details, and have a better characterization of the contour of the skin lesion area.

### Feature selection based on attention mechanism

In order to further filter the multi-scale semantic information captured by the multi-scale context information extraction module, inspired by SKNet [20], this paper designs a feature selection structure. During the training process, the feature selection structure can assign a larger weight to the channel that contributes the most to the segmentation of the lesion area and adaptively select the most suitable receptive field and feature size to achieve accurate prediction. Figure 5 is the detailed feature selection structure. Among them, \(U’\), \(U”\), \(U”’\in R^{H\times W\times C}\) respectively represents the feature maps obtained by the convolutional blocks of \(C_{Block3}\), \(C_{Block4}\), \(C_{Block5}\) and after multi-scale context information extraction. As shown in Eq. (5), \(\bar{U}\) is obtained by summing element by element and integrating the information of multiple branches:

$$\begin{aligned} \bar{U}=U’+U”+U”’ \end{aligned}$$

(5)

Then, the attention feature vectors \(\alpha\), \(\beta\), \(\gamma\) are obtained by using global average pooling and two fully connected layers and a Softmax layer. The final output feature map is \(V=[V_1,V_2,…,V_c]\), where *c* represents the number of channels. The \(i(0<i<c)\) channel \(V_i\) is:

$$\begin{aligned} V_i=\alpha _iU’+ \beta _iU”+ \gamma _iU”’ \end{aligned}$$

(6)

where \(\alpha _i\), \(\beta _i\), \(\gamma _i\) is the variable of vector \(\alpha\), \(\beta\), \(\gamma\), and satisfy Eq. (7):

$$\begin{aligned} \alpha _i+ \beta _i+ \gamma _i=1 \end{aligned}$$

(7)

### Dual-path feature fusion module

Skin melanoma images usually contain a lot of foregrounds and complex background information, such as hair occlusion. In our proposed model, the low-level feature map obtained through the spatial path has a lot of detailed information but contains a lot of useless background information, the high-level feature map obtained through the context path contains rich semantic information but lacks detailed information. But in semantic segmentation networks, simple feature fusion, such as pixel sum and channel splicing, often ignores the inconsistency between semantic information and detailed features. Therefore, we design a Dual-path Feature Fusion (DFF) module based on the attention mechanism, which is used to pay more attention to the lesion foreground area of the skin melanoma image.

As shown in Fig. 6, \(H\in R^{H\times W\times C}\) represents the high-level semantic feature from the context path, and \(L\in R^{H\times W\times C}\) represents the low-level detailed feature from the spatial path. Firstly, the high-level semantic features and the low-level detailed features are connected in series, batch normalization is used to balance the data distribution of the features, and the feature vectors *x* is obtained through the Relu activation function, as shown in Eq. (8).

$$\begin{aligned} x=Relu(BN(Conv(Concat(H,L)))) \end{aligned}$$

(8)

Where, *Conv* denotes the convolution operation, and *Concat* denotes the operation of concatenating the channels of *H* and *L*.

The mathematical expression of *ReLU* function is as follows:

$$\begin{aligned} f(x)=max(0,x) \end{aligned}$$

(9)

*x* represents the input value.

The mathematical expression of *BN* function is as follows:

$$\begin{aligned} BN(x)= \gamma \frac{x-\mu }{\sqrt{\delta ^2 + \epsilon }} + \beta \end{aligned}$$

(10)

Where, *x* represents the input data. \(\mu\) is the mean of the input data over the batch. \(\delta ^2\) is the variance of the input data over the batch. \(\gamma\) is the scaling factor. \(\beta\) is the shifting factor. \(\epsilon\) is a small constant. The purpose of the *BN* function is to ensure that the input to each layer in the network maintains a certain mean and variance, thereby accelerating network training and improving model stability.

Secondly, as shown in Eq. (11), for the feature vector *x*, using Global Average Pooling (GAP) to extract the global average vector, and the attention weight vector *a* is obtained through the Relu activation function and the Sigmoid operation in turn.

$$\begin{aligned} a=Sigmoid(Relu(GAP(x))) \end{aligned}$$

(11)

The mathematical expression of *Sigmoid* function is as follows:

$$\begin{aligned} f(x)=\frac{1}{1+ e^{-x}} \end{aligned}$$

(12)

As shown in Eq. (13), use the attention weight vector *a* to weight *L*, the final output \(M\in R^{H\times W\times C}\) is obtained by adding *H* with the weighted feature map.

$$\begin{aligned} M=Attention(a,L)+H \end{aligned}$$

(13)

*Attention*(*a*, *L*) represents the calculation of the Attention mechanism, and its mathematical expression is as follows:

$$\begin{aligned} A(a,L)=Sigmoid(MLP(Avgpool(a,L))+MLP(Maxpool(a,L))) \end{aligned}$$

(14)

The DFF module uses advanced features to provide semantic information to guide feature fusion, so that the network can correctly focus on the foreground information of the lesion area in the skin melanoma image, thereby generate more discriminative fusion features and improve segmentation accuracy.

### Multivariate loss function

In machine learning and mathematical optimization, the loss function can be regarded as an evaluation of the prediction result. Usually, we use the cross-entropy loss to evaluate the segmentation effect. However, for the segmentation of the lesion area, there are problems of imbalance between the foreground and background categories and the difficulty of separating the boundary. The cross-entropy loss cannot be effective for the above-mentioned problems, which can easily lead to the model being more inclined to the category with a larger pixel ratio during the training process, making the model difficult to be fully trained, and the segmentation effect for the small category is poor. Therefore, we design the multivariate loss function.

Semantic boundary For the segmentation of the lesion area, the difficulty lies in the accurate segmentation of the boundary. Inspired by edge detection algorithms, the traditional edge detection operators can better strengthen the region boundary segmentation. Thus, in this article, the second-order edge detection Laplace operator is used to obtain the boundary information of the ground-truth label and the predicted semantic segmentation mask.

We can find a discrete convolution kernel \(K_{Laplace}\), which is equivalent to the Laplacian operator, as shown in Eq. (15).

$$\begin{aligned} K_{Laplace} =\left[ \begin{array}{ccc} -1 &{} -1 &{} -1 \\ -1 &{} 8 &{} -1 \\ -1 &{} -1 &{} -1 \end{array}\right] \end{aligned}$$

(15)

As shown in Eq. (12), we use convolution calculation to obtain the Laplace edge detection map \(\nabla T(x,y)\in R^{H\times W\times C}\) from the segmentation ground-truth map \(T(x,y)\in R^{H\times W\times C}\).

$$\begin{aligned} \nabla T(x,y)= Conv(T(x,y),K_{Laplace}) \end{aligned}$$

(16)

$$\begin{aligned} E_t= Relu(Tahn(\nabla T)) \end{aligned}$$

(17)

To get a clearer map, as shown in Eq. (17), we first use the Tahn function to transform the value of the element in \(\nabla T\) to [-1,1]. Then the Relu function is used to truncate, only the positive activation is taken, and finally the true segmentation boundary label \(E_t\in R^{H\times W\times C}\) is obtained, as shown in Fig. 7. After the Dual-path Feature Fusion (DFF) module, we can get the predicted segmentation map \(P(x,y)\in R^{H\times W\times C}\) , which is then treated with Eqs. (12) and (13) to get the predicted segmentation boundary map \(E_p\in R^{H\times W\times C}\).

Boundary-oriented loss function. Due to the sparseness of pixels around the boundary, the predicted semantic boundary has a high missing rate. In order to reduce this effect, we define \(W_{Laplace}\) and the boundary-oriented loss function \(L_E\) to help locate the detailed boundary. \(W_{Laplace}\) is boundary weight, which represents the percentage of boundary pixels to all pixels in the segmentation ground-truth label, and its value is between [0, 1], as shown in Eq. (18).

$$\begin{aligned} W_{Laplace}= \sum \limits _{i=1}^{N} E_t^{i} / \sum \limits _{i=1}^{N} T^{i} \end{aligned}$$

(18)

Where \(E_t\) represents the true segmentation boundary map, *T* represents the segmentation ground-truth label, and *N* means the total number of pixels. Thus, the boundary-oriented loss function \(L_E\) is finally formulated as below:

$$\begin{aligned} L_E= – \frac{1}{N} \sum \limits _{i=1}^{N}\left[ W_{Laplace}E_t^{i}log E_p^{i}+ (1- W_{Laplace})(1-E_t^{i})log(1- E_p^{i}))\right] \end{aligned}$$

(19)

Where \(E_p\) represents the predicted segmentation boundary map.

Multivariate loss function The multivariate loss function can pay attention to the pixel-level classification accuracy of the entire image and the boundary at the same time. It can also make the model training more stable, and effectively overcome the imbalance between positive and negative samples, thereby obtaining more accurate segmentation results.

The multivariate loss function \(L_{Total}\) is shown in Eq. (20), which consists of \(L_E\) for boundary segmentation, a cross-entropy loss \(L_{BCE}\) in Eq. (21) and a dice coefficient loss \(L_{Dice}\) in Eq. (22) for semantic segmentation.

$$\begin{aligned} L_{Total}=\lambda _1L_E + \lambda _2L_{BCE} + \lambda _3L_{Dice} \end{aligned}$$

(20)

$$\begin{aligned} L_{BCE}=- \frac{1}{N} \sum \limits _{i=1}^{N}\left[ T^{i}log P^{i}+ (1-T^{i})log(1- P^{i})\right] \end{aligned}$$

(21)

$$\begin{aligned} L_{Dice}=1- \frac{\sum _{i=1}^{N} P^{i}T^{i}+\varepsilon }{\sum _{i=1}^{N}P^{i}+T^{i}+\varepsilon }- \frac{\sum _{i=1}^{N} (1-P^{i})(1-T^{i})+\varepsilon }{\sum _{i=1}^{N}2-P^{i}-T^{i}+\varepsilon } \end{aligned}$$

(22)

Where, \(\lambda _1\) , \(\lambda _2\) and \(\lambda _3\) are hyperparameter. In our experiments, we set them as 0.2, 1, and 1 respectively based on experience, *P* represents the predicted segmentation map and *T* represents the segmentation ground-truth map. \(\varepsilon\) is a settable value, which is used to prevent overfitting. In our experiments, we set it as 1.

## Add Comment