Local query domain creation method in offline phase
Improved FUZZY C-mean (IFCM) algorithm
Traditional soft sensor modeling methods based on just-in-time learning usually involve a cumbersome process of selecting similar sample points across the entire sample dataset. When the historical data set is too large, it will lead to long search times for the algorithm, making it impossible for the soft sensor model to predict the output on time.
In particular, the distribution of sample data in the time-varying continuous non-linear process of biochemical reactions is not concentrated, and the predictive performance of the model is limited by using all relevant samples to build a single overall model. Therefore, this paper uses the Improved Fuzzy C-mean (IFCM) algorithm to divide the queried domain to address the above issues.
Firstly, considering that the traditional fuzzy C-mean algorithm (FCM) [10,11,12,13] suffers from sensitivity to the initial centroid of clustering and a high number of iterations, this paper proposes an enhanced algorithm to improve the FCM (IFCM). The initial values of the FCM algorithm are usually set artificially, and the model is prone to fall into local optimality. This paper determines the number of classifications by the “elbow method” [14] to avoid human intervention. In addition, the distance between all data sample points and the origin is calculated using the Mahalanobis distance, as shown in Eq. (1).
$${D}_{M}\left(x,y\right)=\sqrt{{\left(x-y\right)}^{T}{\sum }^{-1}\left(x-y\right)}$$
(1)
where: \(\sum\) denotes the covariance matrix of the covariance matrix of the multidimensional random variables of \(x\) and \(y\).
The core metric of the elbow method is the sum of squared errors (SSE), which is used to represent the clustering error. As the number of clusters \(k\) increases, the sample division will become finer, the degree of aggregation of each cluster will gradually increase, and the SSE will naturally become progressively smaller. Moreover, when \(k\) is less than the actual number of clusters, the decrease in SSE will be significant because an increase in \(k\) will substantially increase the degree of aggregation of each cluster. However, when \(k\) reaches the actual number of clusters, the return on the degree of aggregation obtained by increasing \(k\) will decrease rapidly, so the decline in SSE will decrease sharply and then level off as the value of \(k\) continues to increase. This means that the graph of the relationship between SSE and \(k\) is the shape of an elbow, and the value of k corresponding to this elbow is the actual number of clusters of the data.
Secondly, the sample points are ranked according to the Mahalanobis distance, and the clustering subsets are divided equally. In each clustering subset, the middle sample point is selected as the initial clustering center of the FCM algorithm, and the local query domain is constructed.
Finally, the FCM algorithm’s affiliation matrix divides the historical sample dataset into a reasonable number of sub-databases, as shown in Fig. 1. Each of these sub-databases is a local query domain for JITL. By creating a local query domain, the search range of the algorithm is made smaller, and the objective function of the FCM is shown in Eq. (2).
$${J}_{m}=\sum_{i=1}^{n}\sum_{j=1}^{c}{u}_{ij}^{m}{\Vert {x}_{i}-{v}_{j}\Vert }^{2},2\le m<\infty$$
(2)
where: \(\Vert {x}_{i}-{v}_{j}\Vert\) is the Euclidean distance from the sample point \(x_i\) to the centroid \(v_j\), \(u_{ij}\) is the affiliation function. \(m\left(m>1\right)\) is the fuzzy index, and generally taken as m = 2. n denotes the number of populations and \(c\) denotes the number of samples.
As the data samples of actual biochemical reaction process present a high-dimensional non-linear distribution, the Euclidean distance in the traditional FCM algorithm has some advantages for spherical structure clustering. However, some computational disadvantages exist in solving a high-dimensional data problem like to the biochemical reaction process. Therefore, the IFCM algorithm can cluster the sample data more accurately and consistently than the direct use of the FCM algorithm.
Adaptive pruning database
In the actual biological biochemical reaction process, the accumulation of sample data in the sub-database over time can seriously affect the response rate of the JITL model. Aiming at this problem, this paper proposes an adaptive pruning data mechanism to update the sample data in the database automatically to address such problems. The specific mechanism is as follows:
A similarity query label \(\gamma_i\) is created in each sub-database and a minimum threshold \({\eta }_{\mathrm{min}}\) and a maximum threshold \({\eta }_{\mathrm{max}}\) for the number of similarity queries is set, which increases when the output samples \({x}_{i}\) in the sub-database are involved in immediate learning. When \({\gamma }_{i} = {\eta }_{\mathrm{max}}\) exists in the sub-database, the current database automatically deletes all samples of \({\gamma }_{i} \le {\eta }_{\mathrm{min}}\) and re-updates all previous predicted data results and corresponding auxiliary variables to the current database, and then initializes the value of \({\gamma }_{i}\), as shown in Fig. 2.
The adaptive data pruning mechanism prunes the data in the database well and dynamically maintains the quantities in the biochemical reaction process sub-database so that the data in the sub-database can meet the requirements of the JITL strategy.
Dynamic filtering of query domain auxiliary variables
Considering many auxiliary variables measured online during the biochemical reaction process, some of the auxiliary variables do not correlate well with the dominant variables, and too many input variables increase the complexity of the model and reduce its response speed. In addition, as the process characteristics and various parameters change in different stages of biochemical reaction process, the auxiliary variables representing the dynamic characteristics of the biochemical reaction process will change accordingly. This paper uses the K-nearest neighbor mutual information estimation (K-MI) method for dynamic real-time screening of auxiliary variables to improve the predictive performance of the soft sensor model.
Mutual information was first proposed by Shannon [15]. Intuitively, mutual information can measure the interrelationship between two random variables. However, the probability distribution of each variable is unknown in the actual soft sensor implementation, making the calculation of mutual information difficult. However, the probability distribution of each variable is unknown in the actual soft sensor implementation, making the calculation of mutual information difficult. Therefore, the mutual information between variables can be estimated directly using the K-nearest neighbor estimation of mutual information (K-MI) method [16,17,18]. The estimate of the mutual information \(I\left(x,y\right)\) is:
$$I\left(x,y\right)=\Psi \left(k\right)-1/k-\langle \Psi \left({n}_{x}\right)+\Psi \left({n}_{y}\right)\rangle +\Psi (N)$$
(3)
where: \(\Psi \left( k \right)\) is the digamma function, \(\Psi \left( x \right) = {\Gamma^{ – 1}}\left( x \right)d\Gamma \left( x \right)/dx\); \(k\) is generally considered to be 2 to 6, and \({k}\) is set to be 4 in this circle. \(N\) is the number of samples. \(\langle\cdots\rangle\) means that the values of the digamma function are averaged over all variables, i.e., \(\langle \cdots \rangle ={N}^{-1}\sum\nolimits_{i=1}^{N}E\left[\cdots (i)\right]\).
K-MI algorithm is used to filter auxiliary variables, which not only reduces the model’s complexity and improves the model’s response time, but also contributes to the predictive performance of the model.
JITL strategy in online modeling phase
JITL principles
JITL is an online local modeling method. which builds a historical database by collecting a large sample of data offline. When a prediction sample arrives, the prediction model first looks for samples similar it in the historical database, then uses these similar data to build a local model, and finally predicts the output. As soon as the prediction results are output, the model will be immediately abandoned while waiting for the following measurement sample to arrive. The JITL strategy is more suitable for the biochemical reaction process than using a traditional offline global model. The comparison between traditional modeling methods and JITL modeling framework is shown in Fig. 3.
JITL method based on multiple similarity metrics
In JITL, A single similarity metric cannot accurately portray the relationship between input and output, resulting in poor model generalization performance. Several weighting functions are available for assessing similarities, such as truncation and Gaussian functions. However, it is pointed out that the selection of weight functions may not influence the modeling performance as the selection of similarity [19]. Hence, the definition of similarity plays a significant role in the success of the JITL modeling framework. Based on this, this paper uses multiple similarity metrics to assess the similarity between samples, allowing for increased model diversity and enhancing model robustness performance. This paper uses Euclidean Distance (ED), Covariance Weighted Distance (CWD), and similarity metrics based on distance and angle to select suitable sample sets.
-
(1)
ED Similarity. This metric is defined based on the distance of the data from two points in Euclidean space.
$${\omega }_{i} = {e}^{\left(-{d}_{i}^{2}/{\varphi }_{1}{\sigma }_{d}\right)}$$
(4)
$${d}_{i}=\sqrt{{\left({x}_{i}-{x}_{q}\right)}^{T}\left({x}_{i}-{x}_{q}\right)}$$
(5)
where: \(d_i\) is the Euclidean distance between the query sample and the historical sample in Eq. (4). in Eq. (5), \(\sigma_d\) is the standard deviation of the distance vector \({d}_{i}\) and \({\varphi }_{1}\) is the local adjustment parameter.
-
(2)
CWD Similarity. This metric considers the relationship between input variables and between input and output variables.
$${d}_{i}=\sqrt{{\left({x}_{i}-{x}_{q}\right)}^{T}H\left({x}_{i}-{x}_{q}\right)}$$
(6)
$$H={\left({X}^{T}y\right)}^{T}\left({X}^{T}y\right)/{\Vert {X}^{T}y\Vert }^{2}$$
(7)
where:\(H\) is the weighting matrix,\(X\) and \(y\) are the input and output matrices, respectively.
-
(3)
A similarity metric based on distance and angle. The metric uses the angle between two vectors in the space of a sample to measure the degree of similarity between samples.
$$\mathrm{cos}\left({\theta }_{i}\right)=\langle {x}_{i},{x}_{q}\rangle / \left({\Vert {x}_{i}\Vert }_{2}{\Vert {x}_{q}\Vert }_{2}\right)$$
(8)
$${\omega }_{i} = \lambda \sqrt{{e}^{\left(-{d}_{i}^{2}/{\varphi }_{2}{\sigma }_{d}\right)}}+ \left(1-\lambda \right)\mathrm{ cos}\left({\theta }_{i}\right),\mathrm{ cos}\left({\theta }_{i}\right)\ge 0$$
(9)
where: \({d}_{2,i}\) and \(\mathrm{cos}\left({\theta }_{xi}\right)\) denote the distance and angular similarity between the query and historical samples, respectively.
In this paper, three local models are constructed using three similarity metrics to filter the queried domain and generate diverse local state identification results.
Local XGBoost model construction
In the online modeling stage, the XGBoost algorithm is chosen as the base learner for soft sensors, considering the stability and rapidty of XGBoost. The XGBoost algorithm is implemented in a gradient boosting framework, where base learners are built during boosting, with each base learner learning from the previous base learner and updating the residuals. A strong learner is eventually formed by analyzing the base learners’ learning residuals and updating the sample weights during each iteration, as shown in Fig. 4.
First define a decision tree whose output function is shown in Eq. (10).
$$f\left(x\right)={\omega}_{q}\left(x\right), \omega \epsilon {R}^{T} ,q:{R}^{d}\rightarrow\left\{1, 2,\dots ,T\right\}$$
(10)
where: \(x\) is the input vector, \(q\) is the structure of the tree, \(\omega\) is the corresponding leaf fraction, \(T\) is the number of nodes in the tree with leaves, and \(d\) is the dimensionality of the data features. Then, assuming that the fraction of leaf nodes of sample i in the \(jth\) decision tree is \(\omega_{ij}\), the output function of this sample after t decision tree iterations is given by Eq. (11). The objective function of the XGBoost algorithm is shown in Eq. (12).
$${\widehat{y}}_{l}^{\left(t\right)}=\sum_{j=1}^{t}{f}_{j}\left({x}_{i}\right)=\sum_{j=1}^{t}{\omega }_{ij}$$
(11)
$${Obj}^{\left(t\right)}=\sum_{j=1}^{N}L\left({y}_{i},{\widehat{y}}_{l}^{\left(t\right)}\right) +\sum_{j=1}^{t}\Omega \left({f}_{j}\right)$$
(12)
where: \(\sum {_{i = 1}^NL\left( {{y_i},\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{y}_l^{\left( t \right)}} \right)}\) is the loss function, which represents the sum of the error values between the true value \(y_i\) and the predicted value \({\widehat{y}}_{l}^{\left(t\right)}\).
Assuming that the XGBoost algorithm does not constrain the number of nodes, the tree’s structure splits maximally, in which case the XGBoost model will be overfitted. Therefore, a regular term \(\Omega \left({f}_{j}\right)\) is added to the objective function to prevent over-fitting. A penalty term ζ is introduced into the objective function of a single decision tree, as shown in Eq. (13).
$$\Omega \left({f}_{t}\right)=\gamma T +\frac{1}{2} \lambda \sum_{j=1}^{T}{\omega }_{j}^{2}$$
(13)
where: \(T\) is the number of leaf nodes. \(\omega_j\) is the fraction of the \(jth\) leaf node. \(\gamma\) and \(\lambda\) are hyperparameters to control the generalization error and prevent overfitting.
It should be noted that the XGBoost algorithm uses a second-order Taylor expansion for the loss function, which not only improves the accuracy of the model but also allows the gradient to converge faster, just as Newton’s method converges faster than SGD. After simplification and \(t\) iterations, the objective function is as follows:
$${obj}^{(t)} =\gamma T + \sum_{j=1}^{T}\left({\omega }_{j}{G}_{j}+ \frac{1}{2} {\omega }_{j}^{2} \left(\lambda +{H}_{j}\right)\right)$$
(14)
where: \(G_j\) is \(\sum {_{i \in {I_j}}{g_i}}\), \({H_j}\) is \(\sum {_{i \in {I_j}}{h_i}}\).
To find the optimal solution of the objective function, the minimum value of \(Obj^{(t)}\) is required, i.e., the minimum value of \({\omega }_{j}\) is found in \({\omega }_{j}^{*}\). The optimal solution of the objective function is equivalent to Eq. (15), and the final optimal solution of the objective function is obtained as shown in Eq. (16).
$${\omega }_{j}^{*}=\mathrm{arg\,min}\left({\omega }_{j}{G}_{j} +\frac{1}{2}{\omega }_{j}^{2}\left(\lambda +{H}_{j}\right)\right)$$
(15)
$${{Obj}^{\left(t\right)}}^{*}=\gamma T-\frac{1}{2}\sum_{j=1}^{T}\frac{{G}_{j}^{2}}{\lambda +{H}_{j}}$$
(16)
In addition, XGBoost models are engineered to support parallelized model training, and the problem of not being able to load all the feature values into local memory for distributed datasets can be solved by XGBoost models using an approximate histogram algorithm. At the same time, the XGBoost algorithm’s cache-aware access technology and Block out-of-core compute optimization technology can efficiently increase the system’s resource usage. These engineering optimizations specific to XGBoost models can all significantly improve the speed of XGBoost modeling. The use of the XGBoost model as a base learner is very suitable for the JITL strategy compared to other models.
Improved sparrow algorithm
In the modeling process, the accuracy and robustness of the free-growing XGBoost model are easily affected by the parameters, and allowing the XGBoost model to grow freely will result in an over-fitting model. In addition, although the free-growing XGBoost model will improve the model prediction accuracy, it will significantly reduce the model’s computational efficiency, increase the system’s lag, and is unsuitable for online modeling strategy. Therefore, parameters such as the learning rate, the maximum number of iterations, and the maximum depth of the tree of XGBoost need to be optimized as a way of balancing all aspects of the performance of the XGBoost model, i.e., to improve the convergence speed of the model without losing prediction accuracy. The Sparrow Search Algorithm (SSA) has been widely used among the various algorithms for optimizing parameters. It is a new intelligent optimization algorithm that mainly simulates the foraging and predation prevention process of a sparrow flock [20], and consists of a sparrow flock foraging model with a discoverer, a follower, and an early warning. The specific search process is as follows:
-
(1)
The mathematical expression for the iterative update of the discoverer position is shown in Eq. (17).
$${x}_{i,d}^{t+1}=\begin{array}{c}\left\{\frac{-i}{\alpha \cdot {g}_{max}}\right.,r<\beta \\ {x}_{i,d}^{t+1}+q \cdot l, r\le \beta \end{array}$$
(17)
-
(2)
The mathematical expression for iterative follower position update is shown in Eq. (18).
$${x}_{i,d}^{t+1}=\left\{\begin{array}{c}q\cdot \mathrm{exp}\left(\frac{{x}_{worst}^{t}-{x}_{i,d }^{t}}{{t}^{2}}\right),i\geq \frac{n}{2}\\ {x}_{p}^{t+1}+\left|{x}_{i,d}^{t}-{x}_{p}^{t+1}\right|\cdot {a}^{+} \cdot l, i\le \frac{n}{2}\end{array}\right.$$
(18)
-
(3)
The mathematical expression for the anti-predatory behavior of an early warning when it becomes aware of danger is shown in Eq. (19).
$${x}_{i,d}^{t+1}=\left\{\begin{array}{c}{x}_{best}^{t+1} + \rho \cdot \left|{x}_{i,d}^{t}-{x}_{best}^{t}\right|,{f}_{i} > {f}_{g}\\ {x}_{i,d}^{t}+k\cdot \left(\frac{\left|{x}_{i,d}^{t}-{x}_{best}^{t}\right|}{\left({f}_{i}-{f}_{\omega }\right)+\varepsilon }\right), {f}_{i} ={f}_{g}\end{array}\right.$$
(19)
It is worth noting that sparrow populations require extensive optimization searching in the early iterations. At the same time, diversity decreases in late iterations, leading to premature algorithm convergence and a tendency to fall into local extremes. To address this problem, this paper proposes a hybrid variational optimization strategy (ISSA), i.e., using the standard Cauchy distribution function and standard Gaussian distribution function to enhance the diversity of the sparrow population so that the joiners have a more vital ability to jump out of the optimal local solution.
The hybrid variation strategy introduces dynamic variation parameters \({\lambda }_{1}\), \({\lambda }_{2}\) according to the number of iterations.
$${x}_{i,d}^{{t+1}{\prime}}= {x}_{best}^{t+1}\left[1+{\lambda }_{1} Cauchy\left(\mathrm{0,1}\right) +{\lambda }_{2} Gauss\left(\mathrm{0,1}\right)\right]$$
(20)
$${\lambda }_{1} =1-\frac{{t}^{2}}{{T}^{2}}$$
(21)
$${\lambda }_{2} =\frac{{t}^{2}}{{T}^{2}}$$
(22)
where: \(t\) is the current number of iterations; \(T\) is the maximum number of iterations; and the standard Gaussian distribution function and standard Cauchy distribution function are shown below:
$$f\left(x\right)=\frac{1}{\sqrt{2\pi }}\mathrm{exp}\left(-\frac{{x}^{2}}{2}\right) -\infty <x <+\infty$$
(23)
$$f\left(x\right)=\frac{1}{\pi \left(1+{x}^{2}\right)} -\infty <x <+\infty$$
(24)
The hybrid variation strategy is to generate a new location after each iteration based on the joiner location in the current iteration and to compare the fitness values of the two locations. During the iterative process, parameter \({\lambda }_{1}\) is gradually reduced, and parameter \({\lambda }_{2}\) is gradually increased, thus enhancing the ability of the algorithm to jump out of local extremes and global search. This paper uses the ISSA algorithm to optimize the XGBoost model, resulting in superior robustness and predictive power.
The structure of the ISSA algorithm for optimizing the XGBoost model is shown in Fig. 5.
Model stacking strategy based on multilayer perceptron
Considering the multiple XGBoost primary learning models established in JITL modeling of similar metrics, it is necessary to further integrate multiple XGBoost models. Currently, most multi-model fusions use the weighting approach in the integration strategy to determine the models’ weights by cross-validation. However, cross-validation does not guarantee the best model selection in terms of the actual generalization performance of the test set [21]. In order to enhance the generalization performance of the whole soft sensor model, this paper uses model stacking strategy to improve the prediction performance of the soft sensor model. At the same time, to prevent the model from overfitting, using a weakly fitted multilayer perceptron (MLP) as the second layer of the meta-learner, the structure of the MLP is shown in Fig. 6.
As shown in Fig. 6, a multilayer perceptron model with a forward structure is constructed, and the complexity of the model is balanced by adjusting the number of hidden layers and the number of neurons. When the model is overfitting, the model’s generalization ability can be increased by reducing the number of hidden layers and the number of neurons in the MLP model. Conversely, when the model appears to be under-fitted, the model complexity can be increased by increasing the number of hidden layers and neurons in the MLP model.
In addition, the training set for the secondary learners in most Stacking model research strategies will also be obtained using k-fold cross-validation. However, for the free-growing XGBoost model, the k-fold cross-validation approach does not substantially improve the generalization of the metamodel, and there is a risk of data leakage. It takes several experimental simulations to find the exact number of k-folds, which greatly wastes time for model construction. Therefore, this paper adopts a new strategy to optimize the Stacking model, replacing the K-fold cross-validation scheme by pre-separating the data set. Firstly, the data set is obtained through multiple similarity measures, and the similar data sets of each model are arranged according to the similarity; Then, a portion of the data set is extracted using uniform sampling, which allows a greater degree of information about the characteristics of the data to be obtained; Finally, the separated data set is used as the training set of the meta-learner, as shown in Fig. 7. This paper uses an optimized solution that is more adapted to the JITL strategy than the original cross-validation (CV) solution in Stacking, which not only dramatically prevents the reuse of data and reduces the risk of information leakage but also allows the system to be more responsive.
Modeling process
The flow of the modeling method proposed in this paper is shown in Fig. 8.
To better illustrate the process of online soft sensor modeling in this paper, the modeling process is described as follows:
-
Step 1: In the offline stage, multiple query domains are divided using the IFCM algorithm, and the main auxiliary variables in each query domain are determined separately using the K-MI algorithm.
-
Step 2: In the online prediction stage, the KL scatter is used to determine the sub-database in which the queried domain is located when the query data arrives.
-
Step 3: The multi-similarity measure is used to extract the data in the queried domain, and the extracted data is sorted and segmented. Finally, the remaining data after segmentation are fed into the ISSA-XGBoost local algorithm for model training, respectively.
-
Step 4: The separated dataset is also fed into the ISSA-XGBoost model, and the matrix results predicted by the first layer of the model are then fed into the MLP algorithm for MLP model training.
-
Step 5: Send the query data to the JS-ISSA-XGBoost model for prediction output, and store the output results in the storage database to wait for the sub-database update.
-
Step 6: When the query data obtains the current prediction result through the JS-ISSA-XGBoost model, the query data and the JS-ISSA-XGBoost model will be released, and the system waits for the arrival of the following query data.
Add Comment