Scientific Papers

Analytical reference framework to analyze non-COVID-19 events | Population Health Metrics

The framework proposed and implemented in this work, named “Analytics for Non-COVID-19 Events” (ANE, from now on), adapts the ASUM-DM (Analytics Solutions Unified Method for Data Mining) [8] methodology to include and handle specific characteristics of health events and their underlying data.

Figure 1 shows the eight consecutive stages that compose the framework, which starts with understanding the business, and goes through collecting and processing the input data, building the predictive models, and the feedback and visualization of the results, among other tasks. The description of each of the stages of this framework and its implementation (for our case study) are presented below in the following subsections (Data Analysis, Model Generation, and Visual Analytics).

Fig. 1
figure 1

Data analysis

According to the Business understanding stage, the main purpose of this project is to provide a framework that aims to exploit data related to the number of cases reported in Colombia for specific non-COVID-19 health events to measure the impact of Coronavirus on such events. From the initial list of 4 events, we select tuberculosis and suicide attempts for our case study, following the recommendation of different actors in the sector. For these events, the data were collected from the National Public Health Surveillance System (SIVIGILAFootnote 1—from its acronym in Spanish).

The Analytical approach stage consists of reviewing and selecting (from the literature) the set of techniques that will be used to address the objective of this project, i.e., selecting the supervised learning algorithms (specifically, linear regressors and time series methods) that will be used to forecast the number of cases for a specific disease using historical open data from the Colombian health information system from 2015 to 2020. The information collected from SIVIGILA, during the Data collection stage, is registered according to the International Classification of Diseases (ICD) format. For each disease, in collaboration with the National Health Institute (INS—from its acronym in Spanish), the selected codes were ones that allow obtaining the number of cases reported per week from the national level to the municipality level, including departments and their capitals. In cases such as Tuberculosis, the datasets contain different codes related to specific types of tuberculosis, and, the decision was taken using the more frequently event in Colombian people as criteria. For this reason, the “pulmonary tuberculosis” is chosen for the analysis.

Additionally, to compare results among geographical regions, population density and the codification of Colombia’s geographic division were taken from the National Administrative Department of Statistics (DANE—from its acronym in Spanish).

In the Data understanding stage, the data are inspected for the construction of the analytical models. Using multivariate descriptive analysis and statistical measures (such as mean, standard deviation, minimum, maximum and coefficient of variation), it is possible to define and validate the granularity of the data to be used in the ANE framework, specifically, for the geographic units (entities), data at national, departmental and capital level will be used, and for the time unit, epidemiological periods will be used (i.e., 4 continuous weeks, having 13 epidemiological periods per year), instead of months. Finally, only data from 2017 to 2020 (which are statistically relevant for the analysis) are used. These decisions were validated by a multidisciplinary team (physicians, engineers, managers, among others).

Model generation

With this in mind, during the Data preparation stage, a common file structure was used as input for the forecast models. Figure 2 shows the file structure using tuberculosis data as an example. The file is organized in seven columns: date, entity, year, month, week, period (epidemiological period), and value. Entities correspond to a department or capital, and value corresponds to the number of reported cases for each entity. Regarding the analysis of data quality, missing values, duplicate values, and standardization problems were evidenced, which expose some inconsistencies present in the weekly records of several municipalities. Duplicated data were not taken into account (they were removed), and missing data (less than 5%) for a geographic unit were completed with the median of the variable among the time units with data. On the other hand, the problem of data standardization is mostly due to inconsistencies in the spelling of the same geographic unit; for example, “Narino” instead of “Nariño”. To avoid this situation, the administrative division of Colombia code (Divipola) provided by DANE is used as the entity key.

Fig. 2
figure 2

File structure of event data about the national, department, and capital behavior for tuberculosis

The Modeling stage includes the model selection and the analytic model construction. The main objective in this stage is to create a model to accurately estimate the number of cases v of a specific event x for n periods. The equation is given below:

$$v = forecast\left( {x,n} \right)$$


The model created allows us to determine the value of v that would have occurred if the COVID-19 pandemic had not happened (referred to as “without COVID-19”), including data up to 2019, and predicting the 2020 periods.

We tested three methods to select the model class: the polynomial regressions (PR), exponential smoothing (ES) with the Holt-Winters (or triple exponential smoothing) method, and the SARIMA method (the last 2 methods were selected according to the literature recommendations). The PR method served as the baseline model because of its simplicity and its capability to reproduce the trend of the data. The Holt-Winters method is a time series forecasting method that considers the trend and seasonality of the series. The SARIMA model belongs to the Box–Jenkis family and is based on ARIMA (AutoRegressive Integrated Moving Average) models, which use the correlation between data, and like the Holt-Winters method, includes a seasonal component.

Thus, the approach used to construct the forecasting models is based on time series, according to the data characteristics and literature recommendations (in fact, the SARIMA model evidences the best results in the literature review). For all types of models, compliance with the assumptions related to the behavior and nature of the data series over time was validated by statistical tests.

The type of model was selected using the tuberculosis data and evaluating with mean absolute percentage error (MAPE) and tracking signal measure, which is used as the number of periods that the forecast is between ± 3.2 standard deviations. For each method, the hyperparameters were adjusted using a minimum threshold of 4.0 for the MAPE measure (to avoid the overfitting of the models), and a minimum of 1 available tolerance period for the tracking signal. The Holt-Winters method, and specifically the triple exponential smoothing method, requires that the alpha, beta, gamma, phi, trend, damped, seasonal, seasonal periods, and boxcox hyperparameters be adjusted. The SARIMA models, on the other hand, have seven hyperparameters (p, d, q, P, D, Q, and m) that must also be adjusted to obtain the best model. The hyperparameter fitting in the Holt-Winters model and the SARIMA model uses an exhaustive search that maximizes the performance measure (MAPE and the number of periods of the tracking signal). The best results were obtained with the SARIMA approach as shown in Table 1.

The analytical model construction step consists of selecting the best predictive model for each of the (65 different) geographic units of each event, using the method that provided the best results in the previous step. The selection is performed using a greedy approach in a finite search space of hyperparameters available for the SARIMA algorithm. With respect to the model training step, different size distributions were tested for the training and testing datasets (70%-30%, 80%-20%, and 90%-10%). After analyzing the results of the models generated with the experts, we decided to use the 80%-20% size distribution, since it was not only the one that offered the best results, but also the distribution that avoided overestimating and underestimating the absolute percentage error in the predictions. Subsequently, for each model created, the Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) are calculated, and the one with the lowest MAPE and RMSE is selected. In case of ambiguity, the AIC and BIC coefficients are used to select the best model.

Visual analytics

The objective of the Indicator definition stage is to define indicators to quantify the impact of COVID-19 on other health events. The underlying assumption is that, without the measures taken to control the COVID-19 pandemic, it would be reasonable to expect the number of reported cases to be close to the forecast. Then, according to the report on excess mortality presented by DANE [4], and the studied events, two types of impact indicators are proposed. First, for a given period t, we analyze the difference between the forecast without COVID-19 (\({v}_{t}^{\mathrm{woCOVID}}\)) and the reported cases (\({O}_{t}\)). This is called “indirect impact” and it is expressed as a percentage (%) (see Eq. 2). Second, as the sanitary emergency evolved over the year, it was desirable to understand the “cumulative impact” from January to a given period t. This is estimated as the difference between the cumulative values of the forecast and the reported cases (see Eq. 3).

$${\mathrm{Indirect}}_{t}: \frac{ {v}_{t}^{\mathrm{woCOVID}}-{O}_{t} }{{v}_{t}^{\mathrm{woCOVID}}}$$


$${\mathrm{Cumulative}}_{t}=\left(\sum_{i=1}^{t}{v}_{i}^{\mathrm{woCOVID}}-\sum_{i=1}^{t}{O}_{i}\right) / \sum_{i=1}^{t}{v}_{i}^{\mathrm{woCOVID}}$$


At the level of the Analytical model deployment stage, an interactive tool, implemented in PowerBI, is used for the stakeholder’s interaction. This tool is published on Alianza Caoba’s public-policy web page.Footnote 2 The tool is composed of three types of pages for each health event. The first one presents the descriptive analysis, the second one the forecast results and the third one the impact indicators. In these three pages offered by the tool, the decision-maker user can filter the information of interest with respect to time and geographic units, and thus, focus only on a specific section of an event to understand its behavior, with the objective of improving his/her decisions. Figs. 11, 12 in Appendix 1, show screen shots of the interactive visualization tool of the prediction for tuberculosis at the national level with/without COVID-19 scenarios. There you can see the results of the prediction models with and without pandemic data, reflecting the estimated impact of each health event in each of the 32 departments and 32 capitals.

Source link