Scientific Papers

On the evaluation of synthetic longitudinal electronic health records | BMC Medical Research Methodology


Dataset description

We select a dataset of longitudinal EHRs to illustrate our discussion on evaluation metrics. The dataset was obtained from the MIMIC-IV repository [54] on Physionet [55] (version 2.2), a freely available resource consisting of de-identified EHRs from the Beth Israel Deaconess Medical Center in Boston, Massachusetts, between 2008 and 2019. We select patients who suffered ischemic heart disease, selecting ICD-9 (International Classification of Diseases-9) code sequences and static patient attributes age, race, gender, and deceased (whether a patient passed away in-hospital). Since there are over 13,000 possible ICD-9 codes, we encode diagnoses by their ICD-9 chapterFootnote 4 to reduce computational complexity. Since the chapters complications of pregnancy, childbirth, and the puerperium, congenital anomalies, and certain conditions originating in the perinatal period are extremely rare in patients with ischemic heart disease, we omit these diagnoses completely.

Regarding missing data, this plays a role mostly in patient attributes. In sequences of diagnoses codes, missingness shows as a sequence being of different length than it would otherwise be – although it cannot be said if values are missing in a specific sequence. Variable-length sequences are handled by using appropriate methods such as DTW. In patient attributes, there is 11% missingness in race, and no missingness in age and gender. Since values might be missing not-at-random, we encode missing values as a separate category (unknown).

The final dataset contains 18,245 patients, with 4 static attributes and a single diagnoses sequence with length between 5 and 37 each. Note that this dataset of longitudinal EHRs is less complex than required for many real-world clinical tasks. ICD-9 codes are encoded by their chapter, and only a small set of patient attributes and sequential health data is selected. This is because this dataset is only used for illustrative purposes, to support the discussion on evaluation metrics for synthetic EHRs.

Synthetic data generating models

We generate synthetic patient data using two distinct deep learning models contained in open-source software libraries. Firstly, a Generative Adversarial Network (GAN) [56] with DoppelGANgerFootnote 5 [57], and secondly a Conditional Probabilistic Auto-Regressive networkFootnote 6 (CPAR) [58]. The DoppelGANger implementation used does not provide support for mixed-length sequences, so we implement a mask following Lin et al. [57] in the package. Both models generate data in two steps, by generating static attributes followed by sequential data conditional on generated attributes. This allows the models to capture the relationship between patient attributes and the progression of diagnoses codes. For both DoppelGANger and CPAR we generate the same number of records as present in the real dataset.

Note that other models have been developed which are able to generate EHRs with static attributes and sequential data. We opt for DoppelGANger and CPAR since they are contained in easy-to-use open-source libraries, promoting reproducibility of this research. Generating synthetic data of the highest quality is not the goal here, as we are providing a discussion of and recommendations on evaluating the quality of synthetic longitudinal EHRs. Other notable methods include Li et al. [16], Theodorou et al. [17], Mosquera et al. [18], Lee et al. [35], where we recommend using deep generative models like GANs and VAEs in the case of high-dimensional datasets. Since, these methods reduce the complexity of the learning task to a lower-dimensional continuous latent space.

Evaluating fidelity

Descriptive statistics

The first step in evaluating synthetic data fidelity is investigating descriptive statistics. We evaluate boxplots of numerical variables, and relative frequencies of categorical variables. For sequential features we compute these statistics at each step.

Low-dimensional projections

To intuitively assess synthetic to real data similarity, we visualize synthetic and real multivariate samples in two dimensions. If synthetic and real samples mostly overlap in the plot, this indicates they are similar. Additionally, this method may indicate whether mode collapse is present, which is the case when synthetic samples are realistic but of very low variety. In a plot, this may show as synthetic samples clustered into one or more dense clouds, instead of following the same dispersion as real datapoints.

Visualizing multivariate samples in two dimensions requires an algorithm which can adequately project multivariate samples to two dimensions. We use tSNE and UMAP for this purpose. Both algorithms compress datasets by constructing datapoints in a low-dimensional space, which exhibit similar divergence between datapoints as the original data, according to some metric. This way, they aim to preserve the overall structure of the data, even after major compression of the feature space [20, 21]. An important hyperparameter in both algorithms is the number of neighbours (called perplexity in tSNE), which controls the amount of datapoints which are considered when calculating divergences.

Although there are similarities, many differences exist between tSNE and UMAP. For example, tSNE calculates divergence between all datapoints using Kullback-Leibler divergence, whereas UMAP calculates divergence only between k-nearest neighbours and uses cross-entropy. UMAP intends to improve upon tSNE in terms of speed and quality [21], but some research claims differences in quality are mainly due to choice in initialization and can thus easily be mitigated [50].

However, when considering longitudinal datasets, calculating divergence between datapoints in tSNE and UMAP adds another dimension of difficulty. For sequential features of variable length, standard metrics do not directly apply since it is unclear how datapoints between samples map to each other. For this reason we first apply the DTW algorithm, which finds a mapping between datapoints of two sequences which minimizes the total divergence between them – subject to some conditions. The mapping starts and ends at the start- and end-point of both samples, and is monotonic and continuous [22].

Lastly, in order for the DTW algorithm to find a mapping between datapoints of two sequences, it requires choosing an appropriate divergence metric. Since we consider features of mixed data types, we use Gower distance [59], which is fit for this purpose.

Since the DTW algorithm returns the cumulative Gower distance over aligned sequence steps, this needs to be scaled to a similar range before averaging static and sequential feature distances. We use the |DtwParallel| package to execute the DTW algorithm, which scales distances with the geometric mean of sequence lengths. This choice can be justified over the use of arithmetic mean-based scaling, since it ensures that sequence length variability is penalized more heavily.

Finally, it should be noted that using Gower distance as a divergence metric significantly impacts results. Continuous feature distances are at most 1, but only for the most dissimilar case. However, categorical feature distances are 1 in case of any difference – so possibly for many cases. For this reason, differences in categorical features tend to overshadow differences in continuous features when measured through Gower distance.

Goodness-of-Fit

The next step in evaluating synthetic data fidelity, is a numerical assessment of synthetic to real data GoF. In other words, measuring the similarity between the synthetic and real data density. However, some method to approximate these densities is required, since they are (usually) intractable. This is often framed as a classification task to discriminate synthetic from real samples, where accuracy close to 50% signifies densities are similar [16, 35, 36]. Note that this closely relates to classification-based GoF testing as in Friedman [60], although synthetic data literature omits explicit testing. In this section we describe strengths and weaknesses of this metric, and how it can be used to statistically test whether the synthetic and real data density are the same in the scenario of longitudinal datasets.

Firstly, comparing synthetic to real data density through this classification task may in some cases be an oversimplification of the problem. Since, we can show mathematically that it implies the following: the entire multivariate data distribution can be encoded as a simple univariate binary feature, for which we approximate its density using a classifier. Below follows the mathematical proof.

Let \(\textbf{X}_{\textbf{R}}\) be the original dataset of real samples, \(\textbf{X}_{\textbf{S}}\) a generated synthetic dataset, and pooled dataset \(\textbf{X} = \textbf{X}_{\textbf{R}} \cup \textbf{X}_{\textbf{S}}\). Since densities \(p(\textbf{X}_{\textbf{R}}),p(\textbf{X}_{\textbf{S}})\) are intractable, we encode them with the univariate binary feature \(\textbf{z} = \textbf{1}_{\textbf{X}_{\textbf{S}}}: \textbf{X} \rightarrow \{0,1\}\) and approximate its posterior \(p(\textbf{z}|\textbf{X})\) with \(q_{\lambda }(\textbf{z})\) through variational inference [61]. Here, we choose \(q_{\lambda }\) as some machine learning model where \(\lambda\) are its parameters. Now, we can optimize for \(\lambda\) when minimizing the Kullback-Leibler divergence [62] between the true and approximated posterior of \(\textbf{z}\):

$$\begin{aligned} q_{\hat{\lambda }}(\textbf{z}) & = \arg \underset{\lambda }{\min }\ \text {KL}(q_{\lambda }(\textbf{z}) \Vert p(\textbf{z}|\textbf{X})) \nonumber \\ & = \arg \underset{\lambda }{\min }\ – \sum \limits _i z_i \log \left( \frac{z_i}{q_i}\right) \nonumber \\ & = \arg \underset{\lambda }{\min }\ \sum \limits _i – \log {q_i} \end{aligned}$$

(1)

, where \(z_i\) are binary labels and \(q_i\) label predictions from \(q_{\lambda }(\textbf{z})\). This is equivalent to optimizing a binary classifier \(f_{\theta }\) for minimal cross-entropy between true labels and predictions:

$$\begin{aligned} f_{\hat{\theta }}(\textbf{z}|\textbf{X}) & = \arg \underset{\theta }{\min }\ \text {H}(f_{\theta }(\textbf{z}|\textbf{X}),p(\textbf{z}|\textbf{X})) \nonumber \\ & = \arg \underset{\theta }{\min }\ -\sum \limits _i z_i \log q_i \nonumber \\ & = \arg \underset{\theta }{\min }\ \sum \limits _i – \log q_i \end{aligned}$$

(2)

, such that approximating \(p(\textbf{z}|\textbf{X})\) can be framed as a simple binary classification task, since the optimization problems in Eqs. (1) and (2) are equivalent. In this case, note that we make a continuous approximation of the binary latent variable on the (0,1) interval. Lastly, note that training, optimizing and reporting results in the classification should be done with independent train, validation and test sets. This is to avoid overfitting the training data, and results in a more reliable approximation of \(p(\textbf{z}|\textbf{X})\).

Encoding the multivariate synthetic and real distribution as univariate binary labels has some clear advantages. It allows for relatively simple approximation through classification, and consequently a univariate representation of the multivariate distributions. This allows for intuitive visualization to inspect model failures such as mode collapse, but also explicit GoF testing of the latent distribution. When \(H_0: q_{\lambda }(\textbf{z}_{\textbf{S}})=q_{\lambda }(\textbf{z}_{\textbf{R}})\) is rejected after performing a univariate GoF test like the KS test, we can be certain also \(H_0: p(\textbf{X}_{\textbf{S}})=p(\textbf{X}_{\textbf{R}})\) is rejected. By explicit GoF testing, we can provide statistical confidence on whether the synthetic and real data density are similar.

However, when \(H_0: q_{\lambda }(\textbf{z}_{\textbf{S}})=q_{\lambda }(\textbf{z}_{\textbf{R}})\) holds it does not necessarily follow that \(H_0: p(\textbf{X}_{\textbf{S}})=p(\textbf{X}_{\textbf{R}})\) also holds. Firstly, the chosen family of densities (machine learning classifiers) \(q_{\lambda }\) might not be suitable. Secondly, the compression of the multivariate dataset into a univariate binary feature \(\textbf{z}\) might be too simplistic. This metric should thus be approached with care.

To mitigate the risk of binary classification being an oversimplification of the problem, an option is to let \(\textbf{z}\) be a multidimensional Gaussian and approximate \(p(\textbf{z}|\textbf{X})\) using a Variational Auto-Encoder (VAE) [63]. Then, testing \(H_0: q_{\lambda }(\textbf{z}_{\textbf{S}})=q_{\lambda }(\textbf{z}_{\textbf{R}})\) can be done through a multivariate GoF test. This way, the added dimensions allow for capturing more intricate differences between the synthetic and real distribution.

In the scenario of longitudinal datasets, the classifier should capture differences across the time dimension as well. An appropriate classifier which can handle mixed-length sequences of mixed data types, is an RNN classifier. This is the classifier we use to analyze classifier predictions and perform the KS test to test \(H_0: q_{\lambda }(\textbf{z}_{\textbf{S}})=q_{\lambda }(\textbf{z}_{\textbf{R}})\).

Whenever we use an RNN we specifically mean a neural network with at least one layer containing Gated Recurrent Units (GRU) [64] to process sequential inputs. Long Short-Term Memory (LSTM) cells [65] are likely not necessary since sequences are relatively short [66].

Evaluating utility

Typical clinical tasks involving patient attributes and diagnoses sequences are, among other things, in-hospital mortality prediction using RNNs [30, 67] and next-step diagnoses prediction using RNNs and attention networks [27, 28, 68]. To assess utility of the generated synthetic EHRs, we can compare performance in these respective tasks with the TSTR approach.

It should be noted that we do not aim for the best performance possible here. The dataset was not selected with a specific use case in mind, and serves an illustrative purpose to support the discussion on evaluation metrics. For showcasing the TSTR approach the comparison in performance is key, the level of performance less so – as long as the algorithm has at least some predictive power. When aiming for the best performance possible in these tasks, we recommend to include additional features like socioeconomic status, physiological measurements and medications, which have empirically shown to be important [30, 67, 69,70,71] – next to omitting chapter encoding of ICD-9 codes.

Evaluating privacy risk

To assess risk of attribute inference, we train an RNN to predict sensitive attributes from all non-sensitive features. Here, we take every possible combination from the feature set {age,gender,race} as sensitive target attributes – so 7 sets in total. Every feature not in the set of target attributes, is used as a non-sensitive input feature in that iteration.

We assess the risk of attribute inference through the predictive accuracy of this AIA. Here, we focus on interpretable metrics such as accuracy and Mean Absolute Error (MAE) to assess the potential privacy risk in a real-world setting.



Source link