Data source and study population
The data source used for this study was the administrative claims database of the Veneto region (~ 5 million inhabitants), in Northeast Italy, and, specifically, its prescription medicine and hospital admission repositories with diagnostic discharge codes. Briefly, the Italian healthcare system mandates that all regions collect and share all transactional information on healthcare expenses, including prescription refills and hospitalisations, for reimbursement purposes. As a practical consequence, complete and timestamped information on prescription refills (mapped to ATC codes [10] as per official Ministry tables) and diagnoses at hospital discharge (encoded via ICD-9-CM codes [11]) was available for this study. Additionally, it was also possible to query the regional registry of healthcare beneficiaries [12], to confirm demographics, standing with the regional healthcare system (including month of death), and exemptions from co-payment.
The inclusion criteria for this study were the following: Italian citizenship and residence in the Veneto region; T2D as identified via a validated claims-based algorithm (98% precision, 96% sensitivity) [13]; at least two years of eligibility as per the regional registry of healthcare beneficiaries between 11 January 2011 and 30 September 2018; at least four refilled prescriptions of GLMs (ATC class A10, “drugs used in diabetes”) during the period. Exclusion criteria were: evidence cancer from diagnostic and exemption codes; evidence of prior heart failure, myocardial infarction, or stroke before the start of the observation period.
Outcome definition and modelling question
The cardiovascular outcome of interest for this study was a version of the 4-point MACE (4P-MACE) composite indicator, defined as the occurrence of at least one between: hospitalisation for heart failure (ICD-9-CM codes starting with 428), myocardial infarction (410–414), or stroke (431–436); or death for any cause.
As previously stated, our objective was to demonstrate whether and to what extent temporal GLM usage patterns, combined with basic information (age, sex, diabetes duration), could identify patients whose trajectories ended on a 4P-MACE. We formalised this task as the following modelling question: “Given the sequence and timing of all GLM prescriptions refilled by a patient (coding resolution: full ATC code; time resolution: trimesters before end-of-observation), and their age, sex, and diabetes duration, does the sequence end on a 4P-MACE, or with the patient’s event-free exit from the database?” Note that this was a classification, rather than temporal prediction, question: in other words, we were only interested in determining where each GLM usage pattern would immediately lead (4P-MACE vs. no event), and not in developing a predicting model to infer something about the future (e.g., survival analysis to determine 4P-MACE probability).
Data preparation and dataset split
The ground truth for each patient was a set of 5 binary indicators, one for 4P-MACE (primary outcome), and one for each of its components. The 4P-MACE label was equal to 1 if and only if the observed GLM pattern ended immediately before a 4P-MACE, and to 0 in case of event-free exit from the database. Each component label was equal to 1 if and only if the observed 4P-MACE was attributable to that component specifically, and to 0 otherwise (i.e., no 4P-MACE, or 4P-MACE but component not involved); multiple components could be equal to 1 at the same time (e.g., fatal myocardial infarction).
We encoded the pattern of GLMs, plus age, sex, and diabetes duration into a single, 2-dimensional, masked tensor of size 51 features × 25 trimesters. The tensor was also right-aligned, meaning that the jth column (j = 1, …, 25) of the tensor photographed the situation at the (26 – j)th trimester, with the trimester immediately preceding end-of-observation in the last column, and the 25th (6.25 to 6 years before end-observation) in the first column. Observation periods longer than 25 trimesters were cut short by ignoring the oldest data points (26th and earlier trimesters). In case of observation periods shorter than 25 trimesters, the tensor was masked (masking value = –1), i.e., all columns corresponding to unobserved trimesters were uniformly filled with the masking value. Each row of the 2-dimensional tensor encoded age, sex, diabetes duration, or the usage, trimester by trimester, of one of the 48 GLMs available in Veneto at the time of the experiment.
This process resulted in 151,175 2-dimensional tensors, which we split into three subsets: a larger training set for model development comprising the data of 131,175 patients, a validation set of 10,000 patients for hyperparameter tuning (if needed), and a test set of 10,000 patients for final performance evaluation.
Model architecture and output
Our model is based on the deep recurrent neural network (RNN) architecture proposed in [14], and adapted from the context of clinical event prediction to GLM usage pattern classification. The main feature of both the original and our version of the architecture is its input-level attention mechanism, i.e., the presence of a specific layer that established a relative importance weighting between ATC classes at each time point [15].
Our model architecture conceptually implements a cascade of four logical steps, namely: tensor ingestion, the attention mechanism, a recurrent layer, final prediction via fully connected layers (Fig. 1). First, the 2-dimensional input tensor is duplicated: one of the copies is passed to the attention mechanism, the other is transposed and ready to be multiplied by an attention matrix. At this point, the network splits into four parallel, identically structured subnetworks, one for each 4P-MACE component. Within each subnetwork, to implement the attention definition used in [14] (plus a bias term) using the computationally efficient tools available within the main deep learning libraries, the first copy of the tensor enters a dense layer of 25 (number of trimesters) neurons equipped with a softmax activation function. This process results in an attention matrix that assigns a weight to each GLM used in each trimester such that the sum over time of the weights is equal to 1, while the sum over all features of the weights attributed to a trimester is unbounded. In other words, for each subject, the network tries to establish the relative importance of each GLM within each trimester and the overall importance of the trimester. After computation, the attention matrix is transposed and multiplied elementwise by the transposed copy of the input two-dimensional tensor, thus implementing the input-level attention mechanism. As there are four subnetworks, we also obtain four (different) attention matrices and four attention-weighted tensors. Each weighted tensor, then, passes through a recurrent layer (a LSTM [16] or GRU [17], possibly with dropout) that squeezes the dynamic, variable-length information carried by the tensor into a single, fixed-length vector. A dense layer with a single neuron and sigmoid activation yields each subnetwork’s output, to be compared to the ground truth of the corresponding 4P-MACE component. Finally, the four subnetworks are brought together via concatenation of the four terminal pre-activation logits, and the resulting 4-element vector is passed to a dense layer with a single neuron, which outputs the final 4P-MACE prediction.
In summary, the model has one primary output, i.e., the score (or probability) associated with the likelihood of an observation window ending on a 4P-MACE vs. on an event-free exit from the database, and four component-specific secondary outputs.
As retrieval of outcome-specific attention maps was possible for all subjects, we produced four average attention matrices, one for each 4P-MACE component. We turned each map into an attention landscape, re-normalised, for legibility, within each trimester, to show the time-resolved patterns of GLMs that most contributed to classification.
Model selection and primary performance evaluation
Given the architecture described above, we selected the final model via hyperparameter tuning based on an exhaustive grid search and early stopping. We tested 216 hyperparameter combinations: presence or absence of a bias term in the attention mechanism, LSTM or GRU as the type of recurrent layer, 64, 128, 256 as the number of recurrent units, rectified linear unit (ReLu) or hyperbolic tangent as the recurrent layer’s activation function, 0%, 10%, or 25% as recurrent layer’s dropout and recurrent dropout (independently). For each combination, we optimised all model parameters using the average binary crossentropy of 4P-MACE and its components as a cost function (ADAM algorithm, learning rate = 0.001); then, we evaluated the area under the receiver-operating characteristic curve (AUROC) for 4P-MACE on the validation set (10,000 patients not used for parameter estimation), stopping the training process after 10 epochs of no improvement, and retaining the best epoch’s parameters. We selected the best model among the 216 candidates as the one maximising the 4P-MACE AUROC on the validation set.
We evaluated the final model’s performance in terms of the AUROCs associated with 4P-MACE and each of its components on the test set (untouched until this point), including 95% confidence intervals calculated via the DeLong method [18].
Secondary benchmarking analyses
The proposed model can leverage on the three fundamental aspects of GLM usage (namely, timing, sequence, and type of medication). However, this comes at the cost of having to handle relatively large (51 × 25) input tensors. Hence, to quantify the possible impact of input type and dimensionality on classification performance, we set up one primary and three secondary analyses following the same experimental protocol and data splits used in the primary analysis. The outputs of each analysis were the classification AUROC on the test set, including 95% confidence interval, and the identification of statistically significant difference in performance vs. the proposed model.
First, to understand the impact and efficiency of sequence-based learning with respect to classification performance, we reran the performance evaluation phase on two artificially modified variations of the test set. Namely, we considered a variation where the unmasked portion of the tensor was randomly shuffled through time, and one where the order of refilled prescriptions was completely inverted (we pretended that the first GLM was prescribed at the date of the last GLM, the second of the second-to-last, etc.).
Second, we implemented a strategy adapted from [19], which requires medication data in the form of variable length (hence masked and zero padded to 150 refills) sequences. On the one hand, the switch from tensor to sequences collapsed the “timing” dimension and reduced dimensionality; on the other, it removed the model’s ability to account for simultaneous therapies. In practice, we transformed each tensor into a zero padded (and masked with masking value = 0) sequence of integers (from 1 to 48, each corresponding to a GLM ATC class) sorted from oldest to newest according to their prescription date; and treated age, sex, and diabetes duration as a separate input. We substituted the initial part of the proposed architecture (tensor ingestion and attention mechanism) with the corresponding solution taken from [9], i.e., sequence ingestion, embedding, and concatenation of patient information with the output of the recurrent layer, using the following hyperparameters (216 combinations): learning rate = 0.001, embedding size (64 or 128); recurrent layer type (LSTM or GRU) and number of units (64, 128, or 256), activation function (ReLu or hyperbolic tangent), dropout and recurrent dropout (for both, independently: no dropout, 10%, or 25%). The rest of the pipeline remained unaltered. Note that, at strong variance with the model in [9], here, we tackled a classification (vs. prediction) task, resulting in a much wider observation window of 6.25 years (vs. 1 year), and static (vs. dynamic, 1 to 5 years in the future) ground truth labels.
Third, we developed the simplest possible model, i.e., a logistic regression on the concatenation of age, sex, diabetes duration, and the bag-of-words vector of prescribed GLMs throughout the entire observation period. This analysis further collapsed all information carried by the “sequence” dimension into a static vector of 51 elements.