Skip to main content

Construction of machine learning diagnostic models for cardiovascular pan-disease based on blood routine and biochemical detection data

Abstract

Background

Cardiovascular disease, also known as circulation system disease, remains the leading cause of morbidity and mortality worldwide. Traditional methods for diagnosing cardiovascular disease are often expensive and time-consuming. So the purpose of this study is to construct machine learning models for the diagnosis of cardiovascular diseases using easily accessible blood routine and biochemical detection data and explore the unique hematologic features of cardiovascular diseases, including some metabolic indicators.

Methods

After the data preprocessing, 25,794 healthy people and 32,822 circulation system disease patients with the blood routine and biochemical detection data were utilized for our study. We selected logistic regression, random forest, support vector machine, eXtreme Gradient Boosting (XGBoost), and deep neural network to construct models. Finally, the SHAP algorithm was used to interpret models.

Results

The circulation system disease prediction model constructed by XGBoost possessed the best performance (AUC: 0.9921 (0.9911–0.9930); Acc: 0.9618 (0.9588–0.9645); Sn: 0.9690 (0.9655–0.9723); Sp: 0.9526 (0.9477–0.9572); PPV: 0.9631 (0.9592–0.9668); NPV: 0.9600 (0.9556–0.9644); MCC: 0.9224 (0.9165–0.9279); F1 score: 0.9661 (0.9634–0.9686)). Most models of distinguishing various circulation system diseases also had good performance, the model performance of distinguishing dilated cardiomyopathy from other circulation system diseases was the best (AUC: 0.9267 (0.8663–0.9752)). The model interpretation by the SHAP algorithm indicated features from biochemical detection made major contributions to predicting circulation system disease, such as potassium (K), total protein (TP), albumin (ALB), and indirect bilirubin (NBIL). But for models of distinguishing various circulation system diseases, we found that red blood cell count (RBC), K, direct bilirubin (DBIL), and glucose (GLU) were the top 4 features subdividing various circulation system diseases.

Conclusions

The present study constructed multiple models using 50 features from the blood routine and biochemical detection data for the diagnosis of various circulation system diseases. At the same time, the unique hematologic features of various circulation system diseases, including some metabolic-related indicators, were also explored. This cost-effective work will benefit more people and help diagnose and prevent circulation system diseases.

Background

Cardiovascular diseases (CVDs), also known as circulatory system diseases, encompass a range of conditions including coronary heart disease (CHD), cerebrovascular disease, arrhythmias, valvular heart disease, cardiomyopathy, heart failure, and other related disorders [1]. With the widespread adoption of unhealthy lifestyle habits, CVDs continue to be the leading cause of mortality and morbidity worldwide, imposing a significant health burden and economic strain on both patients and society [2, 3]. The impact of CVDs is particularly severe in China. According to the China Health Statistical Yearbook 2021, CVDs rank first in both morbidity and mortality rates among urban and rural residents, surpassing cancer and other diseases [4].

Traditional diagnostic approaches for CVDs, including electrocardiograms (ECG), echocardiography, coronary angiography, stress testing, magnetic resonance imaging, and intracoronary ultrasonography, are often costly and not ideal for early-stage detection [5]. These methods are frequently inaccessible to primary healthcare facilities and economically disadvantaged regions due to the prohibitive costs of the required equipment. Moreover, many CVDs are asymptomatic in their early stages, and their progression can be slow, leading to clinical diagnoses often occurring at an advanced stage of the disease or incidentally during routine check-ups or assessments for other conditions. Therefore, it is crucial to identify more accessible and early screening indicators for CVDs.

Clinical laboratory tests, including hematological and biochemical analyses, provide quantitative measurements in the blood of both xenobiotics (foods, drugs, and their metabolites) and biotics (biomarkers) using validated, robust assays [6, 7]. Biochemical changes induced by disease can significantly impact various aspects of bioanalysis. Specifically, metabolic changes such as hyperglycemia, hypertriglyceridemia, high-density lipoprotein (HDL), cholesterol, hypertension, and a pro-inflammatory state are often present even in the early stages of CVDs [8]. However, doctors often focus on significantly abnormal parameters, potentially overlooking a substantial amount of other test data and the interrelationships between laboratory parameters, which may lead to an underestimation of the diagnostic potential of these tests. Therefore, it is essential to study the reference range and variation characteristics of hematological and biochemical indicators for early identification of preventable risk factors and early-stage CVD diagnosis, especially for indicators related to metabolic health, to assist doctors in early-stage CVD detection.

With the advancement of electronic medical record systems, an increasing amount of clinical laboratory test data has become more accessible and reliable. The use of this data, in combination with artificial intelligence (AI), for disease diagnosis, prediction, monitoring, and prognosis is a rapidly growing field [9, 10]. Machine learning (ML), a subset of AI, has shown great promise in aiding the diagnosis of CVDs [1, 11,12,13]. Current ML-based studies on CVDs generally overlook clinical laboratory test data, instead focusing on more expensive and/or invasive imaging techniques such as computed tomography angiography (CTA), heart ultrasound, computed tomography (CT), ECG, and echocardiography [14,15,16,17,18]. Additionally, existing research often emphasizes predicting the risk and prognosis of individual diseases [19, 20]. However, there is limited systematic analysis of the distinguishing features and unique hematological characteristics of CVDs.

In summary, this study aims to address several key questions: (1) to develop cost-effective, large-scale screening models based on blood routine and biochemical test data using clinical data from the First Affiliated Hospital of Xiamen University. The models we developed, after undergoing multiple rounds of parameter optimization, have achieved high accuracy. These models can accurately distinguish between cardiovascular disease patients and healthy individuals, as well as differentiate between most types of cardiovascular diseases; (2) to leverage the strengths of machine learning to explore the diagnostic performance of multi-indicator combinations in blood routine and biochemical test data, identifying universal indicators for the diagnosis and classification of cardiovascular diseases; (3) to systematically compare and evaluate the unique hematological and metabolic characteristics of cardiovascular disease patients, providing clinicians with specialized insights for diagnosis and disease prevention.

Methods

Data collection and processing

All the raw data we collected came from inpatients in the Departments of Neurology and Cardiology and healthy people who had physical examinations in the First Affiliated Hospital of Xiamen University between 2018 and 2023. These data were from the hospital information system. For all patients, we screened the blood routine and biochemical test data from the first test after hospitalization as features for the construction of models, while for healthy people, we selected the blood routine and biochemical test data from the first physical examination every year as features. Because too many missing values may affect the prediction accuracy, we removed the features with a missing value ratio greater than 50% and finally screened out 22 features from the blood routine and 28 features from the biochemical test data (Supplementary Tables 1 and 2). Diagnostic information for all patients was determined according to The International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD-10). To ensure that the sample size for each circulation system disease was sufficient, we removed circulation system diseases with fewer than 100 samples. At the same time, we also deleted samples with a greater proportion of than 50% missing features. In the end, 25,794 healthy people and 32,822 patients with circulation system disease were used to construct our models (Fig. 1; Table 1). These data were randomly divided into a training set (70%) and a validation set (30%).

Table 1 Data distribution of diseases
Fig. 1
figure 1

The flow chart of this study

Machine learning methods

Logistic regression (LR), also known as logistic regression analysis, is a generalized linear regression analysis model, which is often used in data mining, automatic disease diagnosis, economic forecasting, and other fields. Logistic regression estimates the probability of an event occurring based on a given dataset of independent variables, and since the outcome is a probability, the dependent variable ranges between 0 and 1. Random forest (RF) is a classifier with many decision trees, which can be used to deal with classification and regression problems, as well as for dimensionality reduction problems. It also has a good tolerance for outliers and noise and has better prediction and classification performance than decision trees. Support vector machine (SVM) is a kind of generalized linear classifier that classifies data binarily according to supervised learning, and its decision boundary is the maximum margin hyperplane solved by the learning sample. eXtreme Gradient Boosting (XGBoost) is an algorithm or engineering implementation based on the Gradient Boosting Decision Tree (GBDT). XGBoost is efficient, flexible, and lightweight, and has been widely used in data mining, recommender systems, and other fields. The deep neural network (DNN) is a framework for deep learning, that is a neural network with at least one hidden layer. Similar to shallow neural networks, deep neural networks can also provide modeling for complex nonlinear systems, but the extra layers provide a higher level of abstraction for the model, thus improving the model’s capabilities. LR can optimize features through regularization. RF naturally reduces the impact of feature noise by combining multiple decision trees, thereby optimizing feature usage. SVM uses kernel functions and regularization parameters to find an appropriate hyperplane in high-dimensional space, indirectly affecting feature selection and optimization. XGBoost optimizes feature usage in the tree structure through gradient boosting. DNN can automatically learn and optimize features, particularly when dealing with complex data, by progressively extracting and refining features through multiple hidden layers. In summary, each of these algorithms has its strengths in feature optimization. For comparing the performance of different machine learning methods, we selected LR, RF, SVM, XGBoost, and DNN to construct the model [21,22,23,24,25].

To eliminate the impact of different feature scales on the accuracy of the prediction models, we standardized both the training and validation sets. We then performed hyperparameter selection for five machine learning algorithms using a combination of grid search cross-validation (CV) and manual fine-tuning. The parameters adjusted for LR were C, max_iter, penalty, and solver. For RF, the parameters were max_depth, min_samples_leaf, and n_estimators. For SVM, the parameters adjusted were C, gamma, and kernel. For XGBoost, the parameters were colsample_bytree, gamma, learning_rate, max_depth, n_estimators, and subsample. For DNN, the adjusted parameters included activation, number of layers, and number of neurons per layer. All optimal parameters were determined within the training set for the models distinguishing cardiovascular disease patients from healthy individuals. A 5-fold cross-validation was employed, with area under the curve (AUC) serving as the primary performance evaluation metric, to identify the best estimator (Supplementary Data 1).

The LR, RF, and SVM were used through scikit-learn (version 1.3.0), XGBoost was used through the xgboost package (version 2.0.2), and the DNN by tensorflow (version 2.0.2) in python.

Model performance evaluation

All models were trained using the best estimator and then validated on the validation set. Sensitivity (Sn), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), F1 score, matthews correlation coefficient (MCC), and accuracy (Acc) were utilized for model performance evaluation. Their formulas are shown below [26,27,28]:

$${\text{Sn}} = \frac{{{\text{TP}}}}{{{\text{TP + FN}}}} $$
$${\text{Sp}} = \frac{{{\text{TN}}}}{{{\text{TN + FP}}}} $$
$$ {\text{PPV}} = \frac{{{\text{TP}}}}{{{\text{TP + FP}}}} $$
$$ {\text{NPV}} = \frac{{{\text{TN}}}}{{{\text{TN + FN}}}} $$
$$ {\text{Acc}} = \frac{{{\text{TP + TN}}}}{{{\text{TP + FN + TN + FP}}}} $$
$$ {\text{F1~score}} = \frac{{{\text{2TP}}}}{{{\text{2TP + FN + FP}}}} $$
$$ {\text{MCC}}~ = ~\frac{{{\text{TP}}~ \times ~{\text{TN}}~ - {\text{FP}}~ \times ~{\text{FN}}}}{{\sqrt {({\text{TP}}~ + ~{\text{FP}})({\text{TP}}~ + ~{\text{FN}})({\text{TN}}~ + {\text{FP}})({\text{TN}}~ + ~{\text{FN}})} }} $$

TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative separately. Meanwhile, we also made use of the AUC of the receiver operating characteristics curve (ROC) to evaluate the model performance comprehensively. Additionally, to further assess the robustness of the models, all performance evaluation metrics were calculated on the validation set using the bootstrapping method to determine their 95% confidence intervals (CI) [29,30,31].

Model interpretation

Machine learning makes it difficult to explain the contribution of each feature due to its black-box principle, so the SHAP algorithm was introduced in this study. The SHAP algorithm assigns a SHAP value to each feature, which is used to explain the impact of the feature on the predictive model [32]. The SHAP value of each feature was computed by the shap python package (version 0.44.0).

Identification of features for various types of CVDs

To identify the unique hematological and metabolic features of various cardiovascular diseases, we applied the SHAP algorithm to calculate SHAP values for 50 features across the 69 models distinguishing between different diseases. To ensure that the raw SHAP values were accurately represented in the heatmap, we did not normalize the values. We then performed hierarchical clustering on both rows and columns of the heatmap, reordered them according to the clustering results, and finally plotted the heatmap using Python.

To further explore the universal features distinguishing between various diseases, we selected the top ten features from the 69 models and connected these features with the respective diseases in a network graph. The size of each feature’s node in the network increases if it appears frequently among the top ten features across the models, indicating its potential as a universal distinguishing feature between the diseases. The network was visualized using Cytoscape (version 3.10.2) [33].

Results

Circulation system disease prediction model construction

To ensure the accuracy of our prediction models, the number of various circulation system diseases was all over 100 (Table 1). The male-to-female ratio between healthy people and circulation system disease patients was similar, all close to 1:1. The number of healthy people for 40–60 years old and circulation system disease patients for 60–80 years old was the most population, 12,828 and 18,868 respectively (Supplementary Fig. 1). Subsequently, we chose five machine learning methods (LR, RF, SVM, XGBoost, and DNN) and utilized 22 features from blood routine and 28 features from biochemical detection to construct the circulation system disease prediction models. The results showed the comprehensive performance of XGBoost was the best (AUC: 0.9921 (0.9911–0.9930); Acc: 0.9618 (0.9588–0.9645); Sn: 0.9690 (0.9655–0.9723); Sp: 0.9526 (0.9477–0.9572); PPV: 0.9631 (0.9592–0.9668); NPV: 0.9600 (0.9556–0.9644); MCC: 0.9224 (0.9165–0.9279); F1 score: 0.9661 (0.9634–0.9686)) (Table 2). Meanwhile, we also attempted to construct the models only using blood routine or biochemical detection data. We found the model performance of the blood routine combined with biochemical detection was the best (Fig. 2B-D and Supplementary Data 2). Considering the imbalance for sample number among 69 circulation system diseases, we also used each circulation system disease to construct 69 models. The AUC of these models were all beyond 0.9, the highest one reached 0.9996 (0.9992–0.9999) (Fig. 2A; Table 2). These models all showed nice performance and robustness (Table 2).

Table 2 Model performance evaluation results (circulation system diseases vs. healthy, XGBoost)
Fig. 2
figure 2

Construction of circulation system disease prediction model using clinical blood samples. (A) The AUC of 69 circulation system disease prediction models. ROC curves of five machine learning methods using different data. (B) Blood routine combined with biochemical detection. (C) Blood routine. (D) Biochemical detection

Classification of various circulation system diseases

To further subdivide various circulation system diseases, we constructed 69 models distinguishing a kind of circulation system disease from other circulation system diseases, such as distinguishing venous thrombosis of the lower extremities from other circulation system diseases. The XGBoost was selected to construct models because of its good performance. The results showed the AUC of these models ranged from 0.5256 to 0.9267. Surprisingly, the model performance of distinguishing dilated cardiomyopathy (DCM) from other circulation system diseases was the best. DCM is a type of cardiomyopathy characterized by enlargement of the left or both ventricles of the heart with systolic dysfunction. The diagnosis of DCM primarily depends on ultrasonic cardiogram and cardiac magnetic resonance, not the blood routine and biochemical detection. These results indicated these models could help doctors well distinguish different circulation system diseases (Fig. 3 and Supplementary Data 3).

Fig. 3
figure 3

The AUC of 69 models distinguishes a kind of circulation system disease from others

Analysis of circulation system disease-specific indicators

To help us better understand the contributions of 50 features for the circulation system disease prediction model and find the circulation system disease-specific indicators, we used the SHAP algorithm to compute the contribution degree of each feature. For the constructed model only utilizing the blood routine, the top 10 features were lymphocyte percentage (LY%), red blood cell count (RBC), absolute value of monocyte (MO#), hematocrit (HCT), absolute value of neutrophil (NE#), mean erythrocyte hemoglobin concentration (MCHC), plateletcrit (PCT), white blood cell count (WBC), platelet distribution width (PDW), and mean platelet volume (MPV) (Fig. 4A). For the constructed model only utilizing the biochemical detection data, the top 10 features were potassium (K), albumin (ALB), total protein (TP), indirect bilirubin (NBIL), direct bilirubin (DBIL), sodium (Na), glucose (GLU), triglycerides (TG), cholesterol (CHO), Apolipoprotein A1 (APOA1) (Fig. 4B). Interestingly, for the constructed model utilizing the blood routine combined with biochemical detection, only one feature from the blood routine, LY%, was one of the top 10 features (Fig. 4C). These results indicated features from biochemical detection made major contributions to predicting circulation system disease (Fig. 4D). Additionally, to further validate the importance of these features, we also calculated the top 20 features ranked by the other four machine learning methods. As shown in the results, although there were slight variations in feature rankings across different methods, there was considerable overlap among the top 20 features, indicating that our model interpretation approach demonstrates good stability (Supplementary Data 4).

Fig. 4
figure 4

The top 20 features for the circulation system disease prediction model using different data. (A) Blood routine. (B) Biochemical detection. (C) Blood routine combined with biochemical detection. The red represents a high value, and the blue represents a low value. If the SHAP value is positive, it represents the positive effect of the feature on the model, and vice versa. All features are listed in order of importance from top to bottom. (D) The joyplot of numerical distributions of K, TP, ALB, and NBIL among various circulation system diseases and healthy people

To verify whether the performance of the XGBoost was affected by redundant features, we constructed the model to distinguish cardiovascular disease patients from healthy individuals using only the top 10 features ranked by the SHAP algorithm (Fig. 4C). The results showed the model built using all 50 features performed better, indicating that the performance of our models was not impacted by redundant features (Supplementary Fig. 2).

Analysis of characteristic indicators of discrimination between various circulation system diseases

After exploring circulation system disease-specific indicators, we also hoped to further explore characteristic indicators of discrimination between various circulation system diseases. Then, we displayed the SHAP value of each feature through a heatmap. Rows and columns were clustered separately, and the more similar the features or diseases, the closer they were. We found that every circulation system disease had distinctive characteristics (Fig. 5A and Supplementary Data 5). At the same time, a network displaying the intersection features among various circulation system diseases showed that RBC, K, DBIL, and GLU were the top 4 features subdividing various circulation system diseases (Fig. 5B). Elevated GLU is often associated with diabetes, but we found that GLU could also be used to distinguish between different circulation system diseases. DBIL, also known as conjugated bilirubin, is produced by the combination of indirect bilirubin into the liver by the action of intrahepatic glucuronosyltransferase and glucuronic acid, and its elevation is usually related to various liver dysfunctions. But as we can see in our results, it also has great potential for predicting various circulation system diseases. The numerical distributions of the top 4 features among various circulation system diseases and healthy people were different (Fig. 5C). The results proved our models were reliable.

Fig. 5
figure 5

Analysis of specific indicators for differentiation between different circulation system diseases. (A) The heatmap displays SHAP values of 50 features for each disease differentiation model. The positive SHAP value is added to the absolute value of the negative SHAP value to form the final SHAP value to be displayed. (B) The network shows the intersection top 10 features among different disease differentiation models. The red circles represent various circulation system diseases, and the blue circles represent various features. The larger the blue circle, the more the intersection features. (C) The joyplot of numerical distributions of RBC, K, DBIL, and GLU among various circulation system diseases and healthy people

Discussion

Cardiovascular disease (CVD) remains the leading cause of death globally [2]. Early-stage detection of CVD is an important way of reducing this toll. An advanced detection of cardiovascular disease is required to improve therapeutic strategies and patient risk stratification. Therefore, an urgent need exists for novel effective, and targeted therapies with more precise risk stratification, which necessitates a deeper understanding of the underlying molecular mechanisms that drive the progression of CVD.

From a 6-year population-based cohort of the First Affiliated Hospital of Xiamen University, this study enrolled 32,822 CVD and 25,794 CVD-free participants. We implemented 5 kinds of ML-based data-driven pipeline (LR, RF, SVM, XGBoost, and DNN) to identify predictors from 50 candidate variables covering 22 features from the blood routine and 28 features from blood biochemical tests and assessed multiple ML classifiers to establish risk prediction models on CVD. Our models obtained satisfied discriminative performance with the best AUC of 0.9921. Further, we attempted to construct predictive models to distinguish among 69 common CVDs. All these prediction models can discriminate among multiple CVDs, with particularly notable performance in distinguishing DCM (AUC = 0.9267) from others.

In this study, we developed predictive models using blood routine and biochemical test data. These models have the potential to be reliable methods for early diagnosis and large-scale screening for CVDs in populations. Recent studies have shown that bilirubin is not just a byproduct of heme degradation but also a crucial endogenous antioxidant [34]. The biochemical processes underlying the relationship between raised DBIL and higher CHD risk remain unclear, although in middle-aged and older adults, DBIL is independently linked to a linear dose-response increased risk for CHD incidence [35]. It has been reported that DBIL is more readily available in an active state because it is soluble in serum and only weakly bound to albumin. In the meantime, it may be hard for water-soluble DBIL to penetrate the vascular intima of the atherosclerotic plaque and function as an antioxidant [36]. This affirmed the significance of circulating bilirubin, including DBIL and NBIL, as predictive features in our models. The diagnostic relevance pertains to the utility of circulating bilirubin concentrations as a novel and reliable marker of cardiovascular disease risk. This biomarker can be readily measured in clinical laboratories and implemented in medical practice.

Diabetes mellitus is associated with a significantly increased risk of cardiovascular diseases. Chronic hyperglycemia is known to induce mitochondrial dysfunction and endoplasmic reticulum stress, promote the accumulation of reactive oxygen species (ROS), and consequently lead to cardiovascular damage [37]. Similarly, hypertriglyceridemia has been implicated in promoting cardiovascular disease through multiple mechanisms, including the upregulation of signaling pathways that mediate inflammation, oxidative stress, thrombosis, endothelial dysfunction, and vascular impairment [38]. Furthermore, serum cholesterol (CHO) levels have been demonstrated to be associated with an increased risk of CVD [39]. Apolipoprotein (APO) A1, the principal apolipoprotein of plasma high-density lipoproteins (HDLs), possesses multiple well-documented cardioprotective functions [40]. Our models confirm alignment with these established metabolic risk factors—GLU, TG, CHO, and APOA1—highlighted as top predictors in this study. This alignment enhances the models’ clinical utility, demonstrating their potential to identify individuals at risk of cardiovascular events based on readily accessible parameters.

Furthermore, vascular inflammation and associated chronic pro-inflammatory states are considered key factors in the development of CVD [41]. Previous clinical investigations have demonstrated that peripheral blood lymphocytes are associated with the prognosis of heart failure [42]. Lymphocytopenia in chronic heart failure patients may result from programmed lymphocyte death due to excessive sympathetic activation and increased oxidative stress and pro-inflammatory status [43]. The model constructed in our work demonstrates the role of lymphocyte percentage in the diagnosis of CVD, which is consistent with these previous investigations.

Overall, the predictors derived in our data-driven pipeline have been validated by numerous studies, proving the reliability of our model; however, it is the first time that the ten predictors were combined to establish a CVD risk prediction model. Our models underscore the importance of blood lipid and glucose levels, as well as circulating bilirubin, in the prediction of CVDs.

One notable strength of our study is that all the top 10 predictors for model development can be easily obtained through blood sampling, which provides the general population with the opportunity to perform automated and rapid health screening. It also gives clinicians a tool to help them diagnose heart problems early on. As a result, it will be easier to treat patients effectively and avoid serious repercussions.

While earlier studies have primarily focused on the prediction and diagnosis of specific cardiovascular diseases, such as coronary artery disease [44, 45], atrial fibrillation [46], major adverse cardiovascular events in patients with diabetes [47], and heart failure [48], comprehensive approaches that encompass the entire cardiovascular system remain relatively underexplored. We performed an extensive analysis of 69 prevalent cardiovascular diseases and developed diagnostic models. Additionally, our comprehensive approach in constructing the model included an analysis of distinctions between different CVDs, thereby providing physicians with improved diagnostic differentiation.

While the analysis of clinical data is commonly employed in diagnosis, this practice is less prevalent in CVD diagnostics, where available data is often limited to advanced imaging modalities and invasive hemodynamic assessments [49,50,51]. The availability of data are essential prerequisite for advancements in the clinical application of machine learning. Our research utilizes hematological data, which is not only more readily accessible but also significantly more cost-effective.

Several caveats should be considered. Given the effect of biological variables such as sex and age on cardiovascular risk [52], it is imperative to integrate datasets from increasing numbers of donors to evaluate the influence of these variables on human cardiovascular disease. Moreover, patients with cardiovascular disease, especially those of advanced age, often have comorbidities such as diabetes mellitus, obesity, and high blood pressure, which will need to be considered in the analysis and interpretation. This study’s limitation is its single-center retrospective design, with a sample confined to patients from the First Affiliated Hospital of Xiamen University. Consequently, some results may not be generalizable to other populations. Further validation requires studies involving diverse populations and multiple centers.

Conclusions

In summary, our study developed cost-effective, large-scale screening models based on blood routine and biochemical test data. These models are capable of distinguishing not only cardiovascular disease patients from healthy individuals but also differentiating between various types of cardiovascular diseases (Supplementary Fig. 3). We identified K, TP, ALB, and NBIL as universal indicators for distinguishing cardiovascular disease patients from healthy individuals, while RBC, K, DBIL, and GLU were found to be universal indicators for distinguishing between different types of cardiovascular diseases. Additionally, we identified unique hematological and metabolic characteristics for each type of cardiovascular disease, which could provide clinicians with specialized insights for early disease prevention and diagnosis.

Availability of data and materials

No datasets were generated or analysed during the current study.

Abbreviations

CVDs:

Cardiovascular diseases

LR:

Logistic regression

RF:

Random forest

SVM:

Support vector machine

XGBoost:

Extreme Gradient Boosting

GBDT:

Gradient boosting decision tree

DNN:

Deep neural network

DCM:

Dilated cardiomyopathy

RBC:

Red blood cell count

LY%:

Lymphocyte percentage

HCT:

Hematocrit

References

  1. Cheng X, Manandhar I, Aryal S, et al. Application of artificial intelligence in cardiovascular medicine. Compr Physiol. 2021;11(4):2455–66.

    Article  PubMed  Google Scholar 

  2. Roth GA, Mensah GA, Johnson CO, et al. Global burden of cardiovascular diseases and risk factors, 1990–2019: update from the GBD 2019 study. J Am Coll Cardiol. 2020;76(25):2982–3021.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Lindstrom M, DeCleene N, Dorsey H, et al. Global burden of cardiovascular diseases and risks collaboration, 1990–2021. J Am Coll Cardiol. 2022;80(25):2372–425.

    Article  PubMed  Google Scholar 

  4. The W. Report on cardiovascular health and diseases in China 2022: an updated summary. Biomed Environ Sci. 2023;36(8):669–701.

    PubMed  Google Scholar 

  5. Leening MJ, Siregar S, Vaartjes I, et al. Heart disease in the Netherlands: a quantitative update. Neth Heart J. 2014;22(1):3–10.

    Article  CAS  PubMed  Google Scholar 

  6. Bandesh K, Jha P, Giri AK, et al. Normative range of blood biochemical parameters in urban Indian school-going adolescents. PLoS ONE. 2019;14(3): e0213255.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Wolthuis A. Impact of disease on interferences in blood bioanalysis. Bioanalysis. 2011;3(19):2223–31.

    Article  CAS  PubMed  Google Scholar 

  8. Menotti A, Lanti M, Zanchetti A, et al. The role of HDL cholesterol in metabolic syndrome predicting cardiovascular events. The Gubbio population study. Nutr Metab Cardiovasc Dis. 2011;21(5):315–22.

    Article  CAS  PubMed  Google Scholar 

  9. Rabbani N, Kim G, Suarez CJ, et al. Applications of machine learning in routine laboratory medicine: current state and future directions. Clin Biochem. 2022;103:1–7.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Ronzio L, Cabitza F, Barbaro A et al. Has the flood entered the basement? A systematic literature review about machine learning in laboratory medicine. Diagnostics (Basel) 2021;11(2).

  11. Mathur P, Srivastava S, Xu X, et al. Artificial intelligence, machine learning, and cardiovascular disease. Clin Med Insights Cardiol. 2020;14:1522409556.

    Article  Google Scholar 

  12. Attia ZI, Harmon DM, Behr ER, et al. Application of artificial intelligence to the electrocardiogram. Eur Heart J. 2021;42(46):4717–30.

    Article  CAS  PubMed  Google Scholar 

  13. Fernandez-Luque L, Imran M. Humanitarian health computing using artificial intelligence and social media: a narrative literature review. Int J Med Inform. 2018;114:136–42.

    Article  PubMed  Google Scholar 

  14. Panjiyar BK, Davydov G, Nashat H, et al. A systematic review: Do the use of machine learning, deep learning, and artificial intelligence improve patient outcomes in acute myocardial ischemia compared to clinician-only approaches? Cureus. 2023;15(8): e43003.

    PubMed  PubMed Central  Google Scholar 

  15. Chen L, Han Z, Wang J, et al. The emerging roles of machine learning in cardiovascular diseases: a narrative review. Ann Transl Med. 2022;10(10):611.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Muse ED, Topol EJ. Guiding ultrasound image capture with artificial intelligence. Lancet. 2020;396(10253):749.

    Article  PubMed  Google Scholar 

  17. Attia ZI, Kapa S, Lopez-Jimenez F, et al. Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram. Nat Med. 2019;25(1):70–4.

    Article  CAS  PubMed  Google Scholar 

  18. Shu S, Ren J, Song J. Clinical application of machine learning-based artificial intelligence in the diagnosis, prediction, and classification of cardiovascular diseases. Circ J. 2021;85(9):1416–25.

    Article  CAS  PubMed  Google Scholar 

  19. Roh J, Houstis N, Rosenzweig A. Why don’t we have proven treatments for HFpEF? Circ Res. 2017;120(8):1243–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Shah SJ, Katz DH, Selvaraj S, et al. Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation. 2015;131(3):269–79.

    Article  PubMed  Google Scholar 

  21. Wang H, Liang P, Zheng L, et al. eHSCPr discriminating the cell identity involved in endothelial to hematopoietic transition. Bioinformatics. 2021;37(15):2157–64.

    Article  CAS  PubMed  Google Scholar 

  22. Tang H, Zhao YW, Zou P, et al. HBPred: a tool to identify growth hormone-binding proteins. Int J Biol Sci. 2018;14(8):957–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Kumar A, Loharch S, Kumar S, et al. Corrigendum to "Exploiting cheminformatic and machine learning to navigate the available chemical space of potential small molecule inhibitors of SARS-CoV-2″ [Computational and Structural Biotechnology Journal 19 (2021) 424–438]. Comput Struct Biotechnol J. 2023;21:4408.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Zhang D, Xu ZC, Su W, et al. iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics. 2021;37(2):171–7.

    Article  PubMed  Google Scholar 

  25. Eichler J. Protein glycosylation. Curr Biol. 2019;29(7):R229–31.

    Article  PubMed  Google Scholar 

  26. Wu H, Wu Y, Jiang Y, et al. scHiCStackL: a stacking ensemble learning-based method for single-cell Hi-C classification using cell embedding. Brief Bioinform. 2022;23(1).

  27. Meng L, Chan WS, Huang L, et al. Mini-review: recent advances in post-translational modification site prediction based on deep learning. Comput Struct Biotechnol J. 2022;20:3522–32.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Liu M, Zhou J, Xi Q, et al. A computational framework of routine test data for the cost-effective chronic disease prediction. Brief Bioinform. 2023;24(2).

  29. Ning W, Lei S, Yang J, et al. Open resource of clinical data from patients with pneumonia for the prediction of COVID-19 outcomes via deep learning. Nat Biomed Eng. 2020;4(12):1197–207.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Altan G. DeepOCT: An explainable deep learning architecture to analyze macular edema on OCT images[J]. Eng Sci Technol Int J-JESTECH, 2022;34.

  31. Altan G. Breast cancer diagnosis using deep belief networks on ROI images. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2022;28(2):286–91.

    Google Scholar 

  32. Wang K, Tian J, Zheng C, et al. Interpretable prediction of 3-year all-cause mortality in patients with heart failure caused by coronary heart disease based on machine learning and SHAP. Comput Biol Med. 2021;137: 104813.

    Article  PubMed  Google Scholar 

  33. Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Seppen J, Bosma P. Bilirubin, the gold within. Circulation. 2012;126(22):2547–9.

    Article  PubMed  Google Scholar 

  35. Lai X, Fang Q, Yang L, et al. Direct, indirect and total bilirubin and risk of incident coronary heart disease in the Dongfeng-Tongji cohort. Ann Med. 2018;50(1):16–25.

    Article  CAS  PubMed  Google Scholar 

  36. Franchini M, Targher G, Lippi G. Serum bilirubin levels and cardiovascular disease risk: a Janus Bifrons? Adv Clin Chem. 2010;50:47–63.

    Article  CAS  PubMed  Google Scholar 

  37. Fiorentino TV, Prioletta A, Zuo P, et al. Hyperglycemia-induced oxidative stress and its role in diabetes mellitus related cardiovascular diseases. Curr Pharm Des. 2013;19(32):5695–703.

    Article  CAS  PubMed  Google Scholar 

  38. Reiner. Hypertriglyceridaemia and risk of coronary artery disease. Nat Rev Cardiol. 2017;14(7):401–11.

    Article  CAS  PubMed  Google Scholar 

  39. Stamler J, Daviglus ML, Garside DB, et al. Relationship of baseline serum cholesterol levels in 3 large cohorts of younger men to long-term coronary, cardiovascular, and all-cause mortality and to longevity. JAMA. 2000;284(3):311–8.

    Article  CAS  PubMed  Google Scholar 

  40. Nacarelli GS, Fasolino T, Davis S. Dietary, macronutrient, micronutrient, and nutrigenetic factors impacting cardiovascular risk markers apolipoprotein B and apolipoprotein A1: a narrative review. Nutr Rev. 2024;82(7):949–62.

    Article  PubMed  Google Scholar 

  41. Silveira RJ, Barbalho SM, Reverete DAR, et al. Metabolic syndrome and cardiovascular diseases: going beyond traditional risk factors. Diabetes Metab Res Rev. 2022;38(3): e3502.

    Article  Google Scholar 

  42. Ommen SR, Hodge DO, Rodeheffer RJ, et al. Predictive power of the relative lymphocyte concentration in patients with advanced heart failure. Circulation. 1998;97(1):19–22.

    Article  CAS  PubMed  Google Scholar 

  43. Weng TP, Fu TC, Wang CH, et al. Activation of lymphocyte autophagy/apoptosis reflects haemodynamic inefficiency and functional aerobic impairment in patients with heart failure. Clin Sci (Lond). 2014;127(10):589–602.

    Article  CAS  PubMed  Google Scholar 

  44. Shapiro D, Lee K, Asmussen J, et al. Evolutionary action-machine learning model identifies candidate genes associated with early-onset coronary artery disease. J Am Heart Assoc. 2023;12(17): e029103.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Trigka M, Dritsas E. Long-term coronary artery disease risk prediction with machine learning models. Sensors (Basel), 2023;23(3).

  46. Lu Y, Chen Q, Zhang H, et al. Machine learning models of postoperative atrial fibrillation prediction after cardiac surgery. J Cardiothorac Vasc Anesth. 2023;37(3):360–6.

    Article  PubMed  Google Scholar 

  47. Abegaz TM, Baljoon A, Kilanko O, et al. Machine learning algorithms to predict major adverse cardiovascular events in patients with diabetes. Comput Biol Med. 2023;164: 107289.

    Article  CAS  PubMed  Google Scholar 

  48. Kyodo A, Kanaoka K, Keshi A, et al. Heart failure with preserved ejection fraction phenogroup classification using machine learning. ESC Heart Fail. 2023;10(3):2019–30.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Wang YJ, Yang K, Wen Y, et al. Screening and diagnosis of cardiovascular disease using artificial intelligence-enabled cardiac magnetic resonance imaging. Nat Med. 2024;30(5):1471–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Sun Z. Multislice computed tomography angiography in the diagnosis of cardiovascular disease: 3D visualizations. Front Med. 2011;5(3):254–70.

    Article  PubMed  Google Scholar 

  51. Givertz MM, Fang JC, Sorajja P, et al. Executive summary of the SCAI/HFSA clinical expert consensus document on the use of invasive hemodynamics for the diagnosis and management of cardiovascular disease. J Card Fail. 2017;23(6):487–91.

    Article  PubMed  Google Scholar 

  52. You J, Guo Y, Kang JJ, et al. Development of machine learning-based models to predict 10-year risk of cardiovascular disease: a prospective cohort study. Stroke Vasc Neurol. 2023;8(6):475–85.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all the patients who participated in this trial as well as their families. We thank other team members for assisting with manuscript preparation.

Funding

This work was supported by the National Key R & D Program of China [2021ZD0201300 and 2022YFC2704300], the National Natural Science Foundation of China [82371200 and 82171474], and the Natural Science Foundation of Fujian Provincial [2020J05310]. The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Author information

Authors and Affiliations

Authors

Contributions

ZW, WN, YG, and SL wrote the manuscript together. YY, ZW, WN, QC, YG, and LH completed the data collection, investigation, and analysis. ZW and LH contributed to the methodology. WN, YY, and GH designed the study and contributed to conceptualization, funding acquisition, reviewing, and editing. All authors read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Yunyun Yang, Guolin Hong or Wanshan Ning.

Ethics declarations

Ethics approval and consent to participate

The protocol has been approved by the Ethics Committee of the First Affiliated Hospital of Xiamen University (XMYY-2023KYSB088).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Gu, Y., Huang, L. et al. Construction of machine learning diagnostic models for cardiovascular pan-disease based on blood routine and biochemical detection data. Cardiovasc Diabetol 23, 351 (2024). https://doi.org/10.1186/s12933-024-02439-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12933-024-02439-0

Keywords