Machine-learning to stratify diabetic patients using novel cardiac biomarkers and integrative genomics

Table 2 Overview of 6 machine-learning model analysis on all 345 features in binary classification

Model	Training	Training (StDev)	Testing	Testing (StDev)	F1 score	Important features	Important feature bias	AUC
LR	0.608	0.301	0.667	0.0	0.640	Complex III, Complex I, CpG31, CpG28, CpG30, Complex IV, CpG8, CpG4, CpG12, Age	(− 2.688), (− 1.688), (1.648), (− 1.163), (− 1.016), (0.982), (0.945), (0.887), (0.882), (0.848)	NA
LDA	0.567	0.203	0.556	0.0	0.400	SNP16245, SNP16344, SNP151, SNP5463, SNP4295, SNP13722, SNP94, SNP15884, SNP9055, SNP477	(− 3.896E+15), (− 3.896E+15), (− 3.896E+15), (− 3.896E+15), (− 2.719E+15), (− 2.719E+15), (3.398E+14), (3.398E+14), (3.398E+14), 0.266	0.700
KNN	0.642	0.239	0.444	0.0	0.430	NA	NA	0.600
NB	0.725	0.227	0.778	0.0	0.780	Mito 5hmC, Methyltransferase	(1.000), (0.000)	0.775
SVM	0.583	0.337	0.667	0.0	0.640	Complex III, CpG31, Complex I, CpG28, CpG8, CpG22, CpG12, CpG29, CpG4, CpG35	(− 0.732), (0.488), (− 0.443), (− 0.372), (0.350), (− 0.349), (0.322), (− 0.260), (0.259), (0.257)	NA
CART	0.790	0.209	0.711	0.1	0.714	CpG 24, CpG 28, Nuc 5mC, CpG11, CpG23, CpG1, CpG4	(0.587%), (0.213%), (0.040%), (0.040%), (0.040%), (0.040%), (0.040%)	0.715

Model analysis was conducted five times and averages are reported for the resulting training accuracy, training standard deviation, testing accuracy, testing standard deviation, F1 score, and area under the curve (AUC). Important biomarker features associated with each trained model are provided along with the associated influence value for each feature. Important features are listed in order of influence within the model. LR, LDA, SVM feature bias exists as an influence parameter where magnitude dictates feature influence. A positive influence value indicates the biomarker favors classification towards one label while a negative value indicates favorable classification of the opposite label. The larger the magnitude, the more strongly that feature shifts classification. NB feature influence indicates the most important biomarker per class in binary (0,1) classification schemes. CART feature bias percentages indicate feature influence on the created classification tree. Larger percentages indicate a feature that arises near the beginning of a tree before subsequent branching. Influence is not provided for KNN due to model restrictions

ISSN: 1475-2840