XGBoost and cost-sensitive CART for imbalanced multiclass diabetes classification in Iraq

Authors:

Nabila A. Alsharif,Inaam Aboud Hussain,Loaiy F. Naji,

DOI NO:

https://doi.org/10.26782/jmcms.2026.02.00001

Keywords:

Classification,XGBoost,CART,Class imbalance,Diabetes,Pre diabetic,

Abstract

Diabetes imposes a substantial public health burden; according to the International Diabetes Federation, there were about 3.4 million diabetes related deaths worldwide in 2024, and in Iraq, the Federation reports that one in nine adults lives with diabetes in 2024, with 14,683 adult deaths attributable to diabetes and a total diabetes related health expenditure of 2,078 million United States dollars. The dataset analyzed in this study contains 1,000 records collected in 2020 from two Iraqi teaching hospitals and includes multiple clinical and laboratory measurements with three outcome classes, namely Non diabetic, Pre diabetic, and Diabetic, with a low prevalence of the Pre diabetic class and an imbalanced overall class distribution; the data are challenging because they contain many outliers, non homogeneous covariance matrices across classes, exact duplicate rows that were removed before modelling, and linear correlations among certain variables. The study objective was to train and evaluate models that discriminate among the three classes and yield accurate, well calibrated predictions for future cases in similar clinical settings, but the diagnostic properties of the data limited the applicability of classical discriminant functions; therefore two supervised learners were employed: Classification and Regression Trees (CART) and Extreme Gradient Boosting (XGBoost), together with preprocessing that removed exact duplicate rows and excluded VLDL because it is algebraically derived from triglycerides in mmol per liter as VLDL equals triglycerides divided by 2.2, which would introduce redundancy and multicollinearity. On the held-out test set, XGBoost achieved higher Accuracy at 98.18 percent compared with 97.58 percent for CART and higher Balanced Accuracy at 93.84 percent compared with 88.16 percent for CART, indicating that XGBoost provided the strongest overall operating point for this three-class task while CART remains useful when simple and transparent rules are required.

Refference:

I. American Diabetes Association Professional Practice Committee. 2. Diagnosis and Classification of Diabetes: Standards of Care in Diabetes, 2024. Diabetes Care. 2024;47(Suppl 1):S20–S42. 10.2337/dc24-S002.
II. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Monterey, CA: Wadsworth. (Reprinted by Chapman & Hall/CRC, Taylor & Francis, 2017). 10.1201/9781315139470
III. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
IV. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785–794. doi:10.1145/2939672.2939785.
Duncan BB, et al. IDF Diabetes Atlas 11th edition 2025: global prevalence and key metrics. Nephrology Dialysis Transplantation. 2025. 10.1093/ndt/gfaf177.
V. Friedewald, W. T., Levy, R. I., & Fredrickson, D. S. (1972). Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clinical Chemistry, 18(6), 499–502. 10.1093/clinchem/18.6.499
VI. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. 10.1214/aos/1013203451
VII. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010
VIII. Guo, R., Liu, J., Yang, X., Wang, X., Xu, Z., & Wu, K. (2025). Enhance health evidence quality in classification tasks. Digital Health, 11, 20552076251314097. 10.1177/20552076251314097
IX. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. 10.1007/978-0-387-84858-7
X. International Diabetes Federation (IDF). IDF Diabetes Atlas, 11th edition (global facts and projections). Available at: https://diabetesatlas.org/data-by-location/global/ and Global Factsheet PDF (2025).
XI. Idhom, M., Fauzi, A., Muhaimin, A., & Caesarendra, W. (2025). Evaluation of CART and XGBoost Methods on Customer Loan Risk Prediction Based on Consumer Behavior. TEM Journal, 14(3), 2624–2630. 10.18421/TEM143-64
XII. Nuankaew, P.; Chaising, S.; Temdee, P. (2021). Average Weighted Objective Distance-Based Method for Type 2 Diabetes Prediction. IEEE Access, 9, 137015–137028. 10.1109/ACCESS.2021.3117269
XIII. Rashid, Ahlam (2020), “Diabetes Dataset”, Mendeley Data, V1,
XIV. 10.17632/wj9rwkp9c2.1
XV. Rainio, O., Teuho, J., & Klén, R. (2024). Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14, 6086. 10.1038/s41598-024-56706-x
XVI. Sabariah, M. K.; Hanifa, A.; Sa’adah, S. (2014). Early detection of Type II Diabetes Mellitus with Random Forest and Classification and Regression Tree (CART). In: Proceedings of the 2014 International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), 238–242. IEEE. 10.1109/ICAICTA.2014.7005947
XVII. Sahid, M. A.; Babar, M. U. H.; Uddin, M. P. (2024). Predictive modeling of multi-class diabetes mellitus using machine learning and filtering Iraqi diabetes data dynamics. PLOS ONE, 19(5), e0300785. 10.1371/journal.pone.0300785
XVIII. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. 10.1016/j.ipm.2009.03.002

View Download