MACHINE LEARNING BASED PREDICTION AND INSIGHTS OF DIABETES DISEASE: PIMA INDIAN AND FRANKFURT DATASETS

Authors:

Mohammad Raquibul Hossain,Md. Jamal Hossain,Md. Mijanoor Rahman,Mohammad Manjur Alam,

DOI NO:

https://doi.org/10.26782/jmcms.2025.01.00007

Keywords:

Machine Learning Methods,Diabetes Prediction,Logistic Regression,Classification,Random Forest,XGBoost,

Abstract

This paper focused on predicting diabetes disease using machine learning models which is a very active and highly important area of research. Six machine learning methods and three diabetes datasets were experimented with to investigate model performances. The methods are logistic regression, k-Nearest Neighbour, Gaussian Naïve Bayes, Decision Tree, Random Forest, and XGBoost. The datasets are Pima Indian, the Frankfurt Hospital dataset, and the combined dataset where all datasets have 08 (eight) feature variables and 01 (one) target variable. Train-test data split ratio can make a significant difference in model performance. Hence, two different split ratios 50-50 and 90-10 were experimented. Model performances were evaluated using four performance metrics which are precision, recall, F1-score, and accuracy. Random Forest and XGBoost were found to be highly efficient and best-performing among all the methods based on all performance metrics, all datasets, and both train-test split ratios. They performed comparatively better with the combined dataset which involved 2768 instances indicating the importance of a large dataset for better results. Also, the 90-10 train-test split ratio produced comparatively improved results than the 50-50 split ratio for all the datasets and even for almost all models.

Refference:

I. Agliata, Antonio, et al. “Machine Learning as a Support for the Diagnosis of Type 2 Diabetes.” International Journal of Molecular Sciences, vol. 24, no. 7, 2023, p. 6775. 10.3390/ijms24076775.
II. Aguilera-Venegas, Gabriel, et al. “Comparing and Tuning Machine Learning Algorithms to Predict Type 2 Diabetes Mellitus.” Journal of Computational and Applied Mathematics, vol. 427, 2023, p. 115115. 10.1016/j.cam.2023.115115.
III. Alenizi, Abdulrahman S., and Khamis A. Al-karawi. “Machine Learning Approach for Diabetes Prediction.” International Congress on Information and Communication Technology, Springer, 2023, pp. 745–56. 10.1007/978-981-99-3043-2_61
IV. Alzyoud, Mazen, et al. “Diagnosing Diabetes Mellitus Using Machine Learning Techniques.” International Journal of Data and Network Science, vol. 8, no. 1, 2024, pp. 179–88. 10.5267/j.ijdns.2023.10.006
V. Barik, Shekharesh, et al. “Analysis of Prediction Accuracy of Diabetes Using Classifier and Hybrid Machine Learning Techniques.” Intelligent and Cloud Computing: Proceedings of ICICC 2019, Volume 2, Springer, 2021, pp. 399–409. 10.1007/978-981-15-6202-0_41
VI. Bischl, Bernd, et al. “Hyperparameter Optimization: Foundations, Algorithms, Best Practices, and Open Challenges.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 13, no. 2, 2023, p. e1484. 10.1002/widm.1484
VII. Chang, Victor, et al. “Pima Indians Diabetes Mellitus Classification Based on Machine Learning (ML) Algorithms.” Neural Computing and Applications, vol. 35, no. 22, 2023, pp. 16157–73. 10.1007/s00521-022-07049-z
VIII. Dietterich, Thomas G. “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.” Neural Computation, vol. 10, no. 7, 1998, pp. 1895–923. 10.1162/089976698300017197
IX. Ebrahim, Oumer Abdulkadir, and Getachew Derbew. “Application of Supervised Machine Learning Algorithms for Classification and Prediction of Type-2 Diabetes Disease Status in Afar Regional State, Northeastern Ethiopia 2021.” Scientific Reports, vol. 13, no. 1, 2023, p. 7779. 10.1038/s41598-023-34906-1
X. Emmert-Streib, Frank, and Matthias Dehmer. “Understanding Statistical Hypothesis Testing: The Logic of Statistical Inference.” Machine Learning and Knowledge Extraction, vol. 1, no. 3, 2019, pp. 945–62
XI. Febrian, Muhammad Exell, et al. “Diabetes Prediction Using Supervised Machine Learning.” Procedia Computer Science, vol. 216, 2023, pp. 21–30. 10.1016/j.procs.2022.12.107.
XII. Feurer, Matthias, and Frank Hutter. “Hyperparameter Optimization.” Automated Machine Learning: Methods, Systems, Challenges, 2019, pp. 3–33. 10.1007/978-3-030-05318-5_1
XIII. Fregoso-Aparicio, Luis, et al. “Machine Learning and Deep Learning Predictive Models for Type 2 Diabetes: A Systematic Review.” Diabetology & Metabolic Syndrome, vol. 13, no. 1, 2021, p. 148. 10.1186/s13098-021-00767-9.
XIV. Gündoğdu, Serdar. “Efficient Prediction of Early-Stage Diabetes Using XGBoost Classifier with Random Forest Feature Selection Technique.” Multimedia Tools and Applications, vol. 82, no. 22, 2023, pp. 34163–81. 10.1007/s11042-023-15165-8
XV. Hasan, Md Kamrul, et al. “Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers.” IEEE Access, vol. 8, 2020, pp. 76516–31. https://ieeexplore.ieee.org/abstract/document/9076634/
XVI. Hossain, Mohammad Raquibul, et al. “Enhancing Stock Price Prediction Using Empirical Mode Decomposition, Rolling Forecast and Combining Statistical Methods.” International Journal of Computing and Digital Systems, vol. 12, no. 1, 2022, pp. 1343–56. https://journal.uob.edu.bh/handle/123456789/4699
XVII. Hossain, Mohammad Raquibul, et al. “Improving Stock Price Prediction Using Combining Forecasts Methods.” IEEE Access, vol. 9, 2021, pp. 132319–28. https://ieeexplore.ieee.org/abstract/document/9546774/
XVIII. Hossain, Mohammad Raquibul. “Optimizing Heart Disease Prediction: A Comparative Study of Machine Learning Techniques.” Asian Journal of Cardiology Research, vol. 7, no. 1, 2024, pp. 348–59.
XIX. Hossain, Mohammad Raquibul. “Prediction of Lower-Grade Glioma and
Glioblastoma Multiforme Using Machine Learning Models and OptunaOptimization.” Journal of Mathematics and Informatics, vol. 27, 2024, pp. 39–48. http://dx.doi.org/10.22457/jmi.v27a04248
XX. Hossain, Mohammad Raquibul, and Mohd Tahir Ismail. “Empirical Mode Decomposition Based on Theta Method for Forecasting Daily Stock Price.” Journal of Information and Communication Technology, vol. 19, no. 4, 2020, pp. 533–58. https://ejournal.uum.edu.my/index.php/jict/article/view/7531
XXI. Ismail, Leila, et al. “Type 2 Diabetes with Artificial Intelligence Machine Learning: Methods and Evaluation.” Archives of Computational Methods in Engineering, vol. 29, no. 1, 2022, pp. 313–33, https://doi.org/10.1007/s11831-021-09582-x
XXII. Joshi, Ram D., and Chandra K. Dhakal. “Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches.” International Journal of Environmental Research and Public Health, vol. 18, no. 14, 2021, p. 7346. https://doi.org/10.3390/ijerph18147346
XXIII. Kaur, Harleen, and Vinita Kumari. “Predictive Modelling and Analytics for Diabetes Using a Machine Learning Approach.” Applied Computing and Informatics, vol. 18, no. 1/2, 2022, pp. 90–100, https://doi.org/10.1016/j.aci.2018.12.004
XXIV. Khaleel, Fayroza Alaa, and Abbas M. Al-Bakry. “Diagnosis of Diabetes Using Machine Learning Algorithms.” Materials Today: Proceedings, vol. 80, 2023, pp. 3200–03. https://www.sciencedirect.com/science/article/pii/S2214785321050550.
XXV. Khanam, Jobeda Jamal, and Simon Y. Foo. “A Comparison of Machine Learning Algorithms for Diabetes Prediction.” Ict Express, vol. 7, no. 4, 2021, pp. 432–39. https://www.sciencedirect.com/science/article/pii/S2405959521000205.
XXVI. Kumar, Nilesh, et al. “Predicting Diabetes Using Machine Learning.” 2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), IEEE, 2023, pp. 1737–42. https://doi.org/10.1109/ICAC3N60023.2023.10541436.
XXVII. Modak, Sandip Kumar Singh, and Vijay Kumar Jha. “Diabetes Prediction Model Using Machine Learning Techniques.” Multimedia Tools and Applications, vol. 83, no. 13, 2024, pp. 38523–49. https://doi.org/10.1007/s11042-023-16745-4.
XXVIII. Naz, Huma, and Sachin Ahuja. “Deep Learning Approach for Diabetes Prediction Using PIMA Indian Dataset.” Journal of Diabetes & Metabolic Disorders, vol. 19, no. 1, June 2020, pp. 391–403. Springer Link. https://doi.org/10.1007/s40200-020-00520-5.
XXIX. Oikonomou, Evangelos K., and Rohan Khera. “Machine Learning in Precision Diabetes Care and Cardiovascular Risk Prediction.” Cardiovascular Diabetology, vol. 22, no. 1, 2023, p. 259. https://doi.org/10.1186/s12933-023-01985-3.
XXX. Pima Indians Diabetes Database. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.
XXXI. Raschka, Sebastian. “MLxtend: Providing Machine Learning and Data Science Utilities and Extensions to Python’s Scientific Computing Stack.” The Journal of Open Source Software, vol. 3, no. 24, Apr. 2018. https://doi.org/10.21105/joss.00638.
XXXII. —. “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning.” arXiv Preprint arXiv:1811.12808, 2018, https://doi.org/10.48550/arXiv.1811.12808.
XXXIII. Raschka, Sebastian. “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning.” arXiv Preprint arXiv:1811.12808, 2018. 10.48550/arXiv.1811.12808
XXXIV. Rupabanta Singh, Kabrambam. Type 2 Diabetes Dataset. IEEE Dataport, 2024. https://doi.org/10.21227/xm4p-nx87.
XXXV. Sakib, Shadman, et al. “Performance Analysis of Machine Learning Approaches in Diabetes Prediction.” 2021 IEEE 9th Region 10 Humanitarian Technology Conference (R10-HTC), IEEE, 2021, pp. 1–6. https://doi.org/10.1109/R10-HTC53172.2021.9641737.
XXXVI. Saxena, Surabhi, et al. “Machine Learning Algorithms for Diabetes Detection: A Comparative Evaluation of Performance of Algorithms.” Evolutionary Intelligence, 2023, pp. 1–17. https://doi.org/10.1007/s12065-021-00685-9.
XXXVII. Srivastava, Rashmi, and Rajendra Kumar Dwivedi. “A Survey on Diabetes Mellitus Prediction Using Machine Learning Algorithms.” ICT Systems and Sustainability: Proceedings of ICT4SD 2021, Volume 1, Springer, 2022, pp. 473–80. n https://doi.org/10.1007/978-981-16-5987-4_48.
XXXVIII. Tigga, Neha Prerna, and Shruti Garg. “Prediction of Type 2 Diabetes Using Machine Learning Classification Methods.” Procedia Computer Science, vol. 167, 2020, pp. 706–16.
https://www.sciencedirect.com/science/article/pii/S1877050920308024.
XXXIX. Whig, Pawan, et al. “A Novel Method for Diabetes Classification and Prediction with Pycaret.” Microsystem Technologies, vol. 29, no. 10, 2023, pp. 1479–87. https://doi.org/10.1007/s00542-023-05473-2.
XL. Yang, Li, and Abdallah Shami. “On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice.” Neurocomputing, vol. 415, 2020, pp. 295–316. https://doi.org/10.1016/j.neucom.2020.07.061.

View Download