1Student Research Committee, School of Pharmacy, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
2Faculty of Pharmacy, University of Iran Medical Science, 123 University Avenue, Tehran 56789, Iran.
3School of Business Administration, Lakehead University, Thunder Bay, Ontario, Canada.
*Corresponding Author : Yalda Ghazizadeh
Student Research Committee, School of Pharmacy, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Email: [email protected]
Received : Apr 06, 2025
Accepted : Apr 28, 2025
Published : May 05, 2025
Archived : www.jcimcr.org
Copyright : © Ghazizadeh Y (2025).
Background: The hallmark of diabetes, a long-term metabolic disease, is hyperglycemia brought on by insulin failure. Its rising prevalence around the world emphasizes the necessity of precise and effective forecast techniques. By examining medical data, Machine Learning (ML) has demonstrated potential in the diagnosis and prognosis of diabetes.
Objective: Using 768 female individuals, all aged 21 years or older, residing near Phoenix, Arizona, attempts to assess how well different machine learning classifiers predict diabetes. The goal is to identify the best model while resolving issues with feature selection and data preparation.
Methods: Analysis was done on six different machine learning methods. The dataset was preprocessed using feature standardization, missing value management, and Recursive Feature Elimination (RFE) to choose the best features. The model’s performance was assessed using the following metrics: accuracy, sensitivity, specificity, positive predictive value, and negative predictive value.
Results: The models with the best accuracy were Random Forest (RF) (84%), Support Vector Machine (SVM), and Logistic Regression (82%). Decision trees and Naïve Bayes performed competitively, but marginally worse. The results imply that using very easy preprocessing methods, traditional machine learning models may produce accurate diabetes predictions.
Conclusion: With their capacity to balance interpretability and accuracy, traditional machine learning models—Random Forest and SVM in particular-show great promise for diabetes prediction. Conventional ML models are still useful for clinical applications because of their transparency and simplicity of use, even if sophisticated deep learning techniques can improve prediction. To increase forecast accuracy and generalizability, future studies should investigate hybrid techniques that combine deep learning and conventional models.
Keywords: Diabetes prediction; PIMA Indian dataset; Diabetes diagnoses; Machine learning.
Diabetes is a metabolic condition that affects how well a person’s body processes blood glucose, also referred to as blood sugar. It is typified by hyperglycemia, which is brought on by deficiencies in either insulin action or production, or both [1,2]. Diabetes has emerged as a global public health emergency. As of 2019, the International Diabetes Federation estimates that 463 million people globally have diabetes [3]. It is expected to reach 578 million (10.2%) by 2030 and 700 million (10.9%) by 2045 due to its fast-rising incidence [4]. Predictive analysis is a method that uses a variety of machine learning algorithms, data mining, and statistical methods to analyze past and present data to predict future occurrences. By implementing predictive analysis to healthcare data, important decisions and predictions can be made. Predictive analytics applies machine learning techniques to diagnose diseases as accurately, improve patient care, optimize assets, and improve clinical results [5]. The application of machine learning to diabetes prediction has been the subject of several studies. For example, a thorough assessment of machine learning’s use in diabetes research was carried out by Kavakiotis [6] et al. who emphasized its value in decision support and predictive analytics. Comparing several categorization models, Sisodia [7] et al. discovered that ensemble approaches, such as Random Forest, performed better than conventional algorithms. Even though machine learning has showed promise, issues including feature selection, data quality, and model interpretability still exist. A machine learning-based system for diabetes classification was suggested by Feng [8] et al. who used methods like Synthetic Minority Over-sampling Technique with Edited Nearest Neighbors (SMOTEENN) and Generative Adversarial Networks (GANs) to solve issues with feature analysis, class imbalance, and data preparation. Their method showed the potential of cutting-edge AI models to enhance diabetes prediction and obtained great accuracy (96.27% for binary classification). Predictive medicine leverages advanced bioinformatics, genomics, and artificial intelligence to assess an individual’s risk of developing diseases and tailor preventive or therapeutic strategies accordingly [9-13]. By analyzing genetic variations, biomarkers, and patient history, predictive models can identify predispositions to conditions such as cancer, cardiovascular diseases, and neurodegenerative disorders [14-18]. High-throughput sequencing and machine learning algorithms enable early diagnosis, prognosis estimation, and personalized treatment plans based on a patient’s molecular profile [19-22]. Additionally, pharmacogenomics-a key component of predictive medicine-optimizes drug selection and dosage by predicting individual responses to medications, reducing adverse effects and enhancing treatment efficacy [23]. This approach is revolutionizing healthcare by shifting from a reactive to a proactive model, ultimately improving patient outcomes and reducing medical costs. This study uses data from 768 female individuals, all aged 21 years or older, residing near Phoenix, Arizona, to evaluate how well different machine learning algorithms predict diabetes. Using several methods, we want to identify the model that produces the best accurate predictions and investigate possible enhancements for further studies.
Participants: This study focused on 768 female individuals, all aged 21 years or older, residing near Phoenix, Arizona, with data collected by the National Institute of Diabetes and Digestive and Kidney Diseases. Diabetes diagnosis followed the World Health Organization’s criteria, which classify an individual as diabetic if their plasma glucose level reaches or exceeds 200 mg/dL (11.1 mmol/L) two hours after a glucose load during a survey examination.
Study features: Each patient record includes the following features.
• Pregnancies – Number of times pregnant.
• Glucose – Plasma glucose concentration during 2 h in an oral glucose tolerance test.
• Blood Pressure – Diastolic blood pressure (mm Hg).
• Skin thickness – Triceps skinfold thickness (mm), indicative of body fat percentage.
• Insulin – 2-hour serum insulin level (mu U/ml).
• BMI – Body mass index, a measure of body fat based on weight and height.
• Diabetes pedigree function – A score indicating genetic predisposition to diabetes.
• Age – Age of the patient (years).
Statistical analysis
This research includes various processes such as data preprocessing, data normalization, feature selection, and evaluating the results (Figure 1). Data preparation was done to deal with missing values, standardize features, and eliminate outliers before model training. To resolve missing values, we used a hybrid strategy that used mean and median imputation to preserve the dataset’s distribution while reducing potential bias. Identifying and managing outliers is critical for reliable data analysis. The Insulin variable in the original dataset included a substantial number of outliers that persisted even after missing values were imputed; hence, these outliers were deleted. Several machine learning models were used in this study, including Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), and Logistic Regression (LR) to train models for predicting individuals with diabetes [24-26]. The dataset is divided 80/20% into training and test sets. 10-fold cross-validation was utilized to validate the performance of the learning model [27].
Data availability: You can download the data by https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Diabetic individuals were 37 years old on average, while not diabetic individuals were 31 years old on average. Pregnancies, glucose, skin thickness, insulin, and BMI show significant differences between diabetic and non-diabetic groups, emphasizing their significance in diabetes prediction. (Table 1) shows a summary of characteristics of each group.
With highest Sensitivity (0.74), Specificity (0.90), Accuracy (0.84), PPV (0.80), and NPV (0.87), the Random Forest (RF) model exhibits the best overall performance. This implies that RF is highly capable of accurately classifying both positive and negative situations. With an accuracy of 0.82, SVM and Logistic Regression (LR) both demonstrate excellent performance. LR has greater Specificity (0.91), whereas SVM has a higher Sensitivity (0.70) than LR (0.65). With an accuracy of 0.80 and a balance between sensitivity (0.72) and specificity (0.84), Naïve Bayes (NB) performs a bit worse than SVM. With an accuracy of 0.77, a lower PPV, and a modest sensitivity, Decision Tree (DT) and KNN perform the worst.
Feature | Non-Diabetic | Diabetic | p-value |
---|---|---|---|
Pregnancies (Mean (SD)) | 3.30(3.02) | 4.87(3.74) | <0.001 |
Glucose (Mean (SD)) | 109.98(26.14) | 141.26(31.94) | <0.001 |
Blood Pressure (Mean (SD)) | 68.18(18.06) | 70.82(21.49) | 0.072 |
Skin Thickness (Mean (SD)) | 19.66(14.89) | 22.16(17.68) | 0.038 |
Insulin (Mean (SD)) | 68.79(98.87) | 100.34(138.69) | <0.001 |
BMI (Mean (SD)) | 30.30(7.69) | 35.14(7.26) | <0.001 |
Diabetes Pedigree Function (Mean (SD)) | 0.43(0.30) | 0.55(0.37) | <0.001 |
Age (Mean (SD)) | 31.19(11.67) | 37.07(10.97) | <0.001 |
Yahyaoui et al. suggested a diabetes predicting framework leveraging machine learning and deep learning methodologies [28]. They applied RF, SVM, and convolutional neural network to identify and diagnose diabetes patients. The outcomes indicated that the RF model surpassed deep learning and support vector machine SVM techniques, attaining a total accuracy of 83.67%. Sharma et al. Used approaches such as NB [28], LR, decision tree, and artificial neural network for diabetes prediction. Among these strategies, LR yielded the highest precision of 80.43% in identifying whether a patient has diabetes or not. Haritha et al. leveraged the PIMA dataset with a KNN classifier and the Cuckoo fuzzy KNN algorithm [30], obtaining an accuracy of 81.00%. Patra and Kuntia [31] introduced a Standard Deviation KNN (SDKNN) algorithm for diabetes categorization on the PIDD dataset. The model applied the standard deviation of KNN attributes to determine the distance between training and testing data, achieving an accuracy of 83.76% for the enhanced weighted SDKNN. By combining Generative Adversarial Networks (GANs) and Synthetic Minority Over-sampling Technique with Edited Nearest Neighbors (SMOTEENN), recent studies, such the one by Feng et al. (2023), have shown even greater classification accuracy (96.27%). Although these techniques enhance model performance, it’s crucial to understand that they also increase computing complexity and necessitate the use of bigger training datasets. On the other hand, our work shows that conventional machine learning models may still perform quite competitively even in the absence of advanced data augmentation or deep learning methods [32]. Even though this study concentrated on conventional machine learning models, combining convolutional and Artificial Neural Networks (ANNs and CNNs) might improve prediction accuracy, especially for bigger datasets. An approach to diabetes risk assessment that is more dynamic and individualized may be possible by using real-time health monitoring data from wearable technology (such as smartwatches and glucose monitors). The generalizability of machine learning models for diabetes prediction would be enhanced by using datasets that contain people from different ethnic origins. Physicians may be able to make better judgments by integrating machine learning-based diagnostic assistance systems with Electronic Health Records (EHRs). This might ultimately result in earlier identification and better patient outcomes. Our study demonstrates that conventional machine learning models continue to be quite successful in predicting diabetes and provide notable benefits in terms of accessibility and interpretability. Our findings demonstrate that Random Forest and SVM are appropriate for clinical applications, they can achieve good prediction accuracy with comparatively little preprocessing. In the end, this work adds insightful information to the continuing investigation of diabetes prediction powered by machine learning. Even if deep learning has the potential to lead to more developments, our results confirm that conventional machine learning techniques are still essential for medical diagnosis. To develop more reliable, comprehensible, and clinically useful models for diabetes prediction and treatment, future research should try to integrate the advantages of both conventional and deep learning approaches.