Diabetes Prediction Using Machine Learning

Author: Famous Ghanyo Tay

Diabetes is a chronic condition characterized by high blood sugar levels due to the body's inability to properly regulate insulin.Diabetes can be classified as type 1 or type 2. As of 2021, approximately 422 million adults worldwide were living with diabetes.

Project Scope

The Machine Learning Model for Diabetes Prediction project aims to develop an intelligent system that can accurately predict the likelihood of an individual developing diabetes. In this project, four popular algorithms are used and the best one is chosen. They are:

Support Vector Classifier
Random Forest Classifier
AdaBoost Classifier
Gradient Boosting Classifier

Dataset

The data for this project is available at kaggle. The dataset used for this project is the well-known Pima Indians Diabetes Dataset obtained from Kaggle.

Exploratory Data Analysis on the Data

AIM: To gain insights into the dataset, identify patterns, and understand the relationships between different variables.

Visualizing the Correlation between the Features and the Targest

We can clearly see that 'Glucose' has a strong positive correlation with the 'Outcome'. 'Blood Pressure' is having the lowest positive correlation with 'Outcome'.

To be more specific, I am going to plot histplot of the 'Glucose' feature with the target 'Outcome'.

From the above plot, we see a positive linear correlation.

As the value of `Glucose` increases, the count of patients having diabetes increases i.e. value of `Outcome` as 1, increases.

Also, after the `Glucose` value of 125, there is a steady increase in the number of patients having `Outcome` of 1.

Note, when `Glucose` value is 0, it means the measurement is missing. We need to fill that values with the *mean* or *median* and then it will make sense.

So, there is a significant amount of positive linear correlation.

To view more plots with detailed explanation, view the entire notebook

here

Building and Evaluation of the ML Models

Like I mentioned earlier, I am going to consider 4 algorithms; Support Vector Classifier (SVC), Random Forest Classifier (RFC), AdaBoost Classifier (ABC) and Gradient Boosting Classifier (GBC)

Support Vector Classifier

SVC is a variant of SVM that aims to find the best hyperplane in a high-dimensional feature space to separate different classes of data. It works by mapping the input data into a higher-dimensional space and finding the optimal decision boundary that maximizes the margin between classes. The choice of kernel plays a crucial role in determining the shape of the decision boundary. In my project, I demonstrated the impact of the kernels (linear, polynomial, radial basis function, and sigmoid) on the model's performance. For each kernel , I evaluated the performance of SVC on the training data and testing data. The accuracy scores were then visualized as seen below;

We can see the 'poly' and 'rbf' kernels have significantly higher training accuracy than testing accuracy. This suggest they are overfitting the training data and may not perform very well on unseen data. We will select the 'linear' kernel since it has the highest testing accuracy. This means it can generalize well to unseen data and provide accurate predictions.

Random Forest Classifier

Random Forest (RF) is an ensemble learning method that combines multiple decision trees to make predictions. Here, we are using 500 decision trees, each having a maximum depth of 10.

AdaBoost Classifier

The AdaBoostClassifier is a boosting algorithm that combines multiple weak learners to create a strong learner. It iteratively adjusts the weights of misclassified samples to focus on difficult examples. By combining weak learners, AdaBoost can improve classification performance. I used 500 base estimators with a learning rate of 0.5. Learning rate controls the contribution of each weak learner to the final prediction.

Gradient Boosting Classifier

The GradientBoostingClassifier is another boosting algorithm that combines multiple decision trees to create a strong learner. It iteratively fits new trees to the residuals of the previous trees, gradually improving the model's performance. I used 500 base estimators with a learning rate of 0.5 and maximum depth of 5.

Choosing The Best Model

ML Model	Accuracy Score
Support Vector Classifier	0.72077
Random Forest Classifier	0.74026
AdaBoost Classifier	0.74675
Gradient Boosting Classifier	0.73376

ROC Curves

Out of the four models, we can clearly see that the AdaBoost model is good for our dataset. Hence this is the model I am going to choose for my predictions.