Home

Diabetes Prediction Using Machine Learning

Author: Famous Ghanyo Tay

Diabetes is a chronic condition characterized by high blood sugar levels due to the body's inability to properly regulate insulin.Diabetes can be classified as type 1 or type 2. As of 2021, approximately 422 million adults worldwide were living with diabetes.

Project Scope

Diabetes

The Machine Learning Model for Diabetes Prediction project aims to develop an intelligent system that can accurately predict the likelihood of an individual developing diabetes. In this project, four popular algorithms are used and the best one is chosen. They are:

Dataset

The data for this project is available at kaggle. The dataset used for this project is the well-known Pima Indians Diabetes Dataset obtained from Kaggle.

Exploratory Data Analysis on the Data

AIM: To gain insights into the dataset, identify patterns, and understand the relationships between different variables.

Visualizing the Correlation between the Features and the Targest

Heatmap

We can clearly see that 'Glucose' has a strong positive correlation with the 'Outcome'. 'Blood Pressure' is having the lowest positive correlation with 'Outcome'.

To be more specific, I am going to plot histplot of the 'Glucose' feature with the target 'Outcome'. Histplot

From the above plot, we see a positive linear correlation.

  • As the value of `Glucose` increases, the count of patients having diabetes increases i.e. value of `Outcome` as 1, increases.
  • Also, after the `Glucose` value of 125, there is a steady increase in the number of patients having `Outcome` of 1.
  • Note, when `Glucose` value is 0, it means the measurement is missing. We need to fill that values with the *mean* or *median* and then it will make sense.
  • So, there is a significant amount of positive linear correlation.
  • To view more plots with detailed explanation, view the entire notebook

    here

    Building and Evaluation of the ML Models

    Like I mentioned earlier, I am going to consider 4 algorithms; Support Vector Classifier (SVC), Random Forest Classifier (RFC), AdaBoost Classifier (ABC) and Gradient Boosting Classifier (GBC)

    Ai Image

    Support Vector Classifier

    SVC is a variant of SVM that aims to find the best hyperplane in a high-dimensional feature space to separate different classes of data. It works by mapping the input data into a higher-dimensional space and finding the optimal decision boundary that maximizes the margin between classes. The choice of kernel plays a crucial role in determining the shape of the decision boundary. In my project, I demonstrated the impact of the kernels (linear, polynomial, radial basis function, and sigmoid) on the model's performance. For each kernel , I evaluated the performance of SVC on the training data and testing data. The accuracy scores were then visualized as seen below;

    Kernels

    We can see the 'poly' and 'rbf' kernels have significantly higher training accuracy than testing accuracy. This suggest they are overfitting the training data and may not perform very well on unseen data. We will select the 'linear' kernel since it has the highest testing accuracy. This means it can generalize well to unseen data and provide accurate predictions.

    Random Forest Classifier

    Random Forest (RF) is an ensemble learning method that combines multiple decision trees to make predictions. Here, we are using 500 decision trees, each having a maximum depth of 10.

    AdaBoost Classifier

    The AdaBoostClassifier is a boosting algorithm that combines multiple weak learners to create a strong learner. It iteratively adjusts the weights of misclassified samples to focus on difficult examples. By combining weak learners, AdaBoost can improve classification performance. I used 500 base estimators with a learning rate of 0.5. Learning rate controls the contribution of each weak learner to the final prediction.

    Gradient Boosting Classifier

    The GradientBoostingClassifier is another boosting algorithm that combines multiple decision trees to create a strong learner. It iteratively fits new trees to the residuals of the previous trees, gradually improving the model's performance. I used 500 base estimators with a learning rate of 0.5 and maximum depth of 5.

    Choosing The Best Model

    ML Model Accuracy Score
    Support Vector Classifier 0.72077
    Random Forest Classifier 0.74026
    AdaBoost Classifier 0.74675
    Gradient Boosting Classifier 0.73376

    ROC Curves

    ROC

    Out of the four models, we can clearly see that the AdaBoost model is good for our dataset. Hence this is the model I am going to choose for my predictions.

    Link to Github Repo