Diabetes is a chronic condition characterized by high blood sugar levels due to the body's inability to properly regulate insulin.Diabetes can be classified as type 1 or type 2. As of 2021, approximately 422 million adults worldwide were living with diabetes.
The Machine Learning Model for Diabetes Prediction project aims to develop an intelligent system that can accurately predict the likelihood of an individual developing diabetes. In this project, four popular algorithms are used and the best one is chosen. They are:
The data for this project is available at kaggle. The dataset used for this project is the well-known Pima Indians Diabetes Dataset obtained from Kaggle.
AIM: To gain insights into the dataset, identify patterns, and understand the relationships between different variables.
We can clearly see that 'Glucose' has a strong positive correlation with the 'Outcome'. 'Blood Pressure' is having the lowest positive correlation with 'Outcome'.
To be more specific, I am going to plot histplot of the 'Glucose' feature with the target 'Outcome'.From the above plot, we see a positive linear correlation.
To view more plots with detailed explanation, view the entire notebook here
Like I mentioned earlier, I am going to consider 4 algorithms; Support Vector Classifier (SVC), Random Forest Classifier (RFC), AdaBoost Classifier (ABC) and Gradient Boosting Classifier (GBC)
SVC is a variant of SVM that aims to find the best hyperplane in a high-dimensional feature space to separate different classes of data. It works by mapping the input data into a higher-dimensional space and finding the optimal decision boundary that maximizes the margin between classes. The choice of kernel plays a crucial role in determining the shape of the decision boundary. In my project, I demonstrated the impact of the kernels (linear, polynomial, radial basis function, and sigmoid) on the model's performance. For each kernel , I evaluated the performance of SVC on the training data and testing data. The accuracy scores were then visualized as seen below;
We can see the 'poly' and 'rbf' kernels have significantly higher training accuracy than testing accuracy. This suggest they are overfitting the training data and may not perform very well on unseen data. We will select the 'linear' kernel since it has the highest testing accuracy. This means it can generalize well to unseen data and provide accurate predictions.
Random Forest (RF) is an ensemble learning method that combines multiple decision trees to make predictions. Here, we are using 500 decision trees, each having a maximum depth of 10.
The AdaBoostClassifier is a boosting algorithm that combines multiple weak learners to create a strong learner. It iteratively adjusts the weights of misclassified samples to focus on difficult examples. By combining weak learners, AdaBoost can improve classification performance. I used 500 base estimators with a learning rate of 0.5. Learning rate controls the contribution of each weak learner to the final prediction.
The GradientBoostingClassifier is another boosting algorithm that combines multiple decision trees to create a strong learner. It iteratively fits new trees to the residuals of the previous trees, gradually improving the model's performance. I used 500 base estimators with a learning rate of 0.5 and maximum depth of 5.
ML Model | Accuracy Score |
---|---|
Support Vector Classifier | 0.72077 |
Random Forest Classifier | 0.74026 |
AdaBoost Classifier | 0.74675 |
Gradient Boosting Classifier | 0.73376 |
Out of the four models, we can clearly see that the AdaBoost model is good for our dataset. Hence this is the model I am going to choose for my predictions.