Diabetes Prediction Analysis in Python and it’s implementation in Flask-Part 1

4 min readJul 1, 2020

The journey of learning data science has not been easy. After trying to learn and understand the various intricacies of different algorithms and how to deploy them in Jupyter Notebook, I thought I have learnt pretty much the basics. But, there’s more….

During my daily course of studies and also while trying to get my hands dirty with some hackathons, I came across the Production Deployment phase of Data Science Pipeline. Please don’t judge me for that as we all learn something new everyday. That’s when I came to know about Flask and it’s use for creating a wonderful interface to share your prediction about the data.

So here I am going to share the details about the Project I have worked on and it’s implementation with Flask to create a user interface. So, if you are pretty much aware about the modelling techniques and want to know what’s next, then you are at the right place.

In this article, I have used the diabetes dataset from Kaggle. You can download the data directly from here. Also I will be uploading the link for my Github Profile from where you would be able to download my complete code.

I will walk you through the various snippets on how I selected the model and how I created the Flask Interface.

This is a classification model which means the Predictor Variable is a binary variable (Outcome in this case).

I have used different modelling techniques to check which one is giving the best accuracy and better Precision and Recall values as this is a medical problem.

Importing Different Libraries and reading CSV file into python

Check for the missing values.

Next, I performed some data visualizations for the data.

From the above heatmap, we can see the correlation between ‘BMI and SkinThickness=0.94’,’DiabetesPedigreeFunction and Pregnancies=0.9', ‘BMI and Age=0.85’ , ‘Skin Thickness and Age=0.83’

Next, I have done some Feature Engineering. We can see that there is no null or NA values in the data. But there are some rows that have zero values and definitely we don’t need any 0 values in our data. So, I am replacing them with the median value of that particular Feature.

Replaced the 0 values with the median values of the feature

Modelling Part

I have used Models like Naive Bayes, Random Forest, XGBoost and check the accuracy. Further I removed the Feature ‘SkinThickness’ which is highly correlated to BMI and Age. The Accuracy and F1 Score Improved.

But still I wanted to reduce False Positive and False Negatives even more. So further, I have done oversampling on the ‘Outcome’ variable using SMOTE.

Modelling with Random Forest and checking the various metrics

Dropped “SkinThickness” from train and test data

Random Forest Modelling after the Feature is removed

Oversampling and again rerunning the model.

We can see that the accuracy was better with oversampling. But the False Negative Rate reduced, False Positive Rate increased. So we can select which ever model we need as per the requirement.

The Flask Implementation is continued with the next article.

I hope you liked this. Please give your feedback and comments. Have a great day and great learning ahead.

I hope you liked this article. Please give a clap, share and leave a comment.

In case you have any suggestions, please do share.

REFERENCES

https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/

https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

https://www.kaggle.com/johndasilva/diabetes

Diabetes Prediction Analysis in Python and it’s implementation in Flask-Part 1

Written by Devleena Banerjee