Data OilSt.

  • Krishna Kankipati

Analysis of Diabetes dataset

Diabetes has become a more common non-communicable disease in the society. For many years a million of people have fallen prey to Diabetes, due to negligence, or lack of knowledge or not aware of it.

What is Diabetes Mellitus?

Diabetes is a metabolic disorder with high glucose levels in blood.

What happens?

The components in the food are digested and are transported to different body parts through blood. Likewise Glucose is transported to various body parts through blood. 

Insulin is a hormone produced in the pancreas. Insulin action causes the glucose to be taken by the cells for their energy needs, to store (in liver as glycogen) and utilise them ultimately leading to the fall in blood glucose levels. However there are some conditions where Insulin is not produced in adequate quantities or even though it is produced in adequate quantities, our body might have become resistant to the Insulin.

All these conditions lead to increased glucose/sugar in the blood - condition of Diabetes Mellitus

Normal values

The blood glucose homeostasis when disrupted leads to increased(hyperglycemia) or decreased(hypoglycemia) blood glucose levels. 

Diabetes is characterised by hyperglycemia. Normally 4mmol/lit of sugars are present in the blood.

Fasting blood sugar

Sugar in the blood is tested at least 8 hours of fasting i.e usually blood is taken early in the morning before breakfast. It must be 70-100 mg/dl(<5.6 mmol/lit)

Sugars >110 mg/dl should raise a suspicion of Diabetes Mellitus. If needed, an oral glucose tolerance test(OGTT) is done.  Usually glucose>125mg/dl(7mmol/lit) indicates diabetes and fasting blood glucose test should be repeated.

Random Blood Sugar

Taken randomly in a non-fasting subject.

Normal: if 80-140 mg/dl (4.4-7.8mmol/lit)

Prediabetes: if  140-200 mg/dl(7.8-11mmol/lit)           

Diabetes Mellitus: if >=200 mg/dl  

Postprandial Blood Sugar

Of course, ADA says it is not a diagnostic of Diabetes Mellitus.

Normal if glucose levels are <180mg/dl 2 hours after a meal.


It is the average of blood sugar levels for the past 2-3 months. The average is always better than a single result for any confirmation.

Normal : if A1C < 5.7%

Prediabetes: if A1C 5.7% to 6.4%

Diabetes: if A1C >= 6.5%

OGTT(Oral Glucose Tolerance Test):

Tests the blood sugar levels before or after 2 hours of taking a drink given by them

Normal : if  < 140mg/dl

Prediabetes: if 140mg/dl to 199mg/dl

Diabetes: if >= 200mg/dl


  • Approximately 463 million adults were living with Diabetes Mellitus. By 2045 this will rise to 700M.

  • 1 in 5 of the people who are above 65 years have Diabetes Mellitus.

  • Diabetes Mellitus caused 4.2 million deaths.

  • Diabetes Mellitus lead to many complications of Eyes, Kidneys and Cardio Vascular System.

  • In fact type-2 Diabetes is a leading cause of Blindness, Kidney failure and major cause of heart attacks and strokes.

Types of Diabetes

Type 1 Diabetes Mellitus:

Previously called Juvenile Diabetes Mellitus because Type-1 Diabetes Mellitus usually precipitates in children and teenagers.

In the body the immune system attacks the insulin producing cells in pancreas(autoimmune disease). So Insulin is not properly produced leading to Diabetes Mellitus.

So they are treated by giving Insulin daily. 

Type II Diabetes Mellitus:

About 90% of patients with Diabetes Mellitus is of Type II. Also called Adult onset Diabetes Mellitus, since it usually manifests after 35 years. However it may be seen in the young too. 

Here the insulin is produced normally but the body cannot use the Insulin well - Insulin Resistance.

Usually seen in people who are overweight and having a sedentary lifestyle.

Dataset -

We have chosen a dataset which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The data consists of various medical variables of about 15000 patients. All patients here are females at least 21 years old of Pima Indian heritage.

"The Pima (or Akimel O'odham, also spelled Akimel O'otham, "River People", formerly known as Pima) are a group of Native Americans living in an area consisting of what is now central and southern Arizona. The majority population of the surviving two bands of the Akimel O'odham are based in two reservations: the Keli Akimel O'otham on the Gila River Indian Community (GRIC) and the On'k Akimel O'odham on the Salt River Pima-Maricopa Indian Community (SRPMIC)." Wikipedia

The main objective is to analyse several medical variables that lead to Diabetes. The  variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Dataset Sample

Dataset Info

It shows that there are 15000 rows of 8 columns and most of them are of integer values, except BMI and DiabetesPedigree which are float values i.e decimal values.

Statistics of the Dataset

This info provides the minimum and maximum values of each variable along with other info like Percentiles, Mean and Standard Deviation.

The dataset consists of 10,000 patients who are having diabetes and 5000 patients who are not suffering from diabetes.

0 represents that the patient isn’t having Diabetes.

1 represents that the patient is suffering from Diabetes


Correlation is a measure of the strength of linear relationships between two variables.

Range: [-1, 1], 1: Perfect positive linear relationship, -1: Perfect negative linear relationship, 0: No relationship at all.

Correlation Heatmap

The correlation plot shows the relation between the parameters.

  1. 1. Pregnancies, Age,  are the most correlated parameters with Diabetes.

  2. 2. Serum Insulin and BMI have little correlation with Diabetes.

  3. 3. PlasmaGlucose, TricepsThickness and DiabetesPedigree have tiny correlation with Diabetes.

  4. 4. Little correlation between DiastolicBloodPressure and Diabetes.

* These observations are as per the dataset. The correlation may vary.


Foetus needs more sugars and other nutrients to grow. This increases the blood glucose levels of the mother from the storage making it available for them. If this process exceeds the normal level, it leads to increased glucose in the mother’s blood leading to Gestational Diabetes Mellitus.

We can observe that majority of the patients have 0 pregnancies and no diabetes.

Number of Diabetic patients in each pregnancy group

Plasma Glucose: 

Plasma glucose concentration for 2 hours in an oral glucose tolerance test. 

Oral Glucose Tolerance Test(OGTT) -

Tests the blood sugar levels before and 2 hours after taking a drink given by the doctor.

Normal : <140 mg/dl

Prediabetes: 140mg/dl to 199 mg/dl

Diabetes: >= 200 mg/dl

Majority of the patients are having the Oral Glucose Tolerance Test(OGTT) value in the range between 75 - 125.

There are no patients with OGTT greater than 200. The maximum OGTT value among the patients is 192.

There are 5000 patients with Diabetes, surprisingly the majority of the patients are having OGTT value below 140.

Diastolic Blood Pressure: 

Pressure blood exerts within arteries between heartbeats. Normal diastolic blood pressure is 80 mmHg or below. The relationship between the Diastolic BP and Diabetes is unknown but it's believed that the following contribute to both:

  • Obesity

  • A diet high in fat and sodium

  • Chronic inflammation

  • Sedentary lifestyle

Frequency and Probability Distribution of Diastolic Blood Pressure values in patients

Concentration of Diastolic BP and Swarm Plot of Diastolic BP

Tricep Thickness: 

Skin thickness is primarily determined by collagen content and is increased in insulin-dependent diabetes mellitus(IIDM). 

Diabetic subjects, especially women, showed a significant shift toward centripetal distribution of fat. The data indicated that centripetal fat distribution is a masculine characteristic. It is suggested that in diabetes there is a disturbance of male/female hormonal balance, responsible for centripetal fat distribution in women, and for exaggeration of centripetal fat distribution in men. Furthermore, the data suggested that persons with diabetes have more total fat than their nondiabetic counterparts. 

- American Diabetes Association. 

Distribution of Tricep Thickness of patients

All the patients with Tricep Thickness greater than 60 are Diabetic

Serum Insulin:

In Type I Diabetes Mellitus, Insulin isn’t produced due to Autoimmunity.

In Type II Diabetes Mellitus, Insulin is at normal levels, but the cells resist the Insulin thus leading to more accumulation of glucose in the blood.

Distribution Plot of Serum Insulin levels

Patients with Diabetes


An increase in body fat is generally associated with an increase in risk of metabolic diseases such as type 2 diabetes mellitus, hypertension and dyslipidaemia. Body mass index (BMI) criteria are currently the primary focus in obesity treatment recommendations, with different treatment cutoff points based upon the presence or absence of obesity-related comorbid disease. In addition, many patients with these metabolic diseases are either overweight or obese. While these simple clinical concepts may be well-accepted among many clinicians and researchers, and assumed to be readily accessible in the medical literature, the authors are unaware of any previous reports in which data regarding the important relationship between BMI and metabolic disease are summarised in a comprehensive manner. Defining the relationship between body weight and metabolic disease is critical toward a better understanding of the underlying pathophysiological processes leading to excessive fat-related metabolic disease.

Defining the relationship between body weight and metabolic diseases is critical toward better understanding of the underlying pathophysiological processes leading to these diseases. Data from the two national surveys reported here support the common clinical observation that patients with higher BMI are at higher risk for having diabetes mellitus, hypertension and dyslipidaemia. They also confirm the converse – the majority of patients with these metabolic diseases are either overweight or obese.

  • The relationship of Body Mass Index to diabetes mellitus, hypertension and dislipidaemia: comparison of data from two national surveys

Distribution of BMI values of the patients

  1. Patients with BMI in range of 25 to 29.9 i.e 'Overweight' tend to have diabetics.

  2. Patients with BMI value 30 and above are more prone to diabetics.

Diabetes Pedigree:

It provides information about diabetes history in relatives and genetic relationship of those relatives with patients. Higher Pedigree Function means the patient is more likely to have diabetes.

Distribution of Diabetes Pedigree values

We can observe that higher Diabetes Pedigree value may lead to Diabetes


The epidemic of type 2 diabetes is clearly linked to increasing rates of overweight and obesity in the U.S. population, but projections by the Centers for Disease Control and Prevention (CDC) suggest that even if diabetes incidence rates level off, the prevalence of diabetes will double in the next 20 years, in part due to the aging of the population. Other projections suggest that the number of cases of diagnosed diabetes in those aged ≥65 years will increase by 4.5-fold (compared to 3-fold in the total population) between 2005 and 2050.

The incidence of diabetes increases with age until about age 65 years, after which both incidence and prevalence seem to level off ( As a result, older adults with diabetes may either have incident disease (diagnosed after age 65 years) or long-standing diabetes with onset in middle age or earlier.

Distribution of Age parameter

We can observe that patients above 40 years are more prone to Diabetes
The above analysis is strictly related to the dataset mentioned. The medical variables and lifestyle may effect the conditions of Diabetes.


Now a days the life on the earth has changed. People are restoring to sedentary lifestyle with less hard work, more automation and software tech.

This leads to several diseases i.e contribute to Non Communicable Diseases.

Each condition can be a risk factor for other conditions and, increase in the number of such conditions like Obesity, Hypertension and Diabetes can be more risky.

Exercise is a well established tool to prevent and combat diabetes. A strict diet plan plays a crucial role along with exercises. Cutting down the regular use of sugar foods like Soft Drinks helps to restore health.

Regular health checkups are advised for treating the condition in prediabetic stage .

Diabetes - A silent pandemic, approximately 463 million adults (20-79 years) were living with diabetes; by 2045 this will rise to 700 million.