Foundations of Data Science Project - Diabetes Analysis


Context


Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

A few years ago research was done on a tribe in America which is called the Pima tribe. In this tribe, it was found that the ladies are prone to diabetes very early. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients were females at least 21 years old of Pima tribe.


Objective


Here, we are analyzing different aspects of Diabetes in the Pima Diabetes Analysis by doing Exploratory Data Analysis.


Data Dictionary


The dataset has the following information:

Q 1: Import the necessary libraries and briefly explain the use of each library (3 Marks)

Write your Answer here:

Ans 1:NumPy: NumPy is also called as Numerical Python and this library is used to perform wide variety of Mathematical Operations on the datasets.It supports large multi-dimensional arrays for data handling. Pandas: Pandas are useful for Data cleaning and analyzing datasets.They can also be used to import data efficiently from different file formats like csv,json etc.They run on top of Numpy. matplotlib: This is Data visualization library for plotting basic graphs line, bar graphs etc. This import python plot(.pyplot) to visualize the graphs. Seaborn: Seaborn is an advanced data visualization library and is based on matlplotlib. It provides an interface for statistical graphs and drawings. It enables us to visualize sophisticated graphs like box plot, heatmaps, pairplots etc.

Q 2: Read the given dataset (1 Mark)

Q3. Show the last 10 records of the dataset. How many columns are there? (1 Mark)

Write your Answer here:

Ans 3:Total Columns:9

Q4. Show the first 10 records of the dataset (1 Mark)

Q5. What do you understand by the dimension of the dataset? Find the dimension of the pima dataframe. (1 Mark)

Write your Answer here:

Ans 5:The shape of the dataset is a tuple of two elements. Number of rows in the dataset is shown by the first element i.e 768 Number of columns in the dataset is shown by the second element i.e 9

Q6. What do you understand by the size of the dataset? Find the size of the pima dataframe. (1 Mark)

Write your Answer here:

Ans 6:The total number of elements in the data (rows * columns) is defined as the Size of the dataset which is 6912.

Q7. What are the data types of all the variables in the data set? (2 Marks)

Hint: Use the info() function to get all the information about the dataset.

Write your Answer here:

Ans 7:There are in total two datatypes in the given dataset.They are int64 and float64.(Both are numerical variables). 1. integer(int64) is a number without a decimal point.In the given dataset, there are 7 columns: Pregnencies,Glucose,BloodPressure,SkinThickness,Insulin,Age,Outcome (The values of Outcome column are 0 and 1) 2. Float(float64) is a number with a decimal point. There are 2 columns with float dtype:BMI,DiabetesPedigreeFunction.

Q8. What do we mean by missing values? Are there any missing values in the pima dataframe? (2 Marks)

Write your Answer here:

Ans 8:Missing Values tells us about the missing information in the dataset i.e if there are any blank/Null values in the cells of the dataset. The Pima Diabetis dataset has no missing values since the result of the above code is False.

Q9. What do the summary statistics of the data represent? Find the summary statistics for all variables except 'Outcome' in the pima data. Take one column/variable from the output table and explain all its statistical measures. (3 Marks)

Write your Answer here:

Ans 9:'Age' column: 1. The minimum and maximum age of Diabetes patients are 21 and 81 respectively. 2. The average age of a Diabetic patient i.e mean of the Age is 33 3. The standard deviation tells where the data is clustered or scattered around the mean. Here the std of Age is 11. 4. The count function gives the number of entries the Age column has. It has in total 768 entries. 5. Median-The median age of the Diabetic patients is 29yrs (meaning- 50 percentile of patients are <= age 29). 6.Lower quartile(25%) from the data is 24 which means 25 percentile of the patients are under the age 24. 7.Upper quartile(75%) from the data is 41 which means 75 percentile of the patients are under the age 41. From the above data, the mean is greater than median which infers that the data is skewed towards right.

Q 10. Plot the distribution plot for the variable 'BloodPressure'. Write detailed observations from the plot. (2 Marks)

Write your Answer here:

Q 11. What is the 'BMI' of the person having the highest 'Glucose'? (1 Mark)

Write your Answer here:

Ans 11:The BMI of the patient with highest Glucose level of 661 is 42.9.

Q12.

12.1 What is the mean of the variable 'BMI'?

12.2 What is the median of the variable 'BMI'?

12.3 What is the mode of the variable 'BMI'?

12.4 Are the three measures of central tendency equal?

(3 Marks)

Write your Answer here:

Ans 12:The mean, median and mode of BMI are 32.45,32,32 From the above data, we can infer that the median and mode are equal whereas the mean is very slightly greater than median and mode.

Q13. How many women's 'Glucose' levels are above the mean level of 'Glucose'? (1 Mark)

Write your Answer here:

Ans 13:The mean Glucose level is 121.67. There are 343 women whose are Glucose levels are above the mean level which is approximately 44%.

Q14. How many women have their 'BloodPressure' equal to the median of 'BloodPressure' and their 'BMI' less than the median of 'BMI'? (2 Marks)

Write your Answer here:

Ans 14:There are 22 women whose BloodPressure equals to median BloodPressure and BMI < median BMI.

Q15. Create a pairplot for the variables 'Glucose', 'SkinThickness', and 'DiabetesPedigreeFunction'. Write your observations from the plot. (4 Marks)

Write your Answer here:

Ans 15:Between glucose levels (x-axis) and diabetes pedigree function (y-axis), diabetic patients have more glucose level which can be seen on the graph with orange dots towards the right of the graph. However, the diabetes pedigree function is distributed all across. Between Skin Thickness (y-axis) and glucose levels (x-axis), skin thickness values are distributed all across which means between diabetic and non-diabetic patients we can't say one grou has significantly thicker skin than other group.

Q16. Plot the scatterplot between 'Glucose' and 'Insulin'. Write your observations from the plot. (2 Marks)

Write your Answer here:

Ans 16:Majority of people with low and high glucose levels are populated in the lower part of the graph (insulin<300). Its interesting to see some people with glucose level >140 has high insulin level and are populated at the top right of the graph with couple outliers.

Q 17. Plot the boxplot for the 'Age' variable. Are there outliers? (2 Marks)

Write your Answer here:

Ans 17:Yes, there are outliers on the data (age>65).

Q18. Plot histograms for the 'Age' variable to understand the number of women in different age groups given whether they have diabetes or not. Explain both histograms and compare them. (3 Marks)

Ans 18: Among the women with diabetes, 166 women are under age 40 which is 62% of the total diabetic patients. This tells us that more than half of patients are at high risk of diabetes or have diabetes within age 40.

Among the non-diabetic people, majority of the women are under age 30.

Q 19. What is the Interquartile Range of all the variables? Why is this used? Which plot visualizes the same? (2 Marks)

Write your Answer here:

Ans 19:The above are the Interquartile Range for each variable. The range of IQR is between Q1 & Q3. IQR is used to know the dispersion of the data in a dataset (i.e. how widely the data is spread across). It also helps us to identify outliers in the data. Boxplot can be used to visualize quartiles and outliers in the data.

Q 20. Find and visualize the correlation matrix. Write your observations from the plot. (3 Marks)

Write your Answer here:

Ans 20: Age and Pregnancies are positively correlated moderately with correlaion coefficient of 0.54 SkinThickness and BMI are positively correlated moderately with correlaion coefficient of 0.53. There are variables which have weak negative correlation with other variables such as Pregnancies vs DiabetesPedigreeFunction,Insulin vs Pregnancies etc. From the above data, we can infer that there is no strong correlation between the variables with correlation coefficient >0.7