## Description

1. The dataset “weights.tsv” contains weight values (in pounds) for 20 people: 164, 158,

172, 153, 144, 156, 189, 163, 134, 159, 143, 176, 177, 162, 141, 151, 182, 185, 171,

152.

1.1. Import the data from the file weights.tsv into a Pandas Series object in Python.

(4 points)

1.2. Create a new series object with weights converted to kilograms from pounds (1

pound = 0.453592 kilograms). Round the results to two decimal places.

(4 points)

1.3. Find the mean, median, and standard deviation of both series objects using

Pandas functions. (4 points)

1.4. Plot a histogram of weight (in kilograms) using matplotlib library with 10 bins.

(4 points)

2. For this problem statement, you are given a dataset named “boston.csv”. This dataset

contains information collected by the US Census Service concerning housing in the

areas of Boston, Mass. The data was originally published by Harrison, D., & Rubinfeld,

D.L. (1978). Hedonic prices and the demand for clean air. Journal of Environmental

Economics and Management,5,81–102.

Here is the description of variables/columns in this dataset:

Column name Description

CRIM per capita crime rate by town

ZN proportion of residential land zoned for lots over 25,000 sq.ft.

Page 2 of 4

NDUS proportion of non-retail business acres per town

CHAS Charles River dummy variable

(1 if tract bounds river; 0 otherwise)

NOX nitric oxides concentration (parts per 10 million)

RM mean number of rooms per dwelling

AGE proportion of owner-occupied units built prior to 1940

DIS weighted distances to five Boston employment centres

RAD index of accessibility to radial highways

TAX full-value property-tax rate per $10,000

PTRATIO pupil-teacher ratio by town

LSTAT % lower status of the population

MEDV median value of owner-occupied homes in $1000’s

2.1. Import the dataset “boston.csv” into a Pandas dataframe and obtain the number

of rows and columns for the dataframe. (3 points)

2.2. What is the owner-occupied home value (MEDV) for the lowest nitric oxide

concentration (NOX) from the dataframe? (3 points)

2.3. Create a boxplot of per capita crime rate (CRIM) using Matplotlib. Obtain the

interquartile range for crime rate (CRIM) using Pandas functions. (4 points)

2.4. Subset all columns of the dataframe for rows with outliers of crime rate into a new

dataframe. Compare the mean of AGE between the two dataframes with respect

to crime rate, what do you interpret? (Hint: Outliers exist 1.5 times of interquartile

range above third quartile and below first quartile) (4 points)

2.5. Create scatterplot between distances to employment centers (DIS) and nitric

oxide levels (NOX). Obtain correlation index between the two columns and

interpret their relation. (4 points)

2.6. Similarly, create a scatterplot between highway accessibility index (RAD) and

property tax rate (TAX). Obtain correlation index, compare it to the scatter-plot,

and interpret the relation between RAD and TAX. Take appropriate action on the

data based on your observation. (6 points)

Page 3 of 4

3. We will be using the “tips” dataset from seaborn package for this problem statement.

This dataset contains information about restaurant bills and tips made by people

classified by their gender along with few other attributes which are self-explanatory. You

can import this dataset into a pandas dataframe as follows:

3.1. Calculate percentage of tip amounts for bill totals, rounded to two decimal places

and create a new column “tip_percent” in the same dataframe. (3 points)

3.2. For what days in the week do we have the data, and which day has the highest

bill mean? (Hint: lookup for “groupby” in pandas documentation) (3 points)

3.3. Are there more dinners or lunches? Create a dataframe with this data. Are there

more smokers during lunches or dinners? Create another dataframe with this

data. Join the two dataframes by time of day and calculate the percent of

smokers at lunch and dinner. Compare the results. (6 points)

3.4. Using the boxplot function from seaborn package, create plots on “tip” column for

Male and Female from “sex” column. Compare the boxplots and provide your

interpretation on outliers between males and females. (4 points)

3.5. Create the same boxplots as above for “tip_percent” and “sex”, for tip percent

below 70. Now compare the boxplots between male and female, which boxplot

has more outliers and which one is more symmetric? (4 points)

4. For this last problem statement, you will work on the “avocado.csv” dataset which

contains information related to avocado sales across multiple regions/cities over the

years 2015 to 2018 organised by date. The data contains 10 columns which are self

explanatory.

Page 4 of 4

4.1. Import the dataset file into a Pandas dataframe and identify the count of missing

values per column. Handle missing values based on column type and explain

your reasons behind selecting appropriate techniques. (8 points)

4.2. Convert the fields Type, Year and Region to categorical data type and subset the

dataframe to exclude region “TotalUS” and “West” and sort the dataframe by date

in ascending order. Is the mean price of an avocado higher in 2017 compared to

2016? (4 points)

4.3. Sum up the total volume of avocado sales by region and create a horizontal bar

plot using Matplotlib. Which state from the region has the highest sales of

avocados by volume? Subset the data for that state, create a histogram of mean

price and interpret it. Obtain the correlation index between mean price and total

volume for that state, what do you find? (6 points)

4.4. Provide your observations of the following timeline plot of avocado sales by

volume. Which month consistently has the highest volume of sales every year? In

general, what could be some possible reasons driving this surge in sales? (2

points)