Well done! Having spent years analyzing financial records for big banks, you have finally scratched your idealistic itch and joined the education sector. In your latest role, you've become the Chief Data Scientist for your city's school district. In this capacity, you'll be helping the school board and mayor make strategic decisions regarding future school budgets and priorities.
As a first task, you haveve been asked to analyze the district-wide standardized test results. You will be given access to every student's math and reading scores, as well as various information on the schools they attend. Your responsibility is to aggregate the data to and showcase obvious trends in school performance.
Your final report should include each of the following:
District Summary
Create a high level snapshot (in table form) of the district's key metrics, including:
Create an overview table that summarizes key metrics about each school, including:
Create a table that highlights the top 5 performing schools based on Overall Passing Rate. Include:
Create a table that highlights the bottom 5 performing schools based on Overall Passing Rate. Include all of the same metrics as above.
Create a table that lists the average Math Score for students of each grade level (9th, 10th, 11th, 12th) at each school.
Create a table that lists the average Reading Score for students of each grade level (9th, 10th, 11th, 12th) at each school.
Create a table that breaks down school performances based on average Spending Ranges (Per Student). Use 4 reasonable bins to group school spending. Include in the table each of the following:
Repeat the above breakdown, but this time group schools based on a reasonable approximation of school size (Small, Medium, Large).
Repeat the above breakdown, but this time group schools based on school type (Charter vs. District).
As final considerations:
Hints and Considerations
# Import Dependencies
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Read in file
school_file = "PyCitySchools/raw_data/schools_complete.csv"
student_file = "PyCitySchools/raw_data/students_complete.csv"
# Store imported files as new DFs
school_df = pd.read_csv(school_file)
student_df = pd.read_csv(student_file)
## Print statements are included for checking purposes - can be deleted if neccessary
## Print rows of each DF:
# school_df
# student_df
# Rename col "name" in school_df to "school name"
school_df.rename_axis({"name" : "school name"}, axis=1, inplace=True)
# Make a copy of school_df to manipulate and reuse later throughout the code
# Store renamed school_df as new df
copy_school_sum = school_df.copy()
## Print statements are included for checking purposes - can be deleted if neccessary
## Print both DFs to verify that the columns are the same
#school_df.columns
#copy_school_sum.columns
# Rename col "school" in student_df to "school name"
# Store renamed student_df as new df
renamed_student_df = student_df.rename(columns={"school" : "school name"})
## Print statements are included for checking purposes - can be deleted if neccessary
## Print both DFs to verify that the columns are the same
#renamed_student_df.columns
## Total Schools
total_schools = school_df["school name"].count()
#print(total_schools)
## Total Students
total_students = renamed_student_df["school name"].count()
#print(total_students)
## Total Budget
total_budget = school_df["budget"].sum()
## Average Math Score
avg_math_score = renamed_student_df["math_score"].mean()
## Average Reading Score
avg_read_score = renamed_student_df["reading_score"].mean()
## % Passing Math based on 70
math_pass = renamed_student_df.loc[(student_df["math_score"] >= 70)]
count_pass_math = math_pass["math_score"].count()
#print(count_pass_math)
per_math_pass = (count_pass_math/total_students)*100
#print(per_math_pass)
## % Passing Reading based on 70
read_pass = renamed_student_df.loc[(student_df["reading_score"] >= 70)]
count_pass_read = read_pass["reading_score"].count()
#print(count_pass_read)
per_read_pass = (count_pass_read/total_students)*100
#print(per_read_pass)
## Overall Passing Rate (Average of the above two)
overall_pass = (per_math_pass + per_read_pass)/2
#print(overall_pass)
district_summary = {"Total Schools" : total_schools,
"Total Students" : total_students,
"Total Budget" : total_budget,
"Average Math Score" : avg_math_score,
"Average Reading Score" : avg_read_score,
"% Passing Math" : per_math_pass,
"% Passing Reading" : per_read_pass,
"% Overall Passing" : overall_pass
}
district_summary_df = pd.DataFrame([district_summary])
district_summary_df = district_summary_df[["Total Schools",
"Total Students","Total Budget","Average Math Score", "Average Reading Score",
"% Passing Math", "% Passing Reading", "% Overall Passing" ]]
district_summary_df
We will use copy_school_sum for this portion
# First, we will delete the School ID col from copy_school_sum because we will not need it for future use
del copy_school_sum['School ID']
# We can print copy_school_sum to verify the del
# copy_school_sum
## Calculate the Per Student Budget
## We will pull the school budget and divide it by the school size to determine the budget per student
## Then, we will create a new col named 'Per Student Budget'
copy_school_sum['Per Student Budget'] = copy_school_sum['budget']/copy_school_sum['size']
# copy_school_sum
## Average Math Score & Average Reading Score
## We will use a groupby function to group on school name and display both reading and math scores from above
## then reset the index
avg_math_read_tbl = renamed_student_df.groupby(['school name'])['reading_score', 'math_score'].mean().reset_index()
# avg_math_read_tbl
## Now, we merge avg_math_read_tbl with copy_school_sum df
## We want to merge on School Name and by 'outer' to include everything
copy_school_sum = copy_school_sum.merge(avg_math_read_tbl, on='school name', how="outer")
#copy_school_sum
## We will use conditionals to find the % Passing Math and % Passing Reading
## Then, we will store them as new variabels
## We are using a passrate of 70% or higher
# % Passing Reading
summary_passing_read = renamed_student_df[renamed_student_df['reading_score']>=70]
#% Passing Math
summary_passing_math = renamed_student_df[renamed_student_df['math_score']>=70]
#summary_passing_read
#summary_passing_math
## Count the number of students passing in reading
pass_read_count_sum = summary_passing_read.groupby(["school name"])['reading_score'].count().reset_index()
## Then, rename the column the 'reading_score' to 'Reading Count'
pass_read_count_sum.rename_axis({'reading_score' : 'Reading Count'}, axis=1, inplace=True)
#pass_read_count_sum
## Count the number of students passing in math
pass_math_count_sum = summary_passing_math.groupby(["school name"])['math_score'].count().reset_index()
## Then, rename the column the 'math_score' to 'Math Count'
pass_math_count_sum.rename_axis({'math_score' : 'Math Count'}, axis=1, inplace=True)
#pass_math_count_sum
## Merge pass_math_count_sum with pass_read_count_sum
## We want to merge on School Name and by 'inner' to include only the contents found in both columns
pass_count = pass_math_count_sum.merge(pass_read_count_sum, on="school name", how='inner')
#pass_count
## Merge copy_school_sum with pass_count we just created
## We want to merge on School Name and by 'outer' to include everything
copy_school_sum = copy_school_sum.merge(pass_count, on="school name", how='outer')
#copy_school_sum
## Calc % passing math and reading
## Take the subject count and divide by the school size, then pmultiply by 100 to get percentage
# % Passing Math
copy_school_sum['% Passing Math'] = (copy_school_sum['Math Count']/copy_school_sum['size'])*100
# % Passing Reading
copy_school_sum['% Passing Reading'] = (copy_school_sum['Reading Count']/copy_school_sum['size'])*100
#copy_school_sum
# Now, we will delete the Math Count and Reading Count cols from copy_school_sum
# because we will not need the counts for future use
del copy_school_sum['Math Count']
del copy_school_sum['Reading Count']
#copy_school_sum
## Calc % overall passing
## Overall Passing Rate (Average of the above two)
copy_school_sum['% Overall Passing'] = (copy_school_sum['% Passing Math'] + copy_school_sum['% Passing Reading'])/2
#copy_school_sum
# now, rename axis for reading and math scores to Avg. Reading Score and Avg. Math Score in copy_school_sum
copy_school_sum.rename_axis({'reading_score':'Avg. Reading Score',
'math_score':'Avg. Math Score'}, axis=1, inplace=True)
copy_school_sum
Create a table that highlights the top 5 performing schools based on Overall Passing Rate. Include:
## Create a table that highlights the top 5 performing schools based on Overall Passing Rate.
## Found the top 5 performing by sorting copy_school_sum on the '% Overall Passing' col in copy_school_sum df
## By sorting, we can find the five top performing
top_performing_by_pr_df = copy_school_sum.sort_values(by=['% Overall Passing'], ascending=False).head(5)
top_performing_by_pr_df.head(5)
Include:
## Create a table that highlights the 5 worst performing schools based on Overall Passing Rate.
## Include all of the same metrics as above.
## Using the copy_school_sum df found the bottom 5 performing schools based on Overall Passing Rate col
worst_performing_by_pr_df = copy_school_sum.sort_values(by=['% Overall Passing']).head(5)
worst_performing_by_pr_df
Create a table that lists the average Math Score for students of each grade level (9th, 10th, 11th, 12th) at each school.
# We will use the pivot table group to display the requested information
math_scores_by_grade_df = pd.pivot_table(student_df, values=['math_score'], index=['school'],
columns=['grade'])
math_scores_by_grade_df = math_scores_by_grade_df.reindex_axis(labels=['9th',
'10th',
'11th',
'12th'], axis=1, level=1)
math_scores_by_grade_df
Create a table that lists the average Reading Score for students of each grade level (9th, 10th, 11th, 12th) at each school.
# will use the pivot table group to display
read_scores_by_grade_df = pd.pivot_table(student_df, values=['reading_score'], index=['school'],
columns=['grade'])
read_scores_by_grade_df = read_scores_by_grade_df.reindex_axis(labels=['9th',
'10th',
'11th',
'12th'], axis=1, level=1)
read_scores_by_grade_df
Create a table that breaks down school performances based on average Spending Ranges (Per Student). Use 4 reasonable bins to group school spending. Include in the table each of the following:
# Copy the copy_school_sum and save as scores_by_school_spending
scores_by_school_spending = copy_school_sum.copy()
# Print columns of each to verify they're the same
# copy_school_sum
# scores_by_school_spending.columns
# Create bins - we will need labels and bins
bins = [0, 585, 615, 645, 675]
spending_labels = ['$0-585', '$586-615', '$616-645', '$646-675']
# Use bins and labels to sort through data and divide it up appropriately
# save bined data as bins_school_spending variable
bins_school_spending = pd.cut(scores_by_school_spending['Per Student Budget'], bins, labels=spending_labels)
# Convert bins_school_spending to df
bins_school_spending = pd.DataFrame(bins_school_spending)
# add Spending Level col
copy_school_sum['Spending Level'] = bins_school_spending
# Show cols for bins_school_spending to verify
#bins_school_spending.columns
# Do a groupby on Spending Level and school name
scores_by_school_spending = copy_school_sum.groupby(['Spending Level'])['Avg. Reading Score',
'Avg. Math Score',
'% Passing Reading',
'% Passing Math',
'% Overall Passing'
].mean()
scores_by_school_spending
# Create a copy of copy_school_sum and save as scores_by_school_size
scores_by_school_size = copy_school_sum.copy()
# Print scores_by_school_size to verify
# scores_by_school_size
# Create bins - we will need labels and bins
bins = [0, 1000, 2000, 5000]
size_labels = ['Small', 'Medium', 'Large']
# Use bins and labels to sort through data and divide it up appropriately
# save bin data as bins_school_size variable
bins_school_size = pd.cut(scores_by_school_size['size'], bins, labels = size_labels)
# Convert bins_school_spending to df
bins_school_size = pd.DataFrame(bins_school_size)
# add 'School Population' col
copy_school_sum['School Population'] = bins_school_size
# Show cols for bins_school_size to verify
# bins_school_size.columns
# Do a groupby on School Population and school name
scores_by_school_size = copy_school_sum.groupby(['School Population','school name'])['Avg. Reading Score',
'Avg. Math Score',
'% Passing Reading',
'% Passing Math',
'% Overall Passing'
].mean()
scores_by_school_size
# Create a copy of copy_school_sum and save as scores_by_school_type
scores_by_school_type = copy_school_sum.copy()
# Convert scores_by_school_type to df
scores_by_school_type = pd.DataFrame(scores_by_school_type)
# Print scores_by_school_type cols to verify
# scores_by_school_type.columns
# Do a groupby on School Type
scores_by_school_type = copy_school_sum.groupby(['type'])['Avg. Reading Score',
'Avg. Math Score',
'% Passing Reading',
'% Passing Math',
'% Overall Passing'
].mean()
scores_by_school_type.head()