This post covers some of the steps involved with basic exploratory data analysis in Python, primarily utilizing the Pandas, Matplotlib, and Seaborn libraries. We won’t be diving into machine learning, modeling deep neural networks or utilizing gradient descent to minimize the cost function for a logistic regression model, primarily because this is my first data science project, so deep neural networks are slightly out of the question (for now)! This post is for true beginners.
If there’s anything that I would want a beginner like me to know when starting out their data science journey, it’s that reading books, articles, blogs, and watching video tutorials are all great (and necessary) things to do to expand your knowledge, but completing data science projects is the #1 method to gain applied knowledge and instill concepts that you have learned. I’m sharing my experience with hopes that other aspiring data scientists can not only relate to the pains and frustrations, but also the overwhelming feeling of success after completing a project, no matter how elementary you think it might be. Personally, I had to knock down a ginormous wall of self-doubt and completing my first project was key to doing that (not saying all self-doubt is completely gone, but I’ve learned not to be so hard on myself).
This project is the first of five as part of Flatiron School’s Online Data Science program. We were given a fictional scenario: Microsoft is looking to enter the movie industry, and we need to provide recommendations along with meaningful visualizations to present to non-technical stakeholders at Microsoft as a foundation for steps to success in the movie biz. I won’t be covering all the details of the project in this post, and the project itself is definitely not entirely comprehensive, but we’ll look at some highlights that I think may be insightful. The overall idea is to practice working with real-world data and presenting clear analytics as a result.
The data we were given were pulled from several sources: Box Office Mojo (https://www.boxofficemojo.com), IMDB (https://www.imdb.com/interfaces/), Rotten Tomatoes (https://www.rottentomatoes.com), and The Movie Database (https://www.themoviedb.org/?language=en-US). There were 11 files/tables overall that could be utilized (additional data mining/web scraping/API calls were allowed, but I decided for multiple reasons not to go that route). You may notice the data is slightly dated, so please don’t accept my recommendations as hard facts. They are solely based on analysis from this data.
Importing the necessary libraries
Along with Pandas, Matplotlib, and Seaborn, I also utilized Numpy and SciPy but won’t be focusing on those libraries here.
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import seaborn as sns
import matplotlib.pyplot as plt
Importing and previewing the data
Pandas allows you to manipulate data easily in a two-dimensional data structure in the form of rows and columns. Since these particular files were zipped CSV files, you can either unzip them manually and then import them into a Pandas DataFrame, or import them with the compression parameter as shown below.
movie_gross_df = pd.read_csv('zippedData/bom.movie_gross.csv.gz', compression='gzip')movie_basics_df = pd.read_csv('zippedData/imdb.name.basics.csv.gz', compression='gzip')movie_title_akas_df = pd.read_csv('zippedData/imdb.title.akas.csv.gz', compression='gzip')movie_title_basics_df = pd.read_csv('zippedData/imdb.title.basics.csv.gz', compression='gzip')movie_title_crew_df = pd.read_csv('zippedData/imdb.title.crew.csv.gz', compression='gzip')movie_title_principals_df = pd.read_csv('zippedData/imdb.title.principals.csv.gz', compression='gzip')movie_title_ratings_df = pd.read_csv('zippedData/imdb.title.ratings.csv.gz', compression='gzip')movie_info_df = pd.read_csv('zippedData/rt.movie_info.tsv.gz', compression='gzip', sep='\t')movie_reviews_df = pd.read_csv('zippedData/rt.reviews.tsv.gz', compression='gzip', sep='\t', encoding='cp1252')tmdb_movies_df = pd.read_csv('zippedData/tmdb.movies.csv.gz', compression='gzip')movie_budgets_df = pd.read_csv('zippedData/tn.movie_budgets.csv.gz', compression='gzip')
After the data was imported into Pandas DataFrames, I created a list of all data frames and used a for loop to preview the first five rows of data and the data info which gives an idea of the number of rows with values, column names, and data types for each data frame. You can also use the Pandas describe method to get a statistical summary of the data. These methods will help you get an initial understanding of which features of the data you want to engineer and potentially draw conclusions from.
all_dfs = [movie_gross_df, movie_basics_df, movie_title_akas_df, movie_title_basics_df,
movie_title_crew_df, movie_title_principals_df, movie_title_ratings_df, movie_info_df,
movie_reviews_df, tmdb_movies_df, movie_budgets_df]df_names = ['movie_gross', 'movie_basics', 'movie_title_akas', 'movie_title_basics',
'movie_title_crew', 'movie_title_principals', 'movie_title_ratings', 'movie_info',
'movie_reviews', 'tmdb_movies', 'movie_budgets']for i in range(len(all_dfs)):
Cleaning the data
Cleaning the data took the majority of my time with this project. I won’t go through each detail of how I cleaned the data, but the majority of the data frames had some form of missing values in one or more columns, so determining how I would handle those was different for each column. There were a few columns that I dropped entirely since the majority of the data was missing, and for the purpose of this project, they were basically useless (shown below):
movie_basics_df.drop(['birth_year', 'death_year'], axis=1, inplace=True)movie_title_akas_df.drop(['language', 'attributes', 'types'], axis=1, inplace=True)movie_title_principals_df.drop(['job', 'characters'], axis=1, inplace=True)
Another technique was imputing missing values with the median in order to avoid affecting the distribution of the data:
movie_gross_df['domestic_gross'].fillna(movie_gross_df.domestic_gross.median(), inplace=True)movie_gross_df['foreign_gross'].fillna(movie_gross_df.domestic_gross.median(), inplace=True)
Along with filling missing values, I also needed to ensure data types were correct by converting object data types to either integers or floats as necessary, converting dates to the proper date time format, and also splitting lists within a single column into several columns. Here’s an example of converting a column from an object data type to a float (had to remove comma first before I could convert the data type):
#Remove extraneous comma movie_gross_df['foreign_gross'].replace(',', '', regex=True, inplace=True)#Change foreign_gross to float64
movie_gross_df['foreign_gross'] = movie_gross_df['foreign_gross'].astype('float64')
Using the Pandas merge method, I combined several data frames based on either similar column values or indices in order to analyze a combination of different features. In most cases, I used an inner join to ensure values matched and I didn’t have duplicates. This method is similar to joining tables in SQL:
genres_df = movie_budgets_df.merge(movie_title_basics_df, left_on='movie', right_on='original_title', how='inner')crew_df = genres_df.merge(movie_title_crew_df, left_on='tconst', right_on='tconst', how='inner')combined_df = crew_df.merge(movie_basics_df, left_on='directors', right_on='nconst', how='inner')
Visualizing the data
Once you’ve determined questions that you want to answer with the data, making the data speak with visualizations is incredibly important. Seaborn is built on top of Matplotlib and is primarily used to make your visualizations look “prettier” than a standard pyplot with Matplotlib. I used several visualizations throughout my project to help answer questions like “When hiring a director, how much experience do they need?” The visualization below is a box plot from Seaborn showing the amount of movies directed vs. net profit:
# Plot net profit vs. count of movies directed
plt.title('Net Profit vs. Movies Directed', fontsize=16)
plt.xlabel('Total Movies Directed')
plt.ylabel('Net Profit (in billions)')
It seems that directors who have directed 4 or more movies have a positive net profit. Knowing this, let’s use that as a filter and see the top 25 directors by net profit with a bar plot:
#Plot bar graph
plt.ylabel('Average Profit (in 100 millions)')
plt.title('Top 25 Directors with Four or More Movies', fontsize=16)
Looks like Joss Whedon, Christopher Nolan, Michael Bay, and Jon Favreau top the list with over 500 million in net profit! (Keep in mind the data is slightly dated.)
Although I definitely didn’t touch on every aspect of this project, I hope this helps you get an initial understanding of some basic exploratory data analysis of the movie industry in Python using Pandas, Matplotlib and Seaborn. For more details on this project, feel free to check it out here: https://github.com/dbarth411/dsc-mod-1-project-v2-1-online-ds-sp-000.
As an aspiring data scientist, I happily welcome any comments, questions, or recommendations!