This analysis looks into the relations that genre, release year, and budget (adjusted for inflation) have with a movies' overall rating and profit based on the data from The Movie Database (TMDb), which includes information, classifications, and statistics about nearly 11,000 movies.
The questions to be answered are:
Please note that the explanations for the executed code will precede the code itself throughout the report.
Begin by importing the libraries for needed for analysis and set inline plotting.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
%matplotlib inline
Read the .csv file into a pandas dataframe.
df = pd.read_csv('tmdb-movies.csv')
df.head()
Examine the shape of the dataframe, finding 10,866 rows, or movies, and 21 columns of information.
df.shape
Check for and drop any duplicated rows.
df.duplicated().sum()
df.drop_duplicates(inplace = True)
Look at general statistics about the dataframe.
df.describe()
df.info()
Visualize the data by column to examine the data.
df.hist(figsize = (10, 10));
Drop the columns that are not useful for this analysis. Budget may help account for higher ratings, and revenue allows us to calculate profit. The release_year, genres, budget_adj, and vote_average columns are the main columns of information we need to answer the questions posed.
df.drop(['id', 'imdb_id', 'popularity', 'original_title', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'runtime', 'production_companies', 'release_date', 'vote_count'], axis = 1, inplace = True)
Search for any missing information, like missing genre fields.
df.info()
Only 23 movies are missing genre information, so given the size of the dataset, the simplest solution is to remove these movies from the dataset for this analysis.
df.dropna(axis = 0, inplace = True, subset = ['genres'])
df.info()
The genre column in this dataframe is made up of a string of genre names separated by pipes, or the | character.
To explore the genres, we need to divide the movies into groups based on genres. Since each movie can have multiple genres, the simplest way to analyze genre information is to include a movie in the group for each genre it has, even if that means that a movie is included in multiple dataframes.
This does limit the report because it will not look at every combination of genre as separate groups. .nunique() shows us that there are 2,039 different combinations of genres in this dataset; far too many to analyze each combination separately in this report.
df.genres.nunique()
Instead, we will create separate dataframes for each individual genre, including a movie if the genre is included in its list of genres. To begin this, we need find each individual genre that exists in this dataset.
We start by creating a NumPy array out of the genre column of the movie dataframe.
genre_array = df.genres.values
genre_array = genre_array.astype('U')
print(genre_array.dtype)
We then split the elements of the genre array by the | delimiter and store the individual genres in a new array.
split_genre_array = np.core.defchararray.split(a = genre_array, sep = '|')
print(split_genre_array.dtype)
print(split_genre_array)
By getting a total number and the maximum length of words representing genre tags in all dataframe genre column values, we can initialize a properly sized array to utilize numpy's efficiency better.
total_words = 0
max_length = 0
for index, row in enumerate(split_genre_array):
for word in split_genre_array[index]:
total_words += 1
if len(word) > max_length:
max_length = len(word)
print('total_words: ' + str(total_words))
print('max_length: ' + str(max_length))
Initialize a numpy array with the shape of 1 x the total number of words representing genres.
combined_genre_array = np.empty((1, total_words), dtype = ('U' + str(max_length)))
combined_genre_array.shape
Set each element in the array
count = 0
for index, row in enumerate(split_genre_array):
row_list = ast.literal_eval(str(row))
for word in row_list:
combined_genre_array[0, count] = word
count += 1
print(combined_genre_array)
Finally, find each individual genre in the combined_genre_array.
genre_list = np.unique(combined_genre_array)
print(genre_list)
Now that we have a list of all of the genres in the dataset, we can create a new dataframe for each genre.
#_df = df.loc[df.genres.str.contains('')]
action_df = df.loc[df.genres.str.contains('Action')]
adventure_df = df.loc[df.genres.str.contains('Adventure')]
animation_df = df.loc[df.genres.str.contains('Animation')]
comedy_df = df.loc[df.genres.str.contains('Comedy')]
crime_df = df.loc[df.genres.str.contains('Crime')]
documentary_df = df.loc[df.genres.str.contains('Documentary')]
drama_df = df.loc[df.genres.str.contains('Drama')]
family_df = df.loc[df.genres.str.contains('Family')]
fantasy_df = df.loc[df.genres.str.contains('Fantasy')]
foreign_df = df.loc[df.genres.str.contains('Foreign')]
history_df = df.loc[df.genres.str.contains('History')]
horror_df = df.loc[df.genres.str.contains('Horror')]
music_df = df.loc[df.genres.str.contains('Music')]
mystery_df = df.loc[df.genres.str.contains('Mystery')]
romance_df = df.loc[df.genres.str.contains('Romance')]
science_fiction_df = df.loc[df.genres.str.contains('Science Fiction')]
tv_movie_df = df.loc[df.genres.str.contains('TV Movie')]
thriller_df = df.loc[df.genres.str.contains('Thriller')]
war_df = df.loc[df.genres.str.contains('War')]
western_df = df.loc[df.genres.str.contains('Western')]
Let's see how many of each genre is in this dataset by getting the length of the index in each genre dataframe. Then sort the values from highest to lowest.
genre_count = []
for genre in genre_list:
temp = genre.lower() + "_df"
temp = temp.replace(" ", "_")
number = len(eval(temp + '.index'))
genre_count.append([temp[:-3], number])
genre_count.sort(key = lambda x:x[1], reverse = True)
for index, genre in enumerate(genre_count):
genre_count[index] = [genre[0].title().replace("_", " "), genre[1]]
print(genre_count)
Now let's visualize this information.
plt.figure(figsize=(20, 10))
x, y = [*zip(*genre_count)]
graph = plt.bar(x, y)
plt.xticks(rotation = 'vertical')
plt.xlabel('Genres')
plt.ylabel('Number of Movies')
# Place the values above the bars.
for p in graph.patches:
plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')
plt.show()
genre, amount = [*zip(*genre_count)]
plt.figure(figsize = (10, 10))
plt.title('Proportion of the Total Number of Movies in Each Genre')
plt.pie(amount, labels = genre, textprops = {'fontsize': 10})
plt.show()
From these charts, it is easy to see which genres have the most movies. The top three genres are Drama, Comedy, and Thriller.
Before we continue visualizing data, let's go ahead and create more data from the dataframes and store it in an organized fashion for later.
The information we want to calculate is
To begin getting all of this information, we need to know the full range of years in the overall dataset.
years = df['release_year'].unique()
years.sort()
print(years)
There are 56 years, from 1960–2015, in this dataset.
We will create a dataframe to store all of the results we need with rows for each calculation and columns for each genre.
result_rows = ['vote_average_mean', 'budget_adj_mean', 'revenue_adj_mean', 'profit_adj_mean']
for year in years:
result_rows.append(str(year) + '_vote_average')
result_rows.append(str(year) + '_budget_adj')
result_rows.append(str(year) + '_revenue_adj')
result_rows.append(str(year) + '_profit_adj')
results_df = pd.DataFrame(index = result_rows, columns = genre_list)
for genre in genre_list:
# Create the string to eval the dataframe of the current genre's movies.
genre_df = genre.lower().replace(' ', '_') + '_df'
# Set the overall means for the vote_average, budget_adj, and the revenure_adj columns, and calculate the overall profit mean.
results_df.at['vote_average_mean', str(genre)] = eval(genre_df)['vote_average'].mean()
results_df.at['budget_adj_mean', str(genre)] = eval(genre_df)['budget_adj'].mean()
results_df.at['revenue_adj_mean', str(genre)] = eval(genre_df)['revenue_adj'].mean()
results_df.at['profit_adj_mean', str(genre)] = (eval(genre_df)['revenue_adj'].mean() - eval(genre_df)['budget_adj'].mean())
# Set these four values for each year.
for year in years:
temp_year_df = eval(genre_df).loc[eval(genre_df)['release_year'] == year]
results_df.at[str(year) + '_vote_average', str(genre)] = temp_year_df['vote_average'].mean()
results_df.at[str(year) + '_budget_adj', str(genre)] = temp_year_df['budget_adj'].mean()
results_df.at[str(year) + '_revenue_adj', str(genre)] = temp_year_df['revenue_adj'].mean()
results_df.at[str(year) + '_profit_adj', str(genre)] = (temp_year_df['revenue_adj'].mean() - temp_year_df['budget_adj'].mean())
results_df.info()
Convert the results_df dtype from object to float.
results_df = results_df.apply(pd.to_numeric, axis = 1, errors = 'coerce')
results_df.info()
results_df
We can now use these results to find out which genre was the most popular in each year by the mean vote_average.
results_max = results_df.idxmax(axis = 1)
vote_average_max = results_max[0::4]
budget_average_max = results_max[1::4]
revenue_average_max = results_max[2::4]
profit_average_max = results_max[3::4]
print(vote_average_max)
# Use [1:] to avoid counting the overall mean in the graphs.
vote_x = vote_average_max[1:].value_counts().index.values.tolist()
vote_y = vote_average_max[1:].value_counts().tolist()
graph = plt.bar(vote_x, vote_y)
plt.title('Number of Years Genres Have the Highest Vote Average')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15, 20, 25, 30, 35])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')
# Place the values above the bars.
for p in graph.patches:
plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')
plt.show()
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Vote Average')
plt.pie(vote_y, labels = vote_x, textprops = {'fontsize': 10})
plt.show()
We will then do the same for the mean budget_adj values.
print(budget_average_max)
# Use [1:] to avoid counting the overall mean in the graphs.
budget_x = budget_average_max[1:].value_counts().index.values.tolist()
budget_y = budget_average_max[1:].value_counts().tolist()
graph = plt.bar(budget_x, budget_y)
plt.title('Number of Years Genres Have the Highest Average Budget')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')
# Place the values above the bars.
for p in graph.patches:
plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 6), textcoords = 'offset points')
plt.show()
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Average Budget')
plt.pie(budget_y, labels = budget_x, textprops = {'fontsize': 10})
plt.show()
Then for the average revenue_adj values.
print(revenue_average_max)
# Use [1:] to avoid counting the overall mean in the graphs.
revenue_x = revenue_average_max[1:].value_counts().index.values.tolist()
revenue_y = revenue_average_max[1:].value_counts().tolist()
graph = plt.bar(revenue_x, revenue_y)
plt.title('Number of Years Genres Have the Highest Average Revenue')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15, 20])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')
# Place the values above the bars.
for p in graph.patches:
plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')
plt.show()
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Average Revenue')
plt.pie(revenue_y, labels = revenue_x, textprops = {'fontsize': 10})
plt.show()
The Documentary genre is the most highly rated movie genre on average, and the Adventure genre has the highest budgets and revenues according to this dataset.
Let's see which movie genres are the most profitible per year on average and then visualize the overarching information.
print(profit_average_max)
# Use [1:] to avoid counting the overall mean in the graphs.
profit_x = profit_average_max[1:].value_counts().index.values.tolist()
profit_y = profit_average_max[1:].value_counts().tolist()
graph = plt.bar(profit_x, profit_y)
plt.title('Number of Years Genres Have the Highest Average Profit')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15, 20])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')
# Place the values above the bars.
for p in graph.patches:
plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')
plt.show()
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Average Profit')
plt.pie(profit_y, labels = profit_x, textprops = {'fontsize': 10})
plt.show()
Adventure movies are also the most profitible on average. This might be why so many of them have a high budget, but this also might be because they have a higher average budget instead.
Finally, let's look at the association between a higher movie budget and a higher average rating by graphing these values for all movies in the dataset to a scatter plot.
plt.figure(figsize = (20, 20))
plt.title('Average Budget and Vote Average')
plt.xlabel("Budget (In 100,000,000's)")
plt.ylabel('Vote Average')
graph = plt.scatter(df['budget_adj'], df['vote_average'])
plt.show()
Here is the same chart with a line of best fit.
plt.figure(figsize = (20, 20))
plt.title('Average Budget and Vote Average')
plt.xlabel("Budget (In 100,000,000's)")
plt.ylabel('Vote Average')
graph = plt.scatter(df['budget_adj'], df['vote_average'])
plt.plot(df['budget_adj'], np.poly1d(np.polyfit(df['budget_adj'], df['vote_average'], 1))(df['budget_adj']), color = 'black')
plt.show()
From this scatter plot and line of best fit, we can see that a higher budget does have a slight correlation with a higher vote_average, but the absolute highest rated movies have lower budgets.
This report has analyzed the The Movie Database (TMDb) to determine the answers to these three questions.
The most highly rated movie genre by year varied, but the Documentary genre was the genre that had the highest average rating across the most years. This may be because documentaries are more serious productions that tend to be polished and because less of them are produced.
The most profitible movie genre by year varied as well, but the Adventure genre was the genre that had the highest average profit across the most years. This may be because this genre also tended to have the highest budget, but also because it is one of the more popular movie genres that have a wider audience.
Finally, a higher average budget does appear to have a very slight association with a higher average vote rating, but more statistical analysis will need to be performed to prove anything. Additionally, the highest and lowest vote ratings of all the dataset were with lower budget movies.
With this information it can be determined that in order to make a more successful movie measured by voted ratings, one should make a documentary with the highest budget possible, and that to make a more successful movie by profit or revenue, one should make adventure movies with higher budgets when able.
Further analysis would be beneficial by factoring the vote_count values into the calculations to eliminate bias or statistical outliers in movies with very high or very low vote_average values but low vote_count values. Additionally, this analysis is limited because of how the averages of each year and overall were divided and compared by individual genre. Doing the same for genre pairs would give more insightful information into the nuances of genre popularity and success.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
McKinney, W. (2018). Python for Data Analysis: Data Wrangling With pandas, numpy, and ipython (2nd ed.). Sebastopol, CA: OReilly Media.
Code for line of best fit of a scatter plot in python. (2015, August 15). Retrieved September 15, 2018, from https://stackoverflow.com/a/31800660
The online documentations of pandas, numpy, and matplotlib.
Thanks go to the Python for Data Analysis textbook for teaching most of the general concepts used in this report and to the above stackoverflow.com discussion for being the last example I needed to understand how to plot a line of best fit.