Investigating the TMDb Movie Data

Matthew Unrue, Fall 2018

Udacity Data Analyst Nanodegree Project 2

Introduction

This analysis looks into the relations that genre, release year, and budget (adjusted for inflation) have with a movies' overall rating and profit based on the data from The Movie Database (TMDb), which includes information, classifications, and statistics about nearly 11,000 movies.

The questions to be answered are:

  1. What is the most highly rated movie genre in each year?
  2. Which movie genre is the most profitable on average by year?
  3. How much did the average budget of the genres' movies appear to influence the ratings of the movies?

Please note that the explanations for the executed code will precede the code itself throughout the report.

Begin by importing the libraries for needed for analysis and set inline plotting.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast

%matplotlib inline

Data Wrangling

General Properties

Read the .csv file into a pandas dataframe.

In [2]:
df = pd.read_csv('tmdb-movies.csv')
df.head()
Out[2]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

Examine the shape of the dataframe, finding 10,866 rows, or movies, and 21 columns of information.

In [3]:
df.shape
Out[3]:
(10866, 21)

Check for and drop any duplicated rows.

In [4]:
df.duplicated().sum()
Out[4]:
1
In [5]:
df.drop_duplicates(inplace = True)

Look at general statistics about the dataframe.

In [6]:
df.describe()
Out[6]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000 1.086500e+04 1.086500e+04
mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859 1.754989e+07 5.136900e+07
std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260 3.430753e+07 1.446383e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000 2.085325e+07 3.370173e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 21 columns):
id                      10865 non-null int64
imdb_id                 10855 non-null object
popularity              10865 non-null float64
budget                  10865 non-null int64
revenue                 10865 non-null int64
original_title          10865 non-null object
cast                    10789 non-null object
homepage                2936 non-null object
director                10821 non-null object
tagline                 8041 non-null object
keywords                9372 non-null object
overview                10861 non-null object
runtime                 10865 non-null int64
genres                  10842 non-null object
production_companies    9835 non-null object
release_date            10865 non-null object
vote_count              10865 non-null int64
vote_average            10865 non-null float64
release_year            10865 non-null int64
budget_adj              10865 non-null float64
revenue_adj             10865 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.8+ MB

Visualize the data by column to examine the data.

In [8]:
df.hist(figsize = (10, 10));

Data Cleaning

Drop the columns that are not useful for this analysis. Budget may help account for higher ratings, and revenue allows us to calculate profit. The release_year, genres, budget_adj, and vote_average columns are the main columns of information we need to answer the questions posed.

In [9]:
df.drop(['id', 'imdb_id', 'popularity', 'original_title', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'runtime', 'production_companies', 'release_date', 'vote_count'], axis = 1, inplace = True)

Search for any missing information, like missing genre fields.

In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 7 columns):
budget          10865 non-null int64
revenue         10865 non-null int64
genres          10842 non-null object
vote_average    10865 non-null float64
release_year    10865 non-null int64
budget_adj      10865 non-null float64
revenue_adj     10865 non-null float64
dtypes: float64(3), int64(3), object(1)
memory usage: 679.1+ KB

Only 23 movies are missing genre information, so given the size of the dataset, the simplest solution is to remove these movies from the dataset for this analysis.

In [11]:
df.dropna(axis = 0, inplace = True, subset = ['genres'])
In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10842 entries, 0 to 10865
Data columns (total 7 columns):
budget          10842 non-null int64
revenue         10842 non-null int64
genres          10842 non-null object
vote_average    10842 non-null float64
release_year    10842 non-null int64
budget_adj      10842 non-null float64
revenue_adj     10842 non-null float64
dtypes: float64(3), int64(3), object(1)
memory usage: 677.6+ KB

Exploratory Data Analysis

The genre column in this dataframe is made up of a string of genre names separated by pipes, or the | character.

To explore the genres, we need to divide the movies into groups based on genres. Since each movie can have multiple genres, the simplest way to analyze genre information is to include a movie in the group for each genre it has, even if that means that a movie is included in multiple dataframes.

This does limit the report because it will not look at every combination of genre as separate groups. .nunique() shows us that there are 2,039 different combinations of genres in this dataset; far too many to analyze each combination separately in this report.

In [13]:
df.genres.nunique()
Out[13]:
2039

Instead, we will create separate dataframes for each individual genre, including a movie if the genre is included in its list of genres. To begin this, we need find each individual genre that exists in this dataset.

We start by creating a NumPy array out of the genre column of the movie dataframe.

In [14]:
genre_array = df.genres.values
genre_array = genre_array.astype('U')
print(genre_array.dtype)
<U51

We then split the elements of the genre array by the | delimiter and store the individual genres in a new array.

In [15]:
split_genre_array = np.core.defchararray.split(a = genre_array, sep = '|')
In [16]:
print(split_genre_array.dtype)
print(split_genre_array)
object
[list(['Action', 'Adventure', 'Science Fiction', 'Thriller'])
 list(['Action', 'Adventure', 'Science Fiction', 'Thriller'])
 list(['Adventure', 'Science Fiction', 'Thriller']) ...
 list(['Mystery', 'Comedy']) list(['Action', 'Comedy']) list(['Horror'])]

By getting a total number and the maximum length of words representing genre tags in all dataframe genre column values, we can initialize a properly sized array to utilize numpy's efficiency better.

In [17]:
total_words = 0
max_length = 0

for index, row in enumerate(split_genre_array):
    for word in split_genre_array[index]:
        total_words += 1
        if len(word) > max_length:
            max_length = len(word)
        
print('total_words: ' + str(total_words))
print('max_length: ' + str(max_length))
total_words: 26955
max_length: 15

Initialize a numpy array with the shape of 1 x the total number of words representing genres.

In [18]:
combined_genre_array = np.empty((1, total_words), dtype = ('U' + str(max_length)))
combined_genre_array.shape
Out[18]:
(1, 26955)

Set each element in the array

In [19]:
count = 0

for index, row in enumerate(split_genre_array):
    row_list = ast.literal_eval(str(row))
    
    for word in row_list:
        combined_genre_array[0, count] = word
        count += 1
        
In [20]:
print(combined_genre_array)
[['Action' 'Adventure' 'Science Fiction' ... 'Action' 'Comedy' 'Horror']]

Finally, find each individual genre in the combined_genre_array.

In [21]:
genre_list = np.unique(combined_genre_array)
print(genre_list)
['Action' 'Adventure' 'Animation' 'Comedy' 'Crime' 'Documentary' 'Drama'
 'Family' 'Fantasy' 'Foreign' 'History' 'Horror' 'Music' 'Mystery'
 'Romance' 'Science Fiction' 'TV Movie' 'Thriller' 'War' 'Western']

Now that we have a list of all of the genres in the dataset, we can create a new dataframe for each genre.

In [22]:
#_df = df.loc[df.genres.str.contains('')]
action_df = df.loc[df.genres.str.contains('Action')]
adventure_df = df.loc[df.genres.str.contains('Adventure')]
animation_df = df.loc[df.genres.str.contains('Animation')]
comedy_df = df.loc[df.genres.str.contains('Comedy')]
crime_df = df.loc[df.genres.str.contains('Crime')]
documentary_df = df.loc[df.genres.str.contains('Documentary')]
drama_df = df.loc[df.genres.str.contains('Drama')]
family_df = df.loc[df.genres.str.contains('Family')]
fantasy_df = df.loc[df.genres.str.contains('Fantasy')]
foreign_df = df.loc[df.genres.str.contains('Foreign')]
history_df = df.loc[df.genres.str.contains('History')]
horror_df = df.loc[df.genres.str.contains('Horror')]
music_df = df.loc[df.genres.str.contains('Music')]
mystery_df = df.loc[df.genres.str.contains('Mystery')]
romance_df = df.loc[df.genres.str.contains('Romance')]
science_fiction_df = df.loc[df.genres.str.contains('Science Fiction')]
tv_movie_df = df.loc[df.genres.str.contains('TV Movie')]
thriller_df = df.loc[df.genres.str.contains('Thriller')]
war_df = df.loc[df.genres.str.contains('War')]
western_df = df.loc[df.genres.str.contains('Western')]

Let's see how many of each genre is in this dataset by getting the length of the index in each genre dataframe. Then sort the values from highest to lowest.

In [23]:
genre_count = []
for genre in genre_list:
    temp = genre.lower() + "_df"
    temp = temp.replace(" ", "_")
    number = len(eval(temp + '.index'))
    
    genre_count.append([temp[:-3], number])
    
genre_count.sort(key = lambda x:x[1], reverse = True)

for index, genre in enumerate(genre_count):
    genre_count[index] = [genre[0].title().replace("_", " "), genre[1]]

print(genre_count)
[['Drama', 4760], ['Comedy', 3793], ['Thriller', 2907], ['Action', 2384], ['Romance', 1712], ['Horror', 1637], ['Adventure', 1471], ['Crime', 1354], ['Family', 1231], ['Science Fiction', 1229], ['Fantasy', 916], ['Mystery', 810], ['Animation', 699], ['Documentary', 520], ['Music', 408], ['History', 334], ['War', 270], ['Foreign', 188], ['Tv Movie', 167], ['Western', 165]]

Now let's visualize this information.

In [24]:
plt.figure(figsize=(20, 10))
x, y = [*zip(*genre_count)]
graph = plt.bar(x, y)
plt.xticks(rotation = 'vertical')
plt.xlabel('Genres')
plt.ylabel('Number of Movies')


# Place the values above the bars.
for p in graph.patches:
    plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')

plt.show()
In [25]:
genre, amount = [*zip(*genre_count)]
plt.figure(figsize = (10, 10))
plt.title('Proportion of the Total Number of Movies in Each Genre')
plt.pie(amount, labels = genre, textprops = {'fontsize': 10})

plt.show()

From these charts, it is easy to see which genres have the most movies. The top three genres are Drama, Comedy, and Thriller.

Before we continue visualizing data, let's go ahead and create more data from the dataframes and store it in an organized fashion for later.

The information we want to calculate is

  1. The mean of the vote_average column across all years for each dataframe.
  2. The mean of the budget_adj column across all years for each dataframe.
  3. The mean of the revenue_adj column across all years for each dataframe.
  4. The mean of the vote_average, budget_adj, revenue_adj, and the profit (revenue - budget) per year for each dataframe.

To begin getting all of this information, we need to know the full range of years in the overall dataset.

In [26]:
years = df['release_year'].unique()
years.sort()
print(years)
[1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015]

There are 56 years, from 1960–2015, in this dataset.

We will create a dataframe to store all of the results we need with rows for each calculation and columns for each genre.

In [27]:
result_rows = ['vote_average_mean', 'budget_adj_mean', 'revenue_adj_mean', 'profit_adj_mean']

for year in years:
    result_rows.append(str(year) + '_vote_average')
    result_rows.append(str(year) + '_budget_adj')
    result_rows.append(str(year) + '_revenue_adj')
    result_rows.append(str(year) + '_profit_adj')
    
    
results_df = pd.DataFrame(index = result_rows, columns = genre_list)
In [28]:
for genre in genre_list:
    # Create the string to eval the dataframe of the current genre's movies.
    genre_df = genre.lower().replace(' ', '_') + '_df'
    
    # Set the overall means for the vote_average, budget_adj, and the revenure_adj columns, and calculate the overall profit mean.
    results_df.at['vote_average_mean', str(genre)] = eval(genre_df)['vote_average'].mean()
    results_df.at['budget_adj_mean', str(genre)] = eval(genre_df)['budget_adj'].mean()
    results_df.at['revenue_adj_mean', str(genre)] = eval(genre_df)['revenue_adj'].mean()
    results_df.at['profit_adj_mean', str(genre)] = (eval(genre_df)['revenue_adj'].mean() - eval(genre_df)['budget_adj'].mean())
    
    # Set these four values for each year.
    for year in years:
        temp_year_df = eval(genre_df).loc[eval(genre_df)['release_year'] == year]
        results_df.at[str(year) + '_vote_average', str(genre)] = temp_year_df['vote_average'].mean()
        results_df.at[str(year) + '_budget_adj', str(genre)] = temp_year_df['budget_adj'].mean()
        results_df.at[str(year) + '_revenue_adj', str(genre)] = temp_year_df['revenue_adj'].mean()
        results_df.at[str(year) + '_profit_adj', str(genre)] = (temp_year_df['revenue_adj'].mean() - temp_year_df['budget_adj'].mean())
    
In [29]:
results_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 228 entries, vote_average_mean to 2015_profit_adj
Data columns (total 20 columns):
Action             228 non-null object
Adventure          228 non-null object
Animation          212 non-null object
Comedy             228 non-null object
Crime              228 non-null object
Documentary        172 non-null object
Drama              228 non-null object
Family             228 non-null object
Fantasy            220 non-null object
Foreign            156 non-null object
History            220 non-null object
Horror             228 non-null object
Music              216 non-null object
Mystery            224 non-null object
Romance            228 non-null object
Science Fiction    228 non-null object
TV Movie           156 non-null object
Thriller           224 non-null object
War                220 non-null object
Western            204 non-null object
dtypes: object(20)
memory usage: 47.4+ KB

Convert the results_df dtype from object to float.

In [30]:
results_df = results_df.apply(pd.to_numeric, axis = 1, errors = 'coerce')
In [31]:
results_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 228 entries, vote_average_mean to 2015_profit_adj
Data columns (total 20 columns):
Action             228 non-null float64
Adventure          228 non-null float64
Animation          212 non-null float64
Comedy             228 non-null float64
Crime              228 non-null float64
Documentary        172 non-null float64
Drama              228 non-null float64
Family             228 non-null float64
Fantasy            220 non-null float64
Foreign            156 non-null float64
History            220 non-null float64
Horror             228 non-null float64
Music              216 non-null float64
Mystery            224 non-null float64
Romance            228 non-null float64
Science Fiction    228 non-null float64
TV Movie           156 non-null float64
Thriller           224 non-null float64
War                220 non-null float64
Western            204 non-null float64
dtypes: float64(20)
memory usage: 47.4+ KB
In [32]:
results_df
Out[32]:
Action Adventure Animation Comedy Crime Documentary Drama Family Fantasy Foreign History Horror Music Mystery Romance Science Fiction TV Movie Thriller War Western
vote_average_mean 5.787752e+00 5.940585e+00 6.403147e+00 5.905167e+00 6.124889e+00 6.908462e+00 6.165546e+00 5.997563e+00 5.863537e+00 5.981383e+00 6.410479e+00 5.337447e+00 6.480392e+00 5.946790e+00 6.042874e+00 5.665582e+00 5.788024 5.750671e+00 6.297778e+00 6.083030e+00
budget_adj_mean 3.296011e+07 4.393734e+07 2.529535e+07 1.592031e+07 2.153351e+07 6.232226e+05 1.485353e+07 2.692389e+07 3.742932e+07 2.379151e+06 2.655802e+07 7.764841e+06 1.298886e+07 1.948231e+07 1.565922e+07 2.952429e+07 296021.519438 2.068296e+07 2.893007e+07 2.449744e+07
revenue_adj_mean 9.175827e+07 1.416752e+08 8.525613e+07 4.789700e+07 5.663641e+07 2.258210e+06 4.041714e+07 8.773671e+07 1.112342e+08 1.784286e+06 4.794858e+07 2.408271e+07 4.665592e+07 5.088930e+07 4.820022e+07 8.697842e+07 349635.347523 5.528610e+07 6.998611e+07 4.610126e+07
profit_adj_mean 5.879816e+07 9.773789e+07 5.996078e+07 3.197669e+07 3.510290e+07 1.634988e+06 2.556360e+07 6.081282e+07 7.380484e+07 -5.948650e+05 2.139056e+07 1.631787e+07 3.366706e+07 3.140699e+07 3.254100e+07 5.745413e+07 53613.828085 3.460314e+07 4.105605e+07 2.160382e+07
1960_vote_average 6.050000e+00 6.680000e+00 NaN 6.312500e+00 5.900000e+00 NaN 6.276923e+00 7.033333e+00 6.650000e+00 4.900000e+00 6.180000e+00 6.642857e+00 6.600000e+00 NaN 6.350000e+00 6.766667e+00 NaN 6.816667e+00 5.500000e+00 5.950000e+00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2014_profit_adj 6.442447e+07 1.108748e+08 5.923856e+07 2.454596e+07 5.437214e+06 1.207010e+04 1.261628e+07 5.185299e+07 1.040524e+08 NaN 2.028777e+07 5.676366e+06 1.167274e+07 2.016965e+07 1.383936e+07 7.936568e+07 0.000000 1.854078e+07 5.582573e+07 -2.083203e+06
2015_vote_average 5.629907e+00 6.075362e+00 6.492308e+00 6.019136e+00 5.915686e+00 7.007018e+00 6.078846e+00 6.379545e+00 5.915152e+00 NaN 6.513333e+00 5.008800e+00 6.442424e+00 5.642857e+00 6.092982e+00 5.682558e+00 6.100000 5.471345e+00 6.222222e+00 6.316667e+00
2015_budget_adj 3.047847e+07 4.848398e+07 2.114820e+07 1.260273e+07 2.512320e+07 2.663157e+04 7.617659e+06 2.750590e+07 3.344422e+07 NaN 1.274506e+07 2.168255e+06 3.964362e+06 1.188333e+07 6.946804e+06 2.622459e+07 137999.939280 1.142763e+07 2.187555e+07 3.860932e+07
2015_revenue_adj 1.136456e+08 1.816888e+08 9.814404e+07 4.361023e+07 8.150678e+07 5.152083e+05 2.262774e+07 1.066024e+08 1.130950e+08 NaN 1.660334e+07 7.131260e+06 1.964576e+07 2.114214e+07 2.759893e+07 1.028802e+08 0.000000 4.166720e+07 7.217865e+07 1.056374e+08
2015_profit_adj 8.316709e+07 1.332048e+08 7.699585e+07 3.100750e+07 5.638358e+07 4.885767e+05 1.501008e+07 7.909652e+07 7.965076e+07 NaN 3.858276e+06 4.963005e+06 1.568140e+07 9.258813e+06 2.065213e+07 7.665557e+07 -137999.939280 3.023956e+07 5.030311e+07 6.702806e+07

228 rows × 20 columns

We can now use these results to find out which genre was the most popular in each year by the mean vote_average.

In [33]:
results_max = results_df.idxmax(axis = 1)
In [34]:
vote_average_max = results_max[0::4]
budget_average_max = results_max[1::4]
revenue_average_max = results_max[2::4]
profit_average_max = results_max[3::4]
print(vote_average_max)
vote_average_mean    Documentary
1960_vote_average         Family
1961_vote_average        Mystery
1962_vote_average        Western
1963_vote_average            War
1964_vote_average            War
1965_vote_average      Animation
1966_vote_average      Animation
1967_vote_average            War
1968_vote_average        Mystery
1969_vote_average          Crime
1970_vote_average      Animation
1971_vote_average         Family
1972_vote_average    Documentary
1973_vote_average         Comedy
1974_vote_average          Music
1975_vote_average    Documentary
1976_vote_average    Documentary
1977_vote_average    Documentary
1978_vote_average        Foreign
1979_vote_average          Music
1980_vote_average       TV Movie
1981_vote_average          Music
1982_vote_average      Animation
1983_vote_average        History
1984_vote_average    Documentary
1985_vote_average    Documentary
1986_vote_average      Animation
1987_vote_average        History
1988_vote_average    Documentary
1989_vote_average    Documentary
1990_vote_average    Documentary
1991_vote_average    Documentary
1992_vote_average    Documentary
1993_vote_average    Documentary
1994_vote_average    Documentary
1995_vote_average    Documentary
1996_vote_average    Documentary
1997_vote_average            War
1998_vote_average            War
1999_vote_average    Documentary
2000_vote_average    Documentary
2001_vote_average    Documentary
2002_vote_average        Western
2003_vote_average    Documentary
2004_vote_average    Documentary
2005_vote_average    Documentary
2006_vote_average    Documentary
2007_vote_average    Documentary
2008_vote_average    Documentary
2009_vote_average    Documentary
2010_vote_average            War
2011_vote_average    Documentary
2012_vote_average          Music
2013_vote_average    Documentary
2014_vote_average    Documentary
2015_vote_average    Documentary
dtype: object
In [35]:
# Use [1:] to avoid counting the overall mean in the graphs.
vote_x = vote_average_max[1:].value_counts().index.values.tolist()
vote_y = vote_average_max[1:].value_counts().tolist()
graph = plt.bar(vote_x, vote_y)

plt.title('Number of Years Genres Have the Highest Vote Average')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15, 20, 25, 30, 35])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')

# Place the values above the bars.
for p in graph.patches:
    plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')

plt.show()
In [36]:
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Vote Average')
plt.pie(vote_y, labels = vote_x, textprops = {'fontsize': 10})

plt.show()

We will then do the same for the mean budget_adj values.

In [37]:
print(budget_average_max)
budget_adj_mean    Adventure
1960_budget_adj      History
1961_budget_adj    Animation
1962_budget_adj          War
1963_budget_adj      History
1964_budget_adj       Family
1965_budget_adj      History
1966_budget_adj          War
1967_budget_adj          War
1968_budget_adj    Adventure
1969_budget_adj      History
1970_budget_adj      Foreign
1971_budget_adj    Animation
1972_budget_adj      Fantasy
1973_budget_adj    Animation
1974_budget_adj       Action
1975_budget_adj      Romance
1976_budget_adj       Family
1977_budget_adj          War
1978_budget_adj      Fantasy
1979_budget_adj       Action
1980_budget_adj        Music
1981_budget_adj          War
1982_budget_adj      History
1983_budget_adj      History
1984_budget_adj    Adventure
1985_budget_adj          War
1986_budget_adj      Western
1987_budget_adj      History
1988_budget_adj      Foreign
1989_budget_adj    Adventure
1990_budget_adj    Adventure
1991_budget_adj      History
1992_budget_adj      History
1993_budget_adj      Western
1994_budget_adj          War
1995_budget_adj       Action
1996_budget_adj    Adventure
1997_budget_adj          War
1998_budget_adj      Western
1999_budget_adj    Adventure
2000_budget_adj      Western
2001_budget_adj       Action
2002_budget_adj      Western
2003_budget_adj    Adventure
2004_budget_adj          War
2005_budget_adj          War
2006_budget_adj    Adventure
2007_budget_adj      Fantasy
2008_budget_adj    Adventure
2009_budget_adj    Adventure
2010_budget_adj      Western
2011_budget_adj      Western
2012_budget_adj    Adventure
2013_budget_adj      Western
2014_budget_adj    Adventure
2015_budget_adj    Adventure
dtype: object
In [38]:
# Use [1:] to avoid counting the overall mean in the graphs.
budget_x = budget_average_max[1:].value_counts().index.values.tolist()
budget_y = budget_average_max[1:].value_counts().tolist()
graph = plt.bar(budget_x, budget_y)

plt.title('Number of Years Genres Have the Highest Average Budget')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')

# Place the values above the bars.
for p in graph.patches:
    plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 6), textcoords = 'offset points')

plt.show()
In [39]:
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Average Budget')
plt.pie(budget_y, labels = budget_x, textprops = {'fontsize': 10})

plt.show()

Then for the average revenue_adj values.

In [40]:
print(revenue_average_max)
revenue_adj_mean          Adventure
1960_revenue_adj            History
1961_revenue_adj          Animation
1962_revenue_adj          Adventure
1963_revenue_adj             Action
1964_revenue_adj             Family
1965_revenue_adj              Music
1966_revenue_adj                War
1967_revenue_adj          Animation
1968_revenue_adj            Mystery
1969_revenue_adj              Crime
1970_revenue_adj                War
1971_revenue_adj             Action
1972_revenue_adj              Crime
1973_revenue_adj           Thriller
1974_revenue_adj            Western
1975_revenue_adj             Horror
1976_revenue_adj              Music
1977_revenue_adj    Science Fiction
1978_revenue_adj              Music
1979_revenue_adj          Adventure
1980_revenue_adj    Science Fiction
1981_revenue_adj          Adventure
1982_revenue_adj          Adventure
1983_revenue_adj          Adventure
1984_revenue_adj             Action
1985_revenue_adj                War
1986_revenue_adj          Adventure
1987_revenue_adj              Music
1988_revenue_adj          Animation
1989_revenue_adj             Family
1990_revenue_adj            Western
1991_revenue_adj            History
1992_revenue_adj            Western
1993_revenue_adj            Mystery
1994_revenue_adj          Animation
1995_revenue_adj            Mystery
1996_revenue_adj                War
1997_revenue_adj                War
1998_revenue_adj            Western
1999_revenue_adj          Adventure
2000_revenue_adj          Adventure
2001_revenue_adj            Fantasy
2002_revenue_adj            Fantasy
2003_revenue_adj            Fantasy
2004_revenue_adj            Fantasy
2005_revenue_adj          Adventure
2006_revenue_adj          Adventure
2007_revenue_adj            Fantasy
2008_revenue_adj          Adventure
2009_revenue_adj          Adventure
2010_revenue_adj          Adventure
2011_revenue_adj          Adventure
2012_revenue_adj          Adventure
2013_revenue_adj          Adventure
2014_revenue_adj          Adventure
2015_revenue_adj          Adventure
dtype: object
In [41]:
# Use [1:] to avoid counting the overall mean in the graphs.
revenue_x = revenue_average_max[1:].value_counts().index.values.tolist()
revenue_y = revenue_average_max[1:].value_counts().tolist()
graph = plt.bar(revenue_x, revenue_y)

plt.title('Number of Years Genres Have the Highest Average Revenue')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15, 20])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')

# Place the values above the bars.
for p in graph.patches:
    plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')

plt.show()
In [42]:
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Average Revenue')
plt.pie(revenue_y, labels = revenue_x, textprops = {'fontsize': 10})

plt.show()

The Documentary genre is the most highly rated movie genre on average, and the Adventure genre has the highest budgets and revenues according to this dataset.

What is the most profitable movie genre for each year?

Let's see which movie genres are the most profitible per year on average and then visualize the overarching information.

In [43]:
print(profit_average_max)
profit_adj_mean          Adventure
1960_profit_adj            History
1961_profit_adj          Animation
1962_profit_adj          Adventure
1963_profit_adj             Action
1964_profit_adj             Family
1965_profit_adj              Music
1966_profit_adj                War
1967_profit_adj          Animation
1968_profit_adj            Mystery
1969_profit_adj              Crime
1970_profit_adj                War
1971_profit_adj             Action
1972_profit_adj              Crime
1973_profit_adj           Thriller
1974_profit_adj            Western
1975_profit_adj             Horror
1976_profit_adj              Music
1977_profit_adj    Science Fiction
1978_profit_adj              Music
1979_profit_adj          Adventure
1980_profit_adj    Science Fiction
1981_profit_adj          Adventure
1982_profit_adj          Adventure
1983_profit_adj          Adventure
1984_profit_adj             Action
1985_profit_adj                War
1986_profit_adj                War
1987_profit_adj              Music
1988_profit_adj          Animation
1989_profit_adj             Family
1990_profit_adj            Western
1991_profit_adj            History
1992_profit_adj          Animation
1993_profit_adj            Mystery
1994_profit_adj          Animation
1995_profit_adj            Mystery
1996_profit_adj                War
1997_profit_adj            Romance
1998_profit_adj            Western
1999_profit_adj          Adventure
2000_profit_adj          Adventure
2001_profit_adj            Fantasy
2002_profit_adj            Fantasy
2003_profit_adj            Fantasy
2004_profit_adj            Fantasy
2005_profit_adj          Adventure
2006_profit_adj          Adventure
2007_profit_adj            Fantasy
2008_profit_adj          Adventure
2009_profit_adj          Adventure
2010_profit_adj             Family
2011_profit_adj          Adventure
2012_profit_adj          Adventure
2013_profit_adj          Adventure
2014_profit_adj          Adventure
2015_profit_adj          Adventure
dtype: object
In [44]:
# Use [1:] to avoid counting the overall mean in the graphs.
profit_x = profit_average_max[1:].value_counts().index.values.tolist()
profit_y = profit_average_max[1:].value_counts().tolist()
graph = plt.bar(profit_x, profit_y)

plt.title('Number of Years Genres Have the Highest Average Profit')
plt.xticks(rotation = 'vertical')
plt.yticks([0, 5, 10, 15, 20])
plt.xlabel('Movie Genres')
plt.ylabel('Number of Years')

# Place the values above the bars.
for p in graph.patches:
    plt.annotate(p.get_height(), (p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', fontsize = 11, xytext = (0, 10), textcoords = 'offset points')

plt.show()
In [45]:
plt.figure(figsize = (10, 10))
plt.title('Proportion of Years Genres Have the Highest Average Profit')
plt.pie(profit_y, labels = profit_x, textprops = {'fontsize': 10})

plt.show()

Adventure movies are also the most profitible on average. This might be why so many of them have a high budget, but this also might be because they have a higher average budget instead.

Is a higher movie budget associated with a higher average vote?

Finally, let's look at the association between a higher movie budget and a higher average rating by graphing these values for all movies in the dataset to a scatter plot.

In [46]:
plt.figure(figsize = (20, 20))
plt.title('Average Budget and Vote Average')
plt.xlabel("Budget (In 100,000,000's)")
plt.ylabel('Vote Average')
graph = plt.scatter(df['budget_adj'], df['vote_average'])

plt.show()

Here is the same chart with a line of best fit.

In [47]:
plt.figure(figsize = (20, 20))
plt.title('Average Budget and Vote Average')
plt.xlabel("Budget (In 100,000,000's)")
plt.ylabel('Vote Average')
graph = plt.scatter(df['budget_adj'], df['vote_average'])

plt.plot(df['budget_adj'], np.poly1d(np.polyfit(df['budget_adj'], df['vote_average'], 1))(df['budget_adj']), color = 'black')

plt.show()

From this scatter plot and line of best fit, we can see that a higher budget does have a slight correlation with a higher vote_average, but the absolute highest rated movies have lower budgets.

Conclusions

This report has analyzed the The Movie Database (TMDb) to determine the answers to these three questions.

  1. What is the most highly rated movie genre in each year?
  2. Which movie genre is the most profitable on average by year?
  3. How much did the average budget of the genres' movies appear to influence the ratings of the movies?

The most highly rated movie genre by year varied, but the Documentary genre was the genre that had the highest average rating across the most years. This may be because documentaries are more serious productions that tend to be polished and because less of them are produced.

The most profitible movie genre by year varied as well, but the Adventure genre was the genre that had the highest average profit across the most years. This may be because this genre also tended to have the highest budget, but also because it is one of the more popular movie genres that have a wider audience.

Finally, a higher average budget does appear to have a very slight association with a higher average vote rating, but more statistical analysis will need to be performed to prove anything. Additionally, the highest and lowest vote ratings of all the dataset were with lower budget movies.

With this information it can be determined that in order to make a more successful movie measured by voted ratings, one should make a documentary with the highest budget possible, and that to make a more successful movie by profit or revenue, one should make adventure movies with higher budgets when able.

Further analysis would be beneficial by factoring the vote_count values into the calculations to eliminate bias or statistical outliers in movies with very high or very low vote_average values but low vote_count values. Additionally, this analysis is limited because of how the averages of each year and overall were divided and compared by individual genre. Doing the same for genre pairs would give more insightful information into the nuances of genre popularity and success.

In [48]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
Out[48]:
4294967295

Citations

McKinney, W. (2018). Python for Data Analysis: Data Wrangling With pandas, numpy, and ipython (2nd ed.). Sebastopol, CA: OReilly Media.

Code for line of best fit of a scatter plot in python. (2015, August 15). Retrieved September 15, 2018, from https://stackoverflow.com/a/31800660

The online documentations of pandas, numpy, and matplotlib.

Thanks go to the Python for Data Analysis textbook for teaching most of the general concepts used in this report and to the above stackoverflow.com discussion for being the last example I needed to understand how to plot a line of best fit.

In [ ]: