Predicting the Box Office Success of Movies

CMSC320 Final Project

By: Sumit Nawathe, Eric Zhu

Table of Contents

  1. Introduction
  2. Data Collection and Preprocessing
    • Raw Movielens
    • Kaggle Movielens Companion
    • Combining Datasets
    • Preprocessing and Encoding
  3. Data Exploration
    • Importance of Budget
    • Differences By Genre
    • Distribution of Dates
    • Miscellaneous
  4. Modeling
    • Partitioning Data
    • k-Nearest Neighbors
    • Random Forest
    • Category Reduction
    • Neural Network
    • Summary of Models
  5. Upcoming Movie Predictions
  6. Conclusion

Section 1: Introduction

It goes without saying that movies have been a pivotal part of pop culture over the past half-century and have been influential in shaping our generation. Franchises such as Marvel, Harry Potter, and Star Wars have become household names that almost universally recognized. Yet, not every movie attains this level of success: while some niche films remain cult classics, many fall into obscurity. This has happened to even large and ambitious projects, once highly anticipated but now lost in the sea of entertainment, leading fans to wonder: why?

Our goal is to analyze whether a movie's success can be predicted, and if so, what aspects contribute to a movie's success. We will define success as a movie's box office gross revenue, since this is a easily quantifiable metric and is a strong indicator of popularity and demand.

To do this, we will collect a dataset of movie characteristics and examine it to find what factors are most influential. Then, we will construct models to predict a movie's box office success, and finally apply our model to potentially popular upcoming films (as of time of this tutorial, December 2022).

Section 2: Data Collection and Preprocessing

To do any sort of rigorous science, we first need data to analyze. We will make use of publicly-available online datasets of movie characteristics. The dataset sources are provided in their respective sections.

When reading in data, our strategy will be to initially read everything as strings, and then manually convert elements to the right types. This allows us to observe the data beforehand and prevents our libraries from making false assumptions about the structure of the data.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [2]:
%cd "/content/drive/MyDrive/CMSC 320 Final"
/content/drive/.shortcut-targets-by-id/11YmBoDmAQLT2Wr3r0yyE0hLBy6-UppHE/CMSC 320 Final

We begin by installing the Python libraries necessary for our analysis. These include libraries for data loading, graphic generation, and numerical processing. Some must be installed or upgraded before being imported; documentation on the important or nonstandard libraries will be provided as they are used.

In [3]:
!pip install calmap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting calmap
  Downloading calmap-0.0.9-py2.py3-none-any.whl (7.1 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from calmap) (1.3.5)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from calmap) (1.21.6)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packages (from calmap) (3.2.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib->calmap) (0.11.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->calmap) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->calmap) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->calmap) (1.4.4)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.1->matplotlib->calmap) (1.15.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas->calmap) (2022.6)
Installing collected packages: calmap
Successfully installed calmap-0.0.9
In [4]:
import numpy as np
import pandas as pd
import scipy
import matplotlib
import matplotlib.pyplot as plt
import json
import ast
import sklearn
import sklearn.linear_model
import calmap
import calendar
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

Section 2.1: MovieLens Dataset

Our first primary dataset comes from MovieLens. It was created by the GroupLens Research group at the University of Minnesota for use in a series of papers that attempt to build a robust movie recommendation system using tag genomes. You can learn more about the data by reading the original paper The MovieLens Datasets: History and Context.

We downloaded the stable 25M movie ratings dataset from their website. There are multiple related files; we will read the 'movies.csv' file, which lists movie names and genres, and the 'links.csv' file, which lists IMDB ID's for each movie.

In [5]:
df_raw_movies = pd.read_csv('ml-25m/movies.csv')
display(df_raw_movies.head())
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
In [6]:
df_raw_links = pd.read_csv('ml-25m/links.csv')
display(df_raw_links.head())
movieId imdbId tmdbId
0 1 114709 862.0
1 2 113497 8844.0
2 3 113228 15602.0
3 4 114885 31357.0
4 5 113041 11862.0

We can combine these two files into one dataframe using the 'movieId' column, which is common among both.

In [7]:
df_raw = df_raw_movies.merge(df_raw_links, on='movieId')
display(df_raw.head())
movieId title genres imdbId tmdbId
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 114709 862.0
1 2 Jumanji (1995) Adventure|Children|Fantasy 113497 8844.0
2 3 Grumpier Old Men (1995) Comedy|Romance 113228 15602.0
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 114885 31357.0
4 5 Father of the Bride Part II (1995) Comedy 113041 11862.0

Since this is a production dataset, thankfully almost no preprocessing needs to be done. We only specify the column types and conclude.

In [8]:
df_raw = df_raw.astype({
  'movieId': int,
  'title': 'string',
  'genres': 'string',
  'imdbId': int,
  'tmdbId': 'Int64'
})

The strength of this dataset is its genome classification. However, this is likely far too granular for our purposes: the dataset includes a classification score for each movie against thousands of niche adjectives. This is probably not as useful before the movie is released and public opinion is known (or would require insider information). Thus, we primarily use this dataset for its robust classification of genres.

Section 2.2: Kaggle Movies Dataset

Our second dataset is The Movies Dataset on Kaggle by Rounak Banik. This is constructed as a companion to the MovieLens dataset, with much more extensive information on a smaller set of movies. We download the dataset from Kaggle and import the movie metadata file.

In [9]:
df_meta = pd.read_csv('movielens_metadata/movies_metadata.csv', dtype='string')
display(df_meta.head())
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released <NA> Toy Story False 7.7 5415
1 False <NA> 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... <NA> 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... <NA> 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... ... 1995-12-22 0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92
3 False <NA> 16000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... <NA> 31357 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... ... 1995-12-22 81452156 127.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [{'id': 35, 'name': 'Comedy'}] <NA> 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... 1995-02-10 76578911 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173

5 rows × 24 columns

Unfortunately, this dataset is not as problem-free as the previous one. If we look at the actual file, we find that in some of the descriptions, newline characters are used. However, the dataset author did not take care to mark these fields, so those rows are truncated and extra rows have the truncated data in the wrong columns. To fix this, we look at the 'id' field, which follows immediately after the description and in normal circumstances should not be null. We identify the problem rows, and then manually iterate through the columns to set the appropriate entries to what they should be, and delete the extraneous row.

In [10]:
# identify problem row indices
bad_idx = []
for idx, row in df_meta.iterrows():
  try:
    int(row.id) # will throw an exception if null
  except Exception:
    bad_idx.append(idx-1) # null on extra row, actual problem (description) on previous row

# iterate through bad indices
for idx in bad_idx:
  # fix description with newline
  i = np.where(df_meta.columns == 'overview')[0]
  df_meta.iloc[idx, i] = df_meta.iloc[idx, i] + '\n' + df_meta.iloc[idx+1, 0]

  # iterate through broken columns
  j = 1
  i += 1
  while i < len(df_meta.columns):
    # copy entries to correct columns
    df_meta.iloc[idx, i] = df_meta.iloc[idx+1, j]
    i += 1
    j += 1

# delete extraneous rows
df_meta.drop(index=list(map(lambda x: x+1, bad_idx)), inplace=True)

# reset index
df_meta.reset_index(inplace=True)
df_meta.drop(columns='index', inplace=True)

This dataset has one additional issue: some movie are repeated. Looking at the individual rows, sometimes they are pure duplicates, while other times one small piece of information is changed. We can ascertain how many times this happens by counting unique rows, whicy we identify by IMDB ID.

In [11]:
df_meta = df_meta[pd.notnull(df_meta.imdb_id)] # remove rows with null IMDB ID
print(f"Number of rows: {len(df_meta)}")
print(f"Number of unique IMDB IDs: {len(np.unique(df_meta.imdb_id))}")
print(f"Number of duplicatd IMDB IDs: {len(df_meta) - len(np.unique(df_meta.imdb_id))}")
Number of rows: 45446
Number of unique IMDB IDs: 45416
Number of duplicatd IMDB IDs: 30

There are very few instances of this, so we just remove the duplicate entries.

In [12]:
df_meta.drop_duplicates('imdb_id', inplace=True)
print(f"Number of rows after dropping duplicate IMDB IDs: {len(df_meta)}")
Number of rows after dropping duplicate IMDB IDs: 45416

Now that the data is formatted properly, we cast select column datatypes to what they really represent:

In [13]:
df_meta = df_meta.astype({
    'adult': bool,
    'budget': int,
    'id': int,
    'popularity': float,
    'revenue': 'Int64',
    'vote_average': float,
    'vote_count': int,
    'revenue': float,
    'runtime': float
})

Finally, some columns are not useful to us, such as links to movie posters or web pages, so we simply discard them.

In [14]:
df_meta.drop(columns=['homepage', 'poster_path', 'video'], inplace=True)

Section 2.3: Combining Datasets

To put these two datasets together, we need some common information. Thankfully, both datasets have the IMDB ID for each movie, in some form. In the origina dataset, it is already an interger; in the companion Kaggle dataset, it is a string with additional unnecessary components. We create a preprocessed column on the latter dataset that puts in in the same format as the former, use it to combine the datasets, and then delete the preprocessed column.

In [15]:
# create preprocessed column: stripped IMDB ID as type int
df_meta['imdb_id_proc'] = df_meta.imdb_id.map(lambda s: s[2:] if pd.notnull(s) and len(s)>2 else None)
df_meta.drop(df_meta[pd.isnull(df_meta.imdb_id_proc)].index, inplace=True)
df_meta.imdb_id_proc = df_meta.imdb_id_proc.astype(int)

# merge dataframes on IMDB ID
df = df_raw.merge(df_meta, left_on='imdbId', right_on='imdb_id_proc')

# drop preprocessed column
df.drop(columns='imdb_id_proc', inplace=True)

Both datasets had title and genre columns, but we should only keep one of each. We keep the MovieLens genres column since they have better range and accuracy, but keep the Kaggle titles because they are simpler (the MovieLens titles have the release year attached).

In [16]:
df.drop(columns=['title_x', 'genres_y'], inplace=True)
df.rename({'genres_x':'genres', 'title_y':'title'}, axis=1, inplace=True)

Section 2.4: Preprocessing and Encoding

Unfortunately, much of the data is still not in a usable state. Many of the fields contain complex datatypes and interactions, and many entries are not actually usable for our purposes. We spend time cleaning up the dataset now so that the exploration and modeling phases will go smoothly.

First, we need to deal with missing entries in both the budget and revenue. Unfortunately, the majority of rows do not have data on at least one of these, but they are crucial for our analysis. Missing entries are marked by 0s. Thus, we unfortunately must cut all of these rows. There are still several thousand rows left after this operation, so we will have enough data to model.

In [17]:
print(f"Number of rows before trimming: {len(df)}")
df = df[pd.notnull(df.revenue) & (df.revenue != 0) & pd.notnull(df.budget) & (df.budget != 0)]
print(f"Number of rows after trimming: {len(df)}")
Number of rows before trimming: 42756
Number of rows after trimming: 5318

We must also remove entries that are not marked as released, as well as those with no/zero listed runtime, since we thus cannot be sure of the quality of this data.

In [18]:
df = df[pd.notnull(df.release_date) & (df.status == 'Released')]
df = df.reset_index(drop = True)

The release date by itself is difficult for some models to parse or manipulate. We convert the release date into datetime objects, then use Panda's dt datetime accessor, which allows us to easily manipulate entire series of datetime objects, in order to extract the important components into separate columns.

In [19]:
df.release_date = pd.to_datetime(df.release_date)
df['year'] = df.release_date.dt.year
df['month'] = df.release_date.dt.month
df['day'] = df.release_date.dt.day

There are several data columns that are text entries, but which really represent some combination/specification of a (relatively) small number of categories. They are still difficult to work with in their raw (string) form -- we would much rather be able to work with some numerical type and easily see which rows fall under which category. This is made possible through one-hot encoding, which essentially makes boolean columns for each individual category and fills in whether each row belongs to that category or not. Applying this transformation will make modeling much easier down the line.

We begin by encoding genres. Each movie has a list of associated genres in a formatted string; we obtain a list of all possible genres and create an encoded column for each.

In [20]:
# preprocess empty genre lists
df['genres'] = df['genres'].apply(lambda elm: elm if elm != '(no genres listed)' else None)

# create set of all possible genres
genres_set = set(df.genres.apply(lambda s: [] if pd.isnull(s) else s.split('|')).sum())
print(f"Number of genres: {len(genres_set)}")
print(f"List of genres: \n{chr(10).join(genres_set)}")

# interate through set, create column for presence of each genre
for genre in genres_set:
  df[f"genre_{genre}"] = df.genres.apply(lambda s: False if pd.isnull(s) else (genre in s))

# remove obsolete column
df.drop(columns='genres', inplace=True)
Number of genres: 19
List of genres: 
Drama
Thriller
Adventure
Romance
Musical
Children
Western
Horror
Film-Noir
War
Sci-Fi
Crime
Comedy
IMAX
Action
Documentary
Fantasy
Mystery
Animation

Next, we encode what languages are spoken in the movie, which is likely an important feature as it affects the movie's target audience. However, there are far too many languages: some have many movies (such as English), while some have almost none. To limit our data processing and prevent models from overfitting, we restrict ourselves to only encode spoken languages that are in at least some number of movies. We choose 50 for the threshold, as it is approximately 10% of our dataset; no columns are created for languages that do not meet this threshold.

This time, the processing itself poses some unique challenges due to how the data is stored. The 'spoken languages' column is comprised of string representations of arrays of objects as the example below shows:

In [21]:
print(f"Example of spoken languages entry: \"{df.spoken_languages[0]}\"")
Example of spoken languages entry: "[{'iso_639_1': 'en', 'name': 'English'}]"

We are only interested in the abbreviation of each language in this list, but this would be a pain to parse manually. However, fortunately, this format is exactly how arrays, dictionaries, and strings are written in Python itself (perhaps the dataset author just dumped Python objects as strings). We can thus take use of the Python eval() function, which takes a string, treats it as Python code, and executes it. Applying this function to these strings will allow us to work with normal Python datastructures and access fields with ease.

In [22]:
# get set of all spoken languages
df_spok_lang_lists = df.spoken_languages.apply(lambda s: list(map(lambda o: o['iso_639_1'], eval(s))))
spok_lang_set = set(df_spok_lang_lists.sum())

# filter set to only include languages above threshold of number of movies
limited_spok_lang = set()
for lang in spok_lang_set:
    n = df_spok_lang_lists.map(lambda l: 1 if lang in l else 0).sum()
    if n > 50:
        limited_spok_lang.add(lang)
print(f"Number of selected spoken languages: {len(limited_spok_lang)}")

# create column for each chosen language
for lang in limited_spok_lang:
    df[f"spoken_lang_{lang}"] = df_spok_lang_lists.map(lambda l: (lang in l))

# remove obsolete column
df.drop(columns="spoken_languages", inplace=True)
Number of selected spoken languages: 13

We apply similar preprocessing techniques to encode the production countries of these movies, once again limiting possible countries to those that appear in above some threshold of movies.

In [23]:
# get set of all production countries
df_prod_count_lists = df.production_countries.apply(lambda s: list(map(lambda o: o['name'], eval(s))))
prod_country_set= set(df_prod_count_lists.sum())

# filter set to only include countries above threshold of number of movies
limited_prod_countries = set()
for country in prod_country_set:
    n = df_prod_count_lists.map(lambda l: 1 if country in l else 0).sum()
    if n > 50:
        limited_prod_countries.add(country)
print(f"Number of selected production countries: {len(limited_prod_countries)}")

# create column for each chosen country
for country in limited_prod_countries:
    df[f"prod_country_{country}"] = df_prod_count_lists.map(lambda l: (country in l))

# remove obsolete column
df.drop(columns='production_countries', inplace=True)
Number of selected production countries: 13

Once again, we perform the same procedure to production companies, which may be an important factor (some studios are known for producing good movies).

In [24]:
# get set of all production companies
df_prod_comp_lists = df.production_companies.apply(lambda s: list(map(lambda o: o['name'], eval(s))))
prod_company_set= set(df_prod_comp_lists.sum())

# filter set to only include companies above threshold of number of movies
limited_prod_companies = set()
for company in prod_company_set:
    n = df_prod_comp_lists.map(lambda l: (company in l)).sum()
    if n > 100: # higher cutoff, otherwise too many columns to be useful later
        limited_prod_companies.add(company)
print(f"Number of selected production companies: {len(limited_prod_companies)}")

# create column for each chosen company
for company in limited_prod_companies:
    df[f"prod_company_{company}"] = df_prod_comp_lists.map(lambda l: (company in l))

# make a other company column for companies that were not accounted for
df['prod_company_other'] = True # create a column of true
for column in df.columns:
  if column[:13] == 'prod_company_' and column != 'prod_company_other':
    df['prod_company_other'] = df['prod_company_other'] & ~df[column]

# remove obsolete column
df.drop(columns='production_companies', inplace=True)
Number of selected production companies: 11

There are a few more columns that are not important to us, which we can safely discard. Some have no impact on revenue; others are only known after a movie's release (such as popularity votes), which we must discard for our modeling paradigm.

In [25]:
df.drop(columns=['belongs_to_collection', 'popularity', 'vote_average', 'vote_count', 'movieId', 'imdbId', 'tmdbId', 'id', 'imdb_id'], inplace=True)

Our final preprocessed dataframe is as follows:

In [26]:
display(df.head())
adult budget original_language original_title overview release_date revenue runtime status tagline ... prod_company_Columbia Pictures prod_company_Columbia Pictures Corporation prod_company_Paramount Pictures prod_company_Warner Bros. prod_company_Walt Disney Pictures prod_company_Touchstone Pictures prod_company_Metro-Goldwyn-Mayer (MGM) prod_company_Relativity Media prod_company_Universal Pictures prod_company_other
0 True 30000000 en Toy Story Led by Woody, Andy's toys live happily in his ... 1995-10-30 373554033.0 81.0 Released <NA> ... False False False False False False False False False True
1 True 65000000 en Jumanji When siblings Judy and Peter discover an encha... 1995-12-15 262797249.0 104.0 Released Roll the dice and unleash the excitement! ... False False False False False False False False False True
2 True 16000000 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... 1995-12-22 81452156.0 127.0 Released Friends are the people who let you be yourself... ... False False False False False False False False False False
3 True 60000000 en Heat Obsessive master thief, Neil McCauley leads a ... 1995-12-15 187436818.0 170.0 Released A Los Angeles Crime Saga ... False False False True False False False False False False
4 True 35000000 en Sudden Death International action superstar Jean Claude Van... 1995-12-22 64350171.0 106.0 Released Terror goes into overtime. ... False False False False False False False False True False

5 rows × 71 columns

Section 3: Data Exploration

Before we try to construct a model to predict box office success, it would be good to get a feel of what factors affect a movie's financial performance, and by how much. We take a look at a few specific characteristics and use visualizations and simple statistical techniques to better understand our dataset.

Section 3.1: Importance of Budget

A movie is really a huge investment of financial and human capital. A movie's budget is likely a huge determining factor for the movie's revenue, since it dictates what actors can be cast, how much production work can be done, use of visual effects and extent of marketing, et cetera. We believe that a movie's budget will be highly correlated with its revenue: let's use statistics to back up that assertion.

In additional to graphically plotting the revenue against budget for each movie, we will also perform and plot a linear regression, which finds the best-fit line for the dataset. This is a simple type of model that is easily interpretable, and will give us information about the strength of the trend, if it exists. We use sklearn to generate this regression.

In [27]:
# create linear regression model and fit it
model_lin_reg = sklearn.linear_model.LinearRegression()
model_lin_reg.fit(np.array(df.budget).reshape(-1, 1), np.array(df.revenue))

# create scatter plot of revenue against budget
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(df.budget, df.revenue)

# set axis scale manually
ax.get_xaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x//1000000), ',')))
ax.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x//1000000), ',')))

# plot linear model_lin_reg on domain inside graph
domain = np.linspace(0, df.budget.max(), 100)
plt.plot(
    domain,
    model_lin_reg.predict(domain.reshape(-1, 1)), 
    color='red', 
    label = f'(Revenue) = {model_lin_reg.coef_[0]:.4f}*(Budget) + {model_lin_reg.intercept_:.4f}'
)

# label axes
plt.xlabel('Budget in Millions')
plt.ylabel('Revenue in Millions')
plt.title('Movie Revenue Against Budget With Linear Regression')
plt.legend()
plt.show()

The model reflects the increasing trend of revenue against budget, as we would expect: a movie with higher budget is a bigger investment, which will tend to yield bigger rewards. There is a simple interpretation to the model coefficient: on average, each dollar increase in the budget (one dollar increase in investment) will yield a 3.02 dollar increase in revenue. This is a very good return on investment!

The model seems to fit moderately well. We can quantify how well by calculating the linear regression's $r^2$ value:

In [28]:
model_lin_reg.score(np.array(df.budget).reshape(-1, 1), df.revenue)
Out[28]:
0.5323277516251919

This means that over 50% of the variance in the revenue is explained by this linear relationship with the budget. Budget is therefore an extremely important (possibly the most important) factor in determining a movie's revenue.

Section 3.2: Differences by Genre

Movie genres are strong indicators a a movie's content and target audience: people tend to have a preference for certain styles, plot elements, character archetypes, and themes. Genre is also influences a movie's marketability and reach, so it likely has an effect on box office success. To begin analyzing this, we draw a violin plot of revenue for every genre in our dataset side-by-side to compare their distributions.

In [29]:
# get list of all genres
genres = list(filter(lambda c: 'genre' in c, df.columns))

# create violinplot
plt.figure(figsize=(25, 8))
ax = plt.subplot(111)
plt.violinplot([np.log(np.asarray(df[df[g]].revenue)) for g in genres], positions=range(len(genres)), widths=1, showmeans=True)
ax.set_xticks(list(range(len(genres))))
ax.set_xticklabels(list(map(lambda g: g[6:], genres))) # set cleaned genre names on x axis
ax.set_ylabel('Log Revenue (Dollars)')
ax.set_title('Revenue Distribution By Genre')
plt.show()

We see that IMAX movies perform very well, while documentaries don't. This makes intuitive sense: IMAX movies are shown in theatres and often recieve significant advertising, while documentaries are harder to market for and often target a specific demographic or interest group. Similar interpretations can be made for other contrasting genre pairs. Almost all of the distributions have thin tails towards the low-end of revenue, but the region around the mean is fairly symmetric / bulbous. This indicates an approximately lognormal distribution about the means (the y-axis of the graph has a log scale). The majority of "typical" genres (Action, Romance, Sci-Fi) don't seem to have a major effect on revenue distribution; only the "niche" genres do.

However, we note that the effects of genre may not be constant over time. Cultural preferences and values have changed over time, so it is likely that genre relative popularity has changed from year to year. Perhaps some popular genre in one decade fell out of fashion in another. To examine whether such cultural shifts occurred, we will plot the average revenues of each genre for each clump of 5 years (not per year, since there is too little data and thus too much variance in each group if the time window is small).

In [30]:
# group data into 5 year intervals
fig, ax = plt.subplots(figsize=(20, 8))
df_copy = df.copy()
df_copy['group_of_5'] = df['year'].apply(lambda elm: elm - (elm % 5))

# loop over columns
for column in df.columns:
  if column[0:6] == 'genre_': # if this column is for a genre
    # restrict data to only this genre
    only_curr_genre_info = df_copy[df[column]]
    # get instances and averages of revenues over 5 year intervals
    years_of_5 = []
    avg_per_5_years = []
    for year, df_corresponding in only_curr_genre_info.groupby('group_of_5'):
      if year > 1950 and year < 2015:
        years_of_5.append(year)
        avg_per_5_years.append(df_corresponding['revenue'].mean())
    # plot averages where they exist
    ax.plot(years_of_5, avg_per_5_years, label = column[6:])

# set axis scales and labels
ax.legend()
ax.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x//1000000), ',')))
ax.set_ylabel('Revenue in Millions')
ax.set_xlabel('Beginning of Every 5 Year Interval')
ax.set_title('Average Revenue of Each Genre For Every Group of 5 Years')
plt.show()

It appears that relative genre performance does fluxuate slightly as the years go on. A big example is that the revenue for drama seems to dominate in the 1990s and then taper off as we hit the 2000s. There are other examples of the some genres dominating only to perform worse later on. However, in general, genres seem to have about the same rankings in terms of revenue performance.

Notice that almost every genre has a slight upward trend. However, this may just be due to inflation and increasing population/demand, rather than increasing interest overall.

Section 3.3: Distribution of Dates

Movie releases are carefully timed events. Releases often align with certain times of the year (summer blockbusters, holiday season), but also try to not overlap with other major releases which could draw crowds away. We want to quantify these trends in movie release times from multiple angles.

We begin by plotting the distributions of release day-of-the-week.

In [31]:
# create histogram
fig, ax = plt.subplots(figsize=(12, 8))
values, bins, bars = ax.hist(df.release_date.dt.day_of_week, edgecolor='black', bins=np.arange(7+1))

# align bins, label axes
ax.set_xticks(np.arange(7)+0.5)
ax.set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
ax.set_xlabel('Day of Week')
ax.set_ylabel('Number of Movie Releases')
ax.set_title('Frequency Distribution of Release Day-Of-The-Week')
plt.show()

Movies very often release on Fridays, followed the the middle of the week, with few releases over the weekend or start of the week. This makes some sense: people get off work on Friday and can go to see the first viewings, then the opening weekend will take place immediately after. This timing is intuitive and strategic.

We next plot the distribution of release month-of-the-year.

In [32]:
# create histogram
fig, ax = plt.subplots(figsize=(12, 8))
values, bins, bars = ax.hist(df.month, edgecolor='black', bins=np.arange(13)+1)

# align bins, label axes
ax.set_xticks(np.arange(12)+1.5)
ax.set_xticklabels(list(calendar.month_abbr)[1:])
ax.set_xlabel('Month of Year')
ax.set_ylabel('Number of Movie Releases')
ax.set_title('Frequency Distribution of Release Month-Of-The-Year')
plt.show()

The two most popular months are September and December, which align with the end-of-summer and holiday releases. However, these months don't dominate as strongly -- plenty of movies release throughout the year.

Finally, we wish to plot the distribution of movie releases over the calendar year. This is tricky, as the movie releases in the dataset span many years, and it would be impractical to visualize them separately. What we are mainly interested in is whether certain specific days are popular, so preserving day-of-the-week is not a concern. Thus, we take an approximation: for each release date, we transform its (day, month) combination onto the year 2020 (chosen because it is a leap year, so all days will be accounted for). We plot this frequency distribution below using the calmap library.

In [33]:
# map each (day, month) to 2020
plt.figure(figsize=(25, 20))
dates_mapped_to_2020 = df.release_date.map(lambda rd: pd.to_datetime(f"{rd.month}-{rd.day}-2020"))

# create calendar heatmap
calmap.yearplot(pd.Series(1, index=dates_mapped_to_2020))
plt.title('Frequency Distribution of Release Day-Of-The-Year')
plt.show()

We see that New years, Christmas and its surrounding days, as well as the first week of September are all common release dates. This aligns with the previous analysis that released summer-blockbuster and holiday releases to be popular. For the rest of the calendar dates, releases seem to be mostly uniformly distributed, or at least without a significant pattern.

Section 3.4: Miscellaneous

There are a few more factors which likely have a smaller or insignificant effect on a movie's performance, which we illustrate here. We begin by performing an analysis on the revenue distribution for production companies in a manner similar to how we analyzed genres. Often times, certain companies have a "brand name" where more popular companies will gain a consumer following just because viewers are acquainted with the style of movie that a company creates, so intuitively we would expect some effect.

In [34]:
# get list of all companies
companies = list(filter(lambda c: 'prod_company' in c, df.columns))

# create violinplot
plt.figure(figsize=(25, 8))
ax = plt.subplot(111)
plt.violinplot([np.log(np.asarray(df[df[c]].revenue)) for c in companies], positions=range(len(companies)), widths=1, showmeans=True)

# label distributions and axes
ax.set_xticks(list(range(len(companies))))
ax.set_xticklabels(list(map(lambda c: c[13:].replace(' ', '\n'), companies))) # set cleaned company names on x axis
ax.set_ylabel('Log Revenue (Dollars)')
ax.set_title('Revenue Distribution by Company')
plt.show()

There doesn't appear to be much difference between the company and the performance of their movies: they all have relatively the same mean except for the violin plot of "other movies". Part of the reason we are seeing similar revenues is because we are only displaying companies with more with 100 movies produced. These are old companies with big budgets, so we should expect them to all perform relatively the same. On the contrary, we are not displaying smaller companies with less than 100 movies which likely have lower performance. This is likely why our violin plot for "other" companies (ie the ones with less than 100 movies) do worse on average than the rest of the companies.

Next, we examine movies' runtime in our dataset. To first get a sense for the distribution of runtimes (since it is not immediately obvious), we begin by plotting a frequency distribution of movie runtimes in our dataset as a histogram, and also overlay a fitted lognormal distribution.

In [35]:
# create histogram
plt.figure(figsize=(12, 8))
df_runtime = df[pd.notnull(df.runtime) & (df.runtime != 0)].runtime
n, bins, patches = plt.hist(df_runtime, bins=100)

# fit and plot lognormal distribution
shape,loc,scale = scipy.stats.lognorm.fit(df_runtime)
x = np.linspace(0, max(df_runtime), 100)
scale_factor = len(df_runtime) * (bins[1] - bins[0]) # ad-hoc factor to match histogram
pdf = scipy.stats.lognorm.pdf(x, shape, loc, scale) * scale_factor
plt.plot(x, pdf, 'r', label=f"Lognorm(loc={loc:.3f}, scale={scale:.3f})")

# label axes
plt.xlabel('Runtime in Minutes')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Runtimes')
plt.legend()
plt.show()

The lognormal distribution fits the data quite well. The runtime peak at around 100 minutes, and taper off, with the upper bound being around 200 minutes. This matches our experience/intuition.

Next, we graph each movie's revenue against its runtime. It is not immediately clear what the effect will be, if any. Movies with shorter runtimes could either be short films or children's films, so will likely have smaller revenues. Longer movies could perform better as they have more content, but padding/"filler" time is a known problem in movie production.

In [36]:
# create scatter plot
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(df.runtime, df.revenue)

# set axis scale manually
ax.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x//1000000), ',')))

# label axes
plt.xlabel('Runtime in Minutes')
plt.ylabel('Revenue in Millions of Dollars')
plt.title('Movie Revenue Against Runtime')
plt.show()

There does not appear to be any strong pattern, as we suspected. There is a very slight upwards trend, but such variance around the central runtime region that any model solely based on this would have little predictive power.

Section 4: Models and Regression

Our hope is to create effective models to predict the performance of upcoming movies and in the box office. To do so, we will test various regression techniques. Currently, the only prediction method we have created is a linear regression between budget and revenue; we will proceed with more advanced machine learning methods.

In all cases, our method of testing the model will be the same as for the linear regression: evaluating the model's $r^2$ score, which describes what proportion of the variance in the dataset is explained by the model.

Section 4.1: Paritioning Data

There are several columns we can discard. Original language and release date have already been processed. Original title, title, tagline, and overview are all textual data, which is very difficult to process and likely cannot be used at this scale. Finally, we have already enforced that all movies in our dataset have been released, so status is no longer needed.

In [37]:
df_reg = df.drop(columns = ['original_language', 'original_title', 'overview', 'release_date', 'status', 'tagline', 'title'])

We will convert all bool columns to int64 to make it easier for sklearn to work with.

In [38]:
for column in df_reg.columns:
  if df_reg.dtypes[column] == bool:
    df_reg[column] = df_reg[column].astype(int)

df_reg.dropna(inplace = True)
df_reg = df_reg.astype(float)

Finally, before running our machine-learning models, we will perform a train-test split. This partitions our dataset into to parts, one that we can use to adjust our model, and the other to evaluate our model's predictive power. This is done to adquately evaluate the generalizeability of our model.

In [39]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_reg, random_state = 1345)

Section 4.2: k-Nearest Neighbors

Our first attempt at regression will be using k-Nearest Neighbors, which creates a prediction using the revenues of movies with similar characteristics to its target. This seems intuitively reasonable: movies with similar features would likely attract similar audiences in similar volumes, thus achieving similar revenues.

When creating the model, we need to provide a good value of k (the number of neighbors looked at for each input). To find a good value, we will make use of grid search hyperparameter tuning. This essentially tests several models for various possible hyperparameter values, and keeps the best one.

When finding a good k value, we restrict our search to be between 1 and 6 inclusive; larger values of k will likely result in overfitting.

In [40]:
# import model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

# set up grid search on model
params = {'n_neighbors':range(1,7)}
grid_search_cv_knn = GridSearchCV(KNeighborsRegressor(), 
                              params, cv=10, scoring = 'r2')

# fit training data while tuning hyperparameters
grid_search_cv_knn.fit(df_train.drop(columns = ['revenue'], axis = 1), df_train['revenue'])

# print best hyper parameter value:
print('our best k value hyper parameter is:')
print(grid_search_cv_knn.best_params_)
print('our corresponding r^2 measurement is:')
print(grid_search_cv_knn.best_score_)
our best k value hyper parameter is:
{'n_neighbors': 6}
our corresponding r^2 measurement is:
0.4880127805221036

After creating this model, we test our best scoring KNN regressor on our test dataset:

In [41]:
grid_search_cv_knn.score(df_test.drop(columns = ['revenue'], axis = 1), df_test['revenue'])
Out[41]:
0.5288748639888083

This score is relatively low, very close to the initial linear regression. This could be due to the fact that our budget is unnormalized and so when comparing distance, budget dominates any comparison because it is orders of magnitude greater than anything else.

Section 4.3: Random Forest

A Random Forest regression is another intuitively reasonable approach. This type of regression finds patterns among the various fields that can be used to predict revenue using decision trees, then filters out the bad patterns to find the best predictors.

Once again, we use grid search hyperparameter tuning to find the best model. For the number of trees, we test between 50 and 150 (around the default of 100); for the maximum depth (number of splits for each tree), we search between 2 and 10 (large values will result in overfitting).

In [42]:
# import model
from sklearn.ensemble import RandomForestRegressor

# import model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

# set up grid search on model
params = {'n_estimators':range(50,150, 10), 'max_depth':range(2,12,2)}
grid_search_cv_rand_fore = GridSearchCV(RandomForestRegressor(random_state = 345), 
                              params, cv=10, scoring = 'r2')

# fit training data while tuning hyperparameters
grid_search_cv_rand_fore.fit(df_train.drop(columns = ['revenue'], axis = 1), df_train['revenue'])

# print best hyper parameter value:
print('our best hyper parameter is:')
print(grid_search_cv_rand_fore.best_params_)
print('our corresponding r^2 measurement is:')
print(grid_search_cv_rand_fore.best_score_)
our best hyper parameter is:
{'max_depth': 10, 'n_estimators': 130}
our corresponding r^2 measurement is:
0.569018326323284

We again test the best Random Forest model on the test dataset.

In [43]:
grid_search_cv_rand_fore.score(df_test.drop(columns = ['revenue'], axis = 1), df_test['revenue'])
Out[43]:
0.6080214379532558

This regression model seems to perform much better than k-Nearest Neighbor. This is likely due to the large quanitity of data for each movie after our preprocessing: while Random Forest is able to filter for the most important components, other models may not do so as easily.

Section 4.4: Category Reduction

Unfortunately, many of these categories are not useful and do not provide much meaning to the data; to improve the performance of our models, it is prudent to manually discard the unnecessary data. This will also be useful later when trying to apply our model to upcoming movies, as some of the current data is hard to obtain. We will use our data analysis to guide the process.

The genre one-hot encoding should be kept bcause there was a moderate amount of difference on movie revenue by genre as evidenced by our previous violin plots. However, the similar analysis for production company found that there was no significant difference in revenue among the most frequent companies. There is not enough data among the infrequent companies to justify using them in a model without running into overfitting problems, so it will be discarded. Spoken languages will be dropped because the vast majority of movies in our dataset are in English. Similarly, we will drop the columns for production country because most of them are in the United States.

In [44]:
# create thinned dataframe
df_reg = df_reg[
    [
      'adult',
      'budget',
      'revenue',
      'runtime',
      'year',
      'month',
      'day',
      'genre_Mystery',
      'genre_Adventure',
      'genre_War',
      'genre_Animation',
      'genre_Musical',
      'genre_IMAX',
      'genre_Western',
      'genre_Sci-Fi',
      'genre_Crime',
      'genre_Romance',
      'genre_Children',
      'genre_Documentary',
      'genre_Action',
      'genre_Drama',
      'genre_Fantasy',
      'genre_Film-Noir',
      'genre_Comedy',
      'genre_Thriller',
      'genre_Horror',
  ]
]

# perform train-test split
df_train, df_test = train_test_split(df_reg, random_state = 1345)

We will attempt the same regression techniques on this smaller dataset. We begin with a gridsearch on k-Nearest Neighbors. We will choose to restrict our k value to be less than or equal to 6 because anything over that value can result in significant overfitting.

In [45]:
# import model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

# set up grid search on model
params = {'n_neighbors':range(1,7)}
grid_search_cv_knn = GridSearchCV(KNeighborsRegressor(), 
                              params, cv=10, scoring = 'r2')

# fit training data while tuning hyperparameters
grid_search_cv_knn.fit(df_train.drop(columns = ['revenue'], axis = 1), df_train['revenue'])

# print best hyper parameter value:
print('our best k value hyper parameter is:')
print(grid_search_cv_knn.best_params_)
print('our corresponding r^2 measurement is:')
print(grid_search_cv_knn.best_score_)
our best k value hyper parameter is:
{'n_neighbors': 6}
our corresponding r^2 measurement is:
0.48973040428060893
In [46]:
grid_search_cv_knn.score(df_test.drop(columns = ['revenue'], axis = 1), df_test['revenue'])
Out[46]:
0.5285619877502238

The KNN regressor performs almost the same as before dropping the unnecessary categories.

We try again with the Random Forest regressor.

In [47]:
# import model
from sklearn.ensemble import RandomForestRegressor

# import model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

# set up grid search on model
params = {'n_estimators':range(50,150, 10), 'max_depth':range(2,12,2)}
grid_search_cv_rand_fore = GridSearchCV(RandomForestRegressor(random_state = 345), 
                              params, cv=10, scoring = 'r2')

# fit training data while tuning hyperparameters
grid_search_cv_rand_fore.fit(df_train.drop(columns = ['revenue'], axis = 1), df_train['revenue'])

# print best hyper parameter value:
print('our best hyper parameter values are:')
print(grid_search_cv_rand_fore.best_params_)
print('our corresponding r^2 measurement is:')
print(grid_search_cv_rand_fore.best_score_)
our best hyper parameter values are:
{'max_depth': 10, 'n_estimators': 60}
our corresponding r^2 measurement is:
0.5664144928251216
In [48]:
grid_search_cv_rand_fore.score(df_test.drop(columns = ['revenue'], axis = 1), df_test['revenue'])
Out[48]:
0.6111421866032312

The Random Forest regression also performs about the same as before (in fact, slightly better).

Section 4.5 Neural Network

As our last regression technique, we will create and run a neural network. This type of advanced model tries to find patterns in data using essentially a large number of chained and bounded linear functions. It is computationally expensive but can give good results by picking up on subtle patterns in data.

Setting up a neural network is also quite complicated. We will use the PyTorch framework to create, load, train, and evaluate our model. We begin by creating a data loader, which allows PyTorch to sequentially read the data from our dataframes.

In [49]:
# import and configure PyTorch
import torch
from torch.utils.data import Dataset
torch.set_default_dtype(torch.float64)
torch.set_grad_enabled(True)

# create class to represent dataset in a manner PyTorch can read
class RegressionDataset(Dataset):
  # constructor: store datasets
  def __init__(self, data):
        data = data.reset_index(drop = True)
        self.targets = torch.from_numpy(np.asarray(data['revenue'].values, dtype = np.float64))
        self.input_data = torch.from_numpy(np.asarray(data.drop('revenue', axis = 1).values, dtype = np.float64))
  
  # access element of dataset by index
  def __getitem__(self, index):
      return (self.input_data[index], self.targets[index])
  
  # get length of dataset
  def __len__ (self):
      return len(self.input_data)

# create data loader objects for train and test datasets
from torch.utils.data import DataLoader
dataloader_train = DataLoader(dataset = RegressionDataset(df_train))
dataloader_test = DataLoader(dataset = RegressionDataset(df_test))

Next, we create the actual network itself, which consists of several linear layers back-to-back. Each layer performs a linear transformation to its input. Between each linear layer, we apply a ReLU activation function, which forces all of the neurons at each stage to be postiive. This also gives the model expressiveness -- if not for the activation function, the linear layers would combine to produce only a single linear transformation.

In [50]:
import torch.nn as nn
from torch.nn.functional import relu, logsigmoid


class RegressionNetwork(nn.Module):
    # initialize model: construct linear layers
    def __init__(self):
        super(RegressionNetwork, self).__init__()
        self.fc1 = nn.Linear(25, 20) 
        self.fc2 = nn.Linear(20, 15)
        self.fc3 = nn.Linear(15, 10)
        self.fc4 = nn.Linear(10, 5)
        self.fc5 = nn.Linear(5, 1)

    # applies model to input: sequence of linear+relu functions
    def forward(self, x):
        x = relu(self.fc1(x))
        x = relu(self.fc2(x))
        x = relu(self.fc3(x))
        x = relu(self.fc4(x))
        x = relu(self.fc5(x))
        return x

Finally, we train the neural network, which adjusts the weights in each linear layer to achieve a better score. Our weights are adjusted using an Adam Optimizer based on gradient descent with a few tweeks, which finds the best direction to change the model parameters to minimize the loss (error) produced. Our loss is measured using the standard mean squared error. We optimize our network over several epochs, or training periods.

In [51]:
# import functions, set randomization seeds
import torch.optim as optim
from torch.autograd import Variable
import random
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

# initialize network and optimizer
net = RegressionNetwork()
print(net)
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.02)

# train over several epochs
epochs = 20
for epoch in range(epochs):
  # compute loss over training data
  train_loss_epoch = 0
  for input, target in dataloader_train:
    input = Variable(input, requires_grad = False)
    optimizer.zero_grad()
    # print(input)
    output = net(input)
    # print(loss)
    loss = criterion(output, target)
    loss.backward()
    # print(loss)
    optimizer.step()
    train_loss_epoch += loss.item() 
  
  # compute loss over test data
  test_loss_epoch = 0
  for input, target in dataloader_test:
    input = Variable(input, requires_grad = False)
    optimizer.zero_grad()
    # print(input)
    output = net(input)
    # print(loss)
    loss = criterion(output, target)
    loss.backward()
    # print(loss)
    optimizer.step()
    test_loss_epoch += loss.item() 
  
  # print results every few epochs
  if epoch % 5 == 4:
    print('epoch ' + str(epoch+1) + ':')
    print('training loss:', train_loss_epoch / len(dataloader_train))
    print('test loss:', test_loss_epoch / len(dataloader_test))
RegressionNetwork(
  (fc1): Linear(in_features=25, out_features=20, bias=True)
  (fc2): Linear(in_features=20, out_features=15, bias=True)
  (fc3): Linear(in_features=15, out_features=10, bias=True)
  (fc4): Linear(in_features=10, out_features=5, bias=True)
  (fc5): Linear(in_features=5, out_features=1, bias=True)
)
epoch 5:
training loss: 1.3387224360040906e+16
test loss: 1.930931147919102e+16
epoch 10:
training loss: 1.3545441034648388e+16
test loss: 1.958371446939039e+16
epoch 15:
training loss: 1.3456595874100664e+16
test loss: 1.9166326336006788e+16
epoch 20:
training loss: 1.3194131703424232e+16
test loss: 1.915289442757476e+16

We finally evaluate our neural nerwork by finding its $r^2$ score.

In [52]:
# convert data to use with PyTorch
from sklearn.metrics import r2_score
tensor_test = torch.from_numpy(np.asarray(df_test.drop('revenue', axis = 1).values, dtype = np.float64))

# compute r2 score
model_outputs_test = net(tensor_test)
r2_score(df_test['revenue'], model_outputs_test.detach())
Out[52]:
0.5339704750482415

Interestingly our Neural Network doesn't do as well as a random forest or a knn. We believe part of the reason this is happening is because there isn't much data to train the neural network on. There is only 5000 total entries and often neural networks make more than that to train. Additionally, our neural network is very small, so there may not be enough power in our network to understand more advanced processes.

Section 4.6: Summary of models

It appears that of the models trained, random forest was the most effective regression technique. It is the only model to predict at a R^2 level above 0.6. This makes sense because our data is full of categories, and often decision trees are the most effective at dealing with a large amount of categories.

Additionally, it appears that the language and production company information for each movies largely seems irrelevant. When we removed these columns, we still ended out with roughly the same effectiveness in our regressions, as evidenced by our R^2 values.

Section 5: Upcoming Movie Predictions

Our goal was to predict the box office revenue of upcoming movies, and now we have the knowledge and tools to do so! We choose five upcoming movies to analyze:

  • Avatar 2: The Way of Water
  • Aquaman 2: The Lost Kingdom
  • Black Panther 2: Wakanda Forever
  • Black Adam
  • Babylon

To run our regression models, we need data on all of these movies, which we obtain from this website.

In [53]:
upcomingMovieData = {
    'name':['Avatar 2', 'Aquaman 2', 'Black Panther 2', 'Black Adam', 'Babylon'],
    'adult':[False, False, False, False, True],
    'budget' : [375 * 10**6, 235 * 10**6, 250 * 10**6, 192.5 * 10**6, 109 * 10**6],
    'runtime' : [192, 143, 161, 124, 189],
    'year': [2022, 2023, 2022, 2022, 2022],
    'month' : [12, 12, 11, 11, 12],
    'day': [16, 25, 11, 21, 25],
    'genre_Mystery': [False, False, False, False, False],
    'genre_Adventure': [False, True, False, False, False],
    'genre_War': [True, False, True, False, False],
    'genre_Animation': [False, False, False, False, False],
    'genre_Musical': [False, False, False, False, False],
    'genre_IMAX': [True, True, True, False, False],
    'genre_Western': [False, False, False, False, False],
    'genre_Sci-Fi': [True, False, True, False, False],
    'genre_Crime': [False, False, False, False, False],
    'genre_Romance': [False, False, False, False, True],
    'genre_Children': [False, False, False, False, False],
    'genre_Documentary': [False, False, False, False, False],
    'genre_Action': [True, True, True, True, False],
    'genre_Drama': [True, False, False, False, True],
    'genre_Fantasy': [True, True, True, True, False],
    'genre_Film-Noir': [False, False, False, False, False],
    'genre_Comedy': [False, False, False, False, False],
    'genre_Thriller': [False, False, False, False, False],
    'genre_Horror': [False, False, False, False, False],
}

# make dataframe from map
upcoming_movie_df = pd.DataFrame(upcomingMovieData)
upcoming_movie_df.head()

# remove names from the df
names = upcoming_movie_df['name']
predictors = upcoming_movie_df.drop('name', axis = 1)
predictors = predictors.astype(np.float64)

We apply our three machine-learning models, as well as our initial budget/revenue linear regression, on these five movies.

In [54]:
import warnings
warnings.filterwarnings("ignore")

predictions = []
for i in range(5):
  predictor_row = np.array(predictors.iloc[i])
  predictions.append(
      {
       'Neural Network': net(torch.Tensor(predictor_row)).detach().item(), 
       'K Nearest Neighbors' : grid_search_cv_knn.predict(predictor_row.reshape(1, -1))[0],
       'Random Forest' : grid_search_cv_rand_fore.predict(predictor_row.reshape(1, -1))[0],
       'Linear Regression \n(between budget and revenue)' : model_lin_reg.predict(np.array([[predictor_row[1]]]))[0]
      }
  )

We graph the expected revenue outputs for each of these models on all the movies togethr on one graph to compare the results.

In [55]:
# set up plot scale and axes
fig, ax = plt.subplots(1)
fig.set_figwidth(15)
fig.set_figheight(10)
ax.get_yaxis().set_major_formatter(
  matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ','))) # add commas to the y axis integer values
ax.set_ylabel('Revenue in dollars')
ax.set_xlabel('Movie')
X_axis = np.arange(len(names))

# bars for neural network
labels = [names.iloc[i] for i in range(len(names))]
values = []
for i in range(len(names)):
  predictor_row = np.array(predictors.iloc[i])
  values.append(net(torch.Tensor(predictor_row)).detach().item())
ax.bar(X_axis-0.05*1.5, values, color = 'red', width = 0.1*1.5, label = 'Neural Net')

# bars for KNN
values = []
for i in range(len(names)):
  predictor_row = np.array(predictors.iloc[i])
  values.append(grid_search_cv_knn.predict(predictor_row.reshape(1, -1))[0])
ax.bar(X_axis-0.15*1.5, values, color = 'blue', width = 0.1*1.5, label = 'K-Nearest Neighbors')

# bars for Random Forest
values = []
for i in range(len(names)):
  predictor_row = np.array(predictors.iloc[i])
  values.append(grid_search_cv_rand_fore.predict(predictor_row.reshape(1, -1))[0])
ax.bar(X_axis+0.05*1.5, values, color = 'green', width = 0.1*1.5, label = 'Random Forest')

# bars for linear regression
values = []
for i in range(len(names)):
  predictor_row = np.array(predictors.iloc[i])
  values.append(model_lin_reg.predict(np.array([[predictor_row[1]]]))[0])
ax.bar(X_axis+0.15*1.5, values, color = 'grey', width = 0.1*1.5, label = 'Linear Regression\n(budget vs revenue)')

# make the x axis names of the models
plt.xticks(X_axis, labels)
plt.legend()
plt.title("Movie Revenue Predictions of Major Upcoming Movies")
Out[55]:
Text(0.5, 1.0, 'Movie Revenue Predictions of Major Upcoming Movies')

The bar plot above shows various regression predictions of movie performance at the box office. There is great variance between what the models predict for each movie. For example, KNN predicts much lower Avatar 2 revenue than the rest of the regressions, while Random Forest predicts much higher revenue for Aquaman 2 than the rest of the predictors.

The neural network and the linear regression have almost the same predictions consistently, whereas the random forest and knn seem to produce occasionally novel results from the others.

From what appears from the various regressions, we believe that Avatar 2 and Aquaman 2 are going to raise the most revenue. We have confidence that Babylon is going to perform the worst in the box office because each regression's lowest prediction is for Babylon.

Section 6. Conclusion

In this walkthrough, we collected, processed, combined, and encoded complex datasets regarding past movies and their performances. We explored our dataset through several avenues, including budget, genre, and release time, to heuristically determine what factors influence a movie's box office success; while these factors were important, features such as production company and language were not. We then utilized several machine-learning regression models to try to predict box office success given movie characteristics, and achieved moderate success. Finally, we used our predictors to analyze upcoming movies.

In our analysis, we found that a movie's budget is extremely significant in determining a movie's revenue and return on investment. Movie genres have some effect, but these preferences seem to fluctuate over time. Movies are often released during holiday seasons or when people will be likely to watch them. While these and other factors can be used to predict movie box office success, we were able to explain at most 60% of the variance, suggesting that there are several components we were unable to account for in our analysis.

Our list of factors is not complete -- there are certainly more characteristics of movies that serve as important factors in a movie's performance; these could include star actors or advertising costs. A larger dataset that includes these features may allow us to achieve better predictive performance, which could be the subject of further analysis. We hope you have enjoyed reading this tutorial and have learned data science techniques that will be useful in your future exploration!