Abstract

Given the success of Netflix, a data-driven entertainment company, which decided to bid more than a hundred million U.S. dollars to win the rights for the first two seasons of ‘House of Cards’ relying in large part on its algorithms, one might conclude that data can indeed predict a blockbuster. However, Netflix had a unique access to invaluable information generated by millions of its consumers, something that is not available even in the most detailed movie industry report. Thus, the real question stands as follows: ‘Can Open Data predict a blockbuster? The overall aim of this report is to answer precisely this question and to develop a comprehensive model that will be able to predict local box-office receipts of any movie screened in the United States prior to its official release date using only publically available information.

Most of the movie data is acquired using several Application Programming Interfaces as well as through screen scraping of two specialized open data sources: The International Movie Database and The Numbers. Some auxiliary information is also extracted from Box Office Mojo and Fxtop websites. This data is then used to engineer multiple important movie features that are responsible for the variation in box-office receipts of different movies.

The prediction task is approached both as a classification, with four classes representing different box-office ranges, and as a regression problem. Depending on the actual needs, the two models can prove to have certain advantages over each other. The former approach can be used to determine if a given movie is going to be a blockbuster or, possibly, a box office bomb; whereas the latter will help to make the actual prediction. Thus, Ensemble Stacked Classification (ESCM) and Ensemble Stacked Regression (ESRM) two-level models are generated. These models represent two separate variations of meta ensembling technique, which allows to combine individual predictive power of the first-level models and use it within the second-level model to make final predictions.

The ESCM consists of an Artificial Neural Network, a Support Vector Machine and a Random Forest at its first level and a Logistic Regression at the second level; whereas the ESRM has an Artificial Neural Network and a Support Vector Regressor with a Gaussian kernel at the first level and another Artificial Neural Network at the second level. The proposed classification model achieves accuracy of 65%, outperforming the random classifier by 160%, whereas the regression model results into box-office receipts being predicted with Root Mean Squared Error of $ 29,988,856. Moreover, it appears that the efficiency of the two models can be combined, if they are used simultaneously.

Aimed at enhancing the predictive accuracy of the proposed models, separate analysis is implemented using additional data extracted from Twitter. Controlling for several newly engineered popularity measures of actors and directors allows the new classification model to correctly classify 69.5% of movies. The analysis of the power of data extracted from one of the most popular social networks is presented in Appendix A.