At this project a dataset from The National Collegiate Athletic Association (NCAA) consisting of basketball games for Ivy League colleges from 2006 to 2015 is analysed in order to identify the features that drive a team’s performance.

In every sport, the ultimate goal of a team is winning. However, the final outcome of a game might not be representative of a team’s performance. For instance, a very productive team might loose a game with a very small difference in points. In such case, someone cannot absolutely conclude that the team did not perform well. Based on the aforementioned definition of a “good performance”, a binary variable of winning or loosing is an ill-specified metric in terms of capturing the performance of a team. Therefore, the difference in points at each game is used in order to identify the features that explain the variation in each team’s performance, as the difference in points of the target team (Ivy League teams) and the opponent team.

Additional data is acquired by generating web scraping algorithms, in order to engineer and supplement the provided dataset with features that will best explain a team’s performance. For instance, data regarding coach and team composition at each year is acquired and different regressors are generated such as consecutive coaching years, total number of coaches’ NCAA appearances as well as similarity of each team’s composition between the current and the previous year. In order to identify the most important variables, different feature selection techniques are implemented. More specifically, a univariate selection method namely F – Regression, a Random Forest Regressor, and a stability Randomised Lasso selection process are proposed. Each technique has each own limitations and thus all of the aforementioned selection methodologies are taken into consideration in order to balance out their relative differences and weaknesses. Based on these selection techniques, in addition to an intuitive conceptual argumentation and a backward-forward selection process, the set of the final variables are selected controlling for multicollinearity and functional form misspecification problems among the regressors .

After selecting the most important features that explain the variation of a team’s performance on each game, a linear regression model with robust standard errors that describes the outcome of a game in terms of the selected predictors is generated.