Take Me Out to the Ballgame: Demystifying Data Science Through Baseball

Introduction

The language of data science is… problematic.

Buzzwords like “Predictive Analytics,” “Big Data” and “Machine Learning Algorithms” often push away the very people who could benefit most from leveraging data science in their daily work.

Enter our “Demystifying Data Science” brown bag series, where we break down basic data visualization and machine learning concepts to show you how applying just a few fundamental ideas can dramatically improve day-to-day decision making.

We started with a branch of machine learning called “Supervised Learning”, though we opted to use the term “Predictive Modeling”, as it does a better job of describing what you’re actually trying to accomplish–namely, trying to predict something.

We decided this type of machine learning would be best illustrated with a real-world example, so we started with a simple question: What do you think the 2018 opening game attendance would be at AT&T Park? At ROI·DNA, we happen to have season tickets for the Giants, so most of us have spent a fair share of time at the ballpark. But without any hard data to inform our predictions, our initial guesses were wildly varied.

Fig. 1 – As we can see above, most guesses ranged between 38,000 and 80,000, while Texas native Pat, apparently confusing the Giants with Beyonce, came in a touch aggressive at 175,000.

We then asked our colleagues to brainstorm what data they would need to make more accurate predictions. Specifically, we wanted them to come up with a list of variables they thought might influence attendance. For instance, it might make sense that attendance is lower at rainy games than sunny games. Or we might see that weekend games have higher attendance, or that games against rivals may have the highest turnout.

The goal with predictive modeling is to discover the variables with the greatest predictive power and plug them into an equation that will help to predict future events or other unknown outcomes. In essence, we’re hunting for signals within the noise of what would otherwise appear to be random variation.

Among our colleagues, we found some consensus around the factors that might influence game attendance:

– Weekend game vs. weekday game
– Day vs. night game
– Sunny vs. rainy weather
– Opposing team being played
– If it’s Opening Day

There are a number of different supervised learning algorithms we could have used, but we opted to go with something called multiple linear regression, which tends to be the easiest to understand given its roots in basic algebra.

We then tested the strength of our predictors on every Giants home game since 2001, looking to see which variables had the greatest impact on attendance. As it turned out, all our predictors were fairly weak, meaning they only accounted for some of the variation in attendance. Had we been in a production environment, we likely would have stopped at this point and gone back to the drawing board. As this was only a demonstration, we decided to press on.

After splitting our data into a “training” set and a “test” set, we constructed several models built on different combinations of our predictors. For each of those combinations, we used the “training” set to train the model, and the “test” set to evaluate its performance (a process known as cross-validation).

At the end of this process we selected the best model and asked it to predict the attendance for the game in question given what we already knew — that it would 1) be an Opening Day game, 2) against a non-rival, 3) taking place during the day, 4) on a weekday. In the end, our model predicted 41,735 people would turn out for a sunny game, and 40,550 for a rainy game.

Armed now with the knowledge of our model’s strengths and weaknesses, as well as its predictions, we gave everyone a chance to refine their initial guesses and added a little extra incentive (a $50 Giants’ Dugout gift card) for the person with the closest prediction.

Ultimately, 40,910 people attended the opening game at AT&T Park (2% off our model’s prediction for a sunny day and only 145 off the prize-winning guess of 41,055).

Fig. 2 – In a true victory for the “Wisdom of the Crowd”, the average of our combined guesses ended up being just 185 off the actual attendance (0.45%).

While we were obviously pleased with the end result of our model and the predictions it helped our colleagues make, we were most impressed with how quickly teams were able to identify applications in their day-to-day work. Immediately, our SEM experts began to discuss ways they could leverage predictive modeling to forecast keyword conversions, while our project managers wondered if we could use it to more accurately predict the number of hours needed for a given project.

As we suspected, when freed from complex math and technical jargon, most people found the basics of predictive modeling to be fairly intuitive.

Stay tuned for Part 2 of this blog series where we’ll tackle Data Discovery/Unsupervised Learning models. We’ll explore how this class of algorithm can detect hidden patterns and relationships within your data.