Goal the second: Elasticsearch, Logstash, & Kibana (ELK)
My entire experience accomplishing this second goal with ELK v5 and getting everything working was pretty significant and a lot of fun. My previous familiarity with ELK helped a lot, even if it was only some Elasticsearch basics and some time getting up to speed with Kibana. Here are the overall steps I took:
Installed Elasticsearch and Kibana on a t2.medium Linux in AWS
Installed Logstash on my local machine
Wrote a Python script to clean the full season csv data and export as daily csv “logs"
Created a Logstash config file to handle the parsing of the csv files
“Stashed" the pitch records into Elasticsearch and visualized the data in Kibana
Installing Elasticsearch was mostly straightforward. I simply followed the instructions in the documentation: https://www.elastic.co/guide/index.html. Initially I installed Elasticsearch on a t2.micro instance, you know, trying save a few pennies. As it turns out, version 5 installs with a RAM default higher than a t2.micro (currently t2.micro provides 1 GiB of RAM). Here’s the catch: there is no immediate feedback that something is wrong. The command to start the service executes without error. And then there is nothing. Rather than edit the configs to decrease the minimum memory, I opted to spring for the extra $0.035 per hour and bump up to a t2.medium.
Getting my logstash config file to execute without error took a lot of trial-and-error. A lot. One issue I ran into related to data type mapping in Elasticsearch. I had set the field “hit_speed” to be type float in my logstash config. No rocket science there. But the data set I was using is ALL pitches, not just hits. So the records for the pitches that weren’t hit, by definition, would have no hit speed (or any other hit-related stat). Importing into SQL Server, those records would import either as a NULL or whatever default I set for NULL values. Elasticsearch just threw an error trying to put “” into a float field. And that totally makes sense. So I got cozy with the mutate filter in Logstash to remove the three hit_* stats when the hit_speed is blank.
if ("" in [hit_speed]) { mutate { remove_field => ["hit_speed"] remove_field => ["hit_angle"] remove_field => ["hit_distance_sc"] } }
The Kibana installation was also quite easy. One thing that I did when testing the whole setup was to spin up a Windows EC2 instance. I used that instance just to connect to Kibana via Chrome. Since the instance was in the same AWS environment as my Elasticsearch instance, I reduced the number of factors that could trip me up. Once I knew Kibana was working properly, I terminated the Windows instance and connected directly from my laptop.
So, what was so great about loading the baseball data into Elasticsearch? It does seem a bit Rube-Goldberg-ish just to do some simple calculations in Kibana. And it is. Before I get to a couple benefits, here’s the dashboard in Kibana showing the spin rate calculations.
A cool benefit of having the data available in Kibana is simply the opportunity to discover. Put another way, sometimes it’s fun to just surf through the data. Consider the following two examples:
How many triple plays were there in 2016? Sure, I could just search MLB.com, but why do that when I’ve already gone to all the trouble to load it into ELK? The search is pretty easy: "triple play" AND type:X. The search results show a basic histogram of the results and each raw record in Elasticsearch, with my search terms highlighted:
With very little effort, I can make these search results a bit more readble. I'll add specific fields to the results. When I do that, Kibana also removes the "_source" field from the results, which makes for a cleaner display. I added the home and away teams, the half inning the play occurred, the pitcher's name, and the description of the play. Here's a screenshot of the results now with these specific fields:
On the screenshot, I also showed a handy feature in Kibana (this is not new to Kibana 5). When I click on the field name (here I selected "home_team"), Kibana will show me a quick count of the values. I've found this feature quite useful in understanding the data, although it's not quite as insightful when only seven records are returned.
Adding the additional fields show me a couple of things. First, the Chicago White Sox turned three triple plays in 2016? Wow. Also, because I dropped in the full description of the play, I can see that their third triple play -- on July 8th -- seemed pretty unusual. I found the clip on mlb.com to see how the play actually unfolded.
http://m.mlb.com/news/article/188746916/tim-anderson-turns-third-white-sox-triple-play/
Now let’s look at inside-the-park home runs. These are pretty rare plays, and I’m thinking there were fewer inside-the-park homers than triple plays. Enter search: “inside the park” AND type:X. In 2016 batters hit inside-the-park home runs nine times. Now, say that in your best Ed Rooney voice, “Niiiine tiiiimes.” At least for 2016 triple plays were more rare than inside-the-park home runs, but I would still categorize both as very uncommon if not truly "rare."
Because all of the data is already in Elasticsearch, I can easily search across the available fields as my mind thinks through different questions. For example, looking at these 9 inside-the-park home runs, I notice that eight of the nine are solo home runs -- no other runners were on base. If I were analyzing these in more detail, that might be something I'd want to look at across multiple years.
While searching the data, I contrasted searching in Kibana to performing the same exercise in Tableau. Could I create a calculated field to determine whether a play was a triple play or an inside the park home run? Certainly! But it wouldn't be as simple, and I’d have to create a new one for every question. In Tableau, it's true that I can also view the source data to see the full text of the plays, but it’s not right there in front of me. Note that these aren’t points against Tableau. They highlight that the tool I select has a great deal of influence on HOW I explore the data and answer questions. I’d also argue that it has a great deal to do with determining what questions to even ask in the first place.
Result? GOAL COMPLETED
Other posts in this series: