Classifying MLB Pitch Zones and Predicting MiLB Zones (2024)

Zones

Within the MLB Stats API and Baseball Savant API, pitch locations are classified into 13 zones, as illustrated in Figure 1. These zones, as seen from the catcher’s perspective, define the strikezone and relative locations of a pitch. For example, a pitch inside “Zone 1” would be classified as a “Strike” and “up and away” to a left handed batter. In the same vein, a pitch located in zone 11, 12, 13, or 14 (and beyond) is outside of the strikezone, and would be classified as as “Ball”

All pitches located in Zones 1–9 are considered “In-Zone”

All pitches located in Zones 11–14 are considered “Out-of-Zone”

Classifying MLB Pitch Zones and Predicting MiLB Zones (2)

Zone Metrics

The zone classification of pitches allow analysts to calculate plate discipline metrics beyond conventional metrics such as Whiff% and Swing%. An example of a metric that can be calculated using these zone classifications is Out-of-zone Swing Rate, also known as Chase% or O-Swing%. O-Swing% calculates the rate at which a batter swings at pitches that are “Out-of-zone”, and is a large driver in a batter’s plate discipline outcomes. Intuitively, batters which swing at fewer pitches outside the zone tend to have higher walk rates (BB%). The more “Balls” a batter takes, the more likely they will walk.

Another example of a zone plate discipline metric is In-Zone Swing Rate (Z-Swing%). Similar to O-Swing%, Z-Swing% calculates the number of pitches “In-Zone” which a batter swings. In terms of contact, batters perform significantly better against pitches that are thrown inside the strikezone than ones that are not. The following table summarizes the Estimated wOBA on Contact (xwOBACON) for pitches inside and outside of the strikezone during the 2023 MLB Season.

Classifying MLB Pitch Zones and Predicting MiLB Zones (3)

Batters which consistently swing at pitches inside the strikezone are more likely to perform at a higher level than those that do not. Figure 2 illustrates the xwOBACON of each of the zones during the 2023 MLB Season. Pitches thrown near the heart of the plate tend to be more favourable for batters compared to those further away from the centre.

Classifying MLB Pitch Zones and Predicting MiLB Zones (4)

MLB Gameday is an app created by Major League Baseball which allows fans to follow games in real-time, including scores and pitch-by-pitch data. Figure 3 shows a screenshot of MLB Gameday.

Classifying MLB Pitch Zones and Predicting MiLB Zones (5)

Statcast is a high-speed, high-accuracy, automated tool developed by MLB to track players and baseball movements during a game.

Pitch Tracking

In games which have Statcast, the MLB API provides two different coordiante systems for pitch locations. For simplicity, I will refer to these systems as the “Statcast” and “Gameday” coordinate systems. Using these systems, The location of a pitch is defined as:

Statcast

  • pX — Horizontal position in feet of the ball as it crosses the front axis of home plate.
  • pZ — Vertical position in feet above home plate of the ball as it crosses the front axis of home plate.

Gameday

  • x — X coordinate where pitch crossed front of home plate in pixels from the origin.
  • y — Y coordinate where pitch crossed front of home plate in pixels from the origin.

These definitions read very similarly, but the Gameday coordinates are in pixels from an unknown origin and there are no “Zones”. Plotting each system will help us compare them and see if there are any discrepancies.

Plotting Statcast vs Gameday Pitch Locations

The dataset used contains all the pitches during the 2023 MLB Season and their respective pitch locations in both the Statcast and Gameday Coordinate System. Additionally, the Statcast Zone Types are defined for each pitch. This data was gathered from the MLB Stats API.

First we will take a look at the distribution of pitches in the dataset, and also the distribution of “In-Zone” vs ‘Out-of-Zone’ Pitches.

Classifying MLB Pitch Zones and Predicting MiLB Zones (6)
Classifying MLB Pitch Zones and Predicting MiLB Zones (7)

The distribution of pitch locations is normal. It is interesting to note that the distribution for “In-Zone” pitches along the x-direction truncates abruptly, rather than following a similar distribution as the z-direction distribution. This observation makes sense as the bounds of the zone in the x-direction are fixed in relation to the width of home plate and the bounds of the z-direction vary depending on the batter.

Now let’s take a look at the relationship between the Statcast and Gameday Systems.

At first glance, we can tell a major differences between the two systems.

  1. The Gameday System seems to be rotated 180° from the Statcast System. From this, we can infer that the origin of the Gameday System seems to be above the strikezone and to the right (catcher’s perspective) where positive values indicate distance below the origin (towards the ground) and to the left (catcher’s persepctive)

Let’s transform the plot, so they are both in the same orientation.

Classifying MLB Pitch Zones and Predicting MiLB Zones (9)

Now that we have the systems aligned, we can identify another main difference:

  1. The Statcast System’s origin (0,0) is located at the centre of home plate and at a distance of 0 ft above home plate. The Gameday system’s origin (0,0) is not plotted, as no pitches in this system are near the origin.

Despite this difference, the pitch locations in both systems are relatively similar. Let’s take a look at the relationship between the Statcast and Gameday coordinates. A scatter plot will illustrate the relationship (if any) between the coordinate systems.

Classifying MLB Pitch Zones and Predicting MiLB Zones (10)

The relationship between the Gameday and Statcast coordinates is linear. A linear regression model can be trained to predict Statcast locations from Gameday locations, and vice versa. Additionally, since Statcast data includes both the zones and the Gameday system, a classification machine learning model can be trained to predict zone locations using Gameday data. This means that in instances where only Gameday data is available (like MiLB games), both Statcast pitch locations and zones can be predicted.

As we can see, the Gameday Data is somewhat noisy, as not all the Gameday locations follow the same linear relationship. To address this, we can train a linear regression model to help identify and potentially remove the noisy data, which will allow our zone location classification model to perform better.

Classifying MLB Pitch Zones and Predicting MiLB Zones (11)

The Coefficient of Determination (R2) is extremely close to 1, which indicates that independent variable (Gameday Location) greatly explains the variability of the dependent variable (Statcast Location).

Let’s take a look at the Residuals of the model to sense if there is any pattern to the data that does not follow the linear relationship.

Classifying MLB Pitch Zones and Predicting MiLB Zones (12)

There does seem to be a slight correlation between the corresponding direction and the residual size. However, there are multiple points which have drastically different residual values. Since we know the relationship is effectively linear, we can make an assumption that all data that falls outside a specific residual threshold was data which was incorrectly collected, and therefore can be removed from the dataset.

As a sanity check, we can take a look at an “outlier” pitch to see if there is a discrepancy in the original Statcast Data vs the Gameday Data.

There is a pitch in the dataset which has the following metrics:

  • pitch_x_statcast: 1.96
  • pitch_z_statcast: -0.16
  • pitch_x_statcast_new: 0.28
  • pitch_z_statcast_new: 5.95
  • zone: 14

According to these metrics, the actual location of the pitch was down and to the right of the catcher, which is where zone 14 is located. According to the predicted location, this pitch was thrown approximately 6ft above the centre home plate, which would be in zone 12.

Thankfully, MLB has a repository of all pitches from the 2023 season, so we can verify the location of the pitch. Let’s take a look at that pitch.

https://sporty-clips.mlb.com/5cb43712-941a-4359-b876-75189b61c425.mp4

Classifying MLB Pitch Zones and Predicting MiLB Zones (13)

As seen in the replay of the pitch, the location was in fact what was measured by Statcast. This means that the Gameday System did not accurately reflect the location of the pitch. Due to the strong linear correlation between the Statcast and Gameday Systems, we will assume that all pitches which fall outside the 99th percentile of either the x or z residual will be removed from the dataset.

Following the removal of the outliers, we can plot another scatter plot to see the new data. Visually, all the noise in the data set has been removed, which should improve the accuracy of a classifications model which predicts the zone locations using Gameday Data.

Classifying MLB Pitch Zones and Predicting MiLB Zones (14)

A Random Forest Model is an ensemble method that combines multiple decision trees. In short, the model creates the boundary of each class from its features to predict the class of an input data point. Additionally, the zones are bounded by simple polygons, which allows for straightforward classification, which Random Forest can handle.

The features of the Random Forest are the following:

  • pitch_x_gameday: MLB Gameday X Coordinates
  • pitch_z_gameday: MLB Gameday Z Coordinates
  • k_zone_top: Top of the batter’s strike zone set by the operator when the ball is halfway to the plate.
  • k_zone_bot: Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate

The target of the Random Forest is the following:

  • Zone: Zone location of the ball when it crosses the plate from the catcher’s perspective

These features were selected as they define the location of the pitch and the size of the strikezone. The top and bottom of the strikezone varies between batters, as the size of the strikezone is proportional to the height of the batter. Considering these features will assist in producing an accurate model to predict the zone of each pitch.

Assessing The Model

Now that we have a Random Forest trained, we should assess it. The accuracy of the model is 96%, which is a very strong indicator that the model is effective. Let’s look deeper into it and plot the Testing Data to compare the actual Zone Classifications to the Predicted Zone Classifications. The data was limited to 5000 points for clarity.

Classifying MLB Pitch Zones and Predicting MiLB Zones (15)

The model seems to be performing well. The model uses more than 2 features, so a scatter plot will not be able to capture all the dimensions of the model, but we can take a look at a confusion matrix to understand the distribution of predicted results vs actual results.

Classifying MLB Pitch Zones and Predicting MiLB Zones (16)

The high accuracy of the model is clearly presented in this matrix. The largest discrepancies occur between zones which are vertically adjacent to one another. This makes sense as the horizontal size of the strikezone does not change, but the vertical size is variable, mostly defined by the height of the batter.

We can use the predicted zone locations to also predict whether a pitch was In-Zone or Out-of-Zone.

Classifying MLB Pitch Zones and Predicting MiLB Zones (17)

We can also plot a confusion matrix for In-Zone vs Out-of-Zone predictions to see the distribution of true positive, true negative, false positive, and false negative predictions.

Classifying MLB Pitch Zones and Predicting MiLB Zones (18)

For predicting whether a pitch was “In-Zone” or “Out-of-Zone” the model has a 98.5% accuracy.

We now have two models:

  1. A linear Regression model which predicts Statcast Coordinates given Gameday Coordinates
  2. A Random Forest Classifier which predicts zones given Gameday coordinates

All Minor League Baseball (MiLB) Games have Gameday data, but not all MiLB Games have Statcast Data. We can use the models we trained to predict the zone of each MiLB Pitch and also calculate Zone metrics.

Limitations

We do come into some limitations with the MiLB pitch data.

  1. MiLB Pitch Data is manually recorded, unlike how MLB Pitch Data is recorded through Hawkeye Cameras to get exact measurements. Due to the manual nature of the pitch tracking, errors in the data are much more likely.
  2. When there is room for human error, the predictions made on the model may not be accurate.

Knowing these limitations and the issues they may pose, we can take a look at the High-A pitch location distribution and get a sense of the MiLB data.

MiLB High-A Pitch Location Distribution

This analysis will focus on the 2023 MiLB High-A Season. The data was gathered from the MLB Stats API.

Let’s take a look at the distribution of pitches during the 2023 MiLB High-A Season.

Classifying MLB Pitch Zones and Predicting MiLB Zones (19)

Unfortunately, it seems as if the manual tracked data from the 2023 High-A data is not normally distributed like the MLB Statcast locations were. While this won’t impact the ability of the zone prediction model, it most likely means that the model is making predictions on a large portion of inaccurate data.

We will continue with prediction of pitch zones. There is an understanding that the accuracy of the predictions may be in question, but we will assume that the data collection was accurate enough that meaningful analysis can be conducted.

High-A Zone Predictions

Let’s plot the predicted zones of the data.

Classifying MLB Pitch Zones and Predicting MiLB Zones (20)

The model classifies the data into its distinct clusters effectively. With these predictions, we can calculate High-A Zone Metrics.

High-A Zone Metric Plots

The following scatter plots illustrate zone metrics for High-A batters which saw 1500 pitches during the 2023 season. In each plot, the top right quadrant is most favourable for the batter.

Classifying MLB Pitch Zones and Predicting MiLB Zones (21)
Classifying MLB Pitch Zones and Predicting MiLB Zones (22)
Classifying MLB Pitch Zones and Predicting MiLB Zones (23)

Plate discipline metrics which account for the location of the pitch a batter faces provide a deeper understanding of a batter’s plate approach and swing tendencies. Through the use of Hawkeye and Statcast, the locations and metrics are accurately calculated and made accessible to the public. Unfortunately, MiLB metrics such as these are not provided by MLB and thus inaccuracies in the dataset will continue to be present until Statcast data for all MiLB parks become public. Through the training and testing of both a linear regression and a random forest classifier, zone locations were predicted for all pitches thrown during the 2023 MiLB High-A Season, and subsequently, zone metrics for all batters and pitchers were calculated. These metrics provide coaches, scouts, and analysts information needed to make decisions which will ultimately shape the future of baseball and its players.

Check out my GitHub Repository for the code and outputs of this project: https://github.com/tnestico/mlb_zone_class

Classifying MLB Pitch Zones and Predicting MiLB Zones (2024)
Top Articles
Latest Posts
Article information

Author: Sen. Ignacio Ratke

Last Updated:

Views: 6207

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Sen. Ignacio Ratke

Birthday: 1999-05-27

Address: Apt. 171 8116 Bailey Via, Roberthaven, GA 58289

Phone: +2585395768220

Job: Lead Liaison

Hobby: Lockpicking, LARPing, Lego building, Lapidary, Macrame, Book restoration, Bodybuilding

Introduction: My name is Sen. Ignacio Ratke, I am a adventurous, zealous, outstanding, agreeable, precious, excited, gifted person who loves writing and wants to share my knowledge and understanding with you.