6 min read

Of Sixes and Fours - Analyzing the IPL using the tidyverse

We are back with another post on the Indian Premier League. This is the fourth post in the series. We will assume that you have already read the previous article analyzing strike rates here. One change since the last article is that Cricsheet now has updated data available - so we have the details of all matches played up to 2019.

The initial part of the code is very similar to the previous article. We will not explain the details here, but the code has been repeated below.

batsmen_striker <- match_deliveries %>%
  distinct(delivery_batsman)
nrow(batsmen_striker)
## [1] 514
batsmen_matches <- match_deliveries %>%
  distinct(delivery_batsman, id) %>%
  group_by(delivery_batsman) %>%
  summarise(n_matches = n()) %>%
  ungroup() %>%
  arrange(desc(n_matches))
summary(batsmen_matches$n_matches)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    8.00   21.97   23.00  189.00
batsmen_matches <- batsmen_matches %>%
  filter(n_matches >= 20)
batsmen_match_details <- match_deliveries %>%
  inner_join(batsmen_matches, by = c("delivery_batsman" = "delivery_batsman"))
table(batsmen_match_details$delivery_extras_type)
## 
##    byes legbyes noballs penalty   wides 
##     375    2546     608       2    4823
noballs <- batsmen_match_details %>%
  filter(delivery_extras_type == "noballs", delivery_runs_total > 1)
byes <- batsmen_match_details %>%
  filter(delivery_extras_type == "byes", delivery_runs_batsman > 1)
legbyes <- batsmen_match_details %>%
  filter(delivery_extras_type == "legbyes", delivery_runs_batsman > 1)
wides <- batsmen_match_details %>%
  filter(delivery_extras_type == "wides", delivery_runs_batsman > 1)
penalty <- batsmen_match_details %>%
  filter(delivery_extras_type == "penalty")
batsmen_match_details <- batsmen_match_details %>%
  filter(!(delivery_extras_type %in% c("wides", "penalty")))

At the end of the code above, the dataset batsmen_match_details is restricted to batsmen who have played at least 20 matches. We have also removed the deliveries which don’t contribute to runs scored by a batsman.

Anyone who follows the game of cricket knows that T20 cricket is primarily a batsman’s game. The crowd loves it when the batsmen send the ball into the stands. More elegant players might not use a lot of power in their shots, but may gracefully place it in the gaps for four runs. In this article, we examine the percentage of runs scored boundaries (sixes and fours) by top batsmen in this tournament.

We need to be careful about one particular aspect of the data. Even though four or six runs may be scored off a ball, they are not necessarily to be counted as boundaries. The dataset has a field called delivery_runs_non_boundary to identify such cases.

table(batsmen_match_details$delivery_runs_non_boundary)
## 
##      0      1 
## 153590     12
batsmen_match_details$delivery_runs_batsman[
  batsmen_match_details$delivery_runs_non_boundary == 1]
##  [1] 4 4 4 6 4 4 4 6 4 6 6 4

Looking at the above results, there are only 12 such cases. In all such cases, the runs scored by a batsman in that ball was 4 or 6. For each delivery, we use the case_when function from dplyr to derive two fields - runs_boundary and runs_non_boundary. We also verify that the total runs scored by the batsman in the delivery adds up to the sum of these two fields.

batsmen_match_runs <- batsmen_match_details %>%
  mutate(
    runs_boundary = case_when(
      delivery_runs_non_boundary == 0 &
        (delivery_runs_batsman %in% c(4, 6)) ~ delivery_runs_batsman,
      TRUE ~ 0L
    ),
    runs_non_boundary = case_when(
      delivery_runs_non_boundary == 1 |
        !(delivery_runs_batsman %in% c(4, 6)) ~ delivery_runs_batsman,
      TRUE ~ 0L
    )
  ) %>%
  select(id, innings_num, delivery_batsman, delivery_runs_batsman,
         runs_boundary, runs_non_boundary)
nrow(subset(
  batsmen_match_runs,
  delivery_runs_batsman != runs_boundary + runs_non_boundary
))
## [1] 0

We first summarise the data at the batsman level. We also verify that the total number of runs scored by a batsman in their IPL career adds up to the runs scored in boundaries and non-boundaries.

batsmen_runs_summary <- batsmen_match_runs %>%
  group_by(delivery_batsman) %>%
  summarise(
    n_balls = n(),
    delivery_runs_batsman = sum(delivery_runs_batsman),
    runs_boundary = sum(runs_boundary),
    runs_non_boundary = sum(runs_non_boundary)
  ) %>%
  ungroup() %>%
  arrange(desc(delivery_runs_batsman))
nrow(subset(
  batsmen_runs_summary,
  delivery_runs_batsman != runs_boundary + runs_non_boundary
))
## [1] 0

As there are 151 unique batsmen in the data which is too much to deal with, we restrict the data to those who have scored at least 1000 runs in their IPL career. After this step, we are left with 70 batsmen.

batsmen_runs_summary <- batsmen_runs_summary %>%
  filter(delivery_runs_batsman > 1000)
nrow(batsmen_runs_summary)
## [1] 70

We now derive two fields in the data. runs_boundary_pct is the percentage of runs scored as boundaries in their IPL career. career_strike_rate is the strike rate of the batsman in their IPL career.

batsmen_runs_summary <- batsmen_runs_summary %>%
  mutate(runs_boundary_pct = runs_boundary / delivery_runs_batsman,
         career_strike_rate = delivery_runs_batsman / n_balls) %>%
  arrange(desc(runs_boundary_pct))
summary(batsmen_runs_summary$runs_boundary_pct)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4820  0.5508  0.5906  0.5975  0.6397  0.7886

The summary statistics show that of all batsmen who have played at least 20 matches and scored 1000 runs, no one has scored less than 48% of their runs in boundaries. There is one (or more) batsmen who have scored close to 80% of their runs in boundaries. We’ll soon see who they are.

A reasonable hypothesis is that those who have a large percentage of runs from boundaries will also have higher career strike rates. Let us take a look at a plot to see if this is indeed the case.

ggplot(batsmen_runs_summary,
       aes(runs_boundary_pct, career_strike_rate)) +
  geom_point(colour = "blue") +
  geom_smooth(method = "lm", se = FALSE, alpha = 0.1, colour = "black") +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  xlab("% Runs (Boundaries)") +
  ylab("IPL Career Strike Rate") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    legend.position = "none",
    plot.title = element_text(size = 20, hjust = 0.5, vjust = 0.5),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 10)
  )

The hypothesis appears to be true. The linear regression line, as produced by geom_smooth and the lm method has a positive slope. There seems to be one outlier in the data - someone with a strike rate close to 190% and around 80% of their runs from boundaries. Can you guess who that is? We’ll see soon.

We finally plot the data for each of the batsmen. We use a dot plot to visualize the results.

ggplot(batsmen_runs_summary,
       aes(reorder(delivery_batsman, runs_boundary_pct),
           runs_boundary_pct)) +
  geom_point(colour = "blue") +
  scale_y_continuous(labels = scales::percent) +
  xlab(NULL) +
  ylab("% Runs (Boundaries)") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    legend.position = "none",
    plot.title = element_text(size = 20, hjust = 0.5, vjust = 0.5),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 10)
  ) +
  coord_flip()

Most people would have guessed Chris Gayle or Andre Russell for the outlier in the previous figure. I believe Russell’s recent performance in the 2019 IPL has ensured that he occupies the top spot as of the date of writing this article. I was a bit surprised to see that five Australians appear in the top 10 - Gilchrist, Maxwell, Lynn, Watson and Hayden. Considering that Dwayne Smith also appears in the top 10, the domination by the Aussies and the West Indians in this list appears complete.

As always, the complete code is available in Github.