# Statsketball 2019

Below is my submission methodology for the 2019 American Statistical Association Statsketball Draft Challenge. Having won the contest in 2018, I knew that I had to up my game this year to defend my crown.

# Contest Background

The Statsketball Draft Challenge asks participants the chance to select NCAA Tournament Teams using a budget of 224 draft points. The cost of each team, in draft points, is based on the team’s seed.

seed cost
1 75
2 40
3 25
4 20
5 17
6 15
7 12
8 10
9 9
10 8
11 7
12 6
13 5
14 4
15 3
16 1

The deeper each team goes in the tournanement, the more points they accrue, with each round more valuable than the last. Wins in each round are worth 1, 2, 3, 5, 8, and 13 points, respectively. The goal of the contest is to score the most points by selecting a subset a group of teams within the alloted budget.

# Methodology

### NCAA Hoops Model

A key ingredient to my submission is the NCAA men’s basketball model that I have developed over the last 3 years. In very simple terms, the model using weighted least squares regression to predict score differentials between any two teams, and then utilizes logistic regression to transform the obtained pointspread into a win probability for each team. The specifics of this model are not the primary focus of the Statsketball tournament submission, but for those curious, a detailed write-up of the model methodology can be found here. For the purposes of this contest, all that is important is knowing is that given teams $$A$$ and $$B$$, the model outputs, $$P_A$$ and $$P_B$$, the chances that team $$A$$ and team $$B$$ win the game, respectively.

### Simulating the NCAA Tournament

The second step in my submission invloved running Monte Carlo simulations of the NCAA tournament. A single simulation of the tournament used the following steps

1. Use NCAA Hoops Model to predict the win probabilities for all 32 games in first round.
2. Draw 32 random numbers from a uniform distribution, one for each game.
3. Let game $$i$$ be played between team $$A_i$$ and team $$B_i$$ If the $$i^{th}$$ random number is $$\leq$$ team $$A_i$$’s chances of winning, advance team $$A_i$$ to the next round. Otherwise, advance team $$B_i$$ to the next round.
4. Repeat step 3 for all games in given round.
5. Repeat setps 1-4 for subsequent rounds, until a champion is crowned.

I ran 10,000 simulations of the NCAA tournament to obtain estimates of the probabilities of each team reaching each round. R code for running these simulations can be found here. The results of my simulations are shown below.

team seed region r64 r32 s16 e8 f4 ncg champ
Gonzaga 1 West 1.0000 0.9954 0.9133 0.7932 0.6129 0.4113 0.2653
Duke 1 East 1.0000 0.9928 0.9036 0.7341 0.5039 0.3367 0.1928
Virginia 1 South 1.0000 0.9866 0.8822 0.7261 0.5267 0.2890 0.1682
North Carolina 1 Midwest 1.0000 0.9889 0.8877 0.6631 0.4617 0.2354 0.1081
Michigan St.  2 East 1.0000 0.9824 0.7774 0.6230 0.3082 0.1867 0.0870
Kentucky 2 Midwest 1.0000 0.9695 0.7446 0.5019 0.2380 0.0954 0.0362
Michigan 2 West 1.0000 0.9549 0.7001 0.4205 0.1518 0.0688 0.0276
Tennessee 2 South 1.0000 0.9508 0.7411 0.4412 0.1914 0.0728 0.0275
Purdue 3 South 1.0000 0.9398 0.6942 0.3801 0.1453 0.0517 0.0193
Texas Tech 3 West 1.0000 0.9165 0.6242 0.3324 0.1060 0.0427 0.0160
Auburn 5 Midwest 1.0000 0.7955 0.5087 0.1826 0.0935 0.0312 0.0093
Virginia Tech 4 East 1.0000 0.9012 0.5951 0.1621 0.0678 0.0277 0.0080
Houston 3 Midwest 1.0000 0.9020 0.5159 0.2142 0.0721 0.0193 0.0056
Wisconsin 5 South 1.0000 0.7462 0.4815 0.1321 0.0573 0.0158 0.0042
Florida St.  4 West 1.0000 0.8603 0.5624 0.1190 0.0493 0.0144 0.0040
Buffalo 6 West 1.0000 0.7971 0.3297 0.1366 0.0336 0.0110 0.0033
Louisville 7 East 1.0000 0.7231 0.1890 0.1093 0.0288 0.0110 0.0033
LSU 3 East 1.0000 0.8281 0.5024 0.1455 0.0378 0.0121 0.0030
Kansas 4 Midwest 1.0000 0.8524 0.3988 0.1145 0.0476 0.0111 0.0020
Iowa St.  6 Midwest 1.0000 0.6897 0.3652 0.1494 0.0486 0.0123 0.0017
Kansas St.  4 South 1.0000 0.8050 0.3783 0.0852 0.0297 0.0061 0.0015
Marquette 5 West 1.0000 0.6964 0.3216 0.0455 0.0161 0.0035 0.0013
Mississippi St.  5 East 1.0000 0.8290 0.3609 0.0747 0.0233 0.0063 0.0010
Wofford 7 Midwest 1.0000 0.6854 0.2001 0.0909 0.0237 0.0056 0.0009
Villanova 6 South 1.0000 0.6152 0.2016 0.0725 0.0179 0.0029 0.0006
Nevada 7 West 1.0000 0.5528 0.1717 0.0652 0.0127 0.0026 0.0006
Maryland 6 East 1.0000 0.7421 0.3775 0.0956 0.0215 0.0057 0.0005
Utah St.  8 Midwest 1.0000 0.5534 0.0668 0.0184 0.0041 0.0009 0.0003
Cincinnati 7 South 1.0000 0.5308 0.1392 0.0442 0.0078 0.0014 0.0002
Yale 14 East 1.0000 0.1719 0.0457 0.0044 0.0005 0.0002 0.0002
Florida 10 West 1.0000 0.4472 0.1223 0.0390 0.0064 0.0014 0.0001
Syracuse 8 West 1.0000 0.5412 0.0490 0.0193 0.0063 0.0013 0.0001
Saint Mary’s (CA) 11 South 1.0000 0.3848 0.0944 0.0259 0.0039 0.0007 0.0001
UCF 9 East 1.0000 0.4608 0.0406 0.0099 0.0020 0.0005 0.0001
Baylor 9 West 1.0000 0.4588 0.0377 0.0130 0.0031 0.0004 0.0001
Iowa 10 South 1.0000 0.4692 0.1131 0.0352 0.0064 0.0007 0.0000
VCU 8 East 1.0000 0.5392 0.0557 0.0165 0.0032 0.0007 0.0000
Oklahoma 9 South 1.0000 0.5250 0.0609 0.0220 0.0055 0.0006 0.0000
Ole Miss 8 South 1.0000 0.4750 0.0549 0.0194 0.0054 0.0006 0.0000
Washington 9 Midwest 1.0000 0.4466 0.0445 0.0108 0.0019 0.0003 0.0000
Minnesota 10 East 1.0000 0.2769 0.0324 0.0128 0.0016 0.0003 0.0000
Ohio St.  11 Midwest 1.0000 0.3103 0.1036 0.0275 0.0045 0.0002 0.0000
Seton Hall 10 Midwest 1.0000 0.3146 0.0522 0.0149 0.0021 0.0002 0.0000
Murray St.  12 West 1.0000 0.3036 0.0836 0.0079 0.0011 0.0002 0.0000
Oregon 12 South 1.0000 0.2538 0.1059 0.0124 0.0024 0.0001 0.0000
New Mexico St.  12 Midwest 1.0000 0.2045 0.0705 0.0090 0.0021 0.0001 0.0000
Temple 11 East 0.4049 0.0883 0.0199 0.0018 0.0003 0.0001 0.0000
Belmont 11 East 0.5951 0.1696 0.0545 0.0074 0.0009 0.0000 0.0000
Vermont 13 West 1.0000 0.1397 0.0324 0.0021 0.0005 0.0000 0.0000
UC Irvine 13 South 1.0000 0.1950 0.0343 0.0024 0.0003 0.0000 0.0000
Arizona St.  11 West 0.5599 0.1223 0.0192 0.0033 0.0001 0.0000 0.0000
Liberty 12 East 1.0000 0.1710 0.0251 0.0018 0.0001 0.0000 0.0000
Northern Ky. 14 West 1.0000 0.0835 0.0145 0.0012 0.0001 0.0000 0.0000
Georgia St.  14 Midwest 1.0000 0.0980 0.0153 0.0010 0.0001 0.0000 0.0000
Saint Louis 13 East 1.0000 0.0988 0.0189 0.0009 0.0001 0.0000 0.0000
Northeastern 13 Midwest 1.0000 0.1476 0.0220 0.0016 0.0000 0.0000 0.0000
St. John’s (NY) 11 West 0.4401 0.0806 0.0124 0.0009 0.0000 0.0000 0.0000
Montana 15 West 1.0000 0.0451 0.0059 0.0009 0.0000 0.0000 0.0000
Old Dominion 14 South 1.0000 0.0602 0.0098 0.0006 0.0000 0.0000 0.0000
Gardner-Webb 16 South 1.0000 0.0134 0.0020 0.0004 0.0000 0.0000 0.0000
Colgate 15 South 1.0000 0.0492 0.0066 0.0003 0.0000 0.0000 0.0000
Abilene Christian 15 Midwest 1.0000 0.0305 0.0031 0.0002 0.0000 0.0000 0.0000
Bradley 15 East 1.0000 0.0176 0.0012 0.0002 0.0000 0.0000 0.0000
Iona 16 Midwest 1.0000 0.0111 0.0010 0.0000 0.0000 0.0000 0.0000
North Dakota St.  16 East 0.7344 0.0061 0.0001 0.0000 0.0000 0.0000 0.0000
N.C. Central 16 East 0.2656 0.0011 0.0000 0.0000 0.0000 0.0000 0.0000
Fairleigh Dickinson 16 West 0.5314 0.0025 0.0000 0.0000 0.0000 0.0000 0.0000
Prairie View 16 West 0.4686 0.0021 0.0000 0.0000 0.0000 0.0000 0.0000

### Expected Points

Now that we have probabilities that each team reaches each round in the tournament, we can compute the expected number of points a given team will score. Let $$S_i$$ denote the number of points team $$i$$ will score in the Statsketball tournament. We see that the expected number of points team $$i$$ will score, $$\mathbb{E}(W_i)$$ is computed as follows: $\mathbb{E}(S_i) = \sum_{j = 1}^6 P(\text{Team } i \text{ wins in Round } j)\times(\text{Points for win in Round } j)$ It’s not suprising that teams with better seeds tend to have the best expected points. However, given the increased cost, is it necessarily worth it? Looking at a plot of expected points divided by team cost (to get an estimate of how many points we can expected per draft point we spend), it’s clear that many of the 1-seeds, with the expection of Gonzaga, are overvalued.

### Selecting Teams: The Knapsack Problem

I’d like to select the teams that maximize the number of expected points in my entry, subject to the budget restriction. This is a version of the classic, Knapsack Problem. In the Knapsack Problem, there are $$N$$ objects with values $$p_1, ..., p_n$$ and weights $$w_1, ..., w_n$$. The problem asks for the most valuable subset of objects that can fit in the backpack, which has weight capacity $$C$$. In the context of Statsketball, the objects are the teams, values $$p_i$$ are equal to the expected points scored $$\mathbb{E}(S_i)$$, weights $$w_i$$ are equal to team costs, and the capacity $$C$$ is our budget of 224 draft points. The Knapsack problem is solved using Dynamic Programming, and there is an implementation of a Knapsack solver in the adagio R package. Note, that we will have to combine the expected points of the First 4 teams because the contest rules only asks that we select the combination of teams in each play-in game slot.

library(adagio)
kn_solved <- knapsack(ncaa_sims$cost, ncaa_sims$exp_pts, 224)
kn_solved
## $capacity ## [1] 224 ## ##$profit
## [1] 42.297
##
## $indices ## [1] 2 18 21 32 35 48 49 51 ncaa_sims$team[kn_solved\$indices]
## [1] "Michigan St." "Kentucky"     "Auburn"       "Iona"
## [5] "Purdue"       "Gardner-Webb" "Gonzaga"      "Texas Tech"

We aren’t expecting much from either of the 16 seeds, Gardner-Webb and Iona, but those are simply chosen in order to not be wasteful. Not suprisingly, the teams chosen are among the best value picks in seeds 1-5 identified above.

### Limitations

There are two main limitations of maximizing expected points. The first, is that we don’t know what teams other entrants to the Statsketball tournament are selecting. Any team that is selected by several entrants becomes relatively less valuable. Another limitation is that maximizing expected points yields fewer teams chosen. Many times, we can obtain 90% of the expected points of 1 team for the same (or lower) cost using some combination of 2 or 3 teams. Having lots of teams sounds great in practice, as we keep more doors open in case of upsets, but at the same time, the lower seeded a team is, the less likely it is to exceed it’s expected value. Many distributions of team points in statsketball will be heavily skewed towards 0, as lower seeded teams are likely to be knocked out in round 1 of the tournament. Because we are comparing means and not distributions, there might be some other combination of teams with a lower expected value that is more likely to outscore the group of teams I’ve selected.

By no means is this method perfect, but I think it does a reasonably good job of identifying and selecting undervalued teams given the cost constraints of the tournament.