Financial Mathematics: The Horse Race Problem: A Subspace Solution

The "horse race problem" asks for a reasonable estimate of the relative ability of entrants in a competition given that we know the probabilities of winning. The question is vague, but implies the competition awards a win to the contestant with the highest (or lowest) score, and perhaps further implies that the distribution of contestant scores is a translation family where "ability" is a location parameter. This post formalizes the problem and provides a simple practical solution.

Geelong Cup Photo Finish 2005

Inferring running time distributions from win probabilities

The question is hardly new. As noted in Probability models for horse race outcomes by Mukhtar M. Ali, one way to estimate race outcome probabilities given a liquid market for the winner is to calibrate distributions of their running times. For example, one might assume the distribution of the $i$'th horse's running time $\{\tau_i\}$is gaussian, with known variance, and that the location of each horses' distribution is a free parameter to be estimated. More generally we assume all running time distributions belong to a translation family where $f$ is fixed and a single parameter $a_i$ characterizes each horse via its running time density $$ f_i(x) = f(x;a_i) = f(x-a_i) $$ We denote the corresponding probability that the $i$'th horse's running time $\tau_i$ is less than $x$ by $$ F(x - a_i) = Prob( \tau_i < x).$$ Under these assumptions the probability $p_i$ that the $i$'th horse wins the race is\begin{equation} p_i = \int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx. \end{equation} If we arbitrarily fix $a_1=0$ we can presumably solve for $\{a_i\}_{i>1}$. More on that in a moment.

Motivation (or lack thereof)

On application of the horse race problem is the estimation of so-called exotic probabilities: the probabilities of relevance for racegoers betting on quinellas, exactas (forecast) and trifectas. This presupposes that the market for horses winning a race is more efficient than the markets for various permutations of finishers. There is a small literature aimed at exploiting inefficiencies between the win and place (show) markets at racetracks.

The evidence is not altogether compelling, incidentally. It has been argued here that place prices on at least one betting exchange are roughly as efficient as the win market. And the ability for professional punters to beat the track take is greater for trifectas than the win market (because knowledge of two or three horses can be multiplied, as it were) suggesting that the smart money might be in trifectas instead. In a past life I gathered a fair amount of evidence to suggest just that, so don't rush off to the Dapto dogs armed only with a quaint assumption. Still, the horse race problem as stated remains a interesting mathematical problem and has motivated some people to study order statistics.

Normal running times

As noted by Ali, in the case where $f = \phi $ is the normal distribution an approximate solution for the $a_i$ given known $p_i$ is given by Henery $$ a_i = \frac{ (k-1) \phi\left( \Phi^{-1}(\frac{1}{k}) \right) \left(\Phi^{-1}(p_i)- \Phi^{-1}(\frac{1}{k})\right) }{ \Phi^{-1}\left( \frac{i-\frac{3}{8}}{k + \frac{3}{4} } \right) } $$ Some other approximations are mentioned in this presentation by Victor Lo who summarizes some empirical findings.

Gamma running times

I leave it to the reader to explore another territory where analytic solutions are available. For a treatise on ranking models with gamma processes see Gamma processes, paired comparisons, and ranking by Hal Stern. (Related: Generalized Bradley-Terry models).

Arbitrarily distributed running times by brute force optimization

Here I restrict myself to some observations on the problem where $f$ is arbitrary. As mentioned above we can set $a_1=0$ and throw $\{a_i\}_{i>1}$ to a solver for $F=0$. Here $F = F\left(\{a_i\}_{i\not=1}\right)$ is an $n$ dimensional vector function taking $n-1$ arguments, and the $i$'th entry of $F$ is the difference between the computed ($q_i$) and desired probabilities: \begin{equation} F_i = \overbrace{\int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx}^{q_i} - p_i.\end{equation}This works fine for moderate $n$, but for $n>100$ it is time consuming, as can be seen in the following figure.

Computation time in seconds (left) as a function of the number of runners in the race.
Log scale on right. Matlab's implementation of Levenberg-Marquardt optimizer was used.

Solving a race with 200 runners takes on the order of one hour.

A novelty? Arbitrarily distributed running times by subspace optimization

We now introduce a method that cuts the computation time dramatically while retaining a high level of accuracy in the solution. The idea is to restrict the $n-1$ dimensional search to a subspace of dimension $m < n-1$. In particular if $m$ is fixed (say $m=15$) then we can compute the solution in approximately linear time. A one hour computation for a large race becomes a one minute computation (not as cool as detexity, but not bad).

Obviously the subspace must be chosen with some care, lest it lie nowhere near the best solution. But before getting to that it also helps to introduce a scaled measure of discrepancy between the desired vector $p$ and the one computed. We use the log-odds discrepancy:
$$ D(p_i, q_i) = log(p_i)-log(1-p_i)-log(q_i)+log(1-q_i)$$ rather than the naked difference of probabilities in the equation above.

Now to optimize over a subspace we define a map $\phi(z \in R^m) \mapsto a \in R^n$ as follows. We assume the probabilities are sorted from highest to lowest. We let $\{c_k\}_{k=1}^{m-1}$ be a collection of centroids obtained by k-means clustering of the differenced log-probabilities $\{\log(p_1)-\log(p_i)\}$ into $m-1$ clusters. We then enlarge the collection by adding $0$ and $\log(p_1)-\log(p_n)$. After sorting the enlarged list of cluster centroids (padded with extreme points) we have a collection $c = \{0, \dots, \log(p_1)-\log(p_n)\}$ of cardinality $m+1$. Then we define an $n$ by $m+1$ matrix $A$ by setting $A_{1j} = \delta_{1j}$, $A_{n,j} = \delta_{m+1,j}$ where $\delta$ is the Dirac delta function. For $2 \le i \le m$, choosing the convex combination of the nearest two cluster points that returns $p_i$. Thus there are two non-zero entries of $A_{ij}$ for all but the first and last rows (unless a $p_i$ coincides with a centroid). With slight abuse of notation, $A$ as a linear transformation from $R^{m+1}$ to $R^n$ satisfying $p = A(c)$.

The map $\phi(z \in R^m) \mapsto a \in R^n$ is defined by composition:
$$
\phi = A \cdot \iota
$$where in matlab notation we would write $\iota(z) = [0;z]$. That is to say $\iota$ is the trivial map from $R^m$ to $R^{m+1}$ that assumes the first coordinate is zero. Finally then, we pass the vector function $\tilde{F}$ taking $m$ arguments to a solver. The $i$'th coordinate $\tilde{F}$ is given by
$$
\tilde{F}_i = D\left( q_i\left( \phi\left(z\right)\right), p_i \right)
$$
This works surprisingly well. The intuition is that the sorted abilities $\{a_i\}$ can be interpolated reasonably (and very roughly, vary with the $\log$ of $p_i$) so long as we have enough knot points. It helps to place the knot points near population centers to reduce the amount of interpolation. As we vary $z$ we are varying the locations of the knot points and forcing the $a_i$'s to move in a subspace, but since the interpolation is pretty good the subspace lies quite close to the solution.

And as noted, it is much faster. The figure below shows computation time as a function of the number of starters when we use subspaces of dimension $5$ and $15$. No code optimization has been attempted and undoubtedly, better numbers could be achieved in any number of ways. But the point is that using a subspace search alone results in a $50$x improvement or so for $n \approx 200$. The improvement would of course be much greater, in relative terms, for $n \gg 200$.

Computation time as a function of race size when using a subspace search

The error introduced by the subspace approximation will vary, of course, but as a rough guide the ratio of $p_i$ to $q_i$ differs from unity by about half a percent when using a subspace of dimension five, even when there are $200$ runners. So a horse at 100/1 might be calibrated at 99.5/1. This error drops by another order of magnitude when using a subspace of dimension $15$, whereupon the calibrated and desired probabilities are, as a practical matter, indistinguishable.

Possible improvements

For known distributions we ought to be able to use more accurate interpolation. Here I merely assumed that probabilities fall off as the exponential of ability difference. For very large races we should use the stability postulate instead, at least for those in the tail.

A generalization?

It's usually around this point where start to suspect there can't be anything new here. After all, any similar optimization problem where we have an ounce of intuition as to the solution that could be similarly addressed (i.e. by minimizing interpolation with clustering). Perhaps somebody could point me to method XYZ and conclude that the horse race solution presented here is merely an application of the same.

9 comments:

JamesOctober 4, 2013 at 1:32 AM
This comment has been removed by the author.
JamesOctober 4, 2013 at 1:33 AM
Hello,

Interesting that someone linked to my website from yours and yet there is no reference to my site (www.betfairprotrader.co.uk) on yours.

I wrote a piece about place pricing for horse races, which discussed the efficiency of such prices on the Betfair betting exchange. Conclusion - they are.

You may be interested in Edelman's alternative approach on page 72 of his The Compleat Horseplayer, which is a lot easier/faster to compute than Henery/Harville et al and yields a comparable figure.

I've never seen a 200 runner horse race though you do get 40 runner fields in The National. Would be an interesting calavalry charge to the first fence.
Peter CottonOctober 4, 2013 at 11:52 AM
Hi James,

There is a link to your column from this article (where it says "it has been argued..."). Wasn't entirely clear how to reference the article better. I certainly found it interesting. Thanks for mentioning Compleat Horseplayer. David's an old friend.

Kind regards,

Peter
AnonymousJanuary 31, 2014 at 10:37 PM
Lets say there are 8 horses in a race. The favourite is $3.30 (30.3% chance of winning, ignoring the bookies clip on the event). From just these two pieces of information can we calcualte (a) the probability of the horse coming 2nd (b) and prob of coming 3rd? I thought it might be 30.3% * 69.7% = 21% prob of coming 2nd and 30.3% * (21% + 30.3%) = prob of coming 3rd. This results in a 67% probability of coming 1,2,3 meaning place odds greater than $1.49 are generous. Or am I way off?

Financial Mathematics

Sunday, September 15, 2013

The Horse Race Problem: A Subspace Solution

9 comments:

About Me