Geelong Cup Photo Finish 2005 |
Inferring running time distributions from win probabilities
The question is hardly new. As noted in Probability models for horse race outcomes by Mukhtar M. Ali, one way to estimate race outcome probabilities given a liquid market for the winner is to calibrate distributions of their running times. For example, one might assume the distribution of the \(i\)'th horse's running time \(\{\tau_i\}\)is gaussian, with known variance, and that the location of each horses' distribution is a free parameter to be estimated. More generally we assume all running time distributions belong to a translation family where \(f\) is fixed and a single parameter \(a_i\) characterizes each horse via its running time density $$ f_i(x) = f(x;a_i) = f(x-a_i) $$ We denote the corresponding probability that the \(i\)'th horse's running time \(\tau_i\) is less than \(x\) by $$ F(x - a_i) = Prob( \tau_i < x).$$ Under these assumptions the probability \(p_i\) that the \(i\)'th horse wins the race is\begin{equation} p_i = \int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx. \end{equation} If we arbitrarily fix \(a_1=0\) we can presumably solve for \(\{a_i\}_{i>1}\). More on that in a moment.
Motivation (or lack thereof)
On application of the horse race problem is the estimation of so-called exotic probabilities: the probabilities of relevance for racegoers betting on quinellas, exactas (forecast) and trifectas. This presupposes that the market for horses winning a race is more efficient than the markets for various permutations of finishers. There is a small literature aimed at exploiting inefficiencies between the win and place (show) markets at racetracks.
The evidence is not altogether compelling, incidentally. It has been argued here that place prices on at least one betting exchange are roughly as efficient as the win market. And the ability for professional punters to beat the track take is greater for trifectas than the win market (because knowledge of two or three horses can be multiplied, as it were) suggesting that the smart money might be in trifectas instead. In a past life I gathered a fair amount of evidence to suggest just that, so don't rush off to the Dapto dogs armed only with a quaint assumption. Still, the horse race problem as stated remains a interesting mathematical problem and has motivated some people to study order statistics.
Normal running times
As noted by Ali, in the case where \(f = \phi \) is the normal distribution an approximate solution for the \(a_i\) given known \(p_i\) is given by Henery $$ a_i = \frac{ (k-1) \phi\left( \Phi^{-1}(\frac{1}{k}) \right) \left(\Phi^{-1}(p_i)- \Phi^{-1}(\frac{1}{k})\right) }{ \Phi^{-1}\left( \frac{i-\frac{3}{8}}{k + \frac{3}{4} } \right) } $$ Some other approximations are mentioned in this presentation by Victor Lo who summarizes some empirical findings.
Gamma running times
I leave it to the reader to explore another territory where analytic solutions are available. For a treatise on ranking models with gamma processes see Gamma processes, paired comparisons, and ranking by Hal Stern. (Related: Generalized Bradley-Terry models).
Arbitrarily distributed running times by brute force optimization
Here I restrict myself to some observations on the problem where \(f\) is arbitrary. As mentioned above we can set \(a_1=0\) and throw \(\{a_i\}_{i>1}\) to a solver for \(F=0\). Here \(F = F\left(\{a_i\}_{i\not=1}\right)\) is an \(n\) dimensional vector function taking \(n-1\) arguments, and the \(i\)'th entry of \(F\) is the difference between the computed (\(q_i\)) and desired probabilities: \begin{equation} F_i = \overbrace{\int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx}^{q_i} - p_i.\end{equation}This works fine for moderate \(n\), but for \(n>100\) it is time consuming, as can be seen in the following figure.
Computation time in seconds (left) as a function of the number of runners in the race. Log scale on right. Matlab's implementation of Levenberg-Marquardt optimizer was used. |
A novelty? Arbitrarily distributed running times by subspace optimization
We now introduce a method that cuts the computation time dramatically while retaining a high level of accuracy in the solution. The idea is to restrict the \(n-1\) dimensional search to a subspace of dimension \(m < n-1\). In particular if \(m\) is fixed (say \(m=15\)) then we can compute the solution in approximately linear time. A one hour computation for a large race becomes a one minute computation (not as cool as detexity, but not bad).
Obviously the subspace must be chosen with some care, lest it lie nowhere near the best solution. But before getting to that it also helps to introduce a scaled measure of discrepancy between the desired vector \(p\) and the one computed. We use the log-odds discrepancy:
$$ D(p_i, q_i) = log(p_i)-log(1-p_i)-log(q_i)+log(1-q_i)$$ rather than the naked difference of probabilities in the equation above.
Now to optimize over a subspace we define a map \(\phi(z \in R^m) \mapsto a \in R^n\) as follows. We assume the probabilities are sorted from highest to lowest. We let \(\{c_k\}_{k=1}^{m-1}\) be a collection of centroids obtained by k-means clustering of the differenced log-probabilities \(\{\log(p_1)-\log(p_i)\}\) into \(m-1\) clusters. We then enlarge the collection by adding \(0\) and \(\log(p_1)-\log(p_n)\). After sorting the enlarged list of cluster centroids (padded with extreme points) we have a collection \(c = \{0, \dots, \log(p_1)-\log(p_n)\}\) of cardinality \(m+1\). Then we define an \(n\) by \(m+1\) matrix \(A\) by setting \(A_{1j} = \delta_{1j}\), \(A_{n,j} = \delta_{m+1,j}\) where \(\delta\) is the Dirac delta function. For \(2 \le i \le m\), choosing the convex combination of the nearest two cluster points that returns \(p_i\). Thus there are two non-zero entries of \(A_{ij}\) for all but the first and last rows (unless a \(p_i\) coincides with a centroid). With slight abuse of notation, \(A\) as a linear transformation from \(R^{m+1}\) to \(R^n\) satisfying \(p = A(c)\).
The map \(\phi(z \in R^m) \mapsto a \in R^n\) is defined by composition:
$$
\phi = A \cdot \iota
$$where in matlab notation we would write \(\iota(z) = [0;z]\). That is to say \(\iota\) is the trivial map from \(R^m\) to \(R^{m+1}\) that assumes the first coordinate is zero. Finally then, we pass the vector function \(\tilde{F}\) taking \(m\) arguments to a solver. The \(i\)'th coordinate \(\tilde{F}\) is given by
$$
\tilde{F}_i = D\left( q_i\left( \phi\left(z\right)\right), p_i \right)
$$
This works surprisingly well. The intuition is that the sorted abilities \(\{a_i\}\) can be interpolated reasonably (and very roughly, vary with the \(\log\) of \(p_i\)) so long as we have enough knot points. It helps to place the knot points near population centers to reduce the amount of interpolation. As we vary \(z\) we are varying the locations of the knot points and forcing the \(a_i\)'s to move in a subspace, but since the interpolation is pretty good the subspace lies quite close to the solution.
And as noted, it is much faster. The figure below shows computation time as a function of the number of starters when we use subspaces of dimension \(5\) and \(15\). No code optimization has been attempted and undoubtedly, better numbers could be achieved in any number of ways. But the point is that using a subspace search alone results in a \(50\)x improvement or so for \(n \approx 200\). The improvement would of course be much greater, in relative terms, for \(n \gg 200\).
Computation time as a function of race size when using a subspace search |
Possible improvements
For known distributions we ought to be able to use more accurate interpolation. Here I merely assumed that probabilities fall off as the exponential of ability difference. For very large races we should use the stability postulate instead, at least for those in the tail.
A generalization?
It's usually around this point where start to suspect there can't be anything new here. After all, any similar optimization problem where we have an ounce of intuition as to the solution that could be similarly addressed (i.e. by minimizing interpolation with clustering). Perhaps somebody could point me to method XYZ and conclude that the horse race solution presented here is merely an application of the same.