Sunday, September 15, 2013

The Horse Race Problem: A Subspace Solution

The "horse race problem" asks for a reasonable estimate of the relative ability of entrants in a competition given that we know the probabilities of winning. The question is vague, but implies the competition awards a win to the contestant with the highest (or lowest) score, and perhaps further implies that the distribution of contestant scores is a translation family where "ability" is a location parameter.  This post formalizes the problem and provides a simple practical solution.

Geelong Cup Photo Finish 2005

Inferring running time distributions from win probabilities

The question is hardly new. As noted in Probability models for horse race outcomes by Mukhtar M. Ali, one way to estimate race outcome probabilities given a liquid market for the winner is to calibrate distributions of their running times. For example, one might assume the distribution of the \(i\)'th horse's running time \(\{\tau_i\}\)is gaussian, with known variance, and that the location of each horses' distribution is a free parameter to be estimated. More generally we assume all running time distributions belong to a translation family where \(f\) is fixed and a single parameter \(a_i\) characterizes each horse via its running time density $$ f_i(x) = f(x;a_i) = f(x-a_i) $$ We denote the corresponding probability that the \(i\)'th horse's running time \(\tau_i\) is less than \(x\) by $$ F(x - a_i) = Prob( \tau_i < x).$$ Under these assumptions the probability \(p_i\) that the \(i\)'th horse wins the race is\begin{equation} p_i = \int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx. \end{equation} If we arbitrarily fix \(a_1=0\) we can presumably solve for \(\{a_i\}_{i>1}\). More on that in a moment.

Motivation (or lack thereof)

On application of the horse race problem is the estimation of so-called exotic probabilities: the probabilities of relevance for racegoers betting on quinellas, exactas (forecast) and trifectas. This presupposes that the market for horses winning a race is more efficient than the markets for various permutations of finishers. There is a small literature aimed at exploiting inefficiencies between the win and place (show) markets at racetracks.

The evidence is not altogether compelling, incidentally. It has been argued here that place prices on at least one betting exchange are roughly as efficient as the win market. And the ability for professional punters to beat the track take is greater for trifectas than the win market (because knowledge of two or three horses can be multiplied, as it were) suggesting that the smart money might be in trifectas instead. In a past life I gathered a fair amount of evidence to suggest just that, so don't rush off to the Dapto dogs armed only with a quaint assumption. Still, the horse race problem as stated remains a interesting mathematical problem and has motivated some people to study order statistics.

Normal running times

As noted by Ali, in the case where \(f = \phi \) is the normal distribution an approximate solution for the \(a_i\) given known \(p_i\) is given by Henery $$ a_i = \frac{ (k-1) \phi\left( \Phi^{-1}(\frac{1}{k}) \right) \left(\Phi^{-1}(p_i)- \Phi^{-1}(\frac{1}{k})\right) }{ \Phi^{-1}\left( \frac{i-\frac{3}{8}}{k + \frac{3}{4} } \right) } $$ Some other approximations are mentioned in this presentation by Victor Lo who summarizes some empirical findings.

Gamma running times

I leave it to the reader to explore another territory where analytic solutions are available. For a treatise on ranking models with gamma processes see Gamma processes, paired comparisons, and ranking by Hal Stern. (Related: Generalized Bradley-Terry models).

Arbitrarily distributed running times by brute force optimization

Here I restrict myself to some observations on the problem where \(f\) is arbitrary. As mentioned above we can set \(a_1=0\) and throw  \(\{a_i\}_{i>1}\) to a solver for \(F=0\). Here \(F = F\left(\{a_i\}_{i\not=1}\right)\) is an \(n\) dimensional vector function taking \(n-1\) arguments, and the \(i\)'th entry of \(F\) is the difference between the computed (\(q_i\)) and desired probabilities: \begin{equation} F_i = \overbrace{\int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx}^{q_i} - p_i.\end{equation}This works fine for moderate \(n\), but for \(n>100\) it is time consuming, as can be seen in the following figure.

Computation time in seconds (left) as a function of the number of runners in the race.
Log scale on right. Matlab's implementation of Levenberg-Marquardt optimizer was used.
Solving a race with 200 runners takes on the order of one hour.

A novelty? Arbitrarily distributed running times by subspace optimization

We now introduce a method that cuts the computation time dramatically while retaining a high level of accuracy in the solution. The idea is to restrict the \(n-1\) dimensional search to a subspace of dimension \(m < n-1\). In particular if \(m\) is fixed (say \(m=15\)) then we can compute the solution in approximately linear time. A one hour computation for a large race becomes a one minute computation (not as cool as detexity, but not bad).

Obviously the subspace must be chosen with some care, lest it lie nowhere near the best solution. But before getting to that it also helps to introduce a scaled measure of discrepancy between the desired vector \(p\) and the one computed. We use the log-odds discrepancy:
$$ D(p_i, q_i) = log(p_i)-log(1-p_i)-log(q_i)+log(1-q_i)$$ rather than the naked difference of probabilities in the equation above.

Now to optimize over a subspace we define a map \(\phi(z \in R^m) \mapsto a \in R^n\) as follows. We assume the probabilities are sorted from highest to lowest. We let \(\{c_k\}_{k=1}^{m-1}\) be a collection of centroids obtained by k-means clustering of the differenced log-probabilities \(\{\log(p_1)-\log(p_i)\}\) into \(m-1\) clusters. We then enlarge the collection by adding \(0\) and \(\log(p_1)-\log(p_n)\). After sorting the enlarged list of cluster centroids (padded with extreme points) we have a collection \(c = \{0, \dots, \log(p_1)-\log(p_n)\}\) of cardinality \(m+1\). Then we define an \(n\) by \(m+1\) matrix \(A\) by setting \(A_{1j} = \delta_{1j}\), \(A_{n,j} = \delta_{m+1,j}\) where \(\delta\) is the Dirac delta function. For \(2 \le i \le m\), choosing the convex combination of the nearest two cluster points that returns \(p_i\). Thus there are two non-zero entries of \(A_{ij}\) for all but the first and last rows (unless a \(p_i\) coincides with a centroid). With slight abuse of notation, \(A\) as a linear transformation from \(R^{m+1}\) to \(R^n\) satisfying \(p = A(c)\).

The map \(\phi(z \in R^m) \mapsto a \in R^n\) is defined by composition:
       \phi = A \cdot \iota
$$where in matlab notation we would write \(\iota(z) = [0;z]\). That is to say \(\iota\) is the trivial map from \(R^m\) to \(R^{m+1}\) that assumes the first coordinate is zero. Finally then, we pass the vector function \(\tilde{F}\) taking \(m\) arguments to a solver. The \(i\)'th coordinate \(\tilde{F}\) is given by
      \tilde{F}_i = D\left( q_i\left( \phi\left(z\right)\right), p_i \right)
This works surprisingly well. The intuition is that the sorted abilities \(\{a_i\}\) can be interpolated reasonably (and very roughly, vary with the \(\log\) of \(p_i\)) so long as we have enough knot points. It helps to place the knot points near population centers to reduce the amount of interpolation. As we vary \(z\) we are varying the locations of the knot points and forcing the \(a_i\)'s to move in a subspace, but since the interpolation is pretty good the subspace lies quite close to the solution.

And as noted, it is much faster. The figure below shows computation time as a function of the number of starters when we use subspaces of dimension \(5\) and \(15\). No code optimization has been attempted and undoubtedly, better numbers could be achieved in any number of ways. But the point is that using a subspace search alone results in a \(50\)x improvement or so for \(n \approx 200\). The improvement would of course be much greater, in relative terms, for \(n \gg 200\).

Computation time as a function of race size when using a subspace search
The error introduced by the subspace approximation will vary, of course, but as a rough guide the ratio of \(p_i\) to \(q_i\) differs from unity by about half a percent when using a subspace of dimension five, even when there are \(200\) runners. So a horse at 100/1 might be calibrated at 99.5/1. This error drops by another order of magnitude when using a subspace of dimension \(15\), whereupon the calibrated and desired probabilities are, as a practical matter, indistinguishable.

Possible improvements

For known distributions we ought to be able to use more accurate interpolation. Here I merely assumed that probabilities fall off as the exponential of ability difference. For very large races we should use the stability postulate instead, at least for those in the tail.

A generalization?

It's usually around this point where start to suspect there can't be anything new here. After all, any similar optimization problem where we have an ounce of intuition as to the solution that could be similarly addressed (i.e. by minimizing interpolation with clustering). Perhaps somebody could point me to method XYZ and conclude that the horse race solution presented here is merely an application of the same.

Monday, September 2, 2013

Quasi-random sampling subject to linear and moment constraints (or: "how to bypass model calibration")

Its a rainy long weekend in New York and I've been dusting off an old experiment while listening to the thunder at a friend's place. I was particularly impressed by the fact they had an outlet on their porch positioned right by a comfy chair.

Here's the motivation. In finance we sometimes find ourselves attempting to calibrate a probabilistic model for an underlying random variable (such as a stock price) subject to known constraints (such as the price of an option on the same). Often, we then draw samples from the same model for the purpose of pricing other securities, performing risk calculations or satisfying some other need.

\begin{equation} observed\ prices \rightarrow calibrated\ model \rightarrow random\ samples \end{equation} This has several advantages including the built in regularization we get by using a parametrized model. There is on obvious drawback however: one has to come up with the model!

A while ago I was motivated to skip the calibration step. I was inspired by the herding literature in machine learning, and I started playing with a more direct path: \begin{equation} observed\ prices \rightarrow random\ samples \end{equation}.The idea is that we devise quasi-random sampling schemes that automatically achieve the calibration while striving to fill out the space in a way that achieves some of the objectives of a calibrated model.

Best described by an example I feel. And in this one we consider samples on a discrete lattice where smoothness of the sampled distribution is controlled using the discrete Laplacian.


I will lead with the picture. The middle plot demonstrates that we are able to generate samples whose histogram is quite smooth, yet satisfy the constraints provided. In this case we've chosen a two dimensional state space, as might commonly occur when (strike, maturity) or say (attachment point, maturity) axes are in play.

Discrete Laplacian minimizing quasi-random samples (middle) approximating the true distribution (left). For comparison we also show a minimal surface (right) generated using the same constraints with direct optimization.
We've started with some fake distribution on the left, created some functions of the same and used those functions as constraints.

Aspirations and notation

We allow a lattice to be determined by axes \(x^{(1)}=\{x^{(1)}_1 >...> x^{(1)}_{n_1}\}\) and \(x^{(2)}=\{x^{(2)}_1 >... > x^{(2)}_{n_2}\}\). We suppose that \(F = \{f_k\}_{k \in K}\) is a collection of functions on the implied lattice whose means are known constants \(\mu_k\). That is,
\mu_k = E[f_k] \ \  \ \ k \in K.
with respect to some density that we wish to sample from. Specifically we want to draw samples \(z=\{z^i\}_{i=1}^I = \{(z^i_1,z^i_2)\}_{i=1}^I\) such that
 \hat{f_k} := \frac{1}{I}\sum_{i=1}^I f_k(z^i)
satisfies approximate equalities
  \hat{f_k}(z) \approx \mu_k, \ \ k \in K.
It will turn out that using equalities here involves no loss of generality. Indeed a cute feature of the algorithm exhibited here is that inequalities are treated in the same manner as equalities: we simply use the same function twice with the same weight \(w\) but different \(\mu\)'s. The lower \(\mu\) is interpreted as a lower bound and the higher as an upper bound.

Now for notational convenience we might as well consider \(f_k\) to be functions defined on the integer lattice \(\{1..n_1\} \times \{1..n_2\}\), identifying \(f(r,s) := f(x^{(1)}_r,x^{(2)}_s)\). We also identify samples \((z^{(1)}_r,z^{(2)}_s)\) with their integer pairs \((r,s)\). We denote by \(Z^i(r,s)\) the number of samples falling on the lattice point.

A fast algorithm for generating quasi-random samples

In what follows we will use a prior probability \(\phi(r,s). \) defined on the lattice. It might might be determined by minimization of the square of the discrete Laplacian subject to the equality and inequality constraints given, or by other means.

The idea is to start lobbing samples on the lattice (like stackable Go stones) and watch the distribution emerge. By monitoring the discrete Laplacian we can try to coerce the samples to fill in holes - smoothing things out as it were. By monitoring the expectation of the constraint functions \(f_k\) with respect to the pseudo-sample distribution thus far, we can correct for bias.

For this to be efficient we need a very efficient means of bookkeeping.


At the outset we fix a constant lattice function \(A_k(r,s) = signum \left( \mu_k - f(r,s) \right) \) for \(k \in K\). We initialize the laplacian \(D(r,s)=0\) for all lattice points \((r,s)\). We initialize counters \(S^k_0 = 0\) for \(k \in K\).

Selecting the next quasi-random sample:

Beginning with quasi-random sample \(i=1\) we perform the following steps.

  1. Set urgency \(\nu(r,s) = D(r,s)\) then for \(k \in K\) update it by \( \nu(r,s) \rightarrow \nu(r,s) + w_k A_k(r,s) signum( S^k - i\mu_k ) \)
  2. Choose the most urgent lattice point after weighting: \( (r_i,s_i) = \arg \max \nu(r,s) \phi(r,s)\)
  3. Increment the sample count for the lattice point:  \(Z(r_i,s_i) \rightarrow Z(r_i,s_i) + 1 \)
  4. Decrement \(D(r_i,s_i) \rightarrow D(r_i,s_i) -1 \) then increment \(D(r^*,s^*) \rightarrow D(r^*,s^*) + \frac{1}{4} \) for all lattice points \(r^*,s^*\) neighbouring \(r_i,s_i\). 
  5. Update \( S^k \rightarrow S^k + f^k(r_i,s_i) \). 

What we are doing here is selecting a new point that improves as many of the biases as possible, subject to the discrete Laplacian not getting out of whack.

Selecting the next quasi-random sample (slightly more efficient):

It turns out we can be a little sneakier and initialize \( \nu(r,s) = \phi(r,s) \). Then the following algorithm  is equivalent and somewhat more efficient.
  1. Choose the most urgent lattice point after weighting: \( (r_i,s_i) = \arg \max \nu(r,s) \phi(r,s)\)
  2. Increment the sample count for the lattice point:  \(Z(r_i,s_i) \rightarrow Z(r_i,s_i) + 1 \)
  3. Update \(S^k = S^k_{prev} + f^k(r_i,s_i) \)
  4. For \(k \in K\) update \( \nu(r,s) \rightarrow \nu(r,s) + w_k A_k(r,s) \left( signum(S^k-i \mu_k) - signum(S^k_{prev} - (i-1) \mu_k \right) \)
  5. Decrement \(\nu(r_i,s_i) \rightarrow \nu(r_i,s_i) -1 \) then for all four neighbours \((r^*,s^*)\) of \((r_i,s_i)\) increment \( \nu(r_i,s_i) = \nu(r_i,s_i) + \frac{1}{4} \). 
  6. Set \(S^k_{prev} = S^k\). 
Notice that we don't really need to consider the discrete Laplacian and the moment terms separately, since we only care about their direct tradeoff.

A minor practical tweak

It seems to help to use a tuning parameter \(\lambda\) that weighs the relative importance of satisfying the constraints \(f_k\) versus minimizing the discrete Laplacian. Left as an exercise for the interested reader.

Sunday, September 1, 2013

Two Kelly betting rules: one of which I might actually remember

I was having a conversation the other day involving proportional betting and the Kelly criteria, and was embarrassed that I couldn't remember of the top of my head what some of the numbers came out to be. So I decided to try to make this really simple for myself by supplying two rules. The first applies to betting on an even money favorite. The second applies to betting on longshots. Both assume you wish to maximize the log of your wealth.

RULE 1: If "heads" is ten percent more likely than usual, bet ten percent of your wealth.

Here's a plot of \(E[log(1+fX)]\) where \(X\) is a binary random variable taking value \(1\) or \(-1\) with probability \(0.55\) and \(0.45\) respectively. The x-axis shows \(f\), the fraction of one's wealth one decides to put on the coin flip.

Percentage of one's wealth to wager on a coin flip with p=0.55

The y-axis can be interpreted as the average return, if we assume we get paid even money even though we know the true probability is different. Clearly we want to bet about ten percent of our wealth to maximize this.

But is the scale wrong? Since we have a ten percent edge we should, on average, be making ten percent of ten percent return, or one percent return, right?

Well no. You only get half that. If you want to remember why you only get half a percent, not a one percent return, notice that you are approximately maximizing a linear function (the return you expect, not taking into account the \(log\)) subject to a quadratic penalty (the concavity of \(log\)). To be more precise we observe the expansion around  \(f=0\) of
  E[\log(1+X)]  = 0.55 \log(1+f) + 0.45 \log(1-f)
which is given by
0.55 \log(1+f) + 0.45 \log(1-f)  = \overbrace{0.1 f}^{naive} - \overbrace{\frac{1}{2} f^2}^{concavity} + O(f^3)
The next two terms are \(+ \frac{1}{30} f^3 - \frac{1}{4} f^4\) incidentally. But as might be surmised from the plot we are pretty close to a parabola. The marginal return goes to zero just as the parabolic term starts to match the linear one (i.e. its rate) and the absolute return goes all the way to zero when the parabolic term wipes out the linear one completely. So it must be that we optimize half way down, as it were.

Another way to come at this is that the fraction of wealth you should invest is simply \(p-q\). That's surprising at first. But since \(E[log(1+X)]  = p \log(1+f) + q \log(1-f) \) in general we note that
     \frac{\partial}{\partial f} E[\log(1+X)]  = \frac{p}{1+f} - \frac{q}{1-f}
which is zero for
        f = p - q
Incidentally your return grows slightly faster than the square of your edge. As the amount we bet is linear in the edge, we'd expect the return to grow quadratically as we vary \(p\), more or less. And not forgetting the factor of one half it ought to be half the square of our edge. In fact it grows slightly faster, as can be seen by substituting the optimal \(f= p-q\) back into the return equation.

Average return for an even money coin flip. 

The actual return is the red line. The quadratic approximation is the blue. Not that the percentage edge is \(2(p-0.5)\), not \(p-0.5\).

So much for betting on the favourite, as it were, but what about longshots? The the math doesn't change but our intuition might. Instead of fixing the outcome at even money and varying the true probabilities of an outcome, let's instead fix the true probability at \( p = 0.01 \) and vary the odds we receive. All being fair we should receive $100 back, including our stake, for every $1 bet. We'll assume instead we get a payout of \(\alpha\) times the fair payout.  Then we maximize
  E[\log(1+X)]  = p \log\left(1+ f \alpha \left( \frac{1}{p}-1\right) \right) + (1-p) \log(1-f )
This brings us the the second easy to remember betting rule.

RULE 2. If a longshot's odds are ten percent better than they should be, bet to win ten percent of your wealth

Now I know what you are thinking: why do we need rule #1 if we have rule #2? Well you probably don't, actually, but evidently I might because I have managed to forget rule #2 in the past. We'll call it a senility hedge.

But rule 2 the rule is quite reasonable on a larger domain - though eventually it too breaks down. If you can get ridiculously good odds on a longshot (let's say ten times the fair odds) don't go absolutely nuts. Only bet the amount that would double your wealth if the odds were fair. Diminishing returns set in if you are set to win so much that the concavity of \(log\) kills the fun. You can see that in the picture below.

Average return on a 100/1 horse when we are offered 1000/1 and 2000/1 respectively. 

Notice that we bet only marginally more if we get 20 times the fair odds versus 10 times. We do win twice as much, however.