Wednesday, October 29, 2014

Some special cases of the Lambert W function

Suppose \(f(x)\) is given by the functional relation \begin{equation} e^{-a(x) f(x) + b(x)} = c(x) f(x) + d(x) \end{equation} then \begin{equation} f(x) = \frac{1}{a(x)}W\left( \frac{a(x)}{c(x)} e^{b(x)+\frac{a(x)d(x)}{c(x)} } \right) - \frac{d(x)}{c(x)} \end{equation} where \(W\) is the Lambert W Function. You can click through to Wikipedia for a proof. Here I merely list some special cases which every good citizen should instantly recognize (and it can be mildly tedious to reproduce them).

Recovering the definition of \(W(x)\)

Just to check we are on the right planet set \(a=1\), \(c=1\), \(b=\ln(x)\) and \(d=0\) to get \begin{equation} x e^{-f(x)} = f(x) \end{equation} or equivalently \( f(x)e^{f(x)} = x\) which is the definition of the Lambert W function. Sure enough those substitutions recover $$ f(x) = W(x) $$ as expected. A minor modification uses \(a(x)=1\), \(c(x)=1\), \(b(x)=0\) but leaves \(d(x)\) general.

The solution to \( e^{-f(x)} - f(x) = d(x) \)

is \begin{equation} f(x) = W \left( e^{d(x)} \right) \end{equation} Only slightly more generally, set \(a(x)=a\), \(b=\ln(g(x)), c=1, d=1\) for those cases where \(g(x)\) is on the wrong side, as it were.

The solution to \( e^{-af(x)} = f(x) g(x) \)

or equivalently \( f(x) e^{af(x)} g(x) = 1 \) is \begin{equation} f(x) = \frac{1}{a} W\left( a g(x) \right) \end{equation} In particular,

The solution to \( x^k e^{-af(x)} f(x) = 1 \)

which seems to crop up a fair bit for your author is \begin{equation} f(x) = \frac{1}{a} W\left( a x^{-k} \right) \end{equation}

Similarly setting \(a=-1\) ...

The solution to \( x e^{f(x)} f(x) = 1 \)

(i.e. where \(x\) is on the wrong side but we otherwise have the Lambert W definition) must be \begin{equation} f(x) = - W\left( -\frac{1}{x} \right) \end{equation} We might also take \(b = \ln(g(x))\) and thus

The solution to \( g(x) e^{-af(x)} = f(x) \)

or equivalently \( f(x)e^{af(x)} = g(x) \) is \begin{equation} f(x) = \frac{1}{a} W\left( a g(x) \right) \end{equation} which reduces to the Lambert W function for \(a=1\), as we expect. It is also pretty obvious from first principles, because if we multiply both sides by \(a\) we have \begin{equation} (af) e^{(af)} = ag \end{equation} and thus \(af = W(ag)\). Next suppose we want a power of \(f\) to appear. Let \(b=\beta(x)/k\), \(c = \gamma^{1/k}\), \(a = k/\alpha\) and \(d=0\). And then raise both sides to the power \(k\). It follows that...

The solution to \( e^{\alpha f(x) + \beta(x)} = \gamma f(x)^k \)

is \begin{equation} f(x) = \frac{\alpha}{k}W\left( \frac{k e^{\frac{1}{k}\beta(x)} }{\alpha \gamma^{1/k}} \right) \end{equation} and if, in particular, \(\beta(x) = -\ln(g(x))\) then

The solution to \( e^{\alpha f(x)} = g(x) \gamma f(x)^k \)

is \begin{equation} f(x) = \frac{\alpha}{k}W\left( \frac{k }{\alpha g(x)^{1/k} \gamma^{1/k}} \right) \end{equation} and if we take \(c = \frac{1}{\gamma}\) and \(g(x)=x\) and \(\alpha = \frac{s}{2}\) and \(k=2\) then p>

The solution to \( e^{-\frac{s}{2}f(x)} x f(x)^2 = c \)

is \begin{equation} f(x) = \frac{s}{2k}W\left( \frac{2k}{s} \sqrt{ \frac{c}{x}} \right) \end{equation}

Wednesday, August 13, 2014

A new use for the number 59575553 (ISDA shorten's 20 character Legal Entity Identifiers to 10 character Unique Trade Identifier Prefixes)

The International Swaps and Dealers Association (ISDA) faced a minor problem recently. In shortening their 20 character Legal Entity Identifiers (LEI's) into 10 character Universal Trade Identifier (UTI) prefixes they ran into collisions.

Somewhat randomly this came across my desk. So under proposal is an improvement for hashing LEI's into UTI prefixes: the modulus operation lifted from integers to the space of case insensitive alpha-numeric strings.

You can grab the python code here. Of course it's hardly a new idea.

Wednesday, May 28, 2014

Sensitivities of a Kalman Filter estimate with respect to all past observations

Consider a sequence of observations \(t_1,...,t_k\) at which a latent vector process \(x\) is observed indirectly, via an observation equation \begin{equation} y_{t_i} = H_i x_{t_i} + \epsilon_i \end{equation} We assume \(\epsilon_i\) is mean zero multivariate gaussian with covariance \(R_i\). For brevity we refer to \(y_{t_i}\) as \(y_i\), \(x_{t_i}\) as \(x_i\) and so forth. We assume the evolution of \(x\) in between the times specified can be written \begin{equation} x_{i+1} = A_i x_i + u_i \end{equation} where \(u_i\) are also gaussian. In this linear gaussian system the recursive estimation of \(x_t\) is achieved by the well known Kalman filter, and the contemporaneous impact of the next observation \(y_{k+1}\) is also (it is merely proportional to the Kalman gain).

But less well appreciated is a related computation, the derivatives of the Kalman filter estimate with respect to a past observation \(y_i\). This note establishes how said computation can be achieved by making two observations. The first is a re-representation of the Kalman estimate in the form of a weighted least squares problem (not dissimilar to the Duncan Horn representation). The second observation is that sensitivities of any weighted least squares problem can be computed using an adjoint trick.

Step 1: The Kalman filter solution as a (particular) least squares problem

We shall set up a least squares problem involving the current state \(x_k\) only, and all the previous observations. We argue that the solution to this problem is identical to the Kalman filter. Since the current estimate \(\hat{y}_k\) is a simple linear function of the current state \(x_k\), this allows us to compute the derivative of the current estimate with respect to all previous observations.

In the Kalman filter we assume a gaussian prior on the initial state \(x_0\). This can introduce annoying special cases in what follows, but we can clean up the notation instead by introducing: \begin{eqnarray} y_{-1} & = & H_{-1} x_{-1} + \epsilon_{-1} \\ x_0 & = & A_{-1} x_{-1} + u_{-1} \end{eqnarray} provided \(H_{-1}\) and \(A_{-1}\) are identity matrices, \(\epsilon{-1}\) is identically zero, \(y_{-1}\) is set equal to the mean of our prior and \(u_0\) adopts its covariance. With the boundary conditions cleaned up in this fashion we can invert the dynamical equations, assuming only that \(A\)'s have left inverses \(A^{-1}\), as follows: \begin{equation} x_j = A^{-1}_{j}\left( x_{j+1} - u_j \right) \end{equation} and then re-arrange the observation equations so that the only value of \(x_i\) that appears is \(x_k\). \begin{eqnarray} y_k & = & H_k x_{k} + \epsilon_k \\ y_{k-1} & = & H_{k-1} x_{k-1} + \epsilon_{k-1} \\ & = & H_{k-1} \left( A^{-1}_{k-1}\left( x_{k} - u_{k-1} \right) \right) + \epsilon_{k-1} \\ & = & H_{k-1} A^{-1}_{k-1} x_{k} - H_{k-1} A^{-1}_{k-1} u_{k-1} + \epsilon_{k-1} \\ y_{k-2} & = & H_{k-2} x_{k-2} + \epsilon_{k-2} \\ & = & H_{k-2} \left( A^{-1}_{k-2}\left( x_{k-1} - u_{k-2} \right) \right) + \epsilon_{k-2} \\ & = & H_{k-2} A^{-1}_{k-2} x_{k-1} - H_{k-2} A^{-1}_{k-2} u_{k-2} + \epsilon_{k-2} \\ & = & H_{k-2} A^{-1}_{k-2} \left( A^{-1}_{k-1}\left( x_{k} - u_{k-1} \right) \right) - H_{k-2} A^{-1}_{k-2} u_{k-2} + \epsilon_{k-2} \\ & = & H_{k-2} A^{-1}_{k-2} A^{-1}_{k-1} x_{k} - H_{k-2} A^{-1}_{k-2} A^{-1}_{k-1} u_{k-1} - H_{k-2} A^{-1}_{k-2} u_{k-2} + \epsilon_{k-2} \\ & \dots & \end{eqnarray} from which it is apparent that if we write \(Y = (y_k, y_{k-1}, y_{k-2},...,y_{-1} ) \) then \begin{equation} Y = G x_{k} + \eta \end{equation} where \(G\) is the concatenation of the coefficients of \(x_k\) given above and \(\eta\) is the gaussian random variable equal to the sum of \(u_k\)'s and \(\epsilon_k\)'s (again, with coefficients as above, leading to a non-trivial covariance structure).

Step 2. (A review of the adjoint trick)

Suppose \(x\) solves \(Qx = b(y)\). The adjoint trick can be used to compute the derivative of \(g(x)\) w.r.t. y. In particular, if \(y\) is the observation and \(x\) the solution of a generalized least squares problem with error covariance \(R\) we can cast it in this form by writing: \begin{eqnarray} g(x)& = & H x Q & = & H^T R^{-1} H \\ b(y) & = & H^T R^{-1} y \end{eqnarray} Consider now \begin{equation} f(x,y) = 0 \end{equation} where \begin{equation} f(x,y) = Q x - b(y) \end{equation} We use derivatives of \begin{equation} \tilde{g} = g - \lambda^T f(x,y) \end{equation} with respect to \(y\) as a means of computing derivatives of \(g\) with respect to \(y\). Note that \begin{equation} \frac{\partial \tilde{g}}{\partial y} = \frac{\partial g}{\partial x}\frac{\partial x}{\partial y} - \lambda^T \left( \frac{ \partial f }{\partial x }\frac{\partial x}{\partial y} + \frac{\partial f}{\partial y} \right) \end{equation} and this will simplify if we choose \(\lambda\) judiciously as a solution of \begin{equation} \frac{\partial g}{\partial x } = \lambda^T \frac{\partial f}{\partial x} \end{equation} which we call the adjoint equation. For then \begin{eqnarray} \frac{\partial \tilde{g}}{\partial y} & = & \frac{\partial g}{\partial x}\frac{\partial x}{\partial y} - \lambda^T \left( \frac{ \partial f }{\partial x }\frac{\partial x}{\partial y} + \frac{\partial f}{\partial y} \right) \\ & = & -\lambda^T \frac{\partial f}{\partial y} \\ & = & \lambda^T \frac{\partial b}{\partial y} \end{eqnarray} Now specializing to \begin{equation} g(x) = H x \end{equation} and \(b(y)\) as above we can solve for this convenient choice of \(\lambda\) by writing \begin{eqnarray} H & = & \frac{\partial g}{\partial x} \\ & = & \lambda^T \frac{\partial f}{\partial x} \\ & = & \lambda^T Q \\ & = & \lambda^T H^T R^{-1} H \end{eqnarray} where the second equality is the adjoint equation. It should be clear from this how to compute derivatives of \(\tilde{g}\) with respect to \(y\), and thereby compute derivatives of \(g\) with respect to \(y\).

Wednesday, November 20, 2013

Keeping punters log-happy: Some properties of a "pristine parimutuel" market clearing mechanism

Is there such a thing as a perfect probabilistic paradise for punters? A place where participants' prior probabilistic prophecies are pulled apart and placed into a philosophically pristine parimutuel posterior, without painful departure from previously perceived pseudo-optimal particulars? I believe the platonic possibility is promising (though to provoke, in practice we puritans may be prevented from partaking in this powerful portal whilst we are paralyzed by the paranoid parlance of the present day).

                                               A pristine market

Well now that I'm out of spit, here are some characteristics of a market for mutually exclusive, exhaustive outcomes that I would like to see:
  1. Every participant is forced to invest in a manner that optimizes their long run wealth. They will allocate their wealth to maximize \(E^P[\log {\rm wealth}] \) with respect to their best estimate of real world probability \(P\).
  2. Every participant's estimate of probability, on which 1. is evidently based, must include both private information (from which they might derive a probability measure \(R\) say) and the market probabilities themselves (call them \(Q\)). In contrast, to simply equate \(P\) with \(R\) in 1. is a dreadful, though rather common, mistake.
  3. Criteria 2. is achieved without the need for participants to monitor and respond to price feedback. In contrast the usual mechanism for including market information, as we find in racetrack parimutuels, is to provide provisional estimates of \(Q\) in realtime so that punters can react.
The last requirement motivated my ponderings on this topic. In our perceived utopia participants are given inventive to provide their own private probabilities \(R\) well before "race time" (as it were). They can then put their feet up and relax safe in the knowledge that 1. will be satisfied automatically.

Such is non-trivial because even investors whose bets are tiny with respect to the overall market volume (and thus have negligible price impact) must quite rationally react to partial price discovery even though in "theory", they should simply bet proportionately irrespective of the odds on offer (a lovely accident mentioned below). I use scare quotes because unless the odds are truly known, \(Q\) should really enter the optimization via the back door: an update to \(P\) acknowledging the market's ability to aggregate information.

In this setup the market clearing mechanism does the heavy lifting normally performed iteratively and imperfectly. So long as there is a rule governing the manner in which "\(R + Q = P\)" (figuratively speaking) we can anticipate every player's estimate of \(P\) given \(Q\), and thereby solve simultaneously for Q and log-optimal allocations that presumably influence Q. Here we consider the simplest rule of that sort. We shall assume P is a convex combination of \(R\) and \(Q\) with fixed coefficients.

Also for simplicity we consider the case where the market is not subsidized (though that might be an interesting direction for generalization). Then linear pricing forces us to adopt the most straightforward parimutuel clearing mechanism once we known the allocations: divide all the money wagered amongst those choosing the winning horse, in proportion to their bet.

                                       Necessary optimality conditions

Suppose market participant \(k\) allocates all his wealth \(W^k\) across \(I\) mutually exclusive outcomes. Suppose his estimate of probability for the \(i\)'th state is given by \begin{equation} p^k_i = \eta^k r^k_i + (1- \eta^k) q_i \end{equation} where \(r^k\) is his best estimate using only private information, and \(q_i\) is the market implied probability arrived at by means to be discussed.

As noted this equation is a statement of a seemingly rational philosophy, independent of how the market operates. Investor \(k\) might have noticed in the past that his private information adds some explanatory power to the market, but he probably shouldn't ignore the market prices altogether in arriving at the best estimate of real world probability.

We shall further suppose, in what follows, that all \(K\) participants are rational in another sense. They wish to optimize the log of their posterior wealth. Now it is well appreciated that if \(p^k_i\) are considered fixed this log-optimality leads simply to proportional betting, but that is not the case here. Only \(r^k_i\) and \(\eta^k\)'s are fixed, and we shall attempt to construct allocations \(u = \{u^k\}\) and clearing prices \(q_i\) that overtly depend on the investments made by participants.

To that end let \(u^k\) denote the fraction of wealth investor \(k\) invests in the \(i\)'th state. Suppose that the market clears in parimutuel fashion, meaning that all participants receive the same price. The market probability for the \(i\)'th state must be \begin{equation} q_i = \frac{ \sum_k u^k_i W^k } { \sum_{k=1}^K W^k } = \sum_k u^k_i W^k \end{equation} since we might as well suppose w.l.o.g. that \( \sum_k W^k = 1 \).

The question then arises, is there a choice of \(\{u^k_i \}_{i,k}\) such that investment by each participant is log-optimal? Intuitively one would expect a market equilibrium provided \(\eta^k\)'s are strictly between \(0\) and \(1\).

The utility function for the \(k\)'th investor is \begin{eqnarray} U^k( u^k ) & = & E^p\left[ \log\left( \frac{u^k_i}{q_i} \right) \right] \\ & = & \sum_{i=1}^I p^k_i \log\left( \frac{u^k_i}{ \sum_k u^k_i W^k } \right) \\ & = & \sum_{i=1}^I ( \eta^k r^k_i + (1- \eta^k) q_i ) \log\left( \frac{u^k_i}{ \sum_k u^k_i W^k } \right) \end{eqnarray} and by definition of \(u^k_i\) the constraints are \(\sum_i u^k_i = 1 \). Or for every \(k\) we might write \( g(u^k)=0\) where \(g(u^k) = \sum_i u^k_i - 1\). This sets up the first order Lagrangian equations for \( \Lambda(u,\lambda) = U(u) - \lambda \cdot g( u ) \) where we collect the components \(k=1,..K\). As usual for these problems we set \( \nabla \Lambda = 0 \) because for optimality the derivative of the objective function must be proportional to the derivative of the constraint function. This leads to equations of the form \begin{eqnarray} 0 & = & p^k_i( q_i(u)) \left( \frac{1}{u^k_i} - \frac{W^k}{q_i(u)} \right) - \eta^k W^k \log \left( \frac{u_i}{q_i(u)} \right) - \lambda^k, \ \ \ {\rm and} \\ 0 & = & \sum_i u^k_i - 1 \end{eqnarray} relating the allocations \(u^k_i\). Notice there are \(KI+K\) equations and \(KI+K\) free variables, including both the \(u^k_i\)'s and the \(\lambda^k\) multipliers (the \(W^k\) are fixed parameters and we have \(q_i(u) = \sum_k u^k_i W^k\) as noted above). The solution to this system of non-linear equations defines a pristine parimutuel clearing mechanism.

                                  Comparison to proportional betting

In contrast if we imagine that \(q_i\) are not a function of allocations \(u^k_i\) but fixed, and further suppose \(\eta^k = 1\) then we return to the overly stylized world where participants don't take market prices into account, preferring to believe their own homework is a sufficient statistic. The optimality conditions are instead: \begin{eqnarray} 0 & = & \frac{p^r_i}{u^k_i} - \lambda^k, \ \ \ {\rm and} \\ 0 & = & \sum_i u^k_i - 1 \end{eqnarray} It is apparent from the first equation that \(u^k_i \propto r^k_i\) and then, from the second, that \(u^k_i = r^k_i\). We recover the remarkable accident referred to as proportional betting, where the price \(q_i\) does not enter the picture. This works perfectly well for blackjack where \(r^k = p^k\) but not most real world games where markets inevitably supplement one's private information.


While I have caged this discussion in market language, it should be apparent that we have derived an algorithm for combining multiple probabilistic forecasts into a single probability distribution \(Q = \{q_i\}\), for which there is a large literature. See the references to Ranjan and Gneiting for example.

I shall concede that what is philosophically inconsistent is my use of simple linear pooling to arrive at the individual's subjective probability estimates \(p^k = \{p^k_i\}_{i=1}^I\) based on their prior \(r^k\) and the final market price \(q_i\), given that we then derive a more sophisticated combination scheme for meshing between individuals. Why not use the "better" scheme to combine \(Q\) and \(R\)?

I suppose a flippant response is "why not put a picture of a boy holding a cereal box on the cereal box?". Let me think about a more satisfying answer and get back to you.

Thursday, October 31, 2013

Hooke's Law and the Kalman filter. A little "spring theory" emphasizing the connection between statistics and physics.

This post is purely expository and concerns the simplest one dimensional Kalman filter in which we periodically observe Brownian motion subject to gaussian measurement error.

We make an observation that isn't always stressed up front in the statistical or control systems perspective (such as you will find at wikipedia for example). Then we pursue the analogy between statistics and physics a little further, and show how the updating of a location estimate of a gaussian distribution amounts to a combination of center of mass and reduced mass calculations.

That's because simplifying the corresponding physical system, that contains both fixed and free masses, requires a non-trivial change of reference frame. You might call it "mechanical statistics", as distinct, obviously, from statistical mechanics.

Kalman filter equations are just a center of mass calculation

Suppose the prior estimate of location for a particle is \(m\) and the prior covariance is \(P\). Suppose we make an observation \(y\) with error variance \(R\). Our posterior belief is gaussian with location \(m'\) say and variance \(P'\). The update is usually written \begin{eqnarray} m' & = & m + K ( y - m ) \\ P' & = & P(1-K), \ \ {\rm where} \\ K & = & \frac{P}{P+R} \end{eqnarray} However it is in many ways more natural to use the inverses of covariances instead. If we write \(\varphi = 1/R\), \(p = 1/P\) and \( p' = 1/P'\) and multiply by through by \( \frac{P+R}{PR} \) we notice that the Kalman filter update is merely a center of mass calculation: \begin{eqnarray} m' & = & \frac{m/P + y/R} { 1/R + 1/P } = \frac{ pm + \varphi y }{ \varphi + p } \\ p' & = & \frac{1}{P'} = \frac{P+ R}{PR} = \frac{1}{P} + \frac{1}{R} = \varphi + p \end{eqnarray} The analogy works if we treat precision as mass. And in what follows we'll be equally interested in the analogy between force and the derivative of the negative log likelihood function.

This table suggests that in a gaussian world force is linear in distance. And it true that we can construct an analogue Kalman smoother with rods and springs as follows:

An "analogue" gaussian smoother using perfect Hookean springs
Minimizing energy corresponds to minimizing maximizing log-likelihood. And setting the derivative of log-likelihood to zero corresponds to finding the equilibrium where forces cancel out.

Futhermore the fact that combining two pieces of evidence for one latent variable can sometimes be as simple as merging the two observations at their "center of precision" corresponds to a nice accident when forces grow linearly with distance: the impact of two masses on a third is unchanged if they coalesce at their center of mass.

But there is more to the story...

Reading a "spring diagram" in a Hookean universe

To demonstrate a richer physical analogy we consider next a Gaussian distribution whose location is assumed unknown, but also gaussian. 

Figure 1. Hierarchical model where location of a gaussian distribution is itself gaussian

Suppose our prior is \begin{eqnarray} P( x | \mu ) & \propto & e^{-\rho(x-\mu)^2} \\ P(\mu) & \propto & e^{-p(\mu-m)^2} \end{eqnarray} where this time \(m\) represents our guess as to the location of the center of the distribution. Symbolically we might represent the prior with the following diagram.

Figure 2. Spring diagram representing prior knowledge of the location of a gaussian distribution

Here the tuples represent location and precision, or if you prefer, position and mass. It isn't an analogy. If we assume attractive forces vary as a Hookean law, which is to say proportional to the product of masses and distance: \begin{equation} Force \propto M_1 M_2 d \end{equation} then in the example above the yellow and green masses witness attractive force and potential energy given by \begin{eqnarray*} {\rm Force} & = & \frac{p}{\rho} \rho |m - \mu| = p |m - \mu |,\ \ {\rm and\ thus} \\ {\rm Energy} & = & \frac{1}{2} p (m - \mu )^2\\ \end{eqnarray*} Since energy is the negative log likelihood, we "read" the diagram as \( P( x | \mu ) \propto e^{-\rho(x-\mu)^2} \). Similarly the spring between the yellow mass and red "test particle" is read \( P(\mu) \propto e^{-p(\mu-m)^2} \). The system therefore represents the hierarchical model and indeed, it is an exact physical analogue. So in what follows we will ask the following question: what force does the test particle feel as we add other masses to the picture?

Simplifying  a spring diagram using reduced mass

The game begin in earnest when we introduce noisy evidence of our unknown location parameter \(\mu\) for our mysterious distribution. Suppose we take a draw from said distribution \(x_2\). Suppose we don't observe \(x_2\) itself but instead, a noisy measurement \(y\) whose precision (or "mass", if you will) is \(\varphi\). The noisy measurement's distribution conditional on \(x_2\) is \( P(y|x_2) \propto e^{-\varphi(y-x_2)^2}\) and corresponds to the following spring diagram.

Figure 3. Spring diagram representing noisy evidence
As suggested in the diagram, we will attach the evidence to the latent variable \(\mu\) as follows:

Figure 4. Prior location belief plus a noisy measurement
Our task is simplification of this system until, from the perspective of the test mass, it resembles the form of the prior we began with. We should think of the observed evidence (green rectangles) as fixed points whereas the yellow circles represent unknown quantities that are free to move.

We ought to recall here the rules for combining springs in series, or to be more direct, the "reduced mass" trick for replacing a three body problem with a two body problem. In either situation physics reminds us that the combined action of the rightmost two masses can be simplified:

Figure 5. Prior belief plus a noisy measurement simplified using reduced mass

We replace the mass \(\phi\) with a reduced mass \(\frac{\phi}{\phi+\rho}\) because the intermediating unit mass reduces the pull. Since it is well covered elsewhere I will not derive the reduced mass expression but notice why the reduced mass makes sense in the limits. If \(\phi \rightarrow 0\) the relative size of the yellow unit mass is huge and so the mass at \(\mu\) hardly feels the pull from the green mass at \(y\) at all. In the other extreme case, when \(\phi \rightarrow \infty\), the unit mass is sucked into the green mass and is, for all intents and purposes, stationary. Thus it acts like a fixed unit mass pulling the mass at \(\mu\) rather than a floating one.

We proceed to the final simplification of the diagram. This is pretty easy as the two green masses are inertial. Their impact on the yellow mass is equivalent to a single inertial mass at their center of mass. Thus:

Figure 6. Simplification of Figure 4 by reduced mass and center of mass calculation.
Apologies here because the diagram contains a small error. The denominators should read \(\varphi + \rho\), not \(\varphi + p \). Evidently this takes the form of the prior system in Figure 2 and can be easily read. It states that our posterior belief about the location parameter \(\mu\) is gaussian with mean \(m'\) say and precision \(p'\) say, where \begin{eqnarray*} m' & = & \frac{ m \frac{p}{\rho} + y \frac{\varphi}{\varphi+\rho} } { \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} } \\ p' & = & \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} \end{eqnarray*}
This closes the loop and demonstrates how updating can be performed for the hierarchical model in Figure 1.

Recovering the Kalman filter update

As a parting note, we see that the limit \(\rho \rightarrow \infty\) leads to update equations \begin{eqnarray} p' & = & \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} = \frac{p+p/\rho + \varphi}{\varphi +\rho}  \rightarrow \varphi + p \\ m' & = & \frac{ m \frac{p}{\rho} + y \frac{\varphi}{\varphi+\rho} } { \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} }   \rightarrow \frac{ pm + \varphi y }{ \varphi + p } \end{eqnarray} which is the Kalman update as before. This is to be expected, since in the limit \(\rho \rightarrow \infty\) the problem of locating a distribution with unknown location (noisily observed) shrinks down to the problem of locating a point mass with unknown location (noisily observed).

Sunday, September 15, 2013

The Horse Race Problem: A Subspace Solution

The "horse race problem" asks for a reasonable estimate of the relative ability of entrants in a competition given that we know the probabilities of winning. The question is vague, but implies the competition awards a win to the contestant with the highest (or lowest) score, and perhaps further implies that the distribution of contestant scores is a translation family where "ability" is a location parameter.  This post formalizes the problem and provides a simple practical solution.

Geelong Cup Photo Finish 2005

Inferring running time distributions from win probabilities

The question is hardly new. As noted in Probability models for horse race outcomes by Mukhtar M. Ali, one way to estimate race outcome probabilities given a liquid market for the winner is to calibrate distributions of their running times. For example, one might assume the distribution of the \(i\)'th horse's running time \(\{\tau_i\}\)is gaussian, with known variance, and that the location of each horses' distribution is a free parameter to be estimated. More generally we assume all running time distributions belong to a translation family where \(f\) is fixed and a single parameter \(a_i\) characterizes each horse via its running time density $$ f_i(x) = f(x;a_i) = f(x-a_i) $$ We denote the corresponding probability that the \(i\)'th horse's running time \(\tau_i\) is less than \(x\) by $$ F(x - a_i) = Prob( \tau_i < x).$$ Under these assumptions the probability \(p_i\) that the \(i\)'th horse wins the race is\begin{equation} p_i = \int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx. \end{equation} If we arbitrarily fix \(a_1=0\) we can presumably solve for \(\{a_i\}_{i>1}\). More on that in a moment.

Motivation (or lack thereof)

On application of the horse race problem is the estimation of so-called exotic probabilities: the probabilities of relevance for racegoers betting on quinellas, exactas (forecast) and trifectas. This presupposes that the market for horses winning a race is more efficient than the markets for various permutations of finishers. There is a small literature aimed at exploiting inefficiencies between the win and place (show) markets at racetracks.

The evidence is not altogether compelling, incidentally. It has been argued here that place prices on at least one betting exchange are roughly as efficient as the win market. And the ability for professional punters to beat the track take is greater for trifectas than the win market (because knowledge of two or three horses can be multiplied, as it were) suggesting that the smart money might be in trifectas instead. In a past life I gathered a fair amount of evidence to suggest just that, so don't rush off to the Dapto dogs armed only with a quaint assumption. Still, the horse race problem as stated remains a interesting mathematical problem and has motivated some people to study order statistics.

Normal running times

As noted by Ali, in the case where \(f = \phi \) is the normal distribution an approximate solution for the \(a_i\) given known \(p_i\) is given by Henery $$ a_i = \frac{ (k-1) \phi\left( \Phi^{-1}(\frac{1}{k}) \right) \left(\Phi^{-1}(p_i)- \Phi^{-1}(\frac{1}{k})\right) }{ \Phi^{-1}\left( \frac{i-\frac{3}{8}}{k + \frac{3}{4} } \right) } $$ Some other approximations are mentioned in this presentation by Victor Lo who summarizes some empirical findings.

Gamma running times

I leave it to the reader to explore another territory where analytic solutions are available. For a treatise on ranking models with gamma processes see Gamma processes, paired comparisons, and ranking by Hal Stern. (Related: Generalized Bradley-Terry models).

Arbitrarily distributed running times by brute force optimization

Here I restrict myself to some observations on the problem where \(f\) is arbitrary. As mentioned above we can set \(a_1=0\) and throw  \(\{a_i\}_{i>1}\) to a solver for \(F=0\). Here \(F = F\left(\{a_i\}_{i\not=1}\right)\) is an \(n\) dimensional vector function taking \(n-1\) arguments, and the \(i\)'th entry of \(F\) is the difference between the computed (\(q_i\)) and desired probabilities: \begin{equation} F_i = \overbrace{\int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx}^{q_i} - p_i.\end{equation}This works fine for moderate \(n\), but for \(n>100\) it is time consuming, as can be seen in the following figure.

Computation time in seconds (left) as a function of the number of runners in the race.
Log scale on right. Matlab's implementation of Levenberg-Marquardt optimizer was used.
Solving a race with 200 runners takes on the order of one hour.

A novelty? Arbitrarily distributed running times by subspace optimization

We now introduce a method that cuts the computation time dramatically while retaining a high level of accuracy in the solution. The idea is to restrict the \(n-1\) dimensional search to a subspace of dimension \(m < n-1\). In particular if \(m\) is fixed (say \(m=15\)) then we can compute the solution in approximately linear time. A one hour computation for a large race becomes a one minute computation (not as cool as detexity, but not bad).

Obviously the subspace must be chosen with some care, lest it lie nowhere near the best solution. But before getting to that it also helps to introduce a scaled measure of discrepancy between the desired vector \(p\) and the one computed. We use the log-odds discrepancy:
$$ D(p_i, q_i) = log(p_i)-log(1-p_i)-log(q_i)+log(1-q_i)$$ rather than the naked difference of probabilities in the equation above.

Now to optimize over a subspace we define a map \(\phi(z \in R^m) \mapsto a \in R^n\) as follows. We assume the probabilities are sorted from highest to lowest. We let \(\{c_k\}_{k=1}^{m-1}\) be a collection of centroids obtained by k-means clustering of the differenced log-probabilities \(\{\log(p_1)-\log(p_i)\}\) into \(m-1\) clusters. We then enlarge the collection by adding \(0\) and \(\log(p_1)-\log(p_n)\). After sorting the enlarged list of cluster centroids (padded with extreme points) we have a collection \(c = \{0, \dots, \log(p_1)-\log(p_n)\}\) of cardinality \(m+1\). Then we define an \(n\) by \(m+1\) matrix \(A\) by setting \(A_{1j} = \delta_{1j}\), \(A_{n,j} = \delta_{m+1,j}\) where \(\delta\) is the Dirac delta function. For \(2 \le i \le m\), choosing the convex combination of the nearest two cluster points that returns \(p_i\). Thus there are two non-zero entries of \(A_{ij}\) for all but the first and last rows (unless a \(p_i\) coincides with a centroid). With slight abuse of notation, \(A\) as a linear transformation from \(R^{m+1}\) to \(R^n\) satisfying \(p = A(c)\).

The map \(\phi(z \in R^m) \mapsto a \in R^n\) is defined by composition:
       \phi = A \cdot \iota
$$where in matlab notation we would write \(\iota(z) = [0;z]\). That is to say \(\iota\) is the trivial map from \(R^m\) to \(R^{m+1}\) that assumes the first coordinate is zero. Finally then, we pass the vector function \(\tilde{F}\) taking \(m\) arguments to a solver. The \(i\)'th coordinate \(\tilde{F}\) is given by
      \tilde{F}_i = D\left( q_i\left( \phi\left(z\right)\right), p_i \right)
This works surprisingly well. The intuition is that the sorted abilities \(\{a_i\}\) can be interpolated reasonably (and very roughly, vary with the \(\log\) of \(p_i\)) so long as we have enough knot points. It helps to place the knot points near population centers to reduce the amount of interpolation. As we vary \(z\) we are varying the locations of the knot points and forcing the \(a_i\)'s to move in a subspace, but since the interpolation is pretty good the subspace lies quite close to the solution.

And as noted, it is much faster. The figure below shows computation time as a function of the number of starters when we use subspaces of dimension \(5\) and \(15\). No code optimization has been attempted and undoubtedly, better numbers could be achieved in any number of ways. But the point is that using a subspace search alone results in a \(50\)x improvement or so for \(n \approx 200\). The improvement would of course be much greater, in relative terms, for \(n \gg 200\).

Computation time as a function of race size when using a subspace search
The error introduced by the subspace approximation will vary, of course, but as a rough guide the ratio of \(p_i\) to \(q_i\) differs from unity by about half a percent when using a subspace of dimension five, even when there are \(200\) runners. So a horse at 100/1 might be calibrated at 99.5/1. This error drops by another order of magnitude when using a subspace of dimension \(15\), whereupon the calibrated and desired probabilities are, as a practical matter, indistinguishable.

Possible improvements

For known distributions we ought to be able to use more accurate interpolation. Here I merely assumed that probabilities fall off as the exponential of ability difference. For very large races we should use the stability postulate instead, at least for those in the tail.

A generalization?

It's usually around this point where start to suspect there can't be anything new here. After all, any similar optimization problem where we have an ounce of intuition as to the solution that could be similarly addressed (i.e. by minimizing interpolation with clustering). Perhaps somebody could point me to method XYZ and conclude that the horse race solution presented here is merely an application of the same.

Monday, September 2, 2013

Quasi-random sampling subject to linear and moment constraints (or: "how to bypass model calibration")

Its a rainy long weekend in New York and I've been dusting off an old experiment while listening to the thunder at a friend's place. I was particularly impressed by the fact they had an outlet on their porch positioned right by a comfy chair.

Here's the motivation. In finance we sometimes find ourselves attempting to calibrate a probabilistic model for an underlying random variable (such as a stock price) subject to known constraints (such as the price of an option on the same). Often, we then draw samples from the same model for the purpose of pricing other securities, performing risk calculations or satisfying some other need.

\begin{equation} observed\ prices \rightarrow calibrated\ model \rightarrow random\ samples \end{equation} This has several advantages including the built in regularization we get by using a parametrized model. There is on obvious drawback however: one has to come up with the model!

A while ago I was motivated to skip the calibration step. I was inspired by the herding literature in machine learning, and I started playing with a more direct path: \begin{equation} observed\ prices \rightarrow random\ samples \end{equation}.The idea is that we devise quasi-random sampling schemes that automatically achieve the calibration while striving to fill out the space in a way that achieves some of the objectives of a calibrated model.

Best described by an example I feel. And in this one we consider samples on a discrete lattice where smoothness of the sampled distribution is controlled using the discrete Laplacian.


I will lead with the picture. The middle plot demonstrates that we are able to generate samples whose histogram is quite smooth, yet satisfy the constraints provided. In this case we've chosen a two dimensional state space, as might commonly occur when (strike, maturity) or say (attachment point, maturity) axes are in play.

Discrete Laplacian minimizing quasi-random samples (middle) approximating the true distribution (left). For comparison we also show a minimal surface (right) generated using the same constraints with direct optimization.
We've started with some fake distribution on the left, created some functions of the same and used those functions as constraints.

Aspirations and notation

We allow a lattice to be determined by axes \(x^{(1)}=\{x^{(1)}_1 >...> x^{(1)}_{n_1}\}\) and \(x^{(2)}=\{x^{(2)}_1 >... > x^{(2)}_{n_2}\}\). We suppose that \(F = \{f_k\}_{k \in K}\) is a collection of functions on the implied lattice whose means are known constants \(\mu_k\). That is,
\mu_k = E[f_k] \ \  \ \ k \in K.
with respect to some density that we wish to sample from. Specifically we want to draw samples \(z=\{z^i\}_{i=1}^I = \{(z^i_1,z^i_2)\}_{i=1}^I\) such that
 \hat{f_k} := \frac{1}{I}\sum_{i=1}^I f_k(z^i)
satisfies approximate equalities
  \hat{f_k}(z) \approx \mu_k, \ \ k \in K.
It will turn out that using equalities here involves no loss of generality. Indeed a cute feature of the algorithm exhibited here is that inequalities are treated in the same manner as equalities: we simply use the same function twice with the same weight \(w\) but different \(\mu\)'s. The lower \(\mu\) is interpreted as a lower bound and the higher as an upper bound.

Now for notational convenience we might as well consider \(f_k\) to be functions defined on the integer lattice \(\{1..n_1\} \times \{1..n_2\}\), identifying \(f(r,s) := f(x^{(1)}_r,x^{(2)}_s)\). We also identify samples \((z^{(1)}_r,z^{(2)}_s)\) with their integer pairs \((r,s)\). We denote by \(Z^i(r,s)\) the number of samples falling on the lattice point.

A fast algorithm for generating quasi-random samples

In what follows we will use a prior probability \(\phi(r,s). \) defined on the lattice. It might might be determined by minimization of the square of the discrete Laplacian subject to the equality and inequality constraints given, or by other means.

The idea is to start lobbing samples on the lattice (like stackable Go stones) and watch the distribution emerge. By monitoring the discrete Laplacian we can try to coerce the samples to fill in holes - smoothing things out as it were. By monitoring the expectation of the constraint functions \(f_k\) with respect to the pseudo-sample distribution thus far, we can correct for bias.

For this to be efficient we need a very efficient means of bookkeeping.


At the outset we fix a constant lattice function \(A_k(r,s) = signum \left( \mu_k - f(r,s) \right) \) for \(k \in K\). We initialize the laplacian \(D(r,s)=0\) for all lattice points \((r,s)\). We initialize counters \(S^k_0 = 0\) for \(k \in K\).

Selecting the next quasi-random sample:

Beginning with quasi-random sample \(i=1\) we perform the following steps.

  1. Set urgency \(\nu(r,s) = D(r,s)\) then for \(k \in K\) update it by \( \nu(r,s) \rightarrow \nu(r,s) + w_k A_k(r,s) signum( S^k - i\mu_k ) \)
  2. Choose the most urgent lattice point after weighting: \( (r_i,s_i) = \arg \max \nu(r,s) \phi(r,s)\)
  3. Increment the sample count for the lattice point:  \(Z(r_i,s_i) \rightarrow Z(r_i,s_i) + 1 \)
  4. Decrement \(D(r_i,s_i) \rightarrow D(r_i,s_i) -1 \) then increment \(D(r^*,s^*) \rightarrow D(r^*,s^*) + \frac{1}{4} \) for all lattice points \(r^*,s^*\) neighbouring \(r_i,s_i\). 
  5. Update \( S^k \rightarrow S^k + f^k(r_i,s_i) \). 

What we are doing here is selecting a new point that improves as many of the biases as possible, subject to the discrete Laplacian not getting out of whack.

Selecting the next quasi-random sample (slightly more efficient):

It turns out we can be a little sneakier and initialize \( \nu(r,s) = \phi(r,s) \). Then the following algorithm  is equivalent and somewhat more efficient.
  1. Choose the most urgent lattice point after weighting: \( (r_i,s_i) = \arg \max \nu(r,s) \phi(r,s)\)
  2. Increment the sample count for the lattice point:  \(Z(r_i,s_i) \rightarrow Z(r_i,s_i) + 1 \)
  3. Update \(S^k = S^k_{prev} + f^k(r_i,s_i) \)
  4. For \(k \in K\) update \( \nu(r,s) \rightarrow \nu(r,s) + w_k A_k(r,s) \left( signum(S^k-i \mu_k) - signum(S^k_{prev} - (i-1) \mu_k \right) \)
  5. Decrement \(\nu(r_i,s_i) \rightarrow \nu(r_i,s_i) -1 \) then for all four neighbours \((r^*,s^*)\) of \((r_i,s_i)\) increment \( \nu(r_i,s_i) = \nu(r_i,s_i) + \frac{1}{4} \). 
  6. Set \(S^k_{prev} = S^k\). 
Notice that we don't really need to consider the discrete Laplacian and the moment terms separately, since we only care about their direct tradeoff.

A minor practical tweak

It seems to help to use a tuning parameter \(\lambda\) that weighs the relative importance of satisfying the constraints \(f_k\) versus minimizing the discrete Laplacian. Left as an exercise for the interested reader.