Financial Mathematics

Filtering bond and credit default swap markets

2016-04-20T07:04:00.003-07:00

I presented this talk at the NYU-Tandon faculty seminar. I covered my own work on the rudimentary framework for Benchmark Solutions pricing service (now BMRK on Bloomberg) but did not go into details on all the other efforts of everyone involved. Some of the material has already been covered in my blog post on Kalman sensitivities via the adjoint state method.

Domino dancing. A medium data research workflow with Ruffus on AWS.

2015-12-14T16:22:00.001-08:00

I don't normally blog about technology because frankly, it is too much like work. But I have been trying Domino data labs of late, a nice product by a team mostly out of Bridgewater who clearly have good experience trying to make life easier for quants. They have built a research platform most readily used with Amazon compute but with the option of an in-house install. I certainly feel their efforts are worthy of note.

I don't really know what a data science platform should be necessarily, but perhaps it is a little like picking a dance partner: they have to know when to get out of the way in order for things to move effortlessly. That was a quality I found lacking in both in-house and competing platforms that seem to assume you are trying to step in a certain direction and in the process, inadvertently make it difficult to do simple things you feel you have already mastered. In short, they feel a bit like dancing with ... err ... me.

My workflow

I'll describe my setup for context, though I doubt there is anything atypical about my needs. I had already decided on Amazon over some in-house technology, incidentally, for several key reasons including the on-demand capability and availability of an attached file system. Another preference was use of a single beefy machine rather than many small ones because some of my steps require all the data together. With 8 GPU machines standard, and a 100 processor option coming in 2016 it is hard to see the merit in administering my own box.

My econometric/algo work (a.k.a. data science these days) pushes thousands of data files through a few dozen processing steps. After running local tests I sync both data and code using the Domino client to AWS, then kick off a job there. When the job is done, I simply sync again and examine the results locally.

I use the free python library Ruffus to create a non-blocking pipeline and handle the splitting and recombining of data files therein. I can't show you my actual computation graph as it is proprietary, so here's a stock image from ruffus instead:

Imagine the same with a few dozen steps and you get the idea. Much of what I do lives on one or more of these growing DAGs.

Serving results behind a REST API

The nodes on my computation tree comprise between one and a few thousand files of the same type. The leaves comprise summary tables (sometimes documentation) for this and that. Some output becomes lookup tables for REST API's deployed on Amazon. Domino provides simple version management for the code and data serving those endpoints, making this effortless.

One merely provides the entry point: in my case a plain python function to respond to the request (no wrapping/unwrapping). At present Domino only allows a single entry point but that is hardly a restriction. I merely register the following dispatcher function to automatically exposes every other function I drop in the same module.

That's REST, pretty much for free.

Monitoring runs

No complaints here. Domino presents your runs to colleagues alongside run diagnostics such as the following.

It could be slightly easier to arrange the results, arguably, but one way around that is to maintain your own markdown index page (README.md) and point users to where you really want them to be. As a minor side point I discovered that the markdown rendering was just slightly non-standard - but I'm sure that will be knocked off the issue list by the time you read this.

Serving richer results

By default Domino tracks the new files produced by your runs and renders them to collaborators browsing the results. If you want to serve up richer content than a pile of .csv's you can have your run create an .html file and use javascript. As suggested by one of the Domino developers, here is one way of adding sortability to output tables:

Thanks John. The same goes for plots. Need something fancy? Just add the package to your requirements.txt and it will be installed prior to the run starting.

Reproducible research

In a sense Domino is pioneering what truly reproducible research might look like. I made a very poor stab at the same vague concept many years back so I am sympathetic as to the difficulties. (That effort lived on this site incidentally, and was so dismal I'd forgotten it myself until just now.)

There is an opportunity. While the workflow for developing code is obviously well established in source control programs and accompanying verbiage, the organizing of code, data and intermediate temporary data tends to be more bespoke. The Domino starting point, philosophically, is that code and data be kept in sync and this ensures absolute reproducibility of results. But they have not baked any religious assumptions into the product and one is free to break that symmetry. I'll get to why it helped me to do so a little later.

Certainly the major annoyances (code sharing with package management) and organizational features (basic comment tracking etc) are cleanly solved if your colleagues are also using Domino. One points them to a link to start the run that reproduces your work, or they can select from a menu on your recent runs.

Transparent research

While multi-directional collaboration is a marketing point for Domino, I find that the weaker variety of this (call it transparency) is a sufficient motivation. There is certainly no barrier here for non-technical end-users who want to re-run your research with different parameters and then compare the results. And that comparison is straightforward. One selects two finished runs in the run history and selects the "compare" checkbox to get the full diff.

Note however, that if you keep your code in the same project this comparison will include all your source changes. If you don't want that for yourself or your readers, a simple hack here is to use .dominoresults to hide your code and echo major meta-parameters to their own output file in lieu. Then the Domino presentation of your backtest results (say) is actually very helpful.

I'm reasonably happy with the level of transparency that the Domino platform provides. I feel it will replace a fair amount of presentation preparation (I'm a bit lazy on that front). I maintain a stable version of data and code for this purpose - or at least I aspire to - so that others can drill down into very large runs. One might hope that the days of manually converting piles of data into long powerpoint presentations are behind me, though that might be optimistic.

Perhaps an area for improvement or enhancement of the Domino platform would involve documentation. Some support for Latex generation of documents for instance.

Separating code and data again

As my data sets grew, I ultimately found it more convenient to use a separate project for code and several different projects for data. There are distinct categories of data living remotely and locally. Some data needs to be synced between local and remote instances to make it easier to run quick experiments. Some data is merely temporary. Some data should only ever live on Amazon due to it's size.

Then there is the pragmatic concern with code. I personally didn't want the possibility of large globs of data corrupting my defacto code repo. Nor was there any need for the bulk of my code to be shared with those wanting to make minor hacks to the results.

These concerns are addressed. The Domino team was very helpful and introducing read-only sharing between projects. Thereafter I moved to a pattern of "code only" and "code-light data heavy" projects. In my project holding source code I allow read-only export to the other projects.

To minimize syncing of large files (and lots of files) to amazon I use a chain of Domino data projects. My medium-data flow is as follows:

Ingest data locally to ~/zipped_data_project/
Sync the compressed data to Amazon with Domino client
Unzip on Amazon to a /mnt//unzipped_data_project/... by allowing this second project to import files from /zipped_data_project, (and also from the source project containing the script in question and the rest of one's code).
Allow a results project /mnt//endpoint_project to read the unzipped data and run the pipeline, creating result files and lookup tables. Use .dominoignore config to decide what is worth keeping.
Publish the endpoint.

I've leaving out source control & backup since there are many ways to do that. I use my own hacks to ensure I always backup right before syncing, because very occasionally one may need to recover from a git issue under the hood. The implicit version control achieved by syncing with the Domino client is, for me, more of a secondary line of defense.

I suspect some future improvements to Domino might obviate the minor inconvenience of the zip/unzip shuffle. And some of my issues there relate mostly to sub-par connectivity.

Overall impression

Sweet. Hopefully not sweet as in the Pepsi versus Coke trick (i.e. ask me again when I've drunk the whole can) but even if it is regarded as mere sugar for Amazon compute, Domino data labs seems to me to represent good, very reasonably priced sugar that doesn't get in the way of itself or you. It isn't trying to replace anything, just bring together a few critical ingredients like git, docker, jupyter, python/R/Julia package management and some friends from the Apache family.

The Domino team has cut the time for basics like REST APIs, job management, launchers for non-technical users, scheduling, sensible run management and results rendering. It is drawn together seamlessly with Scala (okay the fact I know that is a tell. But let's say 99% seamless with only the occasional stack trace finding its way to the user :-). You don't need a separate AWS account and overall, it is very simple.

Many quants or supporting groups might no doubt build their own versions of the same but I've come to realize that pulling all this together, or even a poor man's version, can easily fall foul of Hofstadter's Law. It is a little too soon to say that Domino have commoditized this area, but they seem to be off to a good start and with a decent size engineering team in place, I expect Domino to get better. As a quant, that's good news. Liberating even.

Costs seem to be well below internal costs. And my recommendation is that even for hobby projects involving only python, R and/or Julia one might want to consider whether pricing like the following is unreasonable when it obviates a great deal of messing around.

That's my 2 cents for now (or my 9.3 cents as the case may be). For those who are interested I've written up a few more hacks in this followup post.

Some special cases of the Lambert W function

2014-10-29T10:31:00.002-07:00

Suppose $f(x)$ is given by the functional relation \begin{equation} e^{-a(x) f(x) + b(x)} = c(x) f(x) + d(x) \end{equation} then \begin{equation} f(x) = \frac{1}{a(x)}W\left( \frac{a(x)}{c(x)} e^{b(x)+\frac{a(x)d(x)}{c(x)} } \right) - \frac{d(x)}{c(x)} \end{equation} where $W$ is the Lambert W Function. You can click through to Wikipedia for a proof. Here I merely list some special cases which every good citizen should instantly recognize (and it can be mildly tedious to reproduce them).

Recovering the definition of $W(x)$

Just to check we are on the right planet set $a=1$, $c=1$, $b=\ln(x)$ and $d=0$ to get \begin{equation} x e^{-f(x)} = f(x) \end{equation} or equivalently $ f(x)e^{f(x)} = x$ which is the definition of the Lambert W function. Sure enough those substitutions recover $$ f(x) = W(x) $$ as expected. A minor modification uses $a(x)=1$, $c(x)=1$, $b(x)=0$ but leaves $d(x)$ general.

The solution to $ e^{-f(x)} - f(x) = d(x) $

is \begin{equation} f(x) = W \left( e^{d(x)} \right) \end{equation} Only slightly more generally, set $a(x)=a$, $b=\ln(g(x)), c=1, d=1$ for those cases where $g(x)$ is on the wrong side, as it were.

The solution to $ e^{-af(x)} = f(x) g(x) $

or equivalently $ f(x) e^{af(x)} g(x) = 1 $ is \begin{equation} f(x) = \frac{1}{a} W\left( a g(x) \right) \end{equation} In particular,

The solution to $ x^k e^{-af(x)} f(x) = 1 $

which seems to crop up a fair bit for your author is \begin{equation} f(x) = \frac{1}{a} W\left( a x^{-k} \right) \end{equation}

Similarly setting $a=-1$ ...

The solution to $ x e^{f(x)} f(x) = 1 $

(i.e. where $x$ is on the wrong side but we otherwise have the Lambert W definition) must be \begin{equation} f(x) = - W\left( -\frac{1}{x} \right) \end{equation} We might also take $b = \ln(g(x))$ and thus

The solution to $ g(x) e^{-af(x)} = f(x) $

or equivalently $ f(x)e^{af(x)} = g(x) $ is \begin{equation} f(x) = \frac{1}{a} W\left( a g(x) \right) \end{equation} which reduces to the Lambert W function for $a=1$, as we expect. It is also pretty obvious from first principles, because if we multiply both sides by $a$ we have \begin{equation} (af) e^{(af)} = ag \end{equation} and thus $af = W(ag)$. Next suppose we want a power of $f$ to appear. Let $b=\beta(x)/k$, $c = \gamma^{1/k}$, $a = k/\alpha$ and $d=0$. And then raise both sides to the power $k$. It follows that...

The solution to $ e^{\alpha f(x) + \beta(x)} = \gamma f(x)^k $

is \begin{equation} f(x) = \frac{\alpha}{k}W\left( \frac{k e^{\frac{1}{k}\beta(x)} }{\alpha \gamma^{1/k}} \right) \end{equation} and if, in particular, $\beta(x) = -\ln(g(x))$ then

The solution to $ e^{\alpha f(x)} = g(x) \gamma f(x)^k $

is \begin{equation} f(x) = \frac{\alpha}{k}W\left( \frac{k }{\alpha g(x)^{1/k} \gamma^{1/k}} \right) \end{equation} and if we take $c = \frac{1}{\gamma}$ and $g(x)=x$ and $\alpha = \frac{s}{2}$ and $k=2$ then p>

The solution to $ e^{-\frac{s}{2}f(x)} x f(x)^2 = c $

is \begin{equation} f(x) = \frac{s}{2k}W\left( \frac{2k}{s} \sqrt{ \frac{c}{x}} \right) \end{equation}

A new use for the number 59575553 (ISDA shorten's 20 character Legal Entity Identifiers to 10 character Unique Trade Identifier Prefixes)

2014-08-13T11:09:00.001-07:00

The International Swaps and Dealers Association (ISDA) faced a minor problem recently. In shortening their 20 character Legal Entity Identifiers (LEI's) into 10 character Universal Trade Identifier (UTI) prefixes they ran into collisions.

Somewhat randomly this came across my desk. So under proposal is an improvement for hashing LEI's into UTI prefixes: the modulus operation lifted from integers to the space of case insensitive alpha-numeric strings.

You can grab the python code here. Of course it's hardly a new idea.

Sensitivities of a Kalman Filter estimate with respect to all past observations

2014-05-28T17:45:00.000-07:00

Consider a sequence of observations $t_1,...,t_k$ at which a latent vector process $x$ is observed indirectly, via an observation equation \begin{equation} y_{t_i} = H_i x_{t_i} + \epsilon_i \end{equation} We assume $\epsilon_i$ is mean zero multivariate gaussian with covariance $R_i$. For brevity we refer to $y_{t_i}$ as $y_i$, $x_{t_i}$ as $x_i$ and so forth. We assume the evolution of $x$ in between the times specified can be written \begin{equation} x_{i+1} = A_i x_i + u_i \end{equation} where $u_i$ are also gaussian. In this linear gaussian system the recursive estimation of $x_t$ is achieved by the well known Kalman filter, and the contemporaneous impact of the next observation $y_{k+1}$ is also (it is merely proportional to the Kalman gain).

But less well appreciated is a related computation, the derivatives of the Kalman filter estimate with respect to a past observation $y_i$. This note establishes how said computation can be achieved by making two observations. The first is a re-representation of the Kalman estimate in the form of a weighted least squares problem (not dissimilar to the Duncan Horn representation). The second observation is that sensitivities of any weighted least squares problem can be computed using an adjoint trick.

Step 1: The Kalman filter solution as a (particular) least squares problem

We shall set up a least squares problem involving the current state $x_k$ only, and all the previous observations. We argue that the solution to this problem is identical to the Kalman filter. Since the current estimate $\hat{y}_k$ is a simple linear function of the current state $x_k$, this allows us to compute the derivative of the current estimate with respect to all previous observations.

In the Kalman filter we assume a gaussian prior on the initial state $x_0$. This can introduce annoying special cases in what follows, but we can clean up the notation instead by introducing: \begin{eqnarray} y_{-1} & = & H_{-1} x_{-1} + \epsilon_{-1} \\ x_0 & = & A_{-1} x_{-1} + u_{-1} \end{eqnarray} provided $H_{-1}$ and $A_{-1}$ are identity matrices, $\epsilon{-1}$ is identically zero, $y_{-1}$ is set equal to the mean of our prior and $u_0$ adopts its covariance. With the boundary conditions cleaned up in this fashion we can invert the dynamical equations, assuming only that $A$'s have left inverses $A^{-1}$, as follows: \begin{equation} x_j = A^{-1}_{j}\left( x_{j+1} - u_j \right) \end{equation} and then re-arrange the observation equations so that the only value of $x_i$ that appears is $x_k$. \begin{eqnarray} y_k & = & H_k x_{k} + \epsilon_k \\ y_{k-1} & = & H_{k-1} x_{k-1} + \epsilon_{k-1} \\ & = & H_{k-1} \left( A^{-1}_{k-1}\left( x_{k} - u_{k-1} \right) \right) + \epsilon_{k-1} \\ & = & H_{k-1} A^{-1}_{k-1} x_{k} - H_{k-1} A^{-1}_{k-1} u_{k-1} + \epsilon_{k-1} \\ y_{k-2} & = & H_{k-2} x_{k-2} + \epsilon_{k-2} \\ & = & H_{k-2} \left( A^{-1}_{k-2}\left( x_{k-1} - u_{k-2} \right) \right) + \epsilon_{k-2} \\ & = & H_{k-2} A^{-1}_{k-2} x_{k-1} - H_{k-2} A^{-1}_{k-2} u_{k-2} + \epsilon_{k-2} \\ & = & H_{k-2} A^{-1}_{k-2} \left( A^{-1}_{k-1}\left( x_{k} - u_{k-1} \right) \right) - H_{k-2} A^{-1}_{k-2} u_{k-2} + \epsilon_{k-2} \\ & = & H_{k-2} A^{-1}_{k-2} A^{-1}_{k-1} x_{k} - H_{k-2} A^{-1}_{k-2} A^{-1}_{k-1} u_{k-1} - H_{k-2} A^{-1}_{k-2} u_{k-2} + \epsilon_{k-2} \\ & \dots & \end{eqnarray} from which it is apparent that if we write $Y = (y_k, y_{k-1}, y_{k-2},...,y_{-1} ) $ then \begin{equation} Y = G x_{k} + \eta \end{equation} where $G$ is the concatenation of the coefficients of $x_k$ given above and $\eta$ is the gaussian random variable equal to the sum of $u_k$'s and $\epsilon_k$'s (again, with coefficients as above, leading to a non-trivial covariance structure).

Step 2. (A review of the adjoint trick)

Suppose $x$ solves $Qx = b(y)$. The adjoint trick can be used to compute the derivative of $g(x)$ w.r.t. y. In particular, if $y$ is the observation and $x$ the solution of a generalized least squares problem with error covariance $R$ we can cast it in this form by writing: \begin{eqnarray} g(x)& = & H x Q & = & H^T R^{-1} H \\ b(y) & = & H^T R^{-1} y \end{eqnarray} Consider now \begin{equation} f(x,y) = 0 \end{equation} where \begin{equation} f(x,y) = Q x - b(y) \end{equation} We use derivatives of \begin{equation} \tilde{g} = g - \lambda^T f(x,y) \end{equation} with respect to $y$ as a means of computing derivatives of $g$ with respect to $y$. Note that \begin{equation} \frac{\partial \tilde{g}}{\partial y} = \frac{\partial g}{\partial x}\frac{\partial x}{\partial y} - \lambda^T \left( \frac{ \partial f }{\partial x }\frac{\partial x}{\partial y} + \frac{\partial f}{\partial y} \right) \end{equation} and this will simplify if we choose $\lambda$ judiciously as a solution of \begin{equation} \frac{\partial g}{\partial x } = \lambda^T \frac{\partial f}{\partial x} \end{equation} which we call the adjoint equation. For then \begin{eqnarray} \frac{\partial \tilde{g}}{\partial y} & = & \frac{\partial g}{\partial x}\frac{\partial x}{\partial y} - \lambda^T \left( \frac{ \partial f }{\partial x }\frac{\partial x}{\partial y} + \frac{\partial f}{\partial y} \right) \\ & = & -\lambda^T \frac{\partial f}{\partial y} \\ & = & \lambda^T \frac{\partial b}{\partial y} \end{eqnarray} Now specializing to \begin{equation} g(x) = H x \end{equation} and $b(y)$ as above we can solve for this convenient choice of $\lambda$ by writing \begin{eqnarray} H & = & \frac{\partial g}{\partial x} \\ & = & \lambda^T \frac{\partial f}{\partial x} \\ & = & \lambda^T Q \\ & = & \lambda^T H^T R^{-1} H \end{eqnarray} where the second equality is the adjoint equation. It should be clear from this how to compute derivatives of $\tilde{g}$ with respect to $y$, and thereby compute derivatives of $g$ with respect to $y$.

A game played on an Ornstein-Uhlenbeck bridge

2014-01-13T21:07:00.002-08:00

Let $ X_t $ denote the Ornstein-Uhlenbeck process \begin{equation} dX_t = -\kappa X_t dt + \sigma dW_t \end{equation} where $X_0=b$ is assumed known. Let $Y_t = \exp(X_t)$ and without serious loss of generality, choose $\kappa$ and $\sigma$ so that the unconditional distribution of $X_t$ is unit variance - that is to say $\frac{\sigma^2}{2\kappa} = 1$.
We play a two period game in which we choose a time $t_1$ and then, after $Y_{t_1}$ has been revealed to us (or equivalently $X_{t_1}$) we choose a second time $t_2$. We seek to maximize a utility function \begin{equation} U = c Y_{t_1} + Y_{t_2} \end{equation} We can evidently assume $t_1>0$ without loss of generality. The coefficient $c$ is the relative importance of the first period payoff versus the second, and for concreteness one might set $c=1$.
A time coordinate is used merely to make the mathematical representation more familiar. It does not represent a temporal choice, necessarily, but rather an abstraction of a more general search space (such as a location to drill for oil, a choice of marketing strategy, or a technology product). The process $Y_t$, unknown to us, represent the yet to be discovered variation in revenue as we move across the search space.
We seek to analyze how much exploration one should perform, versus sticking to what we know. Intuitively, if the initial point $X(0)=b$ (or $Y(0)=e^b$) is very high we might not want to risk any departure. Whereas if $b$ is less than zero we're better off forgetting it altogether (that is, choosing $t_1 \rightarrow \infty$). In what follows, we sharpen this intuition.

Lemma A: The single period game.

Conditioned on $X_0 = b$ the mean and variance of $X_t$ are given by \begin{eqnarray} \mu(t) & = & b e^{-\kappa t} \\ \nu(t) & = & \frac{ \sigma^2 }{2 \kappa } \left( 1 - e^{-2 \kappa t} \right) \end{eqnarray} So using our simplification $\frac{\sigma^2}{2\kappa} = 1$ we ought to choose \begin{eqnarray} t_1^* & = & \arg \max E[ e^{X(t)} ] \\ & = & \arg \max \log E[ e^{X(t)} ] \\ & = & \arg \max \left( \mu(t) + \frac{1}{2} \nu(t) \right) \\ & = & \arg \max \left( b \lambda_t + \frac{1}{2} ( 1 - \lambda_t ^2 ) \right) \\ & = & \arg \max -\frac{1}{2} \left( \lambda_t - b \right)^2 + \frac{1}{2} \left( b^2 + 1 \right) \end{eqnarray} for $\lambda_t = e^{-\kappa t } $ taking values in $ [0,1]$. This has qualitatively different behaviour depending on which of three regions $b$ falls into. The three possibilities are tabulated below.

Region	$\lambda^*_t$	Optimal $t$	$ \log E[ e^{X_t} ] $	Strategy
$b < 0$	0	$\infty$	$ \frac{1}{2}$	"Reset"
$0 < b < 1$	b	$-\frac{1}{\kappa} \log( b )$	$\frac{1}{2}\left( b^2+1 \right) $	"Explore"
$ b > 1$	1	$0$	$b $	"Stay"

Thus the one period problem has utility $ \zeta(b)$ where \begin{equation} \zeta( x ) = \left\{ \begin{array}{cc} e^{\frac{1}{2}} & x < 0 \\ e^{\frac{1}{2} \left( x^2+1 \right)} & 0 \le x \le 1 \\ e^x & x > 1 \end{array} \right. \end{equation} We denote the corresponding optimal time $t_1$ by $ \pi^{(1)}( b ) $. Namely: \begin{equation} \pi^{(1)}( b ) = \left\{ \begin{array}{cc} \infty & b < 0 \\ -\frac{1}{\kappa} \log( b ) & 0 \le b \le 1 \\ 0 & b > 1 \end{array} \right. \end{equation}

Lemma B: Optimizing over "outside" strategies.

For brevity we denote two period strategies by a decision function $ \pi( x; t, b ) $ that returns the second time choice $ t_2 = \pi( X(t_1); t, b ) $ after $X(t_1)$ is revealed. A simple one parameter family of strategies, indexed by the choice $t_1$, is given by \begin{equation} t_2 = \pi^{(2)}( x ; t_1, b ) = \left\{ \begin{array}{cc} - \pi^{(1)}(b) & x < b \\ t_1 + \pi^{(1)}(b) & x > b \end{array} \right. \end{equation} We call these the "outside" strategies because we assume the best choice for $t_2$ lies outside the open interval $ (0, t_1) $. If $X(t_1)$ is greater than our starting point $ X(0)=b $ we will never move left, but instead use the one period solution to move right (or stay). On the other hand if the first point is revealed to be lower than our starting point we explore, if anywhere, to the left of $0$ instead, again using the one period solution.
The evaluation of this strategy amounts to integration of the one period utility against the (gaussian) distribution of $X(t_1)$. For instance if $0 < b < 1$. \begin{eqnarray} L( \pi^{(2)} ) & = & c E[ e^{ X_{t_1} } ] + E[ e^{ X_{t_2} } ] \\ & = & c E[ e^{ X_{t_1} } ] + P( X_{t_1} < b ) \zeta( b ) + P( X_{t_1} > b ) E[ \zeta( X_{t_1} ) | X_{t_1} > b ] \end{eqnarray} We've already noted that the first term is just $ c e^{ \mu(t_1) + \frac{1}{2} \nu(t_1) } $. But since $\log \zeta(x) $ is piecewise quadratic, the others are also integrals of the type \begin{eqnarray} I(\mu,\sigma;x_1,x_2;a_0,a_1,a_2) := \frac{1}{\sqrt{2 \pi} \sigma } \int_{x_1}^{x_2} e^{-\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)} e^{a_0 + a_1 x + a_2 x^2} dx \end{eqnarray} admitting analytical solution. It is sufficient to note the following equalities, each achieved by substitution. \begin{eqnarray} \frac{1}{\sqrt{2 \pi} \sigma } \int_{x_1}^{x_2} e^{-\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^2} e^{a_1 x + a_2 x^2} dx & = & \frac{1}{\sqrt{2 \pi} } \int_{x_1/\sigma}^{x_2/\sigma} e^{-\frac{1}{2}\left( u-\mu/\sigma \right)^2} e^{a_1 \sigma x + a_2 \sigma^2 u^2} du \\ \frac{1}{\sqrt{2 \pi}} \int_{x_1}^{x_2} e^{-\frac{1}{2}\left( x-\mu \right)^2} e^{a_1 x + a_2 x^2} dx & = & \frac{1}{\sqrt{2 \pi}} \int_{x_1-\mu}^{x_2-\mu} e^{-\frac{1}{2}u^2} e^{ (a_1 \mu + a_2 \mu^2) + (a_1 + 2 a_2 \mu^2) u + a_2 u^2} du \\ \frac{1}{\sqrt{2 \pi}} \int_{x_1}^{x_2} e^{-\frac{1}{2}x^2 } e^{a_1 x + a_2 x^2} dx & = & \frac{1}{\sqrt{p}} \frac{1}{ \sqrt{2 \pi} } \int_{ x_1 \sqrt{p} }^{ x_2 \sqrt{p} } e^{ -\frac{1}{2}u^2 } e^{ \frac{a_1}{\sqrt{p}} u } du \\ \frac{1}{\sqrt{2 \pi}} \int_{x_1}^{x_2} e^{-\frac{1}{2}x^2 } e^{a_1 x} dx & = & e^{\frac{1}{2}a_1^2} \frac{1}{ \sqrt{2 \pi} } \int_{ x_1-a_1 }^{ x_2 - a_1 } e^{ -\frac{1}{2}u^2 } du \end{eqnarray} In the second to last equality $ p = 1 - 2 a_2 $ and validity requires $ p > 0 $.

Lemma C: If inside strategies are never optimal in the symmetric case, they are never optimal.

For simplicity we'd like to assume what would seem to be the critical case, $X(t_1)=b$. And we shall indeed show that irrespective of $b$ we never want to choose $t_2 \in (0,t_1)$. Intuitively it should also be true that we never wish to choose an inside strategy in the asymmetric case either. To tidy up this dangling thread and establish some formulas we'll need, we set $ X(t_1)=d $. We may assume $ d < b $ without loss of generality (otherwise exchange the roles, remembering that we shall establish our coming result for all values of $ b $ ). Now suppose there is a point $ t_2 \in (0,t_1)$. Using the unconditional means and variances used in the one period problem we apply Bayes Rule to find the conditional density on the Bridge: \begin{eqnarray} \rho( x; t_2 ) & \propto & e^{ -\frac{1}{2} \left( \frac{ x - b e^{-\kappa t_2} } { \sqrt{ 1-e^{-2\kappa t_2} } } \right)^2 } e^{-\frac{1}{2} \left( \frac{ x e^{-\kappa(t_1-t_2) } - d } { \sqrt{ 1-e^{-2\kappa (t_1-t_2) } } } \right)^2 } \\ & = & e^{-\frac{1}{2} \frac{ \left( x - b \lambda_2 \right)^2 } { 1-\lambda_2^2 } } e^{-\frac{1}{2} \frac{ \left( x \lambda - d \right)^2 } { 1-\lambda^2 } } \\ & = & e^{ -\frac{1}{2}\left( g_2 x^2 - 2 g_1 x + \dots \right) } \end{eqnarray} where $ \lambda_2 = e^{-\kappa t_2}$, $ \lambda = e^{-\kappa (t_1-t_2)} $ and matching coefficients of $x^2$ and $x$ in the exponent respectively we find \begin{eqnarray} g_2 & = & \frac{1}{1-\lambda_2^2} + \frac{\lambda^2}{1-\lambda^2} \\ g_1 & = & \frac{b \lambda_2}{1-\lambda_2^2} + \frac{d \lambda}{1-\lambda^2} \end{eqnarray} Now the gaussian conditional density \begin{equation} \rho(x) \propto e^{ -\frac{1}{2}\left( g_2 x^2 - 2 g_1 x + \dots \right) } = e^{ -\frac{1}{2} g_2 \left( x - g_1/g_2 \right)^2 + \dots } \end{equation} evidently has precision $g_2$ and mean $g_1/g_2$. The precision (i.e. $1/variance^2$) is independent of $d$ whereas the conditional mean is increasing in $ d $. It follows that if the low side of the bridge were raised, the inside option would become more attractive.

Lemma D: The "outside" strategy is no worse than backtracking to the middle of the bridge

From the one period problem and the additive nature of the payoff we know that all other strategies with $ t_2 \ge t_1 $ or $ t_2 < 0 $ are worse. So specializing the formula for the conditional mean and precision given above to the case $d=b$ we write the conditional mean and variance as \begin{eqnarray} \mu^c(t) & = & \frac{ \frac{b \lambda_2}{1-\lambda_2^2} + \frac{b \lambda}{1-\lambda^2} } { \frac{1}{1-\lambda_2^2} + \frac{\lambda^2}{1-\lambda^2} } = b \frac{ \lambda_2(1-\lambda^2) + \lambda(1-\lambda_2^2) } { 1-\lambda_2^2 \lambda^2 } = b \frac{ \lambda_2 + \lambda}{ 1 +\lambda \lambda_2}\\ \nu^c(t) & = & \left( \frac{1}{1-\lambda_2^2} + \frac{\lambda^2}{1-\lambda^2} \right)^{-2} = \frac{ (1-\lambda_2^2)^2 (1-\lambda^2)^2 } { ( 1- \lambda_2^2 \lambda^2 )^2 } \end{eqnarray} In the case $\lambda_2 = \lambda$, which is the middle of the bridge, this simplifies further \begin{eqnarray} \mu^c(\lambda_2=\lambda) & = & b \frac{2\lambda}{1+\lambda^2} < b \\ \nu^c(\lambda_2=\lambda) & = & \left( \frac{ 1-\lambda^2 }{ 1+ \lambda^2} \right)^2 \end{eqnarray} But there is an outside point with $\tilde{\lambda} = \frac{2\lambda}{1+\lambda^2}$ we might choose instead. This, by construction, will have the same mean $b \tilde{\lambda}$. We notice it also has the same variance: \begin{eqnarray} \nu(\tilde{\lambda}) & = & 1- \tilde{\lambda}^2 \\ & = & \frac{ (1 + \lambda^2)^2 - 4 \lambda^2 }{ (1 + \lambda^2)^2 } \\ & = & \frac{ (1 - \lambda^2)^2 }{ (1 + \lambda^2)^2 } \\ & = & \nu^c(t). \end{eqnarray} Thus we have established an outside point with equivalent utility.

Exhibit E: Pictures looking for proofs.

Returning to the symmetric case $ X(t_1)=b $ we ask is there any value for $t_2 $ inside the interval $ (0,t_1) $ that is a better choice that going outside the bridge (and using the optimal one period solution). To put it another way, is there any value for $\lambda$ for which there is any choice of $\lambda_2$ such that \begin{equation} D(\lambda_2,\lambda,b) = \log \xi(b) - \psi(b) < 0 \end{equation} where $ \psi(b) = \mu^c(t) + \frac{1}{2} \nu^c(t) $ ? Or, using the formulas above, is there a combination $\lambda_2, \lambda$ for which \begin{equation} \log \xi(b) - \left( b \frac{ \lambda_2 + \lambda}{ 1 +\lambda \lambda_2} + \frac{1}{2} \frac{ (1-\lambda_2^2)^2 (1-\lambda^2)^2 } { ( 1- \lambda_2^2 \lambda^2 )^2 } \right) < 0 ? \end{equation} Here is a picture of the difference between the outside option and inside options in the case $b=1.5$ that would suggest the difference is always positive:

Here is another, for the case $b=1.5$.

Lemma F: For $b \in (0,1)$ the function $S(u) = D(\lambda_2=u,\lambda=u;b)$ satisfies $S(u)\ge0 $ for $u \in (0,1)$ with a single zero at $u=\frac{1-\sqrt{1+b}}{b}$

We consider the middle of the symmetric bridge once more, this time varying the width of the bridge. We claim that for $b \in (0,1)$ there is a particular width bridge for which we are indifferent as to whether we choose the middle of the bridge (i.e. $t_2=\frac{1}{2} t_1$ or the optimal outside solution. Specializing to this case implies $\lambda_2=\lambda$ so we write $S(u) = D(\lambda_2=u,\lambda=u;b)$. From the formula for $D$ given above we have \begin{eqnarray} S(u) &= & \frac{1}{2}(1+b^2) - \left( b \frac{ 2u}{ 1 + u^2} + \frac{1}{2} \frac{ (1-u^2)^4 } { ( 1- u^4 )^2 } \right) \\ &= & \frac{1}{2}(1+b^2) - \rho(u;b) \end{eqnarray} where \begin{equation} \rho(u;b) = b \frac{ 2u}{ 1 + u^2} + \frac{1}{2} \left( \frac{1-u^2}{1+u^2} \right)^2 \end{equation} A little algebra shows that \begin{equation} \frac{ \partial }{\partial u} \rho(u;b) = \frac{ 2(u^2-1)(bu^2-2u+b)}{ (1+u^2)^3} \end{equation} which is zero for any solution of $b u^2-2u+b=0$. If $b<1$ we do indeed want the root in $0,1$ given by \begin{equation} u = \frac{1-\sqrt{1+b}}{b} \end{equation} So substituting $u^2 = \frac{2u-b}{b}$ back into $\rho$ yields, after a little simplification: \begin{equation} \rho\left(u=\frac{1-\sqrt{1+b}}{b};b\right) = \frac{1}{2}(1+b^2) \end{equation} which we recognize as precisely the value of the one period problem. This shows that \begin{eqnarray} S(u) &= & \frac{1}{2}(1+b^2) - \rho(u;b) \\ & \ge & \frac{1}{2}(1+b^2) - \max \rho(u;b) \\ & = & \frac{1}{2}(1+b^2) - \frac{1}{2}(1+b^2) \\ & = & 0 \end{eqnarray} as claimed.

Lemma G: For $b >1 $ the function $S(u) = D(\lambda_2=u,\lambda=u;b)$ satisfies $S(u) > 0 $ for $u \in (0,1)$. It approaches value zero as $u \rightarrow 1$, corresponding to the case of a very short bridge.

The limiting case is obvious algebraically and geometrically, since for $b>1$ we have \begin{eqnarray} S(u) &= & b - \left( b \frac{ 2u}{ 1 + u^2} + \frac{1}{2} \left( \frac{1-u^2}{1+u^2} \right)^2 \right) \\ & \rightarrow & 0 \end{eqnarray} as $u\rightarrow 1$. Moreover we can simplify to \begin{eqnarray} S(u) &= & \left(\frac{1-u^2}{1+u^2} \right) \left( 1+ \frac{1}{2} \frac{(1-u)^2}{(1+u^2)} \right) \end{eqnarray} which is evidently greater than zero.

Keeping punters log-happy: Some properties of a "pristine parimutuel" market clearing mechanism

2013-11-20T20:59:00.002-08:00

Is there such a thing as a perfect probabilistic paradise for punters? A place where participants' prior probabilistic prophecies are pulled apart and placed into a philosophically pristine parimutuel posterior, without painful departure from previously perceived pseudo-optimal particulars? I believe the platonic possibility is promising (though to provoke, in practice we puritans may be prevented from partaking in this powerful portal whilst we are paralyzed by the paranoid parlance of the present day).

A pristine market

Well now that I'm out of spit, here are some characteristics of a market for mutually exclusive, exhaustive outcomes that I would like to see:

Every participant is forced to invest in a manner that optimizes their long run wealth. They will allocate their wealth to maximize $E^P[\log {\rm wealth}] $ with respect to their best estimate of real world probability $P$.
Every participant's estimate of probability, on which 1. is evidently based, must include both private information (from which they might derive a probability measure $R$ say) and the market probabilities themselves (call them $Q$). In contrast, to simply equate $P$ with $R$ in 1. is a dreadful, though rather common, mistake.
Criteria 2. is achieved without the need for participants to monitor and respond to price feedback. In contrast the usual mechanism for including market information, as we find in racetrack parimutuels, is to provide provisional estimates of $Q$ in realtime so that punters can react.

The last requirement motivated my ponderings on this topic. In our perceived utopia participants are given inventive to provide their own private probabilities $R$ well before "race time" (as it were). They can then put their feet up and relax safe in the knowledge that 1. will be satisfied automatically.

Such is non-trivial because even investors whose bets are tiny with respect to the overall market volume (and thus have negligible price impact) must quite rationally react to partial price discovery even though in "theory", they should simply bet proportionately irrespective of the odds on offer (a lovely accident mentioned below). I use scare quotes because unless the odds are truly known, $Q$ should really enter the optimization via the back door: an update to $P$ acknowledging the market's ability to aggregate information.

In this setup the market clearing mechanism does the heavy lifting normally performed iteratively and imperfectly. So long as there is a rule governing the manner in which "$R + Q = P$" (figuratively speaking) we can anticipate every player's estimate of $P$ given $Q$, and thereby solve simultaneously for Q and log-optimal allocations that presumably influence Q. Here we consider the simplest rule of that sort. We shall assume P is a convex combination of $R$ and $Q$ with fixed coefficients.

Also for simplicity we consider the case where the market is not subsidized (though that might be an interesting direction for generalization). Then linear pricing forces us to adopt the most straightforward parimutuel clearing mechanism once we known the allocations: divide all the money wagered amongst those choosing the winning horse, in proportion to their bet.

Necessary optimality conditions

Suppose market participant $k$ allocates all his wealth $W^k$ across $I$ mutually exclusive outcomes. Suppose his estimate of probability for the $i$'th state is given by \begin{equation} p^k_i = \eta^k r^k_i + (1- \eta^k) q_i \end{equation} where $r^k$ is his best estimate using only private information, and $q_i$ is the market implied probability arrived at by means to be discussed.

As noted this equation is a statement of a seemingly rational philosophy, independent of how the market operates. Investor $k$ might have noticed in the past that his private information adds some explanatory power to the market, but he probably shouldn't ignore the market prices altogether in arriving at the best estimate of real world probability.

We shall further suppose, in what follows, that all $K$ participants are rational in another sense. They wish to optimize the log of their posterior wealth. Now it is well appreciated that if $p^k_i$ are considered fixed this log-optimality leads simply to proportional betting, but that is not the case here. Only $r^k_i$ and $\eta^k$'s are fixed, and we shall attempt to construct allocations $u = \{u^k\}$ and clearing prices $q_i$ that overtly depend on the investments made by participants.

To that end let $u^k$ denote the fraction of wealth investor $k$ invests in the $i$'th state. Suppose that the market clears in parimutuel fashion, meaning that all participants receive the same price. The market probability for the $i$'th state must be \begin{equation} q_i = \frac{ \sum_k u^k_i W^k } { \sum_{k=1}^K W^k } = \sum_k u^k_i W^k \end{equation} since we might as well suppose w.l.o.g. that $ \sum_k W^k = 1 $.

The question then arises, is there a choice of $\{u^k_i \}_{i,k}$ such that investment by each participant is log-optimal? Intuitively one would expect a market equilibrium provided $\eta^k$'s are strictly between $0$ and $1$.

The utility function for the $k$'th investor is \begin{eqnarray} U^k( u^k ) & = & E^p\left[ \log\left( \frac{u^k_i}{q_i} \right) \right] \\ & = & \sum_{i=1}^I p^k_i \log\left( \frac{u^k_i}{ \sum_k u^k_i W^k } \right) \\ & = & \sum_{i=1}^I ( \eta^k r^k_i + (1- \eta^k) q_i ) \log\left( \frac{u^k_i}{ \sum_k u^k_i W^k } \right) \end{eqnarray} and by definition of $u^k_i$ the constraints are $\sum_i u^k_i = 1 $. Or for every $k$ we might write $ g(u^k)=0$ where $g(u^k) = \sum_i u^k_i - 1$. This sets up the first order Lagrangian equations for $ \Lambda(u,\lambda) = U(u) - \lambda \cdot g( u ) $ where we collect the components $k=1,..K$. As usual for these problems we set $ \nabla \Lambda = 0 $ because for optimality the derivative of the objective function must be proportional to the derivative of the constraint function. This leads to equations of the form \begin{eqnarray} 0 & = & p^k_i( q_i(u)) \left( \frac{1}{u^k_i} - \frac{W^k}{q_i(u)} \right) - \eta^k W^k \log \left( \frac{u_i}{q_i(u)} \right) - \lambda^k, \ \ \ {\rm and} \\ 0 & = & \sum_i u^k_i - 1 \end{eqnarray} relating the allocations $u^k_i$. Notice there are $KI+K$ equations and $KI+K$ free variables, including both the $u^k_i$'s and the $\lambda^k$ multipliers (the $W^k$ are fixed parameters and we have $q_i(u) = \sum_k u^k_i W^k$ as noted above). The solution to this system of non-linear equations defines a pristine parimutuel clearing mechanism.

Comparison to proportional betting

In contrast if we imagine that $q_i$ are not a function of allocations $u^k_i$ but fixed, and further suppose $\eta^k = 1$ then we return to the overly stylized world where participants don't take market prices into account, preferring to believe their own homework is a sufficient statistic. The optimality conditions are instead: \begin{eqnarray} 0 & = & \frac{p^r_i}{u^k_i} - \lambda^k, \ \ \ {\rm and} \\ 0 & = & \sum_i u^k_i - 1 \end{eqnarray} It is apparent from the first equation that $u^k_i \propto r^k_i$ and then, from the second, that $u^k_i = r^k_i$. We recover the remarkable accident referred to as proportional betting, where the price $q_i$ does not enter the picture. This works perfectly well for blackjack where $r^k = p^k$ but not most real world games where markets inevitably supplement one's private information.

Critique

While I have caged this discussion in market language, it should be apparent that we have derived an algorithm for combining multiple probabilistic forecasts into a single probability distribution $Q = \{q_i\}$, for which there is a large literature. See the references to Ranjan and Gneiting for example.

I shall concede that what is philosophically inconsistent is my use of simple linear pooling to arrive at the individual's subjective probability estimates $p^k = \{p^k_i\}_{i=1}^I$ based on their prior $r^k$ and the final market price $q_i$, given that we then derive a more sophisticated combination scheme for meshing between individuals. Why not use the "better" scheme to combine $Q$ and $R$?

I suppose a flippant response is "why not put a picture of a boy holding a cereal box on the cereal box?". Let me think about a more satisfying answer and get back to you.

Hooke's Law and the Kalman filter. A little "spring theory" emphasizing the connection between statistics and physics.

2013-10-31T21:41:00.000-07:00

Readers will be familiar with statistical mechanics. But what about mechanical statistics?

This post is purely expository and concerns the simplest one dimensional Kalman filter. You'll recall the setup. We periodically observe Brownian motion subject to gaussian measurement error and attempt to infer the ground truth from those measurements. But rather than dive into a bunch of conditional expectations for gaussian linear systems we'll look for some physical intuition. We'll pretend that we've been robbed of our computers and are forced to create analog varieties just as in the good old days.

We make an observation that isn't always stressed up front in the statistical literature (Harrison and West for example) or control systems perspective (such as you will find at the at wikipedia entry for "Kalman filter", for example). Then we pursue the analogy between statistics and physics a little further, and show how the updating of a location estimate of a gaussian distribution amounts to a combination of center of mass and reduced mass calculations.

The perspective we arrive at is that most statistical calculations of a filtering variety are directly mapped to a corresponding physical system containing both fixed and free masses. Resolving these physical systems requires a change of reference frame. The calculations are therefore, somewhat unsurprisingly, precisely "center of mass" and "reduced mass" calculations from physics.

Kalman filter equations are just a center of mass calculation

Suppose the prior estimate of location for a particle is $m$ and the prior covariance is $P$. Suppose we make an observation $y$ with error variance $R$. Our posterior belief is gaussian with location $m'$ say and variance $P'$. The update is usually written \begin{eqnarray} m' & = & m + K ( y - m ) \\ P' & = & P(1-K), \ \ {\rm where} \\ K & = & \frac{P}{P+R} \end{eqnarray} However it is in many ways more natural to use the inverses of covariances instead. If we write $\varphi = 1/R$, $p = 1/P$ and $ p' = 1/P'$ and multiply by through by $ \frac{P+R}{PR} $ we notice that the Kalman filter update is merely a center of mass calculation: \begin{eqnarray} m' & = & \frac{m/P + y/R} { 1/R + 1/P } = \frac{ pm + \varphi y }{ \varphi + p } \\ p' & = & \frac{1}{P'} = \frac{P+ R}{PR} = \frac{1}{P} + \frac{1}{R} = \varphi + p \end{eqnarray} The analogy works if we treat precision as mass. And in what follows we'll be equally interested in the analogy between force and the derivative of the negative log likelihood function.

This table suggests that in a gaussian world force is linear in distance. And it true that we can construct an analogue Kalman smoother with rods and springs as follows:

An "analogue" gaussian smoother using perfect Hookean springs

Minimizing energy corresponds to minimizing maximizing log-likelihood. And setting the derivative of log-likelihood to zero corresponds to finding the equilibrium where forces cancel out.

Futhermore the fact that combining two pieces of evidence for one latent variable can sometimes be as simple as merging the two observations at their "center of precision" corresponds to a nice accident when forces grow linearly with distance: the impact of two masses on a third is unchanged if they coalesce at their center of mass.

But there is more to the story...

Reading a "spring diagram" in a Hookean universe

To demonstrate a richer physical analogy we consider next a Gaussian distribution whose location is assumed unknown, but also gaussian.

Figure 1. Hierarchical model where location of a gaussian distribution is itself gaussian

Suppose our prior is \begin{eqnarray} P( x | \mu ) & \propto & e^{-\rho(x-\mu)^2} \\ P(\mu) & \propto & e^{-p(\mu-m)^2} \end{eqnarray} where this time $m$ represents our guess as to the location of the center of the distribution. Symbolically we might represent the prior with the following diagram.

Figure 2. Spring diagram representing prior knowledge of the location of a gaussian distribution

Here the tuples represent location and precision, or if you prefer, position and mass. It isn't an analogy. If we assume attractive forces vary as a Hookean law, which is to say proportional to the product of masses and distance: \begin{equation} Force \propto M_1 M_2 d \end{equation} then in the example above the yellow and green masses witness attractive force and potential energy given by \begin{eqnarray*} {\rm Force} & = & \frac{p}{\rho} \rho |m - \mu| = p |m - \mu |,\ \ {\rm and\ thus} \\ {\rm Energy} & = & \frac{1}{2} p (m - \mu )^2\\ \end{eqnarray*} Since energy is the negative log likelihood, we "read" the diagram as $ P( x | \mu ) \propto e^{-\rho(x-\mu)^2} $. Similarly the spring between the yellow mass and red "test particle" is read $ P(\mu) \propto e^{-p(\mu-m)^2} $. The system therefore represents the hierarchical model and indeed, it is an exact physical analogue. So in what follows we will ask the following question: what force does the test particle feel as we add other masses to the picture?

Simplifying a spring diagram using reduced mass

The game begin in earnest when we introduce noisy evidence of our unknown location parameter $\mu$ for our mysterious distribution. Suppose we take a draw from said distribution $x_2$. Suppose we don't observe $x_2$ itself but instead, a noisy measurement $y$ whose precision (or "mass", if you will) is $\varphi$. The noisy measurement's distribution conditional on $x_2$ is $ P(y|x_2) \propto e^{-\varphi(y-x_2)^2}$ and corresponds to the following spring diagram.

Figure 3. Spring diagram representing noisy evidence

As suggested in the diagram, we will attach the evidence to the latent variable $\mu$ as follows:

Figure 4. Prior location belief plus a noisy measurement

Our task is simplification of this system until, from the perspective of the test mass, it resembles the form of the prior we began with. We should think of the observed evidence (green rectangles) as fixed points whereas the yellow circles represent unknown quantities that are free to move.

We ought to recall here the rules for combining springs in series, or to be more direct, the "reduced mass" trick for replacing a three body problem with a two body problem. In either situation physics reminds us that the combined action of the rightmost two masses can be simplified:

Figure 5. Prior belief plus a noisy measurement simplified using reduced mass

We replace the mass $\phi$ with a reduced mass $\frac{\phi}{\phi+\rho}$ because the intermediating unit mass reduces the pull. Since it is well covered elsewhere I will not derive the reduced mass expression but notice why the reduced mass makes sense in the limits. If $\phi \rightarrow 0$ the relative size of the yellow unit mass is huge and so the mass at $\mu$ hardly feels the pull from the green mass at $y$ at all. In the other extreme case, when $\phi \rightarrow \infty$, the unit mass is sucked into the green mass and is, for all intents and purposes, stationary. Thus it acts like a fixed unit mass pulling the mass at $\mu$ rather than a floating one.

We proceed to the final simplification of the diagram. This is pretty easy as the two green masses are inertial. Their impact on the yellow mass is equivalent to a single inertial mass at their center of mass. Thus:

Figure 6. Simplification of Figure 4 by reduced mass and center of mass calculation.

Apologies here because the diagram contains a small error. The denominators should read $\varphi + \rho$, not $\varphi + p $. Evidently this takes the form of the prior system in Figure 2 and can be easily read. It states that our posterior belief about the location parameter $\mu$ is gaussian with mean $m'$ say and precision $p'$ say, where \begin{eqnarray*} m' & = & \frac{ m \frac{p}{\rho} + y \frac{\varphi}{\varphi+\rho} } { \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} } \\ p' & = & \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} \end{eqnarray*}
This closes the loop and demonstrates how updating can be performed for the hierarchical model in Figure 1.

Recovering the Kalman filter update

As a parting note, we see that the limit $\rho \rightarrow \infty$ leads to update equations \begin{eqnarray} p' & = & \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} = \frac{p+p/\rho + \varphi}{\varphi +\rho} \rightarrow \varphi + p \\ m' & = & \frac{ m \frac{p}{\rho} + y \frac{\varphi}{\varphi+\rho} } { \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} } \rightarrow \frac{ pm + \varphi y }{ \varphi + p } \end{eqnarray} which is the Kalman update as before. This is to be expected, since in the limit $\rho \rightarrow \infty$ the problem of locating a distribution with unknown location (noisily observed) shrinks down to the problem of locating a point mass with unknown location (noisily observed).

The Horse Race Problem: A Subspace Solution

2013-09-15T20:54:00.005-07:00

The "horse race problem" asks for a reasonable estimate of the relative ability of entrants in a competition given that we know the probabilities of winning. The question is vague, but implies the competition awards a win to the contestant with the highest (or lowest) score, and perhaps further implies that the distribution of contestant scores is a translation family where "ability" is a location parameter. This post formalizes the problem and provides a simple practical solution.

Geelong Cup Photo Finish 2005

Inferring running time distributions from win probabilities

The question is hardly new. As noted in Probability models for horse race outcomes by Mukhtar M. Ali, one way to estimate race outcome probabilities given a liquid market for the winner is to calibrate distributions of their running times. For example, one might assume the distribution of the $i$'th horse's running time $\{\tau_i\}$is gaussian, with known variance, and that the location of each horses' distribution is a free parameter to be estimated. More generally we assume all running time distributions belong to a translation family where $f$ is fixed and a single parameter $a_i$ characterizes each horse via its running time density $$ f_i(x) = f(x;a_i) = f(x-a_i) $$ We denote the corresponding probability that the $i$'th horse's running time $\tau_i$ is less than $x$ by $$ F(x - a_i) = Prob( \tau_i < x).$$ Under these assumptions the probability $p_i$ that the $i$'th horse wins the race is\begin{equation} p_i = \int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx. \end{equation} If we arbitrarily fix $a_1=0$ we can presumably solve for $\{a_i\}_{i>1}$. More on that in a moment.

Motivation (or lack thereof)

On application of the horse race problem is the estimation of so-called exotic probabilities: the probabilities of relevance for racegoers betting on quinellas, exactas (forecast) and trifectas. This presupposes that the market for horses winning a race is more efficient than the markets for various permutations of finishers. There is a small literature aimed at exploiting inefficiencies between the win and place (show) markets at racetracks.

The evidence is not altogether compelling, incidentally. It has been argued here that place prices on at least one betting exchange are roughly as efficient as the win market. And the ability for professional punters to beat the track take is greater for trifectas than the win market (because knowledge of two or three horses can be multiplied, as it were) suggesting that the smart money might be in trifectas instead. In a past life I gathered a fair amount of evidence to suggest just that, so don't rush off to the Dapto dogs armed only with a quaint assumption. Still, the horse race problem as stated remains a interesting mathematical problem and has motivated some people to study order statistics.

Normal running times

As noted by Ali, in the case where $f = \phi $ is the normal distribution an approximate solution for the $a_i$ given known $p_i$ is given by Henery $$ a_i = \frac{ (k-1) \phi\left( \Phi^{-1}(\frac{1}{k}) \right) \left(\Phi^{-1}(p_i)- \Phi^{-1}(\frac{1}{k})\right) }{ \Phi^{-1}\left( \frac{i-\frac{3}{8}}{k + \frac{3}{4} } \right) } $$ Some other approximations are mentioned in this presentation by Victor Lo who summarizes some empirical findings.

Gamma running times

I leave it to the reader to explore another territory where analytic solutions are available. For a treatise on ranking models with gamma processes see Gamma processes, paired comparisons, and ranking by Hal Stern. (Related: Generalized Bradley-Terry models).

Arbitrarily distributed running times by brute force optimization

Here I restrict myself to some observations on the problem where $f$ is arbitrary. As mentioned above we can set $a_1=0$ and throw $\{a_i\}_{i>1}$ to a solver for $F=0$. Here $F = F\left(\{a_i\}_{i\not=1}\right)$ is an $n$ dimensional vector function taking $n-1$ arguments, and the $i$'th entry of $F$ is the difference between the computed ($q_i$) and desired probabilities: \begin{equation} F_i = \overbrace{\int f(x;a_i) \prod_{j \not= i}^k \left(1-F\left(x;a_j\right) \right) dx}^{q_i} - p_i.\end{equation}This works fine for moderate $n$, but for $n>100$ it is time consuming, as can be seen in the following figure.

Computation time in seconds (left) as a function of the number of runners in the race.
Log scale on right. Matlab's implementation of Levenberg-Marquardt optimizer was used.

Solving a race with 200 runners takes on the order of one hour.

A novelty? Arbitrarily distributed running times by subspace optimization

We now introduce a method that cuts the computation time dramatically while retaining a high level of accuracy in the solution. The idea is to restrict the $n-1$ dimensional search to a subspace of dimension $m < n-1$. In particular if $m$ is fixed (say $m=15$) then we can compute the solution in approximately linear time. A one hour computation for a large race becomes a one minute computation (not as cool as detexity, but not bad).

Obviously the subspace must be chosen with some care, lest it lie nowhere near the best solution. But before getting to that it also helps to introduce a scaled measure of discrepancy between the desired vector $p$ and the one computed. We use the log-odds discrepancy:
$$ D(p_i, q_i) = log(p_i)-log(1-p_i)-log(q_i)+log(1-q_i)$$ rather than the naked difference of probabilities in the equation above.

Now to optimize over a subspace we define a map $\phi(z \in R^m) \mapsto a \in R^n$ as follows. We assume the probabilities are sorted from highest to lowest. We let $\{c_k\}_{k=1}^{m-1}$ be a collection of centroids obtained by k-means clustering of the differenced log-probabilities $\{\log(p_1)-\log(p_i)\}$ into $m-1$ clusters. We then enlarge the collection by adding $0$ and $\log(p_1)-\log(p_n)$. After sorting the enlarged list of cluster centroids (padded with extreme points) we have a collection $c = \{0, \dots, \log(p_1)-\log(p_n)\}$ of cardinality $m+1$. Then we define an $n$ by $m+1$ matrix $A$ by setting $A_{1j} = \delta_{1j}$, $A_{n,j} = \delta_{m+1,j}$ where $\delta$ is the Dirac delta function. For $2 \le i \le m$, choosing the convex combination of the nearest two cluster points that returns $p_i$. Thus there are two non-zero entries of $A_{ij}$ for all but the first and last rows (unless a $p_i$ coincides with a centroid). With slight abuse of notation, $A$ as a linear transformation from $R^{m+1}$ to $R^n$ satisfying $p = A(c)$.

The map $\phi(z \in R^m) \mapsto a \in R^n$ is defined by composition:
$$
\phi = A \cdot \iota
$$where in matlab notation we would write $\iota(z) = [0;z]$. That is to say $\iota$ is the trivial map from $R^m$ to $R^{m+1}$ that assumes the first coordinate is zero. Finally then, we pass the vector function $\tilde{F}$ taking $m$ arguments to a solver. The $i$'th coordinate $\tilde{F}$ is given by
$$
\tilde{F}_i = D\left( q_i\left( \phi\left(z\right)\right), p_i \right)
$$
This works surprisingly well. The intuition is that the sorted abilities $\{a_i\}$ can be interpolated reasonably (and very roughly, vary with the $\log$ of $p_i$) so long as we have enough knot points. It helps to place the knot points near population centers to reduce the amount of interpolation. As we vary $z$ we are varying the locations of the knot points and forcing the $a_i$'s to move in a subspace, but since the interpolation is pretty good the subspace lies quite close to the solution.

And as noted, it is much faster. The figure below shows computation time as a function of the number of starters when we use subspaces of dimension $5$ and $15$. No code optimization has been attempted and undoubtedly, better numbers could be achieved in any number of ways. But the point is that using a subspace search alone results in a $50$x improvement or so for $n \approx 200$. The improvement would of course be much greater, in relative terms, for $n \gg 200$.

Computation time as a function of race size when using a subspace search

The error introduced by the subspace approximation will vary, of course, but as a rough guide the ratio of $p_i$ to $q_i$ differs from unity by about half a percent when using a subspace of dimension five, even when there are $200$ runners. So a horse at 100/1 might be calibrated at 99.5/1. This error drops by another order of magnitude when using a subspace of dimension $15$, whereupon the calibrated and desired probabilities are, as a practical matter, indistinguishable.

Possible improvements

For known distributions we ought to be able to use more accurate interpolation. Here I merely assumed that probabilities fall off as the exponential of ability difference. For very large races we should use the stability postulate instead, at least for those in the tail.

A generalization?

It's usually around this point where start to suspect there can't be anything new here. After all, any similar optimization problem where we have an ounce of intuition as to the solution that could be similarly addressed (i.e. by minimizing interpolation with clustering). Perhaps somebody could point me to method XYZ and conclude that the horse race solution presented here is merely an application of the same.

Quasi-random sampling subject to linear and moment constraints (or: "how to bypass model calibration")

2013-09-02T13:35:00.003-07:00

Its a rainy long weekend in New York and I've been dusting off an old experiment while listening to the thunder at a friend's place. I was particularly impressed by the fact they had an outlet on their porch positioned right by a comfy chair.

Here's the motivation. In finance we sometimes find ourselves attempting to calibrate a probabilistic model for an underlying random variable (such as a stock price) subject to known constraints (such as the price of an option on the same). Often, we then draw samples from the same model for the purpose of pricing other securities, performing risk calculations or satisfying some other need.

\begin{equation} observed\ prices \rightarrow calibrated\ model \rightarrow random\ samples \end{equation} This has several advantages including the built in regularization we get by using a parametrized model. There is on obvious drawback however: one has to come up with the model!

A while ago I was motivated to skip the calibration step. I was inspired by the herding literature in machine learning, and I started playing with a more direct path: \begin{equation} observed\ prices \rightarrow random\ samples \end{equation}.The idea is that we devise quasi-random sampling schemes that automatically achieve the calibration while striving to fill out the space in a way that achieves some of the objectives of a calibrated model.

Best described by an example I feel. And in this one we consider samples on a discrete lattice where smoothness of the sampled distribution is controlled using the discrete Laplacian.

Example

I will lead with the picture. The middle plot demonstrates that we are able to generate samples whose histogram is quite smooth, yet satisfy the constraints provided. In this case we've chosen a two dimensional state space, as might commonly occur when (strike, maturity) or say (attachment point, maturity) axes are in play.

Discrete Laplacian minimizing quasi-random samples (middle) approximating the true distribution (left). For comparison we also show a minimal surface (right) generated using the same constraints with direct optimization.

We've started with some fake distribution on the left, created some functions of the same and used those functions as constraints.

Aspirations and notation

We allow a lattice to be determined by axes $x^{(1)}=\{x^{(1)}_1 >...> x^{(1)}_{n_1}\}$ and $x^{(2)}=\{x^{(2)}_1 >... > x^{(2)}_{n_2}\}$. We suppose that $F = \{f_k\}_{k \in K}$ is a collection of functions on the implied lattice whose means are known constants $\mu_k$. That is,
\begin{equation}
\mu_k = E[f_k] \ \ \ \ k \in K.
\end{equation}
with respect to some density that we wish to sample from. Specifically we want to draw samples $z=\{z^i\}_{i=1}^I = \{(z^i_1,z^i_2)\}_{i=1}^I$ such that
\begin{equation}
\hat{f_k} := \frac{1}{I}\sum_{i=1}^I f_k(z^i)
\end{equation}
satisfies approximate equalities
\begin{equation}
\hat{f_k}(z) \approx \mu_k, \ \ k \in K.
\end{equation}
It will turn out that using equalities here involves no loss of generality. Indeed a cute feature of the algorithm exhibited here is that inequalities are treated in the same manner as equalities: we simply use the same function twice with the same weight $w$ but different $\mu$'s. The lower $\mu$ is interpreted as a lower bound and the higher as an upper bound.

Now for notational convenience we might as well consider $f_k$ to be functions defined on the integer lattice $\{1..n_1\} \times \{1..n_2\}$, identifying $f(r,s) := f(x^{(1)}_r,x^{(2)}_s)$. We also identify samples $(z^{(1)}_r,z^{(2)}_s)$ with their integer pairs $(r,s)$. We denote by $Z^i(r,s)$ the number of samples falling on the lattice point.

A fast algorithm for generating quasi-random samples

In what follows we will use a prior probability $\phi(r,s). $ defined on the lattice. It might might be determined by minimization of the square of the discrete Laplacian subject to the equality and inequality constraints given, or by other means.

The idea is to start lobbing samples on the lattice (like stackable Go stones) and watch the distribution emerge. By monitoring the discrete Laplacian we can try to coerce the samples to fill in holes - smoothing things out as it were. By monitoring the expectation of the constraint functions $f_k$ with respect to the pseudo-sample distribution thus far, we can correct for bias.

For this to be efficient we need a very efficient means of bookkeeping.

Initialization:

At the outset we fix a constant lattice function $A_k(r,s) = signum \left( \mu_k - f(r,s) \right) $ for $k \in K$. We initialize the laplacian $D(r,s)=0$ for all lattice points $(r,s)$. We initialize counters $S^k_0 = 0$ for $k \in K$.

Selecting the next quasi-random sample:

Beginning with quasi-random sample $i=1$ we perform the following steps.

Set urgency $\nu(r,s) = D(r,s)$ then for $k \in K$ update it by $ \nu(r,s) \rightarrow \nu(r,s) + w_k A_k(r,s) signum( S^k - i\mu_k ) $
Choose the most urgent lattice point after weighting: $ (r_i,s_i) = \arg \max \nu(r,s) \phi(r,s)$
Increment the sample count for the lattice point: $Z(r_i,s_i) \rightarrow Z(r_i,s_i) + 1 $
Decrement $D(r_i,s_i) \rightarrow D(r_i,s_i) -1 $ then increment $D(r^*,s^*) \rightarrow D(r^*,s^*) + \frac{1}{4} $ for all lattice points $r^*,s^*$ neighbouring $r_i,s_i$.
Update $ S^k \rightarrow S^k + f^k(r_i,s_i) $.

What we are doing here is selecting a new point that improves as many of the biases as possible, subject to the discrete Laplacian not getting out of whack.

Selecting the next quasi-random sample (slightly more efficient):

It turns out we can be a little sneakier and initialize $ \nu(r,s) = \phi(r,s) $. Then the following algorithm is equivalent and somewhat more efficient.

Choose the most urgent lattice point after weighting: $ (r_i,s_i) = \arg \max \nu(r,s) \phi(r,s)$
Increment the sample count for the lattice point: $Z(r_i,s_i) \rightarrow Z(r_i,s_i) + 1 $
Update $S^k = S^k_{prev} + f^k(r_i,s_i) $
For $k \in K$ update $ \nu(r,s) \rightarrow \nu(r,s) + w_k A_k(r,s) \left( signum(S^k-i \mu_k) - signum(S^k_{prev} - (i-1) \mu_k \right) $
Decrement $\nu(r_i,s_i) \rightarrow \nu(r_i,s_i) -1 $ then for all four neighbours $(r^*,s^*)$ of $(r_i,s_i)$ increment $ \nu(r_i,s_i) = \nu(r_i,s_i) + \frac{1}{4} $.
Set $S^k_{prev} = S^k$.

Notice that we don't really need to consider the discrete Laplacian and the moment terms separately, since we only care about their direct tradeoff.

A minor practical tweak

It seems to help to use a tuning parameter $\lambda$ that weighs the relative importance of satisfying the constraints $f_k$ versus minimizing the discrete Laplacian. Left as an exercise for the interested reader.

Two Kelly betting rules: one of which I might actually remember

2013-09-01T16:10:00.002-07:00

I was having a conversation the other day involving proportional betting and the Kelly criteria, and was embarrassed that I couldn't remember of the top of my head what some of the numbers came out to be. So I decided to try to make this really simple for myself by supplying two rules. The first applies to betting on an even money favorite. The second applies to betting on longshots. Both assume you wish to maximize the log of your wealth.

RULE 1: If "heads" is ten percent more likely than usual, bet ten percent of your wealth.

Here's a plot of $E[log(1+fX)]$ where $X$ is a binary random variable taking value $1$ or $-1$ with probability $0.55$ and $0.45$ respectively. The x-axis shows $f$, the fraction of one's wealth one decides to put on the coin flip.

Percentage of one's wealth to wager on a coin flip with p=0.55

The y-axis can be interpreted as the average return, if we assume we get paid even money even though we know the true probability is different. Clearly we want to bet about ten percent of our wealth to maximize this.

But is the scale wrong? Since we have a ten percent edge we should, on average, be making ten percent of ten percent return, or one percent return, right?

Well no. You only get half that. If you want to remember why you only get half a percent, not a one percent return, notice that you are approximately maximizing a linear function (the return you expect, not taking into account the $log$) subject to a quadratic penalty (the concavity of $log$). To be more precise we observe the expansion around $f=0$ of
\begin{equation}
E[\log(1+X)] = 0.55 \log(1+f) + 0.45 \log(1-f)
\end{equation}
which is given by
\begin{equation}
0.55 \log(1+f) + 0.45 \log(1-f) = \overbrace{0.1 f}^{naive} - \overbrace{\frac{1}{2} f^2}^{concavity} + O(f^3)
\end{equation}
The next two terms are $+ \frac{1}{30} f^3 - \frac{1}{4} f^4$ incidentally. But as might be surmised from the plot we are pretty close to a parabola. The marginal return goes to zero just as the parabolic term starts to match the linear one (i.e. its rate) and the absolute return goes all the way to zero when the parabolic term wipes out the linear one completely. So it must be that we optimize half way down, as it were.

Another way to come at this is that the fraction of wealth you should invest is simply $p-q$. That's surprising at first. But since $E[log(1+X)] = p \log(1+f) + q \log(1-f) $ in general we note that
\begin{equation}
\frac{\partial}{\partial f} E[\log(1+X)] = \frac{p}{1+f} - \frac{q}{1-f}
\end{equation}
which is zero for
\begin{equation}
f = p - q
\end{equation}
Incidentally your return grows slightly faster than the square of your edge. As the amount we bet is linear in the edge, we'd expect the return to grow quadratically as we vary $p$, more or less. And not forgetting the factor of one half it ought to be half the square of our edge. In fact it grows slightly faster, as can be seen by substituting the optimal $f= p-q$ back into the return equation.

Average return for an even money coin flip.

The actual return is the red line. The quadratic approximation is the blue. Not that the percentage edge is $2(p-0.5)$, not $p-0.5$.

So much for betting on the favourite, as it were, but what about longshots? The the math doesn't change but our intuition might. Instead of fixing the outcome at even money and varying the true probabilities of an outcome, let's instead fix the true probability at $ p = 0.01 $ and vary the odds we receive. All being fair we should receive $100 back, including our stake, for every $1 bet. We'll assume instead we get a payout of $\alpha$ times the fair payout. Then we maximize
\begin{equation}
E[\log(1+X)] = p \log\left(1+ f \alpha \left( \frac{1}{p}-1\right) \right) + (1-p) \log(1-f )
\end{equation}
This brings us the the second easy to remember betting rule.

RULE 2. If a longshot's odds are ten percent better than they should be, bet to win ten percent of your wealth.

Now I know what you are thinking: why do we need rule #1 if we have rule #2? Well you probably don't, actually, but evidently I might because I have managed to forget rule #2 in the past. We'll call it a senility hedge.

But rule 2 the rule is quite reasonable on a larger domain - though eventually it too breaks down. If you can get ridiculously good odds on a longshot (let's say ten times the fair odds) don't go absolutely nuts. Only bet the amount that would double your wealth if the odds were fair. Diminishing returns set in if you are set to win so much that the concavity of $log$ kills the fun. You can see that in the picture below.

Average return on a 100/1 horse when we are offered 1000/1 and 2000/1 respectively.

Notice that we bet only marginally more if we get 20 times the fair odds versus 10 times. We do win twice as much, however.

The Boy or Girl Paradox: A Martingale Perspective.

2013-07-23T19:49:00.003-07:00

Mr Smith always tells the truth.

Interviewer: Mr Smith, how many children do you have?
Mr Smith: I have two.
Interviewer: Mr Smith, do you have a son?
Mr Smith: Yes

Q1: What is the probability that Mr Smith's other child is also a boy?

This is a clean version of the boy or girl paradox. The correct answer is 1 in 3, as a simple application of Bayes' Rule establishes. Given forty families, thirty will have at least one boy. Of those thirty, ten will have two boys.

Slightly different evidence would result in an answer of 1 in 2. For example:

Interviewer: Mr Smith, how many children do you have?
Mr Smith: I have two.
Interviewer: Mr Smith, is your oldest child a boy?
Mr Smith: Yes

Q2: What is the probability that Mr Smith's other child is also a boy?

The answer to the second question is clearly 1 in 2.

The boy or girl paradox rises to the status of a controversy because many people will answer 1 in 2 in circumstances other than this as well (as pointed out by a reader - see comments below). However both Q1 and Q2 are, I think, entirely clear. The first situation results in a lower posterior probability.

Now don't ask me why I was looking at this right now but I got to wondering, is there an even easier way to see the inequality between the two answers? Bayes Rule is pretty simple when you write it down but humans are notoriously bad at using it semi-consciously, if you know what I mean. Scanning the wikipedia explanation left me a bit unsatisfied. So here's another angle.

The Martingale Argument (Informal)

Suppose you had wagered that Mr Smith had two boys. You paid $1 and you'll receive $4 if he has two boys. It's an investment. What evidence would make you happier about your investment? Learning that at least child of two is a boy, or learning that at least one child of one is a boy? The latter is a priori less likely, and therefore better news.

Now to polish it off, note that in the former case, betting directly on Mr Smith's first answer would turn $1 into $4/3 because fair odds are 3:1 on. But had the news been bad (i.e., none of the two children were boys) our parlay would be dead in the water. Since we are betting with fair odds our wealth is a martingale. To get to $4 you we still have to have to increase your wealth threefold. So it must be that the odds of betting on two children being boys given that you already know that at least one is a boy correspond to a threefold increase in wealth. Thus the probability is 1 in 3.

The Martingale Argument (Formalized)

A reader has asked me to tighten the argument and relate to the martingale notion in mathematics. My title references the fact that when we face a bookmaker who offers fair odds the expected value of our wealth will not change at the moment we enter any bet with him. Our wealth is a martingale no matter what betting strategy we employ. Try as we might, we cannot win or lose money on average.

A martingale is defined with respect to both a stochastic process (here our wealth) and a sequence of information sets (representing information arriving, such as whether Mr Smith has at least one boy). If $\cal{F}_{t}$ represents the information available to us at time $t$ and $\cal{F}_{\tau}$ the information at an earlier time then the statement that our wealth is a martingale is written: \begin{equation} \label{martingale} W_{\tau} = E \left[ W_{t} | \cal{F}_{\tau} \right] \end{equation} In general both sides of the equation are random variables but to warm up with a special case, take $\tau=T$ where $T$ is the time we learn whether Mr Smith has two boys or not, and set $t$ equal to the time before we know anything. Our initial wealth is not random so the left hand side of \begin{equation} W_t = E \left[ W_T | \cal{F}_{t} \right] \end{equation} is just 1. If we further assume we make an all-or-nothing bet with only two outcomes this reads \begin{equation} 1 = p W^{win} \end{equation} where $W^{win}$ is our wealth conditional on winning (i.e. the payout). This is just the statement that for a fair bet the payout is inversely proportional to the probability of winning. Since $p=0.25$ we have $W^{win}=4$.

A somewhat more interesting use of the martingale property considers a time $\tau$ just after we learn some information. If we again take $T$ to be the "terminal" time just after we learn whether Mr Smith has two boys or not we have \begin{equation} W_{\tau}(\omega) = E \left[ W_T | \cal{F}_{\tau} \right](\omega) \end{equation} where I have emphasized that both sides are random variables. If we take an expectation with respect to $\cal{F}_{t} $, the information we have when we make the bet, we'll get the third inequality below: \begin{equation} 1 = W_t = E \left[ W_{\tau} | \cal{F}_{t} \right] = E \left[ E \left[ W_T | \cal{F}_{\tau} \right]| \cal{F}_{t} \right] \end{equation} Here the first equality and the second we have seen. The first is our assumption that we start with one dollar. The second equality is the martingale property.

These expressions become quite trivial in the case where the first revelation of information will result in one of two outcomes and one of these outcomes is fatal for our bet (if we learn that none of Mr Smith's children are boys, or if we learn that the first child is not a boy then obviously we are out of the running). Indeed we have \begin{eqnarray} E \left[ E \left[ W_T | \cal{F}_{\tau} \right]| \cal{F}_{t} \right] & = & p^A E \left[ W_T | \cal{F}_{\tau} \right] \\ & = & p^A p^B W^{win} \end{eqnarray} where $p^A$ is the probability of good news and $p^B$ is the conditional probability of winning our bet at time $T$ given that we received good partial information at time $\tau$ (this second probability is what we are asked for). And noting the chain of equalities we therefore have \begin{equation} 1 = p^A p^B W^{win} \end{equation} and thus \begin{equation} p^B = \frac{1}{p^A W^{win} } \end{equation} My informal argument merely noted that $W^{win}$ is a known quantity (=4) and thus $p^B$ is inversely proportional to $p^A$.

To make this explicit consider two scenarios. In scenario $1$ we learn that at least one child is a boy. In case $2$ we learn that the oldest child is a boy. Denote the corresponding probabilities of this news (good in both cases) by $p^A_1$ and $p^A_2$ respectively. Trivially $p^A_2 = 0.5$. On the other hand $p^A_1$ is the probability that at least one child out of two is a boy. This is true in 3 out of 4 cases so $p^A_1 = 0.75$. It follows that \begin{eqnarray} p^B_1 & = & \frac{1}{p^A_1 W^{win} } = \frac{1}{ \frac{3}{4} 4 } = \frac{1}{3} \\ p^A_2 & = & \frac{1}{p^B_2 W^{win} } = \frac{1}{ \frac{1}{2} 4 } = \frac{1}{2} \\ \end{eqnarray} Thus the answer to the first question about Mr Smith is 1 in 3, whereas the answer to the second is 1 in 2.

Really it is trivial as the statement $1 = p^A p^B 4$ is just a statement about fair odds for a parlay bet (where the winnings from a first bet are invested in another). You can always think of the expected value of your wealth for the original bet as equal to the actual cash you hold in the absolutely equivalent parlay bet.

Keeping up with the Joneses. Falkenstein on the implications of relative status utility.

2013-07-04T07:58:00.001-07:00

Eric Falkenstein's recent book The Missing Risk Premium argues that when it comes to modeling the investment universe and its participants, envy is more important than greed. Fund managers try to optimize their returns relative to an index. Individual's utility functions are closer to relative than absolute wealth metrics.

Falkenstein suggests a relative utility function based on the individual's wealth $w_i$ and also the wealth of the only other agent, which we denote $w_{-i}$. \begin{equation} \label{relative} U^r(w_i;a,w_{-i}) = U^a(w_i-w_{-i};a) \end{equation} Indexing the only two humans alive by by $1$ and $-1$ helps keep the symmetry. Here $a$ is a coefficient of risk aversion, shared by both participants, and $$ U^a(x) = - e^{-a x} $$ is the familiar utility function, more usually applied with $x=w_i$.

The author introduces two assets into this two agent economy. One is riskless with deterministic return $\nu$ and one is risky with mean $\mu$ and variance $\sigma^2$. Normally we would imagine $\mu = 1.2$ for example and $\nu = 1.05$ perhaps. But as we'll see momentarily, Falkenstein observes that risky asset return will be exactly the same as the riskless asset return if both agents' utility is relative, as in (\ref{relative}) above.

From Neumann-Morgenstern to Markowitz

This is to be compared to the case of absolute utility where the risky asset experiences excess return proportional to the risk aversion parameter $a$ and the variance of the risky asset $\sigma^2$. Here is a quick refresher.

Recall that there is strong motivation for maximizing expected utility of some sort and weaker justification for the particular form above. Leaving that philosophy alone we certainly can say that so long as $x$ is gaussian, there is a purely mathematical equivalence between the task of maximizing $U^a(x)$ and the task of maximizing a rather famous expression where the naive mean is penalized by a variance term: \begin{eqnarray} \arg\max_x E[U(x)] & = & \arg\min_x \left( -E[U(x)] \right) \nonumber \\ & = & \arg\min_x \left(\log -E[U(x)] \right) \nonumber \\ & = & \arg\min_x \left(\log E[e^{-ax}] \right) \label{exp} \\ & = & \arg\min_x \left(-aE(x) + \frac{1}{2} a^2 Var[x]\right) \label{var} \\ & = & \arg\max_x \left(E(x) - \frac{1}{2} a Var[x] \label{blah} \right) \end{eqnarray} Unsurprisingly (\ref{blah}) with $x=w$ representing wealth is sometimes taken as an ansatz for investing - subject to the gaussian assumption we need to get from (\ref{exp}) to (\ref{var}). In fact there is another, lesser appreciated caveat I have summarized in an earlier post, but it isn't germane to Falkenstein's point either. His objection relates more to what $x$ should be: absolute utility with $x=w_i$ or relative utility with $x=w_i - w_{-i}$.

Absolute utility. Trading mean for variance.

We first review the former. There should be no first order change in utility for an agent transferring a small amount of his wealth from cash to stock.

For this to be true the increase in expected terminal wealth $E(w)$ must balance the variance penalty term above (that is, the term $\frac{1}{2} a Var[w]$).

Without loss of generality the terminal wealth is $w_1(\lambda_1) = (1-\lambda_1) y + \lambda_1 x$ where $y=\nu$ represents cash. A small increase in portfolio holding from $\lambda_1$ in the risky asset to $\lambda_1 + \delta$ changes the expected terminal utility by \begin{equation} \label{equilibrium} \Delta U = \overbrace{ \delta ( \overbrace{\mu}^{risky} - \overbrace{\nu}^{riskless} ) }^{increased\ mean} - \overbrace{\frac{1}{2} a \sigma^2 \delta\lambda_1}^{increased\ variance\ penalty} = 0 \end{equation} where $\mu = E[x]$ must exceed the riskless expectation $\nu$ of cash. From this equation, expressing the fact that there is no marginal benefit in risk adjusted terms of increasing or decreasing exposure, the excess return of the risky asset in this two agent economy can be expressed: $$ \overbrace{\mu}^{risky} - \overbrace{\nu}^{riskless} = \frac{1}{2} a \sigma^2\lambda_1 $$ where importantly, the variance penalty is proportional to the risk aversion $a$, the variance of the risky asset, but also how much the agent already holds in the risky asset (i.e. $\lambda_1$.

Relative utility. No variance tradeoff exists.

But Falkenstein (pages 90 onwards) argues against the relevance of this traditional volatility tradeoff. We introduce a second participant in the economy so that our first agent can be jealous of them. The second participant holds $\lambda_{-1}$ in the risky asset and the rest in cash. Under the posited relative utility assumption (\ref{equilibrium}) is now quite different, as it involves the second agent. $$ \overbrace{ \delta ( \overbrace{\mu}^{risky} - \overbrace{\nu}^{riskless} ) }^{increased\ mean} - \overbrace{\frac{1}{2} a Var[x]\delta \underbrace{(\lambda_1-\lambda_{-1})}_{= 0} }^{increased\ variance\ penalty} = 0 $$ The variance term vanishes since each agent is identical and therefore, in equilibrium, must hold the same amount of the risky asset. There is no excess return. We simply have: $$ \overbrace{\mu}^{risky} = \overbrace{\nu}^{riskless} $$ In other words, in the limit where keeping up with the Joneses is all that influences the decision making of participants, there is there is no reward for risk taking. That's intuitive, because there is no such thing as systematic risk. Participants might as well ride that rollercoaster because in the event of a stockmarket crash, say, they will still happy. Their neighbours will lose too.

References

The reader is of course referred to Erik's book and blog articles.

Valuing CSOs under exchangeable recovery using only matrix calculations

2013-07-02T20:51:00.001-07:00

It is reasonably well appreciated that in a basket default swap (CSO) the present value of credit event payments and the present value for premium payments may be considered linear functions of a loss surface.

Present values are also linear functions of a default surface (in a very explicit form provided herein) provided the joint distribution of sequential defaults is exchangeable.

The exchangeable recovery assumption

By this we mean the following. Let $R_k$ denote the stochastic recovery of reference entity $k$. The recovery model is a joint distribution $ P(R_1 = r_1,..,R_n = r_n )$. It is said to be exchangeable if
\begin{equation} \label{exchangeable} P( R_1 = r_1,..,R_{n} = r_n ) = P( R_{\pi(1)} = r_1, .. ,R_{\pi(n)} = r_n ) \end{equation} for any permutation $\pi$. Here $R_{(i)}$ is the $i$’th order statistic. The recovery of the first reference entity to default is $R_{(1)}$. The recovery of the second reference entity to default is $R_{(2)}$ and so forth.

It may be said that we are modeling the sequence of recovered amounts independent of the details of each reference entity. A limitation, certainly, although the class of such model is still quite interesting. It includes the ubiquitous constant recovery assumption, as well as models that lead to rather more sensible results for supersenior tranches.

Condition (\ref{exchangeable}) implies that the marginal distribution of recovery for every reference entity is identical, but it does not imply they are independent. A finite version of de Finetti’s theorem does, however, imply that all such models are mixtures of i.i.d. models.

Decomposition into simple contracts

In what follows we assume a reference portfolio of $N$ assets and $M$ different tranches. Suppose we are interested in the implications of $K$ different default time joint distributions over these $N$ reference obligations.

Let $\{\theta\} \in \Theta$ denote the parameters for the exchangeable recovery model and $\{\omega\}_{k=1}^K \in \Omega$ denote the parameters for the default time model. For fixed $\theta$ the present values for the $M$ different tranches under $K$ different default model assumptions comprise an $M,K$ matrix. We shall develop matrix expressions for the same, as follows: \begin{eqnarray} \label{decomp} \left[ PV_{credit}(\theta,\omega) \right]_{M, K} & = & \left[ C(\theta)\right]_{M, N} \left[D(\omega) \right]_{N, K} \nonumber \\ \left[ PV_{premia}(\theta,\omega)\right]_{M,K} & = & \left[p \right]_{M,1}\left[ e\right]_{1,K} \cdot \left[ P(\theta) \right]_{M,N+1} \left[ A(\omega) \right]_{N+1, K} + \left[ U \right]_{M,1} \left[ e\right]_{1,K} \end{eqnarray} where $e$ is a vector of ones used to replicate $U$ and $p$. In turn $U$ is a known, upfront amount paid by the purchaser of protection. The vector $p$ comprises the agreed running premia (typically $100$ or $500$ basis points) for the tranches in questions. The dot denotes elementwise multiplication.

Definitions of other matrices will follow but point is that parameters $\theta$ and $\omega$ appear separately on the right hand side, not together. So any tensor product of assumptions regarding recovery and default can be computed efficiently.

We might call $C(\theta)$ the credit protection matrix. The entry $C_{m,n}$ is the average proportion of the $n$'th credit event payment which will be paid to the buyer of protection in tranche $m$, where said average is taken over realizations of the recovery model with parameter $\theta$.

Similarly, the entry $P_{m,n}$ in what we might call the premia matrix $P(\theta)$ is the mean proportion of the annuity for the $n$'th to default “tranchelet” which will accrue to the holder of tranche $m$. Here again we refer to the mean over all realizations of the recovery model when the parameter is set to $\theta$. It is convenient to allow the special case $n=N+1$. We define $P_{m,N+1}$ to be the proportion of the annuity for the $m$'th tranche that will be paid regardless of whether defaults occur or not. The value may well be zero, but typically will not be zero for the most senior tranche due to “writedown from above”.

In (\ref{decomp}) the matrix $A(\omega)$ has entries $A_{n,k}$. The entry $A_{n,k}$ denotes the present value of a contract which pays the holder $1 per year up until the $n$'th default - computed under the assumptions of the $k$'th default time model. For the special case $n=N+1$, $A_{N+1,k}$ is the present value of a contract paying $1 per year up until the deal maturity regardless of default.

Similarly $D(\omega)$ with entries $D_{n,k}$ comprises the present value of elementary contract which pays the holder $1 at the time of the $n$'th default under the assumptions of the $k$'th default time model.

Pricing of simple contracts

Next we consider the pricing of our simple risky annuities $A$ and default count bets $D$ given a default surface $S$ of dimension $N+1 \times T$. We claim these matrices can be computed as follows: \begin{eqnarray} \label{simple} D(\omega) & = & \left(\left(L S W\right)\cdot \left(R\Lambda'W\right)\cdot \left(R\Gamma'G\right)\right)F/10000 \nonumber \\ A(\omega) & = & \left(\left(H\left(\left(1-LS\right)Z\right)W\right)\cdot \left(H\left(R\Lambda'W\right)\right)\right)F \end{eqnarray} where $\Gamma$ is a vector of times and $\Lambda$ a vector of risk free discount factors.

The remaining matrices are constant. $$ L = \left[ \begin{array}{cccc} 1 & 0 &\dots & 0 & \\ 1 & 1 & \dots & 0 & \\ \vdots & \vdots & \ddots & 0 & \\ 1 & 1 & 1 & 1 & \end{array} \right]_{N+1, N+1} $$ $$ F = \left[ \begin{array}{cccc} 1 & 1 &\dots & 1 \\ 0 & 1 & \dots & 1 \\ \vdots & \vdots & \ddots & 1 \\ 0 & 0 & 0 & 1 \end{array} \right]_{T+1, T+1} $$ $$ H = \left[ \begin{array}{cccc|c} 1 & 0 &\dots & 0 & 0 \\ 0 & 1 & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \end{array} \right]_{N, N+1} $$ $$ R = \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right]_{N+1,1} $$ $$ G = \left[ \begin{array}{cccc} -1 & 0 &\dots & 0 \\ 1 & -1 & \dots & 0 \\ 0 & 1 & \ddots & 0 \\ \vdots & \vdots & \ddots & -1 \\ 0 & 0 & 0 & 1 \end{array} \right]_{T, T-1} $$ $$ W = \left[ \begin{array}{cccc} 1/2 & 0 &\dots & 0 \\ 1/2 & 1/2 & \dots & 0 \\ 0 & 1/2 & \ddots & 0 \\ \vdots & \vdots & \ddots & 1/2 \\ 0 & 0 & 0 & 1/2 \end{array} \right]_{T, T-1} $$ $$ Z = \left[ \begin{array}{ccccc} 1 & -1 & 0 & \dots & 0 \\ 0 & 1 & -1 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & -1 \\ 0 & 0 & 0 & 0 & 1 \end{array} \right]_{T, T} $$ Combining (\ref{decomp}) and (\ref{simple}) we can price all tranches for all default models using only matrix computations. The $\theta$ dependent matrices $C$ and $P$ in (\ref{decomp}) can be computed offline.

Are Monads really Monads? Maybe(Some(R))

2013-07-02T20:46:00.000-07:00

I made a few notes a while back for peeps like myself who are familiar with monads in functional programming but wondered why their definition isn't entirely obvious from the definition of a monad in mathematics, specifically category theory.

I personally made rather slow progress on this front (still don't understand why arrows are strong monads for example). But perhaps something of the connection between the monad pattern in programming and the monad definition in mathematics can be cleared up here.

I confess that part of the motivation comes from being turned off by articles about monads. Newbies like your author might easily be forgiven for thinking that a monad as an endofuntorial elephant. I nearly went mad after reading twenty random monad blogs without finding a single one that states which category or categories we are in. That's like finding descriptions of "Germany" that include Nazi's, white sausages and so forth and yet never mention that Germany is a country.

This lack of precision hardly helps us when we turn to details, like that fact that monad is Haskell isn't actually a functor in the Haskell hierarchy. Indeed as Heunen and Jacobs point out, both monads and arrows are actually monoids, in mathematical parlance.

So let me state this first: a monad (programming speak) is a monad (math speak) in the category of types. If neither of us get anything more from this exercise, so be it.

Introducing MAP, a built-in functor

Let's start from the programming side. One use of a monad pattern emerges from an everyday programming need: the desire to extend the number system (or some other type) to include an exception, or 'missing' value. This is achieved by creating a new type. In Scala (you'll forgive the bias) this is constructed as Option[something]. When you alter a Double so that it is an Option[Double] you are creating an obvious mapping from one type to another, you might even say "the" obvious mapping. But you are not merely mapping objects to Option[objects] - at least not if you wish to get any work done. You are also dragging along all the nice things you'd like to do to the objects, namely hit them with functions. Your dragging is a functor.

A functor is not just a map from one space to another, we recall, but also a map of all the stuff-you-can-do-to-things in that space to stuff-you-can-do-to-things in the other space. To be more precise I should say we are dragging over not just functions but maybe some other stuff like relations, but that is a fine point for now. In the case of Double => Option[Double] you drag f(x) = x*x+3, say, into a function we might call Some(f) which should act on elements of Option[Double] and take Some(x) to Some(x*x+3).

Or take Double and shove it in Mathematica, metaphorically. It is no longer a Double, but merely an expression, like x*x+3 which is yet to be bound to a specific Double. Thing is, you'd like to do to the expressions what you do to real numbers, so evidently we are dragging morphisms like the one taking 3 into 9 and 5 into 15 over to morphisms on expressions (like the one taking x into 3*x). They ain't the same thing but there is clearly some conservation of structure as we go from real, tangible Double into ephemeral 'x'.

For more examples of "obvious" mappings take a bunch of program fragments and shove them together in a combinator pattern. Or play with state machines, whose composition can be simplified to a single state machine. The notion of an incomplete calculation, an unconsumated something, an unsimplified something, or an extension of a domain are all examples where there is a obvious, rich mapping from one type to another. So rich, in fact, that if we are thinking sloppily we often fail to distinguish between the domain and range of the obvious mapping.

Now let's back up and dot a few t's because while a monad is a fairly simple pattern in programming we aren't close to recognizing its mathematical definition (well, I'm not). To be a little formal we need categories, functors and natural transformations. Remember that a category is supposed to be a space together with some morphisms obeying laws? It was a long time ago for me too, but the morphisms in two of the examples I mentioned are merely the set of all functions with the signature Double => Double. Again, a morphism is a more general concept than a function, basically obeying associativity, and for that reason is often called an arrow (in mathematics). But let's not worry about the possible deficiency in our morphism collection right now because the set of all Double => Double functions is interesting enough.

Two quick checks. First, it is obvious that composition of functions in the mathematical sense - no side effects - is associative. Second, we'll need not concern ourselves overly with the existence of an identity morphism - the other category requirement - because constructing the function that takes x and returns x presents little challenge to most programmers! Safe to assume the identify exists.

So jolly good, DOUBLE is a category and, returning to our first example, Some( ) sure looks like it might help make a functor from DOUBLE into something we imaginatively call SOME-DOUBLE. (In general we could call the first category THING and the new category SOME-THING, I suppose). Likewise Symbolic( ), which I just made up, looks like it might help create a functor from DOUBLE into SYMBOLIC-DOUBLE, as we might name it, a category representing a computation but not its answer. Are the images of these maps themselves categories? Its seems pretty obvious that they are, because of the triviality of function application associativity and, to be pedantic, the obvious fact that the identity morphism maps to the identity morphisms in both SOME-DOUBLE and SYMBOLIC-DOUBLE.

We focus on the first functor momentarily, whose implementation in Scala comprises the constructor x=>Some(x) and the lifting f => (x => x map f) which uses the 'built-in' map method. I personally think of this built-in functor as a 2-tuple where MAP._1 acts on objects and MAP._2 acts on functions. I'll just use MAP going forward for either though, slightly abusing notation.

MAP = ( x=>Some(x), f => (x => x map f) )

As an aside, if I had my way Function1 would come with a map method so the morphism mapper f => (x => x map f) would look simpler, but it doesn't matter so long as we remember that morally speaking, the morphism-mapping part of the functor is just 'map'. When I say that map is 'built in' I really mean built in to the collections libraries, incidentally, because map is provided in the scala.collection.Traversible and scala.collection.Iterable traits and all collections are of one of these two types. Thus for all intents and purposes the combination (constructor, map) is an "off the shelf" functor although yes, for the third time, there are indeed plenty of morphisms other than functions - partial orders, paths on graphs et cetera - I'll let it go if you will.

What is the domain of MAP? My guess is that we should extend it to include not just DOUBLE but SOME-DOUBLE. Also SOME-SOME-DOUBLE, SOME-SOME-SOME-DOUBLE and so on ad infinitum. Call that big union the category Omega so that MAP is, in math speak, an endofunctor because it maps the category Omega into itself. We are getting a little closer to the mathematical definition now, I do hope, and here comes the last little jog.

A natural (and obvious) transformation between the built-in functor MAP applied once and the built-in functor MAP applied twice

To humor me apply MAP twice to get MAPMAP = MAP compose MAP, a functor that maps points x => Some(Some(x)) and of course messes with morphisms as well. For example, consider the function f(x) = x*x+3. MAP takes this morphism into another morphism which in turn happens to map Some(x) into Some(Some(x*x+3)). On the other hand MAPMAP would take f into a different morphism taking Some(x) into Some(Some(Some(x*x+3))). This just emphasizes the fact that MAP and MAPMAP are not the same thing, but they are dreadfully close.

The built-in functor MAP takes points to points and functions to functions

In fact with apologies for my handwriting and the mixing of mathematical and Scala notation on this diagram, MAP and MAPMAP are so similar that when applied to a particular morphism f(x)=x*x+2 they result in very similar looking functions g(x) and h(x). In fact g and h "are" the same Scala function - the very same pattern matching snippet of code I have scrawled in the picture! Needless to say the images of points a and b look quite similar as well. It is awfully tempting to relate MAP and MAPMAP by squishing Some(Some(x)) back to Some(x). The nice thing is, it costs us little to be precise.

In category theory there is a notion of a natural transformation between two functors that respects composition of morphisms. If we were category theorists we'd probably say "oh I suppose there must be a natural transformation from MAPMAP to MAP". I did not, come to think of it, have that exact experience myself, admittedly, because natural transformations are natural in some respects but also an abstract head f@#k the first time you come across them (they relate two functors, each of which in turn map functions to functions). The moral of the story: I think we should talk more about natural transformations to make us better people.

Actually we'll only talk about the one natural transformation mentioned: MAPMAP to MAP. As the definition on wikipedia explains (with F=MAPMAP and G=MAP, and C=D=Omega), this amounts to finding what is known as a component function (math speak) that lifts h=MAP(MAP(f)) onto g=MAP(f). That is straighforward because, translating back to Scala speak,

h andThen flatten = flatten andThen g

You couldn't hope for a more "obvious" commutative diagram that this - which is a good sign because in category theory many things are supposed look obvious after untangling of definitions or chasing of diagrams. We only need the top half of the diagram here, incidentally, and I don't think I need to define flatten. Even Scala newbies could implement the bit of it we need via pattern matching: x => x match {Some(y) => y}.

Flatten is the component realizing the natural transformation from MAP MAP to MAP (top half)
whereas Some is the component realizing the natural transformation from Id to MAP (bottom half)

Thus natural transformations are sometimes quite friendly, it would seem, and there is an even more trivial one lurking here in the bottom half of the diagram, namely the natural transformation that takes the identity functor into MAP. That is realized by a component that happens to translate into a constructor (programming speak, of course) x => Some(x). Evidently:

f andThen Some = Some andThen g

so the bottom half commutes.

Finally, the connection to mathematical monads

We're done. If you skip down the wikipedia monad page to the formal definition, ignoring the stuff at the start about adjoint functors which is just confusing, you'll see that a monad comprises a functor F and two natural transformations, one taking the identity functor Id to F and the other taking F compose F to F. The natural transformations are defined by their components which are, respectively, the constructor Some and the squishing operation flatten. To summarize, here is how we would relate the definition of a mathematical monad on Wikipedia to the day-to-day notion of a monad in programming Scala

Mathematical monad ingredients	Example of a common monad
Two categories C,D	One category comprising a an infinite union of similar types C=D=Omega=Union(DOUBLE,SOME-DOUBLE,SOME-SOME-DOUBLE,...)
A functor between the categories F: C->D	The "built-in" functor MAP taking points x=>Some(x) and functions f => (x => x map f)
A natural transformation taking the identify functor Id to F	A natural transformation taking the identify functor Id to MAP
A "component" realizing said natural transformation	A constructor x => Some(x) that "wraps" up x
A second natural transformation taking F F -> F	A natural transformation from MAP MAP to MAP
A second "component" realizing the second natural transformation	A flatten operation x => x match {Some(y)=>y} that unwraps y

As the mathematicians remind us there are some things to check, known as coherence conditions, but I leave that to the reader.

What's the point?

Now here we are transforming functors when what we are looking for is as elementary as unwrapping a burrito. One might argue that understanding the abstraction is no harder than noticing the essential similarity between any two cases where this patterns arises and any good programmer should be on the lookout for that sort of thing. They should also be looking to communicate it to those maintaining code long after they have written it, and make that maintenance as simple as possible. Abstraction is next to cleanliness, as they don't really say.

If one is going to talk loosely about "conserving structure" one might as well talk rigorously about it if it isn't all that much harder. And when you do, there are certainly some nontrivial things that fall out of category theory (for another time). No wonder the seemingly arcane area, once considered so obscure and abstract as to be useless, has made a charge up the Google n-gram leaderboard in recent times.

Popularity of "category theory" and "functional programming"

Still, I think most people would prefer to keep it a little wooly and merely recognize that we have an obvious identification between a doubly constructed type like List(List(3,4)) and one level down List(3,4), and that this identification needs to be carried over to morphisms - sorry functions. That is why you would naturally write flatmap yourself as the need arose, but might not necessarily appreciate the similarity to ComputeThis(ComputeThis(blah )) or ParseMe(ParseMe( blah )) or AbstractMe(AbstractMe( blah )) and so on. Actually Option is kind of like a partial computation too. You can think of Some(x) and Some(Some(x)) as calculations in stasis, if you will, although the only thing that "remains" to be calculated is whether x exists or not. My goodness, it always does!

One might wax mathematical in the hope of achieving a higher state of monadicness. Perhaps, for example, we "understand'' what MAP is doing because we root our intuition to pattern matching, kind of. If I tell you that g = (Some(x)=>Some(x*x+3)) you say "oh I get it, that is like x=>x*x+3 lifted up into Option[Double]. And if I tell you that h = (Some(Some(x))=>Some(Some(x*x+3)) you say "oh I get it, that is like x=>x*x+3 lifted all the way to Option[Option[Double]]". And you won't complain if I point out that g and h are the same, at least on some domain, because they are rooted in the same thing, the naked function x=>x*x+3. Perhaps the obviousness prevents us from seeing the top half of the second diagram independently of the bottom, however.

Footnote: The built-in flatMap

It may seem odd that we have discussed monads in both programming and mathematics and I made only passing reference to flatMap, the built-in "bind" for Scala collections (defined here and here for Traversable and Iterable traits respectively). For another time, I guess.

Compositional models and their relation to Bayesian networks

2013-07-02T19:39:00.000-07:00

Some time ago I noticed that Radim Jirousek published a foundational paper bringing together results on compositional models. A subclass of compositional models are an alternative formulation of Bayesian networks.

The basic idea goes as follows. Suppose $\pi$ is a joint distribution on variables $x_1$ and $x_2$ specified by the following table (see the linked paper page 620).

$\pi$	$x_1=0$	$x_1=1$
$x_2=0$	$\frac{1}{2}$	$\frac{1}{2}$
$x_2=1$	0	0

and further suppose $\nu$ is the uniform distribution on variables $x_1$ and $x_3$:

$\pi$	$x_1=0$	$x_1=1$
$x_2=0$	$\frac{1}{2}$	$\frac{1}{2}$
$x_2=1$	0	0

Then we define $$ (\pi \rhd \nu)(x_1,x_2,x_3) = \frac{ \pi(x_1,x_2) \nu(x_2, x_3) } { \nu^{\downarrow(2)}(x_2) } $$ where the down arrow indicates a marginal distribution. This division may not always be defined. For example the composition $\pi \rhd \nu(x)$ is defined for all combinations of $x_1$,$x_2$ and $x_3$ whereas the composition $\nu \rhd \pi(x)$ is defined for only some, as in the table below which can be compared to that on page 630 of Jirousek's paper.

$x_1$	$x_2$	$x_3$	$\pi \rhd \nu(x)$	$\nu \lhd \pi(x)$
false	false	false	Some(0.25)	Some(0.125)
false	false	true	Some(0.25)	Some(0.125)
false	true	false	Some(0.0)	None
false	true	true	Some(0.0)	None
true	false	false	Some(0.25)	Some(0.125)
true	false	true	Some(0.25)	Some(0.125)
true	true	false	Some(0.0)	None
true	true	true	Some(0.0)	None

Conveniently Scala allows function names like |> and <|, and the monadic representation of results. Here None means undefined, and a finer point about the implementation below is that I might have used a ternary operator to resolve 0*0/0=0, as per Jirousek's convention.

Anyway, here's the code

A little addendum to the compendium of conjugate priors

2013-06-20T19:45:00.002-07:00

The excellent "Compendium of Conjugate Priors" by Daniel Fink contains a little typo in equation (122) for the posterior of Athreya's conjugate distribution to the normal. Made a note here lest it trip someone else up. This arises when a modified inverse gauss distribution looking something like a gamma distribution $$ P(\nu; \alpha, \beta, \gamma) \propto \nu^\alpha e^{-\frac{\beta}{\nu} - \gamma \nu} $$ is used as the marginal distribution for the variance parameter of a normal distribution whose mean and variance are unknown; and when the conditional mean of that distribution is given by $$ P(m ; \nu, \alpha,\beta, \gamma, \tau, \mu) \propto \frac{1}{\sqrt{\nu}} e^{ \frac{1}{\nu \tau}(m-\mu) } $$ to create a five parameter prior distribution for both mean $m$ and variance $\nu$ of the normal distribution parameters proportional to $$ P(m,\nu ; \alpha,\beta, \gamma, \tau, \mu ) \propto \nu^{\alpha-\frac{1}{2}} e^{-\frac{\beta}{\nu} - \gamma \nu} \ e^{ \frac{1}{\nu \tau}(m-\mu) } $$ Upon observation of $n$ independent samples $\{x_i\}_{i=1}^n$ presumably drawn from said distribution the posterior distribution takes the same form provided we update the parameters as follows \begin{eqnarray*} \alpha' & = & \alpha - \frac{n}{2} \\ \beta' & = & \beta + \frac{1}{2} \sum_{i=1}^n (x_i-\bar{x})^2 + + \frac{n}{2}\frac{ (\bar{x} -\mu)^2}{1 + n \tau} \\ \mu' & = & \frac{\mu + \tau n \bar{x}}{1 + n\tau} \\ \tau' & = & \frac{\tau}{1+n\tau}\\ \gamma' & = & \gamma \end{eqnarray*} where $\bar{x}$ is the mean of $\{x_i\}_{i=1}^n$. The mentioned paper contains a mistake in the update for $\beta$.

Conditioning a multivariate normal vector on partial, noisy gaussian evidence

2012-12-29T16:08:00.000-08:00

'cause I'm tired of re-deriving this stuff!

Lemma 1

Suppose we partition the mean and covariance of a gaussian vector $y$ as $$ y = \left[ \begin{array}{c} y_1 \\ y_2 \end{array} \right] \sim N \left( \left[ \begin{array}{c} u_1 \\ u_2 \end{array} \right] , \left[ \begin{array}{cc} C_{11} & C_{12} \\ C_{21} & C_{22} \end{array} \right] \right) $$ then the distribution of the latter part of the vector $y_2$ conditioned on the former taking a known value $y_1=a_1$ is multivariate gaussian with mean and variance indicated below: $$ y_2 \mid y_1=a_1 \sim N \left( u_2 + C_{21} C_{11}^{-1}(a_1-u_1), C_{22}-C_{21}C_{11}^{-1}C_{12} \right) $$ We note this here for convenience. See Wikipedia notes on conditional multivariate gaussian, for example.

Lemma 2a

Suppose $x$ is partitioned as $$ x = \left[ \begin{array}{c} x_1 \\ x_2 \end{array} \right] \sim N \left( \left[ \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right] , \left[ \begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array} \right] \right) $$ If we hit $x$ with a linear transformation and also add another gaussian random variable, viz: $$ y = \left[ \begin{array}{cc} I & 0 \\ I & 0 \\ 0 & I \end{array} \right] x - \left[ \begin{array}{c} \eta \\ 0 \\ 0 \end{array} \right] $$ where $\eta$ is the multivariate normal random vector $$ \eta \sim N \left(\tilde{\mu}_1 - \tilde{a}_1, \tilde{\Sigma}_{11} \right) $$ with offset mean parameter $ \tilde{\mu}_1$ and variance $\tilde{\Sigma}_{11}$ (and the use of the offset $\tilde{a}_1$ will become apparent), then by properties of affine transformation (see Wikipedia again) and addition of multivariate random variables we have $$ y \sim N\left( \left[ \begin{array}{c} \mu_1 - \tilde{\mu}_1 + \tilde{a}_1 \\ \mu_1 \\ \mu_2 \end{array} \right], \left[ \begin{array}{ccc} \Sigma_{11} + \tilde{\Sigma}_{11} & \Sigma_{11} & \Sigma_{12} \\ \Sigma_{11} & \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{12} & \Sigma_{22} \end{array} \right] \right) $$

Lemma 2b

With $x,y$ as in Lemma 2a, let us reuse notation for the second and third blocks of $y$, i.e. $x_1$ and $x_2$, since the transformation evidently leaves them untouched. We write $\tilde{x}_1$ for the part that is changed: $$ y = \left[ \begin{array}{c} \tilde{x}_1 \\ x_1 \\ x_2 \end{array} \right] $$ But now if we condition on $\tilde{x}_1=\tilde{a}_1$ then the conditional distribution of the second and third parts of $y$ will of course change. It is gaussian by application of Lemma 1 and has mean, covariance given by $$ \left[ \begin{array}{c} x_1^{+} \\ x_2^{+} \end{array} \right] \sim N\left( \left[ \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right] + \left[ \begin{array}{c} \Sigma_{11} \\ \Sigma_{21} \end{array} \right] \left( \Sigma_{11} + \tilde{\Sigma}_{11}^{-1} \right) \left( \tilde{\mu}_1 - \mu_1 \right), \Sigma - \left[\begin{array}{c} \Sigma_{11} \\ \Sigma_{21} \end{array} \right] \left( \Sigma_{11} + \tilde{\Sigma}_{11} \right)^{-1} \left[ \Sigma_{11} \Sigma_{21} \right] \right) $$

Interpretation of Lemmas 2a and 2b

The formulas have the following interpretation. Suppose we assume $x$ has some prior distribution $$ x = \left[ \begin{array}{c} x_1 \\ x_2 \end{array} \right] \sim N \left( \left[ \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right] , \left[ \begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array} \right] \right) $$ and we now condition on $x_1 -\nu = 0$ where $ \nu \sim N \left(\tilde{\mu}_1, \tilde{\Sigma}_{11} \right) $. Then the posterior distribution of $x = [x_1\ x_2]'$ is precisely the multivariate gaussian vector $x^{+} = [x^{+}_1 \ x^{+}_2]'$ given above. It has the interpretation of a posterior distribution when one part of the vector is conditioned not on a precise vector of values, but a noisy one as with Kalman filtering. In the calculation above we conditioned on $x_1 - \eta = \tilde{a}_1$ where $ \eta \sim N \left(\tilde{\mu}_1 - \tilde{a}_1, \tilde{\Sigma}_{11} \right) $, but that is evidently the same thing as conditioning on $x_1 = \nu $ where $ \nu \sim N \left(\tilde{\mu}_1 , \tilde{\Sigma}_{11} \right) $.
We remark that if $\tilde{\Sigma}_{11}$ is relatively small with respect to $\Sigma_{11}$ then this more or less forces one part of $x$ to adopt a new mean and variance $\tilde{\mu}_1,\tilde{\Sigma}_{11}$. For example in the scalar case with $ \Sigma_{11} = \sigma_{11} $ and $ \tilde{\Sigma}_{11} = \tilde{\sigma}_{11} $ the posterior mean of $x_1$ is $\tilde{\mu}_1- \frac{1}{1+\tilde{\sigma_{11}}/\sigma_{11}}(\tilde{\mu_1}-\mu_1) $ and this tends to the new, prescribed mean $\tilde{\mu}_1$ as $ \tilde{\sigma}_{11}/\sigma_{11}$ tends to zero.

I'm missing something about contragredient transformations and portfolio return

2012-12-29T16:07:00.003-08:00

This entry probably just indicates that I am missing something obvious about portfolio return.
The instantaneous return $\gamma^{\pi}_t$ on a portfolio with weights $\pi$ decomposes as $$ \gamma^{\pi}_t = \left< \frac{dZ_t}{Z} \right>_t = \pi_t^\top \gamma_t + \underbrace{\frac{1}{2} \left( \pi_t^\top diag(\xi \xi^{T}) - \overbrace{\pi^{T}_t \xi \xi^{T} \pi_t }^{portfolio\ variance} \right)}_{excess\ return\ process} $$ where $\gamma_t^i$ is the drift of $\log(X^i_t)$, the log of the $i$'th asset and the $i$'th row of $\xi$ represents the factor decomposition of the diffusion into $n$ independent Brownian motions comprising a vector $dW_t$. $$ d (log(X^i_t)) = \gamma^i_t dt + \xi dW_t $$ Consider the map taking vectors $x \mapsto y = \exp(C \log(x)) $ and thereby defining a new asset vector $Y_t$. That is, $Y_t$ is related to $X_t$ by a simple matrix multiplication in log coordinates.
We consider portfolios of the new assets $Y^i$ with weights $\varpi_i$ say. And again, the portfolio return is related to the asset return via $$ \frac{dZ^\varpi}{Z^{\varpi}} = \varpi^\top \frac{dY_t}{Y_t} $$ where the fractions indicate coordinate-wise (i.e. pointwise) division. Let us suppose further that any instantaneous portfolio return using assets $\{X^i\}$ can be replicated by a portfolio using assets $\{Y^i\}$ only, and vice versa. $$ \varpi^\top \frac{dY_t}{Y_t} = \frac{dZ^\varpi}{Z^{\varpi}} = \frac{dZ^\pi}{Z^{\pi}} = \pi^\top \frac{dX_t}{X_t} $$
By a slightly heavy handed application of the multivariate Ito's Lemma (hey it's good to have it lying around) with $g(x) = C x$ we have $$ \begin{eqnarray} d (log Y_t) & = & \frac{\partial g}{\partial t} dt + \left(\nabla g\right)^\top d \left( \log (X_t) \right) + \frac{1}{2} \left(d(\log X_t)\right)^\top \left(\nabla^2 g \right) \left(d(\log X_t)\right) \\ & = & 0 + C d \left(\log (X_t)\right) + 0 \\ & = & C\gamma dt + C\xi dW_t \\ \end{eqnarray} $$ so writing the decomposition of portfolio return with $C\gamma$ in place of $\gamma$ and $C\xi$ in place of $\xi$ we observe $$ \begin{eqnarray} \left< \frac{dZ^{\varpi}_t}{Z^{\varpi}} \right>_t & = & \varpi_t^\top C\gamma_t + \frac{1}{2} \left( \varpi_t^\top diag(C\xi \xi^{\top}C^\top) - \varpi^\top_t C \xi \xi^\top C^\top \varpi_t \right) \\ & = & \pi_t^\top \gamma_t + \frac{1}{2} \left( \pi_t^\top (C^{-1})^\top diag(C\xi \xi^{\top} C^\top) - \pi^\top_t \xi \xi^\top \pi_t \right) \end{eqnarray} $$ if we use contragredient weights $\varpi_t = (C^\top)^{-1} \pi_t$. But does the contragredient choice actually result in the same drift? The linear and portfolio variance terms are the same, but on the other hand $$ \pi_t^\top (C^{-1})^\top {\rm diag}(C\xi \xi^{\top} C^\top) \not= \pi_t^\top {\rm diag}\left(\xi \xi^{\top} \right) $$

When A Bar-Bell Bond Portfolio Optimizes Modified Excess Return

2012-12-29T16:06:00.000-08:00

Here's a litte curiosity I might get around to publishing some day: a bar-bell portfolio maximizes modified excess return. At first I thought it maximized excess return (in the sense of Stochastic Portfolio Theory) but felt that couldn't be right. Sure enough I made a mistake and thus was born "modified excess return", a shamelessly reverse engineered criteria to make what follows work.

A simple model for zero coupon bond dynamics

Assume a lattice of zero coupon bonds with prices $B^i(t) = B(t;t+\tau^{i})$ and integer time to maturities $\tau^{i}=i$ as $i$ ranges from $1$ to $n$ years. We assume that all bonds are priced off the same piecewise constant forward curve with knot points also at integer years. We write $$ B^{i}(t) = \exp\left(- \int_t^{t+i} f(t,s) ds \right) $$ and assume further that the changes in forward rates $ f(t,s) $ at time $t$ for different years are independent. We presume the forward rates are driven by standard Brownian motion with the same standard deviation $\eta$. They may also have non-trivial drift but here it suffices to observe that the vector of bonds has dynamics given by $$ d \left[ \begin{array}{c} log B^{1}(t) \\ log B^{2}(t) \\ \dots \\ log B^{n}(t) \end{array} \right] = \left[ \begin{array}{c} \gamma^1(t) \\ \gamma^2(t) \\ \vdots \\ \gamma^n(t) \end{array} \right] \ dt + \eta \left[ \begin{array}{cccc} 1 & 0 & \dots & 0 \\ 1 & 1 & \dots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 1 & 1 & 1 & 1 \end{array} \right] \left[ \begin{array}{c} dW^1(t) \\ dW^2(t) \\ \dots \\ dW^n(t) \end{array} \right] $$ or more succinctly $$ d (log B) = \gamma \ dt + \eta J \ dW $$ for scalar constant $\eta$, an $n$ by $n$ matrix $J$ (implicitly defined by the above) and some drift coefficients $\gamma$ that we don't care too much about in this particular exercise.

Defining modified excess return for a bond portfolio

We consider a portfolio of these bonds with weights $\pi$ summing to unity. By analogy with Stochastic Portfolio theory we consider a modified excess return given by $$ {modified\ excess\ return} = \sum_{i=1}^n \pi_i \sigma_{ii} - 2 \sum_{i,j=1}^{n} \pi_i \pi_j \sigma_{ij} + \sum_{i=1}^n \pi_i^2 \sigma_{ii} $$ where, following Stochastic Portfolio Theory notation, $\sigma_{ij}$ is the log-asset covariance, here equal to $\eta^2$ multiplied by the $i$,$j$'th element of $J J^{\top}$. We make no statement as to what modified excess return represents, except to compare it to $$ excess\ return = \sum_{i=1}^n \pi_i \sigma_{ii} - \sum_{i,j=1}^{n} \pi_i \pi_j \sigma_{ij} $$ which is most certainly meaningful. Indeed, the log-optimal investor may seek to maximize excess return. In contrast, the modified excess return makes the covariance term more important so one might reason that, all else being equal, choosing this modification over the bone fide excess return represents a sacrifice of long term growth in exchange for reduced portfolio variance - though the difference picks up the between-asset terms only, not the variances. $$ \begin{eqnarray*} modified\ excess\ return - excess\ return & = & - \sum_{i,j=1}^{n} \pi_i \pi_j \sigma_{ij} + \sum_{i=1}^n \pi_i^2 \sigma_{ii} \\ & = & -\sum_{i \not= j} \pi_i \pi_j \sigma_{ij} \end{eqnarray*} $$ A similar tradeoff is made, also inadvertently, by those constructing minimum variance portfolios in the tradition of Markowitz and de Finetti. In the minimum variance prescription only the portfolio variance term $\sum_{i,j=1}^{n} \pi_i \pi_j \sigma_{ij}$ is contemplated, not $\sum_{i=1}^n \pi_i \sigma_{ii}$.

Maximizing the modified excess return

Whatever the modification contemplated may imply, we proceed towards its surprising implication by mentally multiplying $J$ above and noting that $ (J J')_{i,j} = min(i,j) $ because $$ \left[ \begin{array}{cccc} 1 & 0 & \dots & 0 \\ 1 & 1 & \dots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 1 & 1 & 1 & 1 \end{array} \right] \left[ \begin{array}{cccc} 1 & 1 & \dots & 1 \\ 0 & 1 & \dots & 1 \\ \vdots & \vdots & \ddots & 1 \\ 0 & 0 & 0 & 1 \end{array} \right] = \left[ \begin{array}{cccc} 1 & 1 & \dots & 1 \\ 1 & 2 & \dots & 2 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 2 & \dots & n \end{array} \right] $$ Thus in this slightly contrived forward rate model we have modified excess return proportional to $$ \psi(\pi) = \sum_{i=1}^n i \pi_i - 2\sum_{i,j=1}^{n} min(i,j) \pi_i \pi_j + \sum_{i=1}^n i \pi_i^2 $$ This leaves us with a cute little optimization. We claim that $\psi(\pi)$ and hence the modified excess return is maximized, subject to $\sum_{i=1}^n \pi_i = 1$, by setting the portfolio equal to a barbell. Half the portfolio is invested in the first (shortest maturity) bond and the other half on the last (longest maturity) bond. That is, $$ \pi^* = \left[ \begin{array}{c} 1/2 \\ 0 \\ \vdots \\ 0 \\ 1/2 \end{array} \right] $$ corresponding to a modified excess return of $\psi(\pi^*) = \frac{n-1}{4}$. To prove this observe that $\psi(\pi)$ can be re-written as follows (count the number of times each $\pi_i$ and $\pi_i \pi_j$ occurs) $$ \begin{eqnarray*} \psi(\pi) & = & - 2 \sum_{i,j=1}^{n} min(i,j) \pi_i \pi_j + \sum_{i=1}^n i \pi_i^2 + \sum_{i=1}^n i \pi_i \\ & = & -(\pi_1 + \pi_2 + ... + \pi_n)^2 + (\pi_1 + \pi_2 + ... + \pi_n) \\ & & - (\pi_2 + ... + \pi_n)^2 + (\pi_2 + ... + \pi_n)\\ & & \vdots \\ & & - (\pi_{n-1}+\pi_{n})^2 + (\pi_{n-1}+\pi_{n}) \\ & & - (\pi_n)^2 + \pi_n \\ & = & \sum_{i=0}^{n-1} (-u_i^2 + u_i) \\ & = & \sum_{i=1}^{n-1} (-u_i^2 + u_i) \\ & = & \sum_{i=1}^{n-1} \left(-(u_i-1/2)^2+1/4 \right) \\ & = & \frac{n-1}{4} - \sum_{i=1}^{n-1} (u_i-1/2)^2 \end{eqnarray*} $$ where we have introduced $u_i = \sum_{j=i+1}^n \pi_j $ as the sum of portfolio weights leaving out the first $i$, and applied the constraint $u_0=1$. The expression is clearly maximized by setting $u_1 ... u_n$ equal to $1/2$. By back substitution beginning with $\pi_n$ this implies $\pi = \pi^*$ as claimed.

Excess return on a portfolio of lognormal assets

2012-12-29T16:05:00.000-08:00

Here's my attempt at "Stochastic Portfolio Theory in a Nutshell", for those who, like me, hadn't noticed a bug in Markowitz Theory.

Stochastic Portfolio Theory considers the decomposition of the instantaneous return on a continuously rebalanced portfolio into the instantaneous returns of constituent assets \begin{equation} \frac{ d Z_t} {Z_t} = \pi_t^\top \frac{ dX_t}{X_t} \label{portfolio} \end{equation} Here $Z_t$ is the value of the portfolio, $\pi_t$ is a vector of portfolio weights, the fractions indicate pointwise (i.e. pathwise) division and $dX_t=(dX^1_t,...,dX^n_t) $ is a $n x 1$ vector of stocks with lognormal dynamics: $$ d \left(log(X_t\right)_{n x 1} = \left( \gamma_t \right)_{n x 1} dt + \left( \xi \right)_{n x n} \left(dW_t\right)_{n x 1} $$ where $dW_t$ is a vector and we've emphasized the dimensions throughout. Note that by Ito's Lemma we have $$ \frac{ dX_t}{X_t} = \left( \gamma_t + \frac{1}{2} diag \left( \xi \xi^{T} \right) \right) dt + \left( \xi \right)_{n x n} dW_t $$ where $diag$ extracts a vector of diagonal entries. So if we mentally hit this on the left with the transpose of the portfolio weights $\pi^\top$ we can see that the right hand side of our instantaneous return equation will have a Brownian term $\pi_t^{T} \xi dW_t$ and a drift that we'll get back to momentarily. And if we are on the ball we'll remember that $ \frac{dZ}{Z} $ and $d(\log Z)_t $ have the same Brownian terms. Thus $ d(\log Z)_t $ will also be driven by $\pi_t^{T} \xi dW_t$ and can write for some as yet unspecified drift $\gamma^{\pi}$ $$ d (\log Z_t) = \gamma^{\pi}_t dt + \pi_t^{T} \xi dW_t\ $$ To clean this up we apply Ito's Lemma again to retrieve $ Z_t = \exp(\log(Z_t))$ and thereby: $$ \frac{dZ_t}{Z_t} = \left\{ \gamma^{\pi}_t + \frac{1}{2} \pi^{T}_t \xi \xi^{T} \pi_t \right\} dt + \pi_t^{T} \xi dW_t $$ which reveals the drift term on the left hand side of the instantaneous return equations expressed in terms of a highly relevant quantity: the drift of the logarithm of portfolio wealth. Indeed we call $\gamma^{\pi}_t$ the portfolio growth process. And we observe the important equality appearing in too few investment textbooks, if any beside Bob's: $$ \gamma^{\pi} = \pi_t \cdot \gamma_t + \underbrace{\frac{1}{2} \left( \pi_t \cdot diag(\xi \xi^{T}) - \overbrace{\pi^{T}_t \xi \xi^{T} \pi_t }^{portfolio\ variance} \right)}_{excess\ return\ process} $$ Thus in log space we might say that the portfolio growth is the linear combination of the growth in individual stocks plus the term involving curly braces. We refer to that additional kick as the excess growth rate. And we further observe that it decomposes into the difference between the weighted combination of stock variances and the portfolio variance process, denoted $$ \sigma^{\pi\pi}_t = \pi^{T}_t \xi \xi^{T} \pi_t $$ That's it for now. We note that if one is interested in the return on the logarithm of one's portfolio then the full decomposition into linear and excess return is obviously more pertinent than the linear term alone. And we see why minimizing portfolio variance subject to a known linear return is not, despite its significant popularity, the most relevant exercise.

The decomposition is useful independent of the investors utility function.

For more see Dr Fernholz's book on Amazon, of this summary paper.

Region	\(\lambda^*_t\)	Optimal \(t\)	\( \log E[ e^{X_t} ] \)	Strategy
\(b < 0\)	0	\(\infty\)	\( \frac{1}{2}\)	"Reset"
\(0 < b < 1\)	b	\(-\frac{1}{\kappa} \log( b )\)	\(\frac{1}{2}\left( b^2+1 \right) \)	"Explore"
\( b > 1\)	1	\(0\)	\(b \)	"Stay"

\(\pi\)	\(x_1=0\)	\(x_1=1\)
\(x_2=0\)	\(\frac{1}{2}\)	\(\frac{1}{2}\)
\(x_2=1\)	0	0

\(\pi\)	\(x_1=0\)	\(x_1=1\)
\(x_2=0\)	\(\frac{1}{2}\)	\(\frac{1}{2}\)
\(x_2=1\)	0	0

\(x_1\)	\(x_2\)	\(x_3\)	\(\pi \rhd \nu(x)\)	\(\nu \lhd \pi(x)\)
false	false	false	Some(0.25)	Some(0.125)
false	false	true	Some(0.25)	Some(0.125)
false	true	false	Some(0.0)	None
false	true	true	Some(0.0)	None
true	false	false	Some(0.25)	Some(0.125)
true	false	true	Some(0.25)	Some(0.125)
true	true	false	Some(0.0)	None
true	true	true	Some(0.0)	None