Tuesday, July 23, 2013

The Boy or Girl Paradox: A Martingale Perspective.

Mr Smith always tells the truth.

Interviewer: Mr Smith, how many children do you have?
Mr Smith: I have two.
Interviewer: Mr Smith, do you have a son?
Mr Smith: Yes

Q1: What is the probability that Mr Smith's other child is also a boy?

This is a clean version of the boy or girl paradox. The correct answer is 1 in 3, as a simple application of Bayes' Rule establishes. Given forty families, thirty will have at least one boy. Of those thirty, ten will have two boys.

Slightly different evidence would result in an answer of 1 in 2. For example:

Interviewer: Mr Smith, how many children do you have?
Mr Smith: I have two.
Interviewer: Mr Smith, is your oldest child a boy?
Mr Smith: Yes

Q2: What is the probability that Mr Smith's other child is also a boy?

The answer to the second question is clearly 1 in 2.

The boy or girl paradox rises to the status of a controversy because many people will answer 1 in 2 in circumstances other than this as well (as pointed out by a reader - see comments below). However both Q1 and Q2 are, I think, entirely clear. The first situation results in a lower posterior probability.

Now don't ask me why I was looking at this right now but I got to wondering, is there an even easier way to see the inequality between the two answers?  Bayes Rule is pretty simple when you write it down but humans are notoriously bad at using it semi-consciously, if you know what I mean. Scanning the wikipedia explanation left me a bit unsatisfied. So here's another angle.

                                The Martingale Argument (Informal)

Suppose you had wagered that Mr Smith had two boys. You paid $1 and you'll receive $4 if he has two boys. It's an investment. What evidence would make you happier about your investment? Learning that at least child of two is a boy, or learning that at least one child of one is a boy? The latter is a priori less likely, and therefore better news.

Now to polish it off, note that in the former case, betting directly on Mr Smith's first answer would turn $1 into $4/3 because fair odds are 3:1 on. But had the news been bad (i.e., none of the two children were boys) our parlay would be dead in the water. Since we are betting with fair odds our wealth is a martingale. To get to $4 you we still have to have to increase your wealth threefold. So it must be that the odds of betting on two children being boys given that you already know that at least one is a boy correspond to a threefold increase in wealth. Thus the probability is 1 in 3.

                                 The Martingale Argument (Formalized)

A reader has asked me to tighten the argument and relate to the martingale notion in mathematics. My title references the fact that when we face a bookmaker who offers fair odds the expected value of our wealth will not change at the moment we enter any bet with him. Our wealth is a martingale no matter what betting strategy we employ. Try as we might, we cannot win or lose money on average.

A martingale is defined with respect to both a stochastic process (here our wealth) and a sequence of information sets (representing information arriving, such as whether Mr Smith has at least one boy). If \(\cal{F}_{t}\) represents the information available to us at time \(t\) and \(\cal{F}_{\tau}\) the information at an earlier time then the statement that our wealth is a martingale is written: \begin{equation} \label{martingale} W_{\tau} = E \left[ W_{t} | \cal{F}_{\tau} \right] \end{equation} In general both sides of the equation are random variables but to warm up with a special case, take \(\tau=T\) where \(T\) is the time we learn whether Mr Smith has two boys or not, and set \(t\) equal to the time before we know anything. Our initial wealth is not random so the left hand side of \begin{equation} W_t = E \left[ W_T | \cal{F}_{t} \right] \end{equation} is just 1. If we further assume we make an all-or-nothing bet with only two outcomes this reads \begin{equation} 1 = p W^{win} \end{equation} where \(W^{win}\) is our wealth conditional on winning (i.e. the payout). This is just the statement that for a fair bet the payout is inversely proportional to the probability of winning. Since \(p=0.25\) we have \(W^{win}=4\).

A somewhat more interesting use of the martingale property considers a time \(\tau\) just after we learn some information. If we again take \(T\) to be the "terminal" time just after we learn whether Mr Smith has two boys or not we have \begin{equation} W_{\tau}(\omega) = E \left[ W_T | \cal{F}_{\tau} \right](\omega) \end{equation} where I have emphasized that both sides are random variables. If we take an expectation with respect to \(\cal{F}_{t} \), the information we have when we make the bet, we'll get the third inequality below: \begin{equation} 1 = W_t = E \left[ W_{\tau} | \cal{F}_{t} \right] = E \left[ E \left[ W_T | \cal{F}_{\tau} \right]| \cal{F}_{t} \right] \end{equation} Here the first equality and the second we have seen. The first is our assumption that we start with one dollar. The second equality is the martingale property.

These expressions become quite trivial in the case where the first revelation of information will result in one of two outcomes and one of these outcomes is fatal for our bet (if we learn that none of Mr Smith's children are boys, or if we learn that the first child is not a boy then obviously we are out of the running). Indeed we have \begin{eqnarray} E \left[ E \left[ W_T | \cal{F}_{\tau} \right]| \cal{F}_{t} \right] & = & p^A E \left[ W_T | \cal{F}_{\tau} \right] \\ & = & p^A p^B W^{win} \end{eqnarray} where \(p^A\) is the probability of good news and \(p^B\) is the conditional probability of winning our bet at time \(T\) given that we received good partial information at time \(\tau\) (this second probability is what we are asked for). And noting the chain of equalities we therefore have \begin{equation} 1 = p^A p^B W^{win} \end{equation} and thus \begin{equation} p^B = \frac{1}{p^A W^{win} } \end{equation} My informal argument merely noted that \(W^{win}\) is a known quantity (=4) and thus \(p^B\) is inversely proportional to \(p^A\).

To make this explicit consider two scenarios. In scenario \(1\) we learn that at least one child is a boy. In case \(2\) we learn that the oldest child is a boy. Denote the corresponding probabilities of this news (good in both cases) by \(p^A_1\) and \(p^A_2\) respectively. Trivially \(p^A_2 = 0.5\). On the other hand \(p^A_1\) is the probability that at least one child out of two is a boy. This is true in 3 out of 4 cases so \(p^A_1 = 0.75\). It follows that \begin{eqnarray} p^B_1 & = & \frac{1}{p^A_1 W^{win} } = \frac{1}{ \frac{3}{4} 4 } = \frac{1}{3} \\ p^A_2 & = & \frac{1}{p^B_2 W^{win} } = \frac{1}{ \frac{1}{2} 4 } = \frac{1}{2} \\ \end{eqnarray} Thus the answer to the first question about Mr Smith is 1 in 3, whereas the answer to the second is 1 in 2.

Really it is trivial as the statement \(1 = p^A p^B 4\) is just a statement about fair odds for a parlay bet (where the winnings from a first bet are invested in another). You can always think of the expected value of your wealth for the original bet as equal to the actual cash you hold in the absolutely equivalent parlay bet.

Thursday, July 4, 2013

Keeping up with the Joneses. Falkenstein on the implications of relative status utility.

Eric Falkenstein's recent book The Missing Risk Premium argues that when it comes to modeling the investment universe and its participants, envy is more important than greed. Fund managers try to optimize their returns relative to an index. Individual's utility functions are closer to relative than absolute wealth metrics.

Falkenstein suggests a relative utility function based on the individual's wealth \(w_i\) and also the wealth of the only other agent, which we denote \(w_{-i}\). \begin{equation} \label{relative} U^r(w_i;a,w_{-i}) = U^a(w_i-w_{-i};a) \end{equation} Indexing the only two humans alive by by \(1\) and \(-1\) helps keep the symmetry. Here \(a\) is a coefficient of risk aversion, shared by both participants, and $$ U^a(x) = - e^{-a x} $$ is the familiar utility function, more usually applied with \(x=w_i\).

The author introduces two assets into this two agent economy. One is riskless with deterministic return \(\nu\) and one is risky with mean \(\mu\) and variance \(\sigma^2\). Normally we would imagine \(\mu = 1.2\) for example and \(\nu = 1.05\) perhaps. But as we'll see momentarily, Falkenstein observes that risky asset return will be exactly the same as the riskless asset return if both agents' utility is relative, as in (\ref{relative}) above.

From Neumann-Morgenstern to Markowitz

This is to be compared to the case of absolute utility where the risky asset experiences excess return proportional to the risk aversion parameter \(a\) and the variance of the risky asset \(\sigma^2\). Here is a quick refresher.

Recall that there is strong motivation for maximizing expected utility of some sort and weaker justification for the particular form above. Leaving that philosophy alone we certainly can say that so long as \(x\) is gaussian, there is a purely mathematical equivalence between the task of maximizing \(U^a(x)\) and the task of maximizing a rather famous expression where the naive mean is penalized by a variance term: \begin{eqnarray} \arg\max_x E[U(x)] & = & \arg\min_x \left( -E[U(x)] \right) \nonumber \\ & = & \arg\min_x \left(\log -E[U(x)] \right) \nonumber \\ & = & \arg\min_x \left(\log E[e^{-ax}] \right) \label{exp} \\ & = & \arg\min_x \left(-aE(x) + \frac{1}{2} a^2 Var[x]\right) \label{var} \\ & = & \arg\max_x \left(E(x) - \frac{1}{2} a Var[x] \label{blah} \right) \end{eqnarray} Unsurprisingly (\ref{blah}) with \(x=w\) representing wealth is sometimes taken as an ansatz for investing - subject to the gaussian assumption we need to get from (\ref{exp}) to (\ref{var}). In fact there is another, lesser appreciated caveat I have summarized in an earlier post, but it isn't germane to Falkenstein's point either. His objection relates more to what \(x\) should be: absolute utility with \(x=w_i\) or relative utility with \(x=w_i - w_{-i}\).

Absolute utility. Trading mean for variance.

We first review the former. There should be no first order change in utility for an agent transferring a small amount of his wealth from cash to stock.

For this to be true the increase in expected terminal wealth \(E(w)\) must balance the variance penalty term above (that is, the term \(\frac{1}{2} a Var[w]\)).

Without loss of generality the terminal wealth is \(w_1(\lambda_1) = (1-\lambda_1) y + \lambda_1 x\) where \(y=\nu\) represents cash. A small increase in portfolio holding from \(\lambda_1\) in the risky asset to \(\lambda_1 + \delta\) changes the expected terminal utility by \begin{equation} \label{equilibrium} \Delta U = \overbrace{ \delta ( \overbrace{\mu}^{risky} - \overbrace{\nu}^{riskless} ) }^{increased\ mean} - \overbrace{\frac{1}{2} a \sigma^2 \delta\lambda_1}^{increased\ variance\ penalty} = 0 \end{equation} where \(\mu = E[x]\) must exceed the riskless expectation \(\nu\) of cash. From this equation, expressing the fact that there is no marginal benefit in risk adjusted terms of increasing or decreasing exposure, the excess return of the risky asset in this two agent economy can be expressed: $$ \overbrace{\mu}^{risky} - \overbrace{\nu}^{riskless} = \frac{1}{2} a \sigma^2\lambda_1 $$ where importantly, the variance penalty is proportional to the risk aversion \(a\), the variance of the risky asset, but also how much the agent already holds in the risky asset (i.e. \(\lambda_1\).

Relative utility. No variance tradeoff exists.

But Falkenstein (pages 90 onwards) argues against the relevance of this traditional volatility tradeoff. We introduce a second participant in the economy so that our first agent can be jealous of them. The second participant holds \(\lambda_{-1}\) in the risky asset and the rest in cash. Under the posited relative utility assumption (\ref{equilibrium}) is now quite different, as it involves the second agent. $$ \overbrace{ \delta ( \overbrace{\mu}^{risky} - \overbrace{\nu}^{riskless} ) }^{increased\ mean} - \overbrace{\frac{1}{2} a Var[x]\delta \underbrace{(\lambda_1-\lambda_{-1})}_{= 0} }^{increased\ variance\ penalty} = 0 $$ The variance term vanishes since each agent is identical and therefore, in equilibrium, must hold the same amount of the risky asset. There is no excess return. We simply have: $$ \overbrace{\mu}^{risky} = \overbrace{\nu}^{riskless} $$ In other words, in the limit where keeping up with the Joneses is all that influences the decision making of participants, there is there is no reward for risk taking. That's intuitive, because there is no such thing as systematic risk. Participants might as well ride that rollercoaster because in the event of a stockmarket crash, say, they will still happy. Their neighbours will lose too.


The reader is of course referred to Erik's book and blog articles.

Tuesday, July 2, 2013

Valuing CSOs under exchangeable recovery using only matrix calculations

It is reasonably well appreciated that in a basket default swap (CSO) the present value of credit event payments and the present value for premium payments may be considered linear functions of a loss surface.

Present values are also linear functions of a default surface (in a very explicit form provided herein) provided the joint distribution of sequential defaults is exchangeable.

The exchangeable recovery assumption

By this we mean the following. Let \(R_k\) denote the stochastic recovery of reference entity \(k\). The recovery model is a joint distribution \( P(R_1 = r_1,..,R_n = r_n )\). It is said to be exchangeable if
\begin{equation} \label{exchangeable} P( R_1 = r_1,..,R_{n} = r_n ) = P( R_{\pi(1)} = r_1, .. ,R_{\pi(n)} = r_n ) \end{equation} for any permutation \(\pi\). Here \(R_{(i)}\) is the \(i\)’th order statistic. The recovery of the first reference entity to default is \(R_{(1)}\). The recovery of the second reference entity to default is \(R_{(2)}\) and so forth. 

It may be said that we are modeling the sequence of recovered amounts independent of the details of each reference entity. A limitation, certainly, although the class of such model is still quite interesting. It includes the ubiquitous constant recovery assumption, as well as models that lead to rather more sensible results for supersenior tranches.

Condition (\ref{exchangeable}) implies that the marginal distribution of recovery for every reference entity is identical, but it does not imply they are independent. A finite version of de Finetti’s theorem does, however, imply that all such models are mixtures of i.i.d. models.

Decomposition into simple contracts

In what follows we assume a reference portfolio of \(N\) assets and \(M\) different tranches. Suppose we are interested in the implications of \(K\) different default time joint distributions over these \(N\) reference obligations.

Let \(\{\theta\} \in \Theta\) denote the parameters for the exchangeable recovery model and \(\{\omega\}_{k=1}^K \in \Omega\) denote the parameters for the default time model. For fixed \(\theta\) the present values for the \(M\) different tranches under \(K\) different default model assumptions comprise an \(M,K\) matrix. We shall develop matrix expressions for the same, as follows: \begin{eqnarray} \label{decomp} \left[ PV_{credit}(\theta,\omega) \right]_{M, K} & = & \left[ C(\theta)\right]_{M, N} \left[D(\omega) \right]_{N, K} \nonumber \\ \left[ PV_{premia}(\theta,\omega)\right]_{M,K} & = & \left[p \right]_{M,1}\left[ e\right]_{1,K} \cdot \left[ P(\theta) \right]_{M,N+1} \left[ A(\omega) \right]_{N+1, K} + \left[ U \right]_{M,1} \left[ e\right]_{1,K} \end{eqnarray} where \(e\) is a vector of ones used to replicate \(U\) and \(p\). In turn \(U\) is a known, upfront amount paid by the purchaser of protection. The vector \(p\) comprises the agreed running premia (typically \(100\) or \(500\) basis points) for the tranches in questions. The dot denotes elementwise multiplication.

Definitions of other matrices will follow but point is that parameters \(\theta\) and \(\omega\) appear separately on the right hand side, not together. So any tensor product of assumptions regarding recovery and default can be computed efficiently.

We might call \(C(\theta)\) the credit protection matrix. The entry \(C_{m,n}\) is the average proportion of the \(n\)'th credit event payment which will be paid to the buyer of protection in tranche \(m\), where said average is taken over realizations of the recovery model with parameter \(\theta\).

Similarly, the entry \(P_{m,n}\) in what we might call the premia matrix \(P(\theta)\) is the mean proportion of the annuity for the \(n\)'th to default “tranchelet” which will accrue to the holder of tranche \(m\). Here again we refer to the mean over all realizations of the recovery model when the parameter is set to \(\theta\). It is convenient to allow the special case \(n=N+1\). We define \(P_{m,N+1}\) to be the proportion of the annuity for the \(m\)'th tranche that will be paid regardless of whether defaults occur or not. The value may well be zero, but typically will not be zero for the most senior tranche due to “writedown from above”.

In (\ref{decomp}) the matrix \(A(\omega)\) has entries \(A_{n,k}\). The entry \(A_{n,k}\) denotes the present value of a contract which pays the holder $1 per year up until the \(n\)'th default - computed under the assumptions of the \(k\)'th default time model. For the special case \(n=N+1\), \(A_{N+1,k}\) is the present value of a contract paying $1 per year up until the deal maturity regardless of default.

Similarly \(D(\omega)\) with entries \(D_{n,k}\) comprises the present value of elementary contract which pays the holder $1 at the time of the \(n\)'th default under the assumptions of the \(k\)'th default time model.

Pricing of simple contracts

Next we consider the pricing of our simple risky annuities \(A\) and default count bets \(D\) given a default surface \(S\) of dimension \(N+1 \times T\). We claim these matrices can be computed as follows: \begin{eqnarray} \label{simple} D(\omega) & = & \left(\left(L S W\right)\cdot \left(R\Lambda'W\right)\cdot \left(R\Gamma'G\right)\right)F/10000 \nonumber \\ A(\omega) & = & \left(\left(H\left(\left(1-LS\right)Z\right)W\right)\cdot \left(H\left(R\Lambda'W\right)\right)\right)F \end{eqnarray} where \(\Gamma\) is a vector of times and \(\Lambda\) a vector of risk free discount factors.

The remaining matrices are constant. $$ L = \left[ \begin{array}{cccc} 1 & 0 &\dots & 0 & \\ 1 & 1 & \dots & 0 & \\ \vdots & \vdots & \ddots & 0 & \\ 1 & 1 & 1 & 1 & \end{array} \right]_{N+1, N+1} $$ $$ F = \left[ \begin{array}{cccc} 1 & 1 &\dots & 1 \\ 0 & 1 & \dots & 1 \\ \vdots & \vdots & \ddots & 1 \\ 0 & 0 & 0 & 1 \end{array} \right]_{T+1, T+1} $$ $$ H = \left[ \begin{array}{cccc|c} 1 & 0 &\dots & 0 & 0 \\ 0 & 1 & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \end{array} \right]_{N, N+1} $$ $$ R = \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right]_{N+1,1} $$ $$ G = \left[ \begin{array}{cccc} -1 & 0 &\dots & 0 \\ 1 & -1 & \dots & 0 \\ 0 & 1 & \ddots & 0 \\ \vdots & \vdots & \ddots & -1 \\ 0 & 0 & 0 & 1 \end{array} \right]_{T, T-1} $$ $$ W = \left[ \begin{array}{cccc} 1/2 & 0 &\dots & 0 \\ 1/2 & 1/2 & \dots & 0 \\ 0 & 1/2 & \ddots & 0 \\ \vdots & \vdots & \ddots & 1/2 \\ 0 & 0 & 0 & 1/2 \end{array} \right]_{T, T-1} $$ $$ Z = \left[ \begin{array}{ccccc} 1 & -1 & 0 & \dots & 0 \\ 0 & 1 & -1 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & -1 \\ 0 & 0 & 0 & 0 & 1 \end{array} \right]_{T, T} $$ Combining (\ref{decomp}) and (\ref{simple}) we can price all tranches for all default models using only matrix computations. The \(\theta\) dependent matrices \(C\) and \(P\) in (\ref{decomp}) can be computed offline.

Are Monads really Monads? Maybe(Some(R))

I made a few notes a while back for peeps like myself who are familiar with monads in functional programming but wondered why their definition isn't entirely obvious from the definition of a monad in mathematics, specifically category theory.

I personally made rather slow progress on this front (still don't understand why  arrows are strong monads for example). But perhaps something of the connection between the monad pattern in programming and the monad definition in mathematics can be cleared up here.

I confess that part of the motivation comes from being turned off by articles about monads. Newbies like your author might easily be forgiven for thinking that a monad as an endofuntorial elephant. I nearly went mad after reading twenty random monad blogs without finding a single one that states which category or categories we are in. That's like finding descriptions of "Germany" that include Nazi's, white sausages and so forth and yet never mention that Germany is a country.

This lack of precision hardly helps us when we turn to details, like that fact that monad is Haskell isn't actually a functor in the Haskell hierarchy. Indeed as Heunen and Jacobs  point out, both monads and arrows are actually monoids, in mathematical parlance.

So let me state this first: a monad (programming speak) is a monad (math speak) in the category of types. If neither of us get anything more from this exercise, so be it.

          Introducing MAP, a built-in functor

Let's start from the programming side. One use of a monad pattern emerges from an everyday programming need: the desire to extend the number system (or some other type) to include an exception, or 'missing' value. This is achieved by creating a new type. In Scala (you'll forgive the bias) this is constructed as Option[something]. When you alter a Double so that it is an Option[Double] you are creating an obvious mapping from one type to another, you might even say "the" obvious mapping. But you are not merely mapping objects to Option[objects] - at least not if you wish to get any work done. You are also dragging along all the nice things you'd like to do to the objects, namely hit them with functions. Your dragging is a functor.

A functor is not just a map from one space to another, we recall, but also a map of all the stuff-you-can-do-to-things in that space to stuff-you-can-do-to-things in the other space. To be more precise I should say we are dragging over not just functions but maybe some other stuff like relations, but that is a fine point for now. In the case of Double => Option[Double] you drag f(x) = x*x+3, say, into a function we might call Some(f) which should act on elements of Option[Double] and take Some(x) to Some(x*x+3).

Or take Double and shove it in Mathematica, metaphorically. It is no longer a Double, but merely an expression, like x*x+3 which is yet to be bound to a specific Double. Thing is, you'd like to do to the expressions what you do to real numbers, so evidently we are dragging morphisms like the one taking 3 into 9 and 5 into 15 over to morphisms on expressions (like the one taking x into 3*x). They ain't the same thing but there is clearly some conservation of structure as we go from real, tangible Double into ephemeral 'x'.

For more examples of "obvious" mappings take a bunch of program fragments and shove them together in a combinator pattern. Or play with state machines, whose composition can be simplified to a single state machine. The notion of an incomplete calculation, an unconsumated something, an unsimplified something, or an extension of a domain are all examples where there is a obvious, rich mapping from one type to another. So rich, in fact, that if we are thinking sloppily we often fail to distinguish between the domain and range of the obvious mapping. 

Now let's back up and dot a few t's because while a monad is a fairly simple pattern in programming we aren't close to recognizing its mathematical definition (well, I'm not). To be a little formal we need categories, functors and natural transformations. Remember that a category is supposed to be a space together with some morphisms obeying laws? It was a long time ago for me too, but the morphisms in two of the examples I mentioned are merely the set of all functions with the signature Double => Double. Again, a morphism is a more general concept than a function, basically obeying associativity, and for that reason is often called an arrow (in mathematics). But let's not worry about the possible deficiency in our morphism collection right now because the set of all Double => Double functions is interesting enough.

Two quick checks. First, it is obvious that composition of functions in the mathematical sense - no side effects - is associative. Second, we'll need not concern ourselves overly with the existence of an identity morphism - the other category requirement - because constructing the function that takes x and returns x presents little challenge to most programmers! Safe to assume the identify exists.

So jolly good, DOUBLE is a category and, returning to our first example, Some( ) sure looks like it might help make a functor from DOUBLE into something we imaginatively call SOME-DOUBLE. (In general we could call the first category THING and the new category SOME-THING, I suppose). Likewise Symbolic( ), which I just made up, looks like it might help create a functor from DOUBLE into SYMBOLIC-DOUBLE, as we might name it, a category representing a computation but not its answer. Are the images of these maps themselves categories? Its seems pretty obvious that they are, because of the triviality of function application associativity and, to be pedantic, the obvious fact that the identity morphism maps to the identity morphisms in both SOME-DOUBLE and SYMBOLIC-DOUBLE.

We focus on the first functor momentarily, whose implementation in Scala comprises the constructor x=>Some(x) and the lifting f => (x => x map f) which uses the 'built-in' map method. I personally think of this built-in functor as a 2-tuple where MAP._1 acts on objects and MAP._2 acts on functions. I'll just use MAP going forward for either though, slightly abusing notation.

                  MAP  = ( x=>Some(x), f => (x => x map f) )

As an aside, if I had my way Function1 would come with a map method so the morphism mapper f => (x => x map f) would look simpler, but it doesn't matter so long as we remember that morally speaking, the morphism-mapping part of the functor is just 'map'. When I say that map is 'built in' I really mean built in to the collections libraries, incidentally, because map is provided in the scala.collection.Traversible and scala.collection.Iterable traits and all collections are of one of these two types. Thus for all intents and purposes the combination (constructor, map) is an "off the shelf" functor although yes, for the third time, there are indeed plenty of morphisms other than functions - partial orders, paths on graphs et cetera - I'll let it go if you will. 

What is the domain of MAP? My guess is that we should extend it to include not just DOUBLE but SOME-DOUBLE. Also SOME-SOME-DOUBLE, SOME-SOME-SOME-DOUBLE and so on ad infinitum. Call that big union the category Omega so that MAP is, in math speak, an endofunctor because it maps the category Omega into itself. We are getting a little closer to the mathematical definition now, I do hope, and here comes the last little jog. 

A natural (and obvious) transformation between the built-in functor MAP applied once and the built-in functor MAP applied twice

To humor me apply MAP twice to get MAPMAP = MAP compose MAP, a functor that maps points x => Some(Some(x)) and of course messes with morphisms as well. For example, consider the function f(x) = x*x+3. MAP takes this morphism into another morphism which in turn happens to map Some(x) into Some(Some(x*x+3)). On the other hand MAPMAP would take f into a different morphism taking Some(x) into Some(Some(Some(x*x+3))). This just emphasizes the fact that MAP and MAPMAP are not the same thing, but they are dreadfully close.

The built-in functor MAP takes points to points and functions to functions

In fact with apologies for my handwriting and the mixing of mathematical and Scala notation on this diagram, MAP and MAPMAP are so similar that when applied to a particular morphism f(x)=x*x+2 they result in very similar looking functions g(x) and h(x). In fact g and h "are" the same Scala function - the very same pattern matching snippet of code I have scrawled in the picture! Needless to say the images of points a and b look quite similar as well. It is awfully tempting to relate MAP and MAPMAP by squishing Some(Some(x)) back to Some(x). The nice thing is, it costs us little to be precise.

In category theory there is a notion of a natural transformation between two functors that respects composition of morphisms. If we were category theorists we'd probably say "oh I suppose there must be a natural transformation from MAPMAP to MAP". I did not, come to think of it, have that exact experience myself, admittedly, because natural transformations are natural in some respects but also an abstract head f@#k the first time you come across them (they relate two functors, each of which in turn map functions to functions). The moral of the story: I think we should talk more about natural transformations to make us better people.

Actually we'll only talk about the one natural transformation mentioned: MAPMAP to MAP. As the definition on wikipedia explains (with F=MAPMAP and G=MAP, and C=D=Omega), this amounts to finding what is known as a component function (math speak) that lifts h=MAP(MAP(f)) onto g=MAP(f). That is straighforward because, translating back to Scala speak,

                                 h andThen flatten =  flatten andThen g

You couldn't hope for a more "obvious" commutative diagram that this - which is a good sign because in category theory many things are supposed look obvious after untangling of definitions or chasing of diagrams. We only need the top half of the diagram here, incidentally, and I don't think I need to define flatten. Even Scala newbies could implement the bit of it we need via pattern matching: x => x match {Some(y) => y}.

Flatten is the component realizing the natural transformation from MAP MAP to MAP (top half)
 whereas Some is the component realizing the natural transformation from Id to MAP (bottom half)

Thus natural transformations are sometimes quite friendly, it would seem, and there is an even more trivial one lurking here in the bottom half of the diagram, namely the natural transformation that takes the identity functor into MAP. That is realized by a component that happens to translate into a constructor (programming speak, of course) x => Some(x). Evidently:

                             f andThen Some = Some andThen g

so the bottom half commutes.

           Finally, the connection to mathematical monads

We're done. If you skip down the wikipedia monad page to the formal definition, ignoring the stuff at the start about adjoint functors which is just confusing, you'll see that a monad comprises a functor F and two natural transformations, one taking the identity functor Id to F and the other taking F compose F to F. The natural transformations are defined by their components which are, respectively, the constructor Some and the squishing operation flatten. To summarize, here is how we would relate the definition of a mathematical monad on Wikipedia to the day-to-day notion of a monad in programming Scala

Mathematical monad ingredients Example of a common monad
Two categories C,D One category comprising a an infinite union of similar types C=D=Omega=Union(DOUBLE,SOME-DOUBLE,SOME-SOME-DOUBLE,...)
A functor between the categories F: C->D The "built-in" functor MAP taking points x=>Some(x) and functions f => (x => x map f)
A natural transformation taking the identify functor Id to F A natural transformation taking the identify functor Id to MAP
A "component" realizing said natural transformation A constructor x => Some(x) that "wraps" up x
A second natural transformation taking F F -> F A natural transformation from MAP MAP to MAP
A second "component" realizing the second natural transformation A flatten operation x => x match {Some(y)=>y} that unwraps y

As the mathematicians remind us there are some things to check, known as coherence conditions, but I leave that to the reader.
              What's the point?
Now here we are transforming functors when what we are looking for is as elementary as unwrapping a burrito. One might argue that understanding the abstraction is no harder than noticing the essential similarity between any two cases where this patterns arises and any good programmer should be on the lookout for that sort of thing. They should also be looking to communicate it to those maintaining code long after they have written it, and make that maintenance as simple as possible. Abstraction is next to cleanliness, as they don't really say.

If one is going to talk loosely about "conserving structure" one might as well talk rigorously about it if it isn't all that much harder. And when you do, there are certainly some nontrivial things that fall out of category theory (for another time). No wonder the seemingly arcane area, once considered so obscure and abstract as to be useless, has made a charge up the Google n-gram leaderboard in recent times.

Popularity of "category theory" and "functional programming"
Still, I think most people would prefer to keep it a little wooly and merely recognize that we have an obvious identification between a doubly constructed type like List(List(3,4)) and one level down List(3,4), and that this identification needs to be carried over to morphisms - sorry functions. That is why you would naturally write flatmap yourself as the need arose, but might not necessarily appreciate the similarity to ComputeThis(ComputeThis(blah ))  or ParseMe(ParseMe( blah )) or  AbstractMe(AbstractMe( blah )) and so on. Actually Option is kind of like a partial computation too. You can think of Some(x) and Some(Some(x)) as calculations in stasis, if you will, although the only thing that "remains" to be calculated is whether x exists or not. My goodness, it always does!
One might wax mathematical in the hope of achieving a higher state of monadicness. Perhaps, for example, we "understand'' what MAP is doing because we root our intuition to pattern matching, kind of. If I tell you that g = (Some(x)=>Some(x*x+3)) you say "oh I get it, that is like x=>x*x+3 lifted up into Option[Double]. And if I tell you that h = (Some(Some(x))=>Some(Some(x*x+3)) you say "oh I get it, that is like x=>x*x+3 lifted all the way to Option[Option[Double]]". And you won't complain if I point out that g and h are the same, at least on some domain, because they are rooted in the same thing, the naked function x=>x*x+3. Perhaps the obviousness prevents us from seeing the top half of the second diagram independently of the bottom, however.

                     Footnote: The built-in flatMap  

It may seem odd that we have discussed monads in both programming and mathematics and I made only passing reference to flatMap, the built-in "bind" for Scala collections (defined here and here for Traversable and Iterable traits respectively). For another time, I guess.

Compositional models and their relation to Bayesian networks

Some time ago I noticed that Radim Jirousek published a foundational paper bringing together results on compositional models. A subclass of compositional models are an alternative formulation of Bayesian networks.

The basic idea goes as follows. Suppose \(\pi\) is a joint distribution on variables \(x_1\) and \(x_2\) specified by the following table (see the linked paper page 620).

\(\pi\) \(x_1=0\) \(x_1=1\)
\(x_2=0\) \(\frac{1}{2}\) \(\frac{1}{2}\)
\(x_2=1\) 0 0

and further suppose \(\nu\) is the uniform distribution on variables \(x_1\) and \(x_3\):

\(\pi\) \(x_1=0\) \(x_1=1\)
\(x_2=0\) \(\frac{1}{2}\) \(\frac{1}{2}\)
\(x_2=1\) 0 0

Then we define $$ (\pi \rhd \nu)(x_1,x_2,x_3) = \frac{ \pi(x_1,x_2) \nu(x_2, x_3) } { \nu^{\downarrow(2)}(x_2) } $$ where the down arrow indicates a marginal distribution. This division may not always be defined. For example the composition \(\pi \rhd \nu(x)\) is defined for all combinations of \(x_1\),\(x_2\) and \(x_3\) whereas the composition \(\nu \rhd \pi(x)\) is defined for only some, as in the table below which can be compared to that on page 630 of Jirousek's paper.

\(x_1\) \(x_2\) \(x_3\) \(\pi \rhd \nu(x)\) \(\nu \lhd \pi(x)\)
false false false Some(0.25) Some(0.125)
false false true Some(0.25) Some(0.125)
false true false Some(0.0) None
false true true Some(0.0) None
true false false Some(0.25) Some(0.125)
true false true Some(0.25) Some(0.125)
true true false Some(0.0) None
true true true Some(0.0) None

Conveniently Scala allows function names like |> and <|, and the monadic representation of results. Here None means undefined, and a finer point about the implementation below is that I might have used a ternary operator to resolve 0*0/0=0, as per Jirousek's convention.

Anyway, here's the code