Thursday, June 18, 2009

Inferring Something You Can't See

In his book Information Theory, Inference, and Learning Algorithms, David MacKay describes an interesting problem in Bayesian statistics (that school of statistics that purports to describe, quantitatively, rational thought itself). I paraphrase this here, and comment from my comfy couch in front of the TV what this might mean for customer service at Target.

Here is the statement of MacKay's problem:
Unstable particles are emitted from a source and decay at a distance x, a real number that has an exponential probability distribution with characteristic length L. Decay events can be observed only if they occur in a window extending from x = 1 cm to x = 20 cm. N decays are observed at locations {x1,..., xN}. What is L?
This problem interests me because of the little hitch it throws: the window from x = 1 cm to x = 20 cm. This is the part of the problem that turns a trivially easy problem into a much harder one.

No panic is required or desired on the part of the reader. This is not an exercise to be graded, and one's worth thereby judged. I am a scientist, trained in fact to be able to solve this kind of problem (or, admittedly, to fully understand MacKay's solution); someday, perhaps you, gentle reader, will pay me for this kind of junk, but for now it's free. At the moment, my goal is to describe the solution to an easier problem than MacKay's, then describe the solution to the more difficult problem actually presented. The rhetorical part of the key to solving the difficult problem--incorporation of all relevant information--is something that I can assure the reader will be able to understand, criticize, recognize when applied to everday situations, and argue about with others over agreeable beverages of choice.

To understand the problem requires understanding an exponential distribution. Here is what an exponential distribution looks like:


















This graph shows the essential feature of the exponential distribution: as one moves further to the right, away from the particle source, the probability of a particle decaying decreases (this may seem counterintuitive, but it's true; for the initiated, this is formally a probability density). This means that if we were to watch some number, say 100 particles, decay, we would get data that looks like this:


















The smooth-looking exponential distribution is an idealized experiment corresponding to the rough-looking histogram above. If an infinite number of particles had decayed, then the histogram would be as smooth as the distribution. Instead, with a finite number of decays we see that the data is a bit rough. In fact, this is the reason for treating this problem statistically.

If it weren't for the window, the solution to this problem is very easy. In such a case, a good estimate "characteristic length" L is the average of all the decay distances. Notice that the good estimate is the average of all decay distances, not just the ones (because of the window) that our experimental setup can observe.

However, when we include the effects of the window, we can't see all the particles decay. Instead, we only detect a subset of them, those that decay between 1 cm and 20 cm from the source, illustrated here:


















Because of the window, some of the particle decays that occurred are invisible to us; the histogram is truncated before 1 cm and after 20 cm. For this reason, we are left with the terrible result that we can't just average the particle decay distances we see to estimate the value for L, because we'd be neglecting particle decays that we don't see.

However, the problem can be trivially solved using Bayesian statistics. The essential features of this method (that is to say, the non-mathematical manifestation of same) are to incorporate all of the information available to us to estimate L. The key to doing this is to use not the distribution that governs particle decays, but the distribution that governs the particle decays we can observe. This is an exponential distribution, just as before, except that the probability of observing a decay closer than 1 cm from the source and further than 20 cm from the source is now zero. In other words, the probability distribution of the decays that we can see (as opposed to the probability distribution of the decays) looks like this:

















As can be seen, the distribution is truncated before 1 cm and after 20 cm; we can't see particles decay in those regions, so the probability of observing those things is zero. Hence, the probability distribution is zero in those places.

So you can see that this distribution encodes the information that we're observing particle decays in a window. This is the distribution we feed to the Bayesian beast; its job is to take a distribution and spit out statistics about it, in this case the characteristic decay distance L. If one goes through the math, one finds that the best estimate of L is not the average of the observed distances (as it would have been if we could observe all particle decays), but a much more complicated formula. Still, using immense arcane powers derived mainly from computer algebra systems, this problem can be solved.

So, what does this have to do with customer service?

At Target, they have a computer you can use to complain about the crappy customer service. Conversely, if you had a really great time, you can use the computer to let the company know. However, the company can't use this data--people complaining of crappy service and rejoicing in excellent service--to make inferences about the experiences of the whole collection of customers--since the data thus collected is biased--or can they?

The answer, if you are thinking about the particles-decaying-in-a-window problem, is heck yes that data can be used to infer about the experiences of all customers. The essential problem is that the responses at the computer--the "data" in the customer service problem--mostly come from people with extreme views about the customer service. Thus the sample of their views tends to reflect extreme viewpoints. Even when Target offers incentives to fill out surveys, for instance, they get a sample from the kind of person willing to fill out a survey, rather than what they would really like, which is the customer service experience of all customers, not just the weird customers who fill out surveys and make complaints. They miss all the people who had an okay experience, and those who can't be bothered to fill out surveys, which I expect is most people. These kinds of responses, like particles decaying outside of our window, are invisible to us, but they are still important.

"Biased" here really just means that the company can't "average" the responses that they get in order to determine the overall quality of customer service. The average would be over samples that don't reflect all of the customers, so it would not reflect the experiences of all of the customers.

Back to the particle decay problem: it is possible, using knowledge of the window, to solve the problem, even when we can't just take a simple average. In the customer service case, the solution should be the same: to infer customer service experiences from biased samples, we just have to know what the "window" is that determines what it is we can see. Once I figure out the shape of the window, I'll let you know.

1 comment:

  1. Ah, thank you very much for explaining the problem's solution in such an intuitive way. I just started reading the book and find it rather confusing at first. By the way, can you explain to me what Mackay was trying to do in the first place: "with a little ingenuity and the introduction of ad hoc bins, promising estimators for [lambda] >> 20 could be constructed"?

    ReplyDelete