Wednesday, July 15, 2009

Accepted paper: Bayesian single-exponential rates

Happy news for me today: a (very, very long) paper of mine was accepted for publication today in the Journal of Physical Chemistry B. The impact on science is small, even as important as I think the topic is, but the impact on my career could be huge.

The paper describes a method for estimating the rate of some process essentially by counting the number of times the process occurs. The analogy I like to use is of a road: one plants by the side of the road (lawn chair and cooler are mandatory, just as for those computing rates in molecular simulation) and counts the number of cars that pass in some time period. One estimate of the rate is to take the number of cars that pass and divide by the time period. For instance, if the number of cars was 120 and the time period was 12 hours (this is not a busy road), then one might say the rate is 10 cars per hour.

This could be the same with molecular simulation. My real job is to direct computers to run simulations of protein folding. We make models of the proteins in unfolded states, then let them evolve according to Newton's laws of motion; with the luck of statistical mechanics, some of them reach folded states. Say that I observed 10 folding events in 100 microseconds; then an estimate of the rate would be 1 folding event every 10 microseconds, same as with the cars.

Interestingly, the division method (in this case also known as a maximum likleihood estimate) is not necessarily the best method for estimating a rate. For one thing, this and similar methods provide only point estimates of the rate and do not reflect our uncertainty as to how good the estimate is. To illustrate, imagine observing a road for 1 second and seeing no cars pass; clearly a rate of 0 cars per second (minute, hour) is not a good estimate of the rate. We would prefer a way to know how good our point estimate is.

For this purpose, we can compute a probability distribution of the rate, which describes our beliefs about the rate. That is, we assign a probability to each possible value of the rate. These probabilities in turn describe how surprised we would be, after making an observation, that the true rate would turn out to be any number. If we had made lots of observations (many cars in some long time period) then the probability distribution would be very sharp: we would and should be very surprised to find that the true value is different from the maximum likelihood estimate (number of observations divided by time period) if we have lots of data.

In contrast, if we have a little bit of data--no cars in 1 second--then our probability ends up being not sharp but broad. With silly data like this, we would not be surprised to find that the true rate is anything. This is because the maximum likelihood estimate should not be taken seriously given such a small amount of data.

The manner in which probability distributions of the rate are built is Bayesian inference, that method of statistics that allows things with "true" values like protein folding rates to take on probabilities which reflect our belief that the true value is within some range. As I show in the paper, these methods quite naturally show what is intuitive above: that lots of data gives sharp, reliable estimates and that a tiny bit of data gives poor estimates. Intuition can be made systematic.

Most fun, I can use the methods in the paper to calculate my future performance. I have three papers this year (so far, anyway). If everything stays the same, the probability that I publish between 2 and 5 papers next year is about 59 %. (I call this state "quantitative professional scientific happiness," or QPSH, come up with your own pronunciation) What's more, the probability that I publish less than 2 papers next year is only 14 %. The probability that I publish more than 5 papers is a little more than 26 %.

So you should be more surprised, if everything stays the same, if I publish more than 5 papers next year than if I publish between 2 and 5. You should be even more surprised if I publish less than 2. But who knows if everything will stay the same? Perhaps we should always be surprised?

No comments:

Post a Comment