I was hesitant to use this sub-title, though very relevant, because it is the same title used by Anton Chekov for one of his short stories and one of the best short stories from the Golden Age of Russian literature. With that small literary diversion aside, let me set the situation for a thought experiment and ask you to make a bet.
Suppose I have a bag of 10,000 coins – 9,999 of which are fair (i.e. balanced for heads and tails) and one biased coin which has two heads (I will use H and T henceforth). The coins are well-mixed in the bag and I reach into the bag and withdraw one coin at random. I do not tell you the identity of the coin, but rather, I will flip the coin repeatedly and tell you the result of the flip – H or T. You are asked to make a bet as to whether I have drawn the biased coin based on the observed data. Now, obviously, if I flip the coin and tell you a result is T, then there is no need to proceed any further since I clearly have selected a fair coin. The only interesting aspect of this thought experiment is to consider a sequence of H’s. The question becomes, “How many consecutive H’s would you need to see before you are willing to bet that I have selected the biased coin?”
Null Hypothesis Significance Testing (NHST)
So, as you are thinking about this, you are defining a decision rule in your mind. Should it be N=6? 8? 10? 12? More? And if you are a thoughtful statistician or other scientist with knowledge of statistical methods, you are likely considering a formal statistical hypothesis test to address this question. It is actually quite straightforward.
Let the null hypothesis be that I have selected a fair coin, stated as
H0: coin is fair or in numeric terms H0: pr(H)=0.50.
The alternative hypothesis is then
Ha: the coin is biased or Ha: pr(H)=1.
As you are pondering your decision rule – what value of N consecutive H’s would be enough evidence for you to bet that I have drawn the biased coin – you can calculate the probability of a false positive finding. That is, the probability that you decide or bet that I have selected the biased coin when in fact I have selected the fair coin. This means you lose the bet. This is also called the probability of a Type 1 error or in statistical jargon, the significance level of the test. It is uniformly denoted by the Greek symbol α (alpha). This probability is written in words as
pr(you say the coin I selected is biased when in fact the coin I selected is fair)
or in hypothesis lingo as
pr(reject H0 when in fact H0 is true)
or in more symbolic terms
pr(reject H0 | H0 is true).
(Note that the vertical bar means “given” or “assuming” when using probability statements.)
A Type 1 error means you lose the bet, and so you probably want this to be small. So, you want to choose N to be large; but how large? Well, if the null hypothesis is true, then the probability of H on any given flip is 0.50. And since each flip is an independent event, the probability of N consecutive H’s using a fair coin is 0.50N. Now you have a formula to decide how small you want to make your chances of losing the bet (a Type 1 error). You can select N to be whatever you want depending on how much risk you are willing to take for being wrong.
Now, suppose in our thought experiment, that I have flipped the coin and reported n=10 consecutive H’s. If your decision rule for making the bet was observing 10 consecutive H’s, you would bet that I selected the biased coin. Once we have observed the data, we can compute the probability of observing 10 consecutive H’s assuming the null hypothesis is true (i.e. the coin is fair). This uses the same formula as the Type 1 error and is
p-value = pr(n consecutive H’s | H0 is true) = 0.50n = 0.5010 = 0.0009766.
Perhaps you will notice that I have shifted from N to n in these formulas, and I have done so intentionally. N is a number that is decided before the experiment is run and defines your risk tolerance or the significance level of the test; n is the observed number of H’s – i.e. it is the observed data – and provides a measure for how “far” the data are from the hull hypothesis H0. This is known as the p-value or the significance level of the data. [I will save for a later blog the confusion that is created by the conflation of these two distinct quantities – the significance level of the test and the significance level of the data – which are often used (incorrectly) interchangeably.]
With a traditional significance level being 0.05 (or even conservatively 0.01), a p-value < 0.001 would lead one clearly to reject H0 and conclude that it is highly likely that I have selected the biased coin from the bag after observing 10 consecutive H’s.
Rethinking the Decision Strategy
In the usual NHST paradigm, the goal is to reject the null hypothesis. It’s like proof by contradiction in mathematics. You start with a statement. Then you manipulate that statement using logically valid mathematical steps. If you produce a subsequent statement that is known to be false, then you conclude that the original statement was false. In NHST, we start with a statement – the no effect hypothesis – and then do an experiment to gather data regarding that hypothesis. If the data are incompatible with the hypothesis, then we reject that null hypothesis in favor of the alternative (non-null) hypothesis, which is what we wanted to show in the first place.
So, what scientists want is to reject H0, or more precisely, upon doing an experiment and collecting data, they want to know how likely it is that the alternative hypothesis is true – the thing they really want. Equivalently, they want to know the likelihood that the null hypothesis is false given the observed data. Thus, WHAT SCIENTISTS (ACTUALLY ALL OF US) REALLY WANT TO EVALUATE IS
Pr(Ha is true | observed data), or conversely
Pr(H0 is false | observed data).
In the context of this thought experiment, the question is “How many consecutive H’s are needed to be willing to bet that I selected the biased coin?” Or, said another way, you want to know what is the probability that I selected the biased coin given n consecutive H’s … or even more precisely, you want to know for what value of n is
pr(I selected the biased coin | n consecutive H’s) > 0.50.
That is, when in a sequence of n consecutive H’s are the odds in your favor, and therefore it is in your best interest to make the bet?
In formal expression, we need to evaluate
pr(biased coin | n consecutive H’s observed) = pr(Ha | n).
Fortunately, we have Bayes’ Theorem to help with this calculation.
pr(Ha |n) = pr(n| Ha) p(Ha) / [ pr(n| Ha) p(Ha) + pr(n| H0) p(H0) ]
In the thought experiment,
- pr(n|Ha) = 1 since one is guaranteed to get all H’s if the biased coin is used.
- pr(Ha) = 1/10,000 since there is only 1 biased coin in the bag of 10,000 coins.
- pr(n|H0) = 0.5n since a fair coin has a 50% chance of producing an H.
- pr(H0) = 9,999 / 10,000.
This results in
1 * (1/10,000) / [ 1* (1/10,000) + (.5n * (9999/10,000) ] = 0.093.
In words, given that I have observed 10 consecutive H’s, the probability that I have selected the biased coin from the bag of 10,000 coins is 0.093. With such a low probability of having selected the biased coin, one should clearly conclude that it is very unlikely that I have selected the biased coin and therefore should NOT make the bet after 10 consecutive H’s!
That’s a strikingly different answer than we got using NHST. What’s up?!
Answer the Right Question
There is an adage that says, “do things right” and another that says, “do the right things.” The first is about “how” and the latter is about “what.” If you don’t get the “what” (i.e. the right question) aligned with your goals, even the best “how” efforts are futile since you are doing the wrong things very well. Let me expand on this general truism – first “what”, then “how.”
Let A be the event rain, and let B be the event cloudy. What we want to know is pr(rain|cloudy) or pr(A|B). We all know that the pr(B|A) is a very different quantity and even addresses a very different question/concept (the “what”). It can be so different as to render it meaningless or useless, as in pr(cloudy|rain). We would never mistake pr(rain|cloudy) and pr(cloudy|rain) in our personal lives. Furthermore, we would be shocked if weather reporters quoted pr(cloudy|rain) in their forecasts – what we do not want – but conveyed it and convinced us as if it were pr(rain|cloudy) – what we really want. That would be totally unacceptable, and we might even declare such a weather reporter as a fraud !!!!
Now, let A = a hypothesis and B = the observed data from an experiment about that hypothesis. [It doesn’t matter for this argument whether A is the null or alternative hypothesis.] Just as in the thought experiment we want to know the probability that I have selected the biased coin, so too in a scientific experiment or clinical trial we want to know the probability that a hypothesis is true or false. What is the pr(drug works)? What is the pr(cigarettes cause cancer)? What is the pr(increased spending on TV ads will increase sales)? What is the pr(use of educational program X will increase learning)? All these questions are answered in the context of observing some data (from a controlled experiment or an observational study) relevant to the hypothesis. And they are all stated from the context of
pr(Ha is true) or conversely pr(H0 is false).
And so, we can express the right question as, “What is the pr(hypothesis|observed data)?” That is, what can I infer about the underlying reality or the truth of nature based on the data that I have observed? This is precisely what Thomas Bayes (circa 1763) was addressing, and it is the right question to be asking still! It is pr (A|B) and more specifically,
pr(H0 is false | observed data).
In NHST, we report the p-value, which is akn to pr(B|A), or specifically,
pr(data | H0 is true).
There is some meaning to this probability [unlike the pr(cloudy|rain)] in the sense of proof by contradiction. However, it is fundamentally an answer to the wrong question. The Bayesian probability captured in the pr(hypothesis|data) is more akin to direct proof in mathematics and is indeed a direct answer to the question about which we are all most interested – the right question.
For too long statisticians have been peddling pr(data|hypothesis) in the form of p-values to scientists who have fully adopted their use (and over-use) and in fact, made it the “gold standard” for scientific decision-making. Because most scientists do not fully engage in our mathematical and statistical priesthood of confusing calculations and algorithms, they have accepted what we have told them. In fact, many think (erroneously) of a p-value as exactly the probability that they want – pr(hypothesis|data) – and we as statisticians frequently grumble that scientists do not understand this and misinterpret p-values. For too long, statisticians have been quietly substituting pr(B|A) for what is desired, pr(A|B), all the while lamenting in our offices and lunchrooms that the scientists don’t get it. We are akin to the weather reporter who gives pr(cloudy|rain) – that is, the p-value – but sell it as pr(rain|cloudy). Statisticians (including me) are complicit in this “bait and switch” at least by implicit consent through the mass production and continued prevalence of p-values.
Lastly, a persistent critique of Bayesian approaches is the need for creating a subjective prior. The argument from Bayesians are many and varied, but a simple counterargument is that frequentist methods and hypothesis testing also require assumptions and models, etc. I am ultimately on the side of no less an authority than John Tukey (1) who stated, “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”
- John W. Tukey. The future of data analysis. Ann. Math. Statist. 33, 1-67 (1962).