The Bet
I was hesitant to use this sub-title, though very relevant, because it is the same title used by Anton Chekov for one of his short stories and one of the best short stories from the Golden Age of Russian literature. With that small literary diversion aside, let me set the situation for a thought experiment and ask you to make a bet.
Suppose I have a bag of 10,000 coins – 9,999 of which are fair (i.e. balanced for heads and tails) and one biased coin which has two heads (I will use H and T henceforth). The coins are well-mixed in the bag and I reach into the bag and withdraw one coin at random. I do not tell you the identity of the coin, but rather, I will flip the coin repeatedly and tell you the result of the flip – H or T. You are asked to make a bet as to whether I have drawn the biased coin based on the observed data. Now, obviously, if I flip the coin and tell you a result is T, then there is no need to proceed any further since I clearly have selected a fair coin. The only interesting aspect of this thought experiment is to consider a sequence of H’s. The question becomes, “How many consecutive H’s would you need to see before you are willing to bet that I have selected the biased coin?”
Null Hypothesis Significance Testing (NHST)
So, as you are thinking about this, you are defining a decision rule in your mind. Should it be N=6? 8? 10? 12? More? And if you are a thoughtful statistician or other scientist with knowledge of statistical methods, you are likely considering a formal statistical hypothesis test to address this question. It is actually quite straightforward.
Let the null hypothesis be that I have selected a fair coin, stated as
H0: coin is fair or in numeric terms H0: pr(H)=0.50.
The alternative hypothesis is then
Ha: the coin is biased or Ha: pr(H)=1.
As you are pondering your decision rule – what value of N consecutive H’s would be enough evidence for you to bet that I have drawn the biased coin – you can calculate the probability of a false positive finding. That is, the probability that you decide or bet that I have selected the biased coin when in fact I have selected the fair coin. This means you lose the bet. This is also called the probability of a Type 1 error or in statistical jargon, the significance level of the test. It is uniformly denoted by the Greek symbol α (alpha). This probability is written in words as
pr(you say the coin I selected is biased when in fact the coin I selected is fair)
or in hypothesis lingo as
pr(reject H0 when in fact H0 is true)
or in more symbolic terms
pr(reject H0 | H0 is true).
(Note that the vertical bar means “given” or “assuming” when using probability statements.)
A Type 1 error means you lose the bet, and so you probably want this to be small. So, you want to choose N to be large; but how large? Well, if the null hypothesis is true, then the probability of H on any given flip is 0.50. And since each flip is an independent event, the probability of N consecutive H’s using a fair coin is 0.50N. Now you have a formula to decide how small you want to make your chances of losing the bet (a Type 1 error). You can select N to be whatever you want depending on how much risk you are willing to take for being wrong.
Now, suppose in our thought experiment, that I have flipped the coin and reported n=10 consecutive H’s. If your decision rule for making the bet was observing 10 consecutive H’s, you would bet that I selected the biased coin. Once we have observed the data, we can compute the probability of observing 10 consecutive H’s assuming the null hypothesis is true (i.e. the coin is fair). This uses the same formula as the Type 1 error and is
p-value = pr(n consecutive H’s | H0 is true) = 0.50n = 0.5010 = 0.0009766.
Perhaps you will notice that I have shifted from N to n in these formulas, and I have done so intentionally. N is a number that is decided before the experiment is run and defines your risk tolerance or the significance level of the test; n is the observed number of H’s – i.e. it is the observed data – and provides a measure for how “far” the data are from the hull hypothesis H0. This is known as the p-value or the significance level of the data. [I will save for a later blog the confusion that is created by the conflation of these two distinct quantities – the significance level of the test and the significance level of the data – which are often used (incorrectly) interchangeably.]
With a traditional significance level being 0.05 (or even conservatively 0.01), a p-value < 0.001 would lead one clearly to reject H0 and conclude that it is highly likely that I have selected the biased coin from the bag after observing 10 consecutive H’s.
Rethinking the Decision Strategy
In the usual NHST paradigm, the goal is to reject the null hypothesis. It’s like proof by contradiction in mathematics. You start with a statement. Then you manipulate that statement using logically valid mathematical steps. If you produce a subsequent statement that is known to be false, then you conclude that the original statement was false. In NHST, we start with a statement – the no effect hypothesis – and then do an experiment to gather data regarding that hypothesis. If the data are incompatible with the hypothesis, then we reject that null hypothesis in favor of the alternative (non-null) hypothesis, which is what we wanted to show in the first place.
So, what scientists want is to reject H0, or more precisely, upon doing an experiment and collecting data, they want to know how likely it is that the alternative hypothesis is true – the thing they really want. Equivalently, they want to know the likelihood that the null hypothesis is false given the observed data. Thus, WHAT SCIENTISTS (ACTUALLY ALL OF US) REALLY WANT TO EVALUATE IS
Pr(Ha is true | observed data), or conversely
Pr(H0 is false | observed data).
In the context of this thought experiment, the question is “How many consecutive H’s are needed to be willing to bet that I selected the biased coin?” Or, said another way, you want to know what is the probability that I selected the biased coin given n consecutive H’s … or even more precisely, you want to know for what value of n is
pr(I selected the biased coin | n consecutive H’s) > 0.50.
That is, when in a sequence of n consecutive H’s are the odds in your favor, and therefore it is in your best interest to make the bet?
In formal expression, we need to evaluate
pr(biased coin | n consecutive H’s observed) = pr(Ha | n).
Fortunately, we have Bayes’ Theorem to help with this calculation.
pr(Ha |n) = pr(n| Ha) p(Ha) / [ pr(n| Ha) p(Ha) + pr(n| H0) p(H0) ]
In the thought experiment,
- pr(n|Ha) = 1 since one is guaranteed to get all H’s if the biased coin is used.
- pr(Ha) = 1/10,000 since there is only 1 biased coin in the bag of 10,000 coins.
- pr(n|H0) = 0.5n since a fair coin has a 50% chance of producing an H.
- pr(H0) = 9,999 / 10,000.
This results in
1 * (1/10,000) / [ 1* (1/10,000) + (.5n * (9999/10,000) ] = 0.093.
In words, given that I have observed 10 consecutive H’s, the probability that I have selected the biased coin from the bag of 10,000 coins is 0.093. With such a low probability of having selected the biased coin, one should clearly conclude that it is very unlikely that I have selected the biased coin and therefore should NOT make the bet after 10 consecutive H’s!
That’s a strikingly different answer than we got using NHST. What’s up?!
Answer the Right Question
There is an adage that says, “do things right” and another that says, “do the right things.” The first is about “how” and the latter is about “what.” If you don’t get the “what” (i.e. the right question) aligned with your goals, even the best “how” efforts are futile since you are doing the wrong things very well. Let me expand on this general truism – first “what”, then “how.”
Let A be the event rain, and let B be the event cloudy. What we want to know is pr(rain|cloudy) or pr(A|B). We all know that the pr(B|A) is a very different quantity and even addresses a very different question/concept (the “what”). It can be so different as to render it meaningless or useless, as in pr(cloudy|rain). We would never mistake pr(rain|cloudy) and pr(cloudy|rain) in our personal lives. Furthermore, we would be shocked if weather reporters quoted pr(cloudy|rain) in their forecasts – what we do not want – but conveyed it and convinced us as if it were pr(rain|cloudy) – what we really want. That would be totally unacceptable, and we might even declare such a weather reporter as a fraud !!!!
Now, let A = a hypothesis and B = the observed data from an experiment about that hypothesis. [It doesn’t matter for this argument whether A is the null or alternative hypothesis.] Just as in the thought experiment we want to know the probability that I have selected the biased coin, so too in a scientific experiment or clinical trial we want to know the probability that a hypothesis is true or false. What is the pr(drug works)? What is the pr(cigarettes cause cancer)? What is the pr(increased spending on TV ads will increase sales)? What is the pr(use of educational program X will increase learning)? All these questions are answered in the context of observing some data (from a controlled experiment or an observational study) relevant to the hypothesis. And they are all stated from the context of
pr(Ha is true) or conversely pr(H0 is false).
And so, we can express the right question as, “What is the pr(hypothesis|observed data)?” That is, what can I infer about the underlying reality or the truth of nature based on the data that I have observed? This is precisely what Thomas Bayes (circa 1763) was addressing, and it is the right question to be asking still! It is pr (A|B) and more specifically,
pr(H0 is false | observed data).
In NHST, we report the p-value, which is akn to pr(B|A), or specifically,
pr(data | H0 is true).
There is some meaning to this probability [unlike the pr(cloudy|rain)] in the sense of proof by contradiction. However, it is fundamentally an answer to the wrong question. The Bayesian probability captured in the pr(hypothesis|data) is more akin to direct proof in mathematics and is indeed a direct answer to the question about which we are all most interested – the right question.
For too long statisticians have been peddling pr(data|hypothesis) in the form of p-values to scientists who have fully adopted their use (and over-use) and in fact, made it the “gold standard” for scientific decision-making. Because most scientists do not fully engage in our mathematical and statistical priesthood of confusing calculations and algorithms, they have accepted what we have told them. In fact, many think (erroneously) of a p-value as exactly the probability that they want – pr(hypothesis|data) – and we as statisticians frequently grumble that scientists do not understand this and misinterpret p-values. For too long, statisticians have been quietly substituting pr(B|A) for what is desired, pr(A|B), all the while lamenting in our offices and lunchrooms that the scientists don’t get it. We are akin to the weather reporter who gives pr(cloudy|rain) – that is, the p-value – but sell it as pr(rain|cloudy). Statisticians (including me) are complicit in this “bait and switch” at least by implicit consent through the mass production and continued prevalence of p-values.
Lastly, a persistent critique of Bayesian approaches is the need for creating a subjective prior. The argument from Bayesians are many and varied, but a simple counterargument is that frequentist methods and hypothesis testing also require assumptions and models, etc. I am ultimately on the side of no less an authority than John Tukey (1) who stated, “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”
- John W. Tukey. The future of data analysis. Ann. Math. Statist. 33, 1-67 (1962).
Steve,
I really like your biased coin example. It very nicely illustrates the importance of “prior” such that the problem of interest can be put in the right context.
In your example, the question is “What is the chance that the selected coin is a biased one (with two heads) after observing 10 consecutive H’s in coin tossing?” As you show, NHST gives a p-value of <0.001 which indicates that the data strongly suggest that the null hypothesis of a fair coin is unlikely to be true. Hence, one will reject the null hypothesis and conclude that the selected coin is biased.
Although it seems persuasive, it is the right answer to the wrong question.
Very simply put, the answer provided by NHST ignores how many coins are in the bag. Whether there are 1 biased coin and 9 fair coins in the bag or 1 biased coin and 9,999 fair coins in the bag, the answer is the same. This obviously does not make sense.
Bayesian approach requires the use of prior. Some may consider specifying the prior is subjective and it is not "scientific" because science is supposed to be objective. Yes, science should be factual, impartial, and free of unduly influence. But science should be put in the right context. Whether having 1 biased coin and 9 fair coins in the bag or 1 biased coin and 9,999 fair coins in the bag clearly affects the chance that the selected coin is a biased one or not. The lower the prior probability of a biased coin is (1 in 10,000 versus 1 in 10), the less likely the selected coin is biased, regardless of the outcome of the toss.
A similar example can be found in the field diagnostic testing which you also cover in your next blog.
Given that a test has 95% sensitivity and 90% specificity, if I am tested positive, what is probability that I have the disease?
Even though the test is exceptional accurate, we cannot answer the question. Why? Because one critical piece of information is missing and we need this information before we can answer the question. The critical piece of information is: “What is the prevalence of disease?”
If the prevalence is 50%, the probability that I have disease given test positive (i.e., the positive predictive value) is 90.5%. If the prevalence is 30%, my PPV is 80.3%. If the prevalence is only 1%, the PPV is only 8.8%. In this setting, the prevalence is the prior. The PPV is the posterior. Without knowing the prior, we cannot calculate the posterior.
J. Jack Lee, University of Texas MD Anderson Cancer Center
LikeLike
Jack,
Thanks for the comments. They are “spot on.” I will add a few thoughts briefly.
“Science should be objective.” – Let’s face it, science is subjective no matter how hard we try. Personal judgments and experience color everything we think and believe. If the University of Nowhere announces astonishing research and a claim to have produced anti-gravity (p,0.001), would you believe it? We would all rightly put such a discovery in context of our collective knowledge and be very skeptical. The same should be true (perhaps to a lesser extent) when someone says they have discovered the medical cure for Alzheimer’s based on a lab experiment or a small clinical trial.
For NHST, “the answer is the same.” Absolutely the point I am trying to make as well. NHST assumes knowledge of the truth (H0 is true) so you should always get the same answer if you KNOW the truth. Of course, we do not know the truth and the context for the experiment or any NHST is essential. a p-value cannot be interpreted without knowing some context (i.e. the subjective piece of science that is essential).
Lastly, EVERYONE with any knowledge or sense about diagnostic testing knows that you can only interpret a positive test result in the context of the prevalence of the underlying disease … and that PPV IS THE ONLY MEANINGFUL INTERPRETATION of a diagnostic result. This has been well-known since the dawn of diagnostic testing and is uniformly adopted by all in the medical profession. Why is it so hard to adopt this IDENTICAL thinking when it comes to a clinical trial or research experiment?!? I think part of the answer is that we statisticians have oversold NHST over the last 8 decades.
Thanks again. Please share with colleagues and students. Please feel free to use any examples for educational purposes etc.
LikeLike