We've now seen the sort of claims that can pop up in a hypothesis test. Agreeing what these claims are for your test will typically be the first thing you do when you conduct a hypothesis test. In fact, a hypothesis test can broadly be broken up into three stages:
Step 1: Settle on a claim about a population parameter. In particular define the null hypothesis H0 and the alternative hypothesis HA.
Step 2: Collect a sample and use this to test the null hypothesis.
Step 3: Finish the test with one of the following. Either:
The second step in the above process is very large, involving a substantial part of the methodology of hypothesis testing. We'll leave this step for now and treat it later in this chapter. (In fact, pretty much the rest of this chapter is dedicated to Step 2!) So for the moment we will skip the details of actually doing the test. Our main interest at the moment is on the final decision, Step 3. Every hypothesis test you will ever conduct will result in one of the two outcomes in this step. So let's have a look at what each outcome does and does not entail.
Based on what you've read so far, you could be forgiven for wondering why we can only ever conclude that the claim being tested is false, or alternatively conclude that we don't have enough evidence. What about concluding that it is true!?!
Fair coin - conclusions
Fred wants to conduct a hypothesis test with the following null and alternative hypotheses:
H0: π = 0.5
HA: π ≠ 0.5
After he collects a sample and conducts the test, he will (depending on the results of the test) either:
But suppose Fred's friend owns the coin and is offended that Fred would even dare suggest that the coin is weighted. In fact, the friend wants to be able to conclude that the coin is fair!
Well, unfortunately he can never do that.
This all sounds terribly unfair. It seems that, no matter how the hypothesis test goes, you cannot conclude that the null hypothesis is correct. Fred's friend can never make a claim like:
We have collected enough evidence to suggest that the population proportion π is exactly 0.5.
But there is a very good reason for this. Albert Einstein once said about his scientific theories:
No amount of experimentation can ever prove me right; a single experiment can prove me wrong.
And no, he wasn't just trying to put himself down! The same rule applies to any scientific theory, and to all of statistical inference.
The reason behind why we can't use a hypothesis test to prove the null hypothesis true comes down to two factors:
To address this first point, consider the null and alternative hypotheses in the coin test.
The null hypothesis asserts that π is 0.5, exactly. That is, the null hypothesis asserts that, out of the entire continuous spectrum of possible values between 0 and 1, π is specifically and exactly 0.5.
The alternative hypothesis asserts that π is not 0.5. It simply asserts that, out of the entire continuous spectrum of possible values, π is some value (any value!) other than 0.5.
The point is that the null hypothesis is an extremely strong claim, and the alternative hypothesis is relatively weak.
Now for the second point: how might a sample stand up for or against each of the hypotheses?
Suppose Fred wanted a sample and so he flipped the coin 400 times.
And, for the moment, suppose he got a sample proportion of p = 0.99. Note that this means he got 396 heads and 4 tails. Would this constitute sufficient statistical evidence that the coin is not fair, that π is not equal to 0.5? Yes, it would. Put simply, a sample proportion of 0.99 is so different to 0.5 that we can conclude that the population proportion isn't 0.5.
Samples are good at doing that: they are good at contradicting things. You would look at this (very extreme) sample collected and you would say: 'No way. For the coin to be fair and still give us a result like that is just too unlikely.'
That is, a sample statistic that is sufficiently different to the null hypothesis can be used as evidence against that null hypothesis.
But now suppose that he got a sample proportion of p = 0.51. Note that this means he got 204 heads and 196 tails. That sample proportion is fairly close to 0.5. But does it constitute sufficient statistical evidence that the coin is fair, that π is exactly 0.5? No, it doesn't. It is consistent with the hypothesis that π is 0.5, because a fair coin could easily give us 204 heads and 196 tails. But the sample is also consistent, for example, with a population proportion of π = 0.51. Indeed, a sample proportion of 0.51 is obviously consistent with a population proportion of 0.51, because it is equal to it! So, even though the sample proportion p = 0.51 is close to 0.5, it by no means proves that the coin is fair.
In fact, even if Fred got exactly 200 heads and 200 tails, even this wouldn't prove that π is 0.5! Even though, obviously, a sample like this is consistent with a population proportion of 0.5, it is also consistent with a population proportion of 0.51, or 0.49, for example. True, these values are close to 0.5. But that is not good enough for the null hypothesis! The null hypothesis specifically claims that the population proportion is exactly 0.5. A sample can never give us enough evidence to be sufficiently sure that this is true.
A sample statistic that is close to the null hypothesis cannot be considered sufficient evidence for that null hypothesis.
Samples are not good at proving specific things just because they are consistent with them. In contrast, they are good at disproving specific things by contradicting them.
For these reasons, we never say that we have proven the null hypothesis after a test. If there is not enough evidence to reject the null hypothesis, we sometimes say that we retain the null hypothesis. We interpret this to mean just what we have said: that we are not rejecting the null hypothesis, because there is not enough evidence to do so.