Uncertainty in testing

In Chapter 7 on statistical estimation we emphasized the point that estimation always involves uncertainty. We can only have '95% confidence' in a 95% confidence interval estimate - hence the name! What does this mean in estimation? Suppose 100 statisticians all wanted to estimate the same population parameter. They each separately went out and collected their own sample and produced their own confidence interval. On average, 95 of these statisticians would produce interval estimates that do contain the population parameter, while 5 would not.

Another thing that we emphasized back in Chapter 7 is that these 5 statisticians didn't do anything wrong! They didn't make a 'mistake'. It is simply the case that they, like the rest of the statisticians, were trying to use sample data (which is incomplete) to say something about a population.

As with estimation, hypothesis tests always involve some uncertainty. When we draw a conclusion from a hypothesis test, we can't be certain that we are correct in doing so. And as with estimation, this doesn't mean we have made a mistake! Such errors of uncertainty are inherent to testing.

Unfair coin?

Suppose someone gives you a coin and tells you it is fair. You flip the coin 1,000 times and it comes up heads 800 times. You conclude that the person who gave you the coin was wrong: the coin is not fair. Mathematical details aside, this scenario is essentially a hypothesis test.

But you could, of course, be wrong in your conclusion. It is possible that the person was correct, that the coin was fair, and that such an extreme sample as 800 heads (and 200 tails) came from this population. It is extremely unlikely, but possible. But if this is the case, there is nothing you can do about it. The methodology of hypothesis testing would lead you to reject the hypothesis that the coin is fair.

The above example can be put into the language of hypothesis testing like this: You rejected the null hypothesis when it was true and shouldn't have been rejected.

You may recall that the probability of such an event was defined earlier in this chapter to be the level of significance, α.

For most of this chapter, we have thought of the level of significance as being that number that you choose, at the beginning of the test, that determines your critical value and region of rejection. And this is, operationally, what α does. But it is defined to be the probability of committing a particular type of error: the error of rejecting the null hypothesis when it is true.

In fact, because α is the probability of rejecting the null hypothesis when it is true, it is actually a conditional probability. Expressed as such, we can say:

α = P(H0 is rejected | H0 is true)

So what is α?

It may seem like α is two entirely different things: it is the probability of a particular error occurring but it is also a number that determines key values in our hypothesis test. Well, it is both of these things but they aren't all that different.

As an example, suppose as a statistician you choose a level of significance α = 0.05 for a two-sided test for a population proportion:

H0: π = 0.3

HA: π ≠ 0.3

This level of significance leads you to the critical values -1.96 and 1.96 and the region of rejection is the set of values outside these critical values. Notice that, by design, 5% of the values lie in the region of rejection and 95% lie between the two critical values.

Now, under the assumption that the null hypothesis is true, the test statistic from your sample will follow the standard normal distribution. So if the null hypothesis is true, then 95% of the time you'll get a test statistic between the two critical values (and therefore not reject the null hypothesis) and 5% of the time you'll get a test statistic in the region of rejection (and therefore reject the null hypothesis).

That is, there is a 5% chance that you will reject the null hypothesis in this test if it is true. This is because the level of significance you chose at the beginning was α = 0.05! This is the 'connection' between the two notions of α.

Errors in the test

Instead of thinking about 'percentage chances', you may like to look at it like this: suppose you happen to know that the null hypothesis is true in the above test. That is, suppose you happen to know that π is equal to 0.3. Now suppose 100 statisticians (who aren't lucky enough to know what you know!) decide to test this null hypothesis with a level of significance of α = 0.05. They each separately go out and collect their own sample and calculate their own test statistic. On average, 95 of these statisticians would not reject the null hypothesis. The other 5 would (incorrectly) reject it.

This situation is similar to the 100 statisticians we were discussing earlier, who were constructing 100 confidence intervals. And, just as we pointed out then, the 5 statisticians who incorrectly reject the true null hypothesis haven't made a 'mistake'. They were just 'unlucky'. Hypothesis testing is a part of statistical inference, and statistical inference is always uncertain. And, because it is uncertain, there is always the chance that an error will be made.