Now let's relate this to a particular sample that Fred might have drawn.
But maybe it is fair!
Any statistician will tell you that 396 heads and 4 tails is enough evidence that the coin isn't fair. But it is worth mentioning that it is possible that the coin is fair.
However, when faced with 396 heads and 4 tails, you basically have two options. Either:
Statisticians will always go with the second choice in this situation. And so will we in this textbook.
For the moment, let's consider the possibility that he flips the coin 400 times and gets heads on 396 of those flips. So the sample proportion in this case would be p = 0.99.
Would you consider this to be evidence that the coin is weighted? Probably. A fair coin producing 396 heads and only 4 tails seems very unlikely - so unlikely that we would consider it as evidence that the coin isn't fair.
What about if Fred got 200 heads and 200 tails? Well, this definitely doesn't contradict the assumption that the coin is fair. In this case the sample proportion is 0.5, which happens to equal the assumed value of the population proportion. If you saw the coin produce this even split, you definitely wouldn't conclude that the coin was weighted.
In fact, due to sampling variability, you wouldn't conclude that the coin was weighted even if Fred got (for example) 201 heads and 199 tails, or 202 heads and 198 tails. The coin could be fair and still produce samples like these, because samples (and sample proportions) do vary.
So, we've looked at some extreme examples. If Fred got 396 heads, we'd conclude that our assumption about the coin being fair was wrong. At the other extreme, if he got 201 heads, we wouldn't conclude that our assumption that the coin was fair was wrong.
But in between these two extremes, where do we draw the line? At what point do we say that the number of heads is too far away from 200, and so we will reject our assumption about fairness?
Well, this is where the sampling distribution can help us. We just developed what might be considered a '95% cut-off level' for the test when we argued that, if the null hypothesis is true, then 95% of all sample proportions will be between 0.451 and 0.549. So we might start the hypothesis test out by saying: 'OK, we'll assume that the coin is fair and that π = 0.5. But, as a result of this assumption, we know a region where 95% of all sample proportions should lie. So, if we get a sample proportion that is outside this region, we will reject the assumption.'
We argued earlier that this region was the interval from 0.451 up to 0.549.
So, for example, if Fred got 259 heads, this would produce a sample proportion of 0.6475. This is outside the region and so, according to the hypothesis test, would constitute enough evidence to reject the null hypothesis. If, on the other hand, he got 194 heads then his sample proportion would be 0.485. This is in the region and so, according to the hypothesis test, would not constitute enough evidence to reject the null hypothesis.
The point is that we have used the sample as evidence against which we are judging the validity of the assumption that is made at the beginning of the hypothesis test. This is the basic methodology of a hypothesis test. To recap the steps Fred took:
In particular, at this last stage, Fred will make one of two different decisions. If the sample proportion he observes is outside the interval from 0.451 to 0.549, he argues that the assumption he made must have been wrong, and so the coin is weighted. On the other hand, if the sample proportion he observes is in the interval, Fred argues that he cannot reject the assumption. The coin may be fair - he cannot tell.
At this stage it is worth pointing out that this (basic) hypothesis test was performed at a 'level' of 95%. In the next section we'll see more precisely what this means, and how to conduct hypothesis tests at other 'levels'. For the moment, the important point is that to conduct our test, we used the fact that 95% of (any) normal distribution lies within 1.96 standard deviations of its mean.
It is also important to note that we might be 'wrong' when conducting a hypothesis test. For example, if the coin is truly fair, then 95% of the time a sample is collected, we will get a sample proportion within the 'healthy' range between 0.451 and 0.549. But that means that 5% of the time we will get a sample proportion outside this range, which would lead us to reject the assumption that the coin is fair - even though the coin is actually fair!
Later in this chapter we will have more to say about these 'errors'. For the moment, we will just say that such 'errors' do not mean that you have made a mistake in your method. As with statistical estimation, hypothesis testing will always involve some inherent uncertainty.