In the previous section, when testing the fairness of the coin, we reached a point where we were able to say something like the following:
Assuming the coin is fair, then based on the sampling distribution of the proportion, 95% of all sample proportions will lie between 0.451 and 0.549. Therefore when we collect a sample, we calculate a sample proportion and determine whether or not it is in this range. If the sample proportion is in this range, we do not reject the assumption that the coin is fair. But if it is not in this range, we reject the assumption that the coin is fair and accept the alternative hypothesis that it is not fair.
Of course, the exact endpoints of the range depend on the fact that we have used '95%' in the above statement. To be able to run hypothesis tests at other 'levels', we will need to relate the sampling distribution to the standard normal distribution Z. We will also relate the sample statistic (like the sample proportion) to Z. In doing this, we produce what is known as the test statistic for the sample statistic.
The test statistic will always be a number that gives us some standard way of describing where the sample statistic lies within the sampling distribution. To see why we need test statistics, and what they can do for us, consider the following example.
Comparing the evidence provided by two different tests
Sam and Pam each want to run a hypothesis test. Their hypothesis tests have nothing to do with one another, but we will look at both so that we can see which one has the 'stronger' evidence against the null hypothesis
Sam's test
Census data has shown that 20% of small businesses record a profit in their first year. Sam is testing the hypothesis that this proportion π is the same in her specific city. So she is conducting a hypothesis test with the null and alternative hypotheses:
H0: π = 0.2
HA: π ≠ 0.2
She collects a sample of 400 small businesses and finds that 24% of the small businesses recorded a profit in their first year.
Pam's test
The population-wide approval rating for the government at the last election was 53%. That election was two years ago and Pam would like to test whether this approval rating π is the same today. So Pam is conducting a hypothesis test with the null and alternative hypotheses:
H0: π = 0.53
HA: π ≠ 0.53
She surveys 100 people and finds that the approval rating in this sample is 48%.
Comparing the tests
In both Sam's and Pam's test, the sample proportion calculated is somewhat different to the value proposed in the null hypothesis. But which one is 'more' different? Can we say 'where' the sample statistic lies in the sampling distribution in each case? That is, where does a sample proportion of 24% 'sit' in Sam's hypothesis test and, likewise, where does a sample proportion of 48% sit in Pam's?
As it turns out, Sam's sample proportion of 24% actually provides much more evidence against her null hypothesis test in comparison to
Pam's test. Put simply, this is true because Sam's sample proportion is more standard deviations away from the
hypothesi
To be more precise, it turns out that Sam's sampling distribution has a standard deviation of 2% (that is, 0.02). So a sample proportion of 24% is 2 standard deviations above the proposed value of 20%. In comparison, Pam's sampling distribution has a standard deviation of approximately 5% (or 0.05). So a sample proportion of 48% is 1 standard deviation below the proposed value of 53%.
The point is that the test statistic is a score that gives us a 'standard' way of saying how 'far' a sample statistic is from the value we expect from the null hypothesis.
When we assume that the null hypothesis is correct, we assign some value to a population parameter. As discussed in the previous section, this then enables us to describe the sampling distribution for that parameter. We would like to have some standard way of being able to describe 'where' the sample statistic lies in relation to this sampling distribution. This is what the test statistic does for us. It tells us how different the sample statistic is to the null hypothesis.
Fair coin - test statistic
The hypothesis test for the fairness of the coin uses the null and alternative hypotheses:
H0: π = 0.5
HA: π ≠ 0.5
Under the assumption that the null hypothesis is true, the sampling distribution of the proportion P approximately follows the normal distribution with mean 0.5 and standard deviation 0.025. So P can be transformed into the standard normal distribution Z through the transformation formula:
| Z | = |
|
Now suppose a sample of 400 coin flips is observed and a sample proportion of p = 0.56 heads is recorded (that is, 224 heads). This sample proportion is an example of a sample statistic. The test statistic of this sample statistic is equal to the z-score of the value in the standard normal distribution:
| z | = |
|
||
| = |
|
|||
| = | 2.4 |
Other test statistics
Not all test statistics are z-scores. For a given sample statistic, the test statistic will depend on what distribution the sampling distribution follows. For example, as mentioned earlier, we'll see a difference when testing the mean if σ is unknown. In that case the sampling distribution doesn't technically have anything to do with Z, it is related to a t-distribution.
But what all test statistics do have in common is that they relate a sample statistic (like a sample proportion or sample mean) to a 'score' in some standard probability distribution.
What we have done for the sample proportion of p = 0.56 is to determine where it lies in the standard normal distribution. That is, we've found its z-score. And that is what the test statistic is for the sample proportion: a z-score.
The diagram below shows where this test statistic lies in Z. It looks like a fairly extreme value, suggesting that a sample proportion of 0.56 is extreme. And given that our knowledge of the sampling distribution came from the assumption that the coin was fair, this may suggest that our assumption was wrong.
But how extreme is the test statistic? Is it enough to reject the assumption that the coin is fair? This will depend on the 'level' at which we run the hypothesis test. This is something we decide at the beginning of the test, before we calculate the test statistic. We'll turn to this topic now.