How likely is the sample?

They key to the P-value method is this: once a sample has been collected, we want to assess how likely - you may like to think of it is as how unlikely - that sample is under the assumption that the null hypothesis is true. Our approach to answering this will be to calculate how likely the test statistic is in the probability distribution it follows.

One of the main concepts to become familiar with here is that we are not interested in how likely our particular sample is or how likely our particular test statistic is. Instead, we are interested in finding out how likely a sample or test statistic as extreme as ours is.

Extreme samples

Consider the following two hypothesis tests and the samples collected within them. One is a one-sided hypothesis test, the other is a two-sided hypothesis test. For each test, we will calculate the likelihood of such an extreme sample under the assumption that the null hypothesis is true.

  1. A dose of caffeine over 300 mg is considered an acute overdose. A scientist suspects that a new coffee brand has more than 300 mg in one serve. So she conducts a hypothesis test on the average amount of caffeine in one serve:

    caffeine

    H0: μ = 300

    HA: μ > 300

    She synthesizes the caffeine from 25 serves of the coffee and records a sample mean of x = 303.2 mg. For the purposes of this study, the population standard deviation is assumed to be σ = 10 mg.

  2. A sociologist knows, by referring to census data, that the proportion of adults who are willing to pay higher rates for electricity if it comes from a renewable source is 65%. The sociologist doesn't have access to the raw data and so cannot tell if this proportion changes at all for people with a university education. He decides to run a hypothesis test on the proportion of university-educated people who are willing to pay more for renewable electricity:

    wind farm

    H0: π = 0.65

    HA: π ≠ 0.65

    A survey of 100 university-educated people is conducted and it is found that 76 of these would be willing to pay more, giving a sample proportion of p = 0.76.

So how likely are the two samples collected in each of the above tests?

Caffeine study - one-sided P-value

Under the assumption that the null hypothesis is true, the sampling distribution of the mean is approximately normal with a mean of μ = 300 and a standard deviation of σ/√n = 10/√25 = 2. Therefore the test statistic for the sample mean of x = 303.2 is:

test statistic
z =
303.2 - 300
2
  = 1.60

Now the size of this test statistic gives an indication as to how extreme the sample of coffee serves is under the assumption that the null hypothesis is correct. This test statistic is a z-score, which means that it is a value in the standard normal distribution Z. So, how 'extreme' is the value 1.60 in Z?

To be able to answer this question properly, and to be able to address the meaning of 'extreme' in the question, it is important to notice that the hypothesis test is one-sided. In particular, the alternative hypothesis proposes that the population mean is greater than the value proposed in the null hypothesis. This is important, because when the scientist conducts this test, they are specifically interested in sample means that are extremely greater than 300. So she is interested in test statistics that are as large or larger than 1.60.

So, what is the probability that Z will assume a value greater than 1.60? This question can be answered by referring to the standard normal table, or by using statistical software. There is a 5.48% chance that Z will assume a value greater than or equal to 1.60. This in turn means that, if the null hypothesis is true and the population mean amount of caffeine per serve really is 300 mg, there is only a 5.48% chance of observing a sample with a mean of 303.2 mg or higher.

Renewable energy study - two-sided P-value

Under the assumption that the null hypothesis is true, the sampling distribution of the proportion is approximately normal with a mean of π = 0.65 and a standard deviation of √(π(1 - π)/n) = √(0.65 × 0.35/100) = 0.0477. Therefore the test statistic for the sample proportion of p = 0.76 is:

test statistic
z =
0.76 - 0.65
0.0477
  = 2.31

As with the caffeine example, the size of this test statistic gives an indication as to how extreme the survey is under the assumption that the null hypothesis is correct. So, how 'extreme' is the value 2.31 in Z?

Notice that, unlike the caffeine study, this is a two-sided test. This is important, because the sociologist will reject the null hypothesis if the sample differs from the null hypothesis in either direction. So an 'extreme' sample, in this case, will be one with a sample proportion that is extremely greater than or extremely less than the value in null hypothesis. The sample proportion (of 76%) is 11% greater than the value in the null hypothesis (which is 65%). In particular, it differs from the null hypothesis by 11%. The sociologist is interested in the samples that differ from the null hypothesis by 11% or more.

This means that he wants to know how likely it is that Z assumes a value greater than 2.31 or less than -2.31. Once he knows this probability, he will be able to answer the question 'If the null hypothesis were true, how likely is it that I would get a sample as different to the null hypothesis as 76%?'

The probability that Z will assume a value greater than 2.31 or less than -2.31 can be found using the standard normal table or statistical software. The probability is 2.08%. So if the null hypothesis is true (that is, the proportion of university-educated people who would be willing to pay more for renewable energy really was 65%) then there is only a 2.08% chance of observing a sample proportion as different from this null hypothesis as the one the sociologist observed.

Suppose a sample has been collected for a hypothesis test and a test statistic has been calculated for this sample. This test statistic will be a score in some probability distribution. The P-value of this test statistic is the chance that the probability distribution will assume a value as extreme as the test statistic.

That is the technical definition, but the main way that we interpret the P-value is that it is the probability of observing a sample as extreme as the one we've just observed in a test, if the null hypothesis is true. As the two examples above demonstrate, the P-value will depend upon whether the hypothesis test is one-sided or two-sided. Use the following guide to help you calculate the P-value in any given hypothesis test.

Calculating a P-value

If your hypothesis test is one-sided and the alternative hypothesis proposes that the population parameter is greater than the value in the null hypothesis, the P-value is the probability of assuming a value greater than the test statistic.

If your hypothesis test is one-sided and the alternative hypothesis proposes that the population parameter is less than the value in the null hypothesis, the P-value is the probability of assuming a value less than the test statistic.

If your hypothesis test is two-sided, the P-value is the probability of assuming a value more different to zero than the test statistic. That is, it is the probability of assuming a value greater than the positive value of test statistic and less than the negative value.

The following diagram shows, in relation to probability distributions, the three different possible P-value calculations for the three different possible types of alternative hypothesis.

P-values