Hypothesis testing the proportion

door-to-door

So there is not enough evidence for Seller Door to conclude that the average time taken to conduct a sales routine has changed. Seller Door doesn't mind this result. Their main reason for making the change in the sales routine was to improve sales. In particular, the company wanted to increase the rate of successful sales. As the salespeople met potential customers door-to-door, the company wanted to increase the proportion of these visits that resulted in a sale.

So, has this proportion increased? The company decides to do a hypothesis test to find out.

Under the old routine, the rate of success was 23%. That is, 23% of the customers visited by a Seller Door salesperson would buy something. So the company intends to collect a sample to test whether the current success rate, which is a population proportion π, is greater than 23% (that is, greater than 0.23). The null and alternative hypotheses for this test will be:

H0: π = 0.23

HA: π > 0.23

Notice that this is a one-sided hypothesis test because the alternative hypothesis proposes that the population proportion is greater than 0.23. Because the company wants to be fairly sure before it concludes that the sales rate has improved, it intends to set a low level of significance: α = 0.01. And because the rate of sales is so important to Seller Door, they decide to collect a relatively large sample for this test: they will monitor the success rate over 400 customer visits.

Let's see how the company would conduct this test according to the guide shown in Section 8.3.

State the hypotheses. The company starts by writing down the null and alternative hypotheses that are being included in the test. This makes it clear what is being tested and whether the test is one-sided or two-sided. In this one-sided test, the company is testing whether the rate of successful sales, π, is greater than 23%. We just saw the null and alternative hypotheses, but here they are again:

H0: π = 0.23

HA: π > 0.23

Assume the null hypothesis is true. As always, we assume that the null hypothesis is true for the purposes of doing the test. In this case, this means assuming that the rate of success is 23%. We write the assumption down underneath the two hypotheses: 'Assume π = 0.23'.

Choose a level of significance, α. As we were just discussing, Seller Door would like a lot of evidence before it agrees to reject the null hypothesis. So it sets a level of significance of α = 0.01. In statistical practice, this is considered to be a very low level of significance.

Determine the critical value(s). The step-by-step guide indicates that because this is a one-sided hypothesis test, there is only one critical value. As the guide suggests, this critical value is the positive z-score zα because the alternative hypothesis proposes that π is greater than the value in the null hypothesis. That is, HA proposes that π is greater than 0.23.

region of rejection

The level of significance is α = 0.01 and so the critical value is z0.01. The standard normal table or statistical software can be used to find this z-score. It is z0.01 = 2.326.

Determine the region of rejection. The step-by-step guide also indicates that because this is a one-sided hypothesis test, the region of rejection is one undivided area in the standard normal distribution. In particular, because the critical value is positive (it is 2.326) the region of rejection is the set of values greater than this critical value. So the region of rejection is the set of values greater than 2.326.

Collect a sample and calculate a sample statistic. In this test, Seller Door will need a sample proportion to test the null hypothesis with. As mentioned earlier, the company intended to collect a sample of n = 400 customer visits. Let's suppose they did this, and that 110 of the 400 customers bought something. This gives a sample proportion of 27.5%, or p = 0.275.

Calculate the test statistic. The sample proportion we just calculated comes from the sampling distribution of the proportion, P. Under the assumption that the null hypothesis is true, P approximately follows the normal distribution with mean π = 0.23 and standard deviation √(π(1 - π)/n) = √(0.23 × 0.67/400) = 0.0196.

Therefore the test statistic for the sample proportion p = 0.275 is the z-score of this value, which is:

z =
p - 0.23
0.0196
  =
0.275 - 0.23
0.0196
  = 2.30
test conclusion

Conclusion. Seller Door is now in a position to conclude the hypothesis test. As usual, the company must look at whether or not the test statistic is in the region of rejection.

The test statistic of 2.30 is not greater than 2.326 and therefore is not in the region of rejection. Therefore, Seller Door does not reject the null hypothesis. There is not enough evidence to conclude that the success rate has increased.

It's fairly close though!

Seller Door decided that there wasn't enough evidence to reject the null hypothesis. That is, there wasn't enough evidence to conclude that the rate of successful sales has increased from 23%. In the sample they collected, they got a sample proportion of 27.5%, which is greater than 23%, but the hypothesis test told them that it wasn't greater enough to reject 23% as the population proportion.

But notice that the test statistic is very close to the region of rejection. In fact, it is a little too close for comfort. Recall that 110 of the 400 customers in the sample bought something. If just one more customer had purchased an item, so that 111 of the 400 customers had bought something, the result of the test would be to reject the null hypothesis. (As practice, you might like to conduct the hypothesis test with that data, to prove it to yourself.)

Seller Door would like to conclude that the new sales routine is better. What do we do in a situation where the test statistic is so close to a critical value? Is there any way for the company to use the evidence in the test to conclude that the sales routine has improved? Well, yes and no.

Why not?

The most mathematical and technical answer is: no. There is nothing that Seller Door can do. It cannot say 'Close enough is good enough!', or decide that the test was 'too close to call' and so collect a new set of data to do the test again. The reason that these options are not available to the hypothesis test as described above is that they would invalidate the methodology of such a hypothesis test.

If someone conducted a 99% hypothesis test and did get a sample statistic that was different enough to the null hypothesis so that they did reject the null hypothesis, they could technically say something like:

Under the method that I have used, I would only obtain a sample this extreme 1% of the time if the null hypothesis is true. I therefore conclude that the null hypothesis is not true.

And the point is: statements like this are no longer true if you decide to 'do the test again' (or take some action other than that described in the hypothesis test method) whenever you get a result that is 'close'. The whole point of deciding on a level of significance and then determining a region of rejection is to set up an algorithm for yourself against which to decide to do one of two things at the end of the test: reject the null hypothesis, or don't.

But then again ...

But let's not fool ourselves here. Seller Door's result is very close. And one can imagine some results being so close that rounding in calculations could have an impact. That is, it is possible to have a hypothesis test where the decision to reject or not reject the null hypothesis could depend on whether the statistician is rounding to 4 decimal places or 5. In such a case, should the distinction between rejecting and not rejecting be so black and white?

And more generally, if the test statistic is extremely close to the critical value, can we look more closely at the evidence provided by the sample, regardless of the reject/not reject outcome? Yes, we certainly can.

Look at Seller Door's test. They set their level of significance at α = 0.01, which is very low. Even though the null hypothesis is not rejected, the sample was still extremely unlikely if the null hypothesis is true. In fact, if the success rate really was 23%, it can be shown that there is only a 1.07% chance of getting a sample as extreme as the one collected.

So how do we resolve this dilemma?

One practical answer is that more tests would be run. This does not mean that another hypothesis test would be conducted instead of Seller Door's test. The failure to reject the null hypothesis in their test would be recorded. But, in practice and if possible, more tests would be conducted. Particularly in the scientific community, it is only when a series of tests tend to point in a particular direction that we conclude a matter with much certainty.

Another practical answer that indicates what we might do with the result of this one test is to report the value of the test statistic, and indeed to report the low probability (1.07%) of observing such a statistic. This will provide any user of the report with enough information to address any grey area in the results, even if a black-and-white decision must be made.

Actually, the practice of reporting the likelihood (or unlikelihood as it were) of the observed test statistic is a very common one. In fact, when we focus on the actual numerical likelihood of an observed test statistic, instead of just on whether or not that likelihood is sufficiently low to reject the null hypothesis, we are using what is called the P-value method of hypothesis testing. We'll finish this chapter and our investigation of testing by turning to this method in the next section.