4.10. Statistical Significance

An observation from an experiment becomes accepted fact when the data proves statistical significance. That is, it must be proven that the observation is consistently true and not a random anomaly.

Let us consider a specific historical example. In the nineteenth century, lung cancer was a rare disease. Then during the first half of the twentieth century, the smoking of tobacco products, especially cigarettes, grew in popularity. According to Gallup polling data, self-reported adult smoking in the U.S. peaked in 1954 at 45%. Medical doctors noticed that as smoking rates increased, so did occurrences of lung cancer. But observations were not enough to conclude that smoking was causing lung cancer. Some studies in the 1940s and 1950s pointed blame towards smoking, but the public was skeptical because so many were smokers, and the tobacco industry claimed that smoking was safe. Then from 1952 to 1955, E. Cuyler Hammond and Daniel Horn, scientists from the American Cancer Society, conducted an extensive study using about 188,000 male volunteers (The Study That Helped Spur the U.S. Stop-Smoking Movement). They showed that the rate of lung cancer among smokers is outside the range of what is statistically possible for non-smokers.

The strategy is to show that a null hypothesis must be rejected since it is false. A null hypothesis, represented as H_0, claims that the observation in question falls within the range of regular statistical occurrences. It usually advocates for the status quo belief. So a null hypothesis might be: “Smokers of tobacco products and non-smokers are equally likely to develop lung cancer.” Whereas, an alternate hypothesis, H_a, might be: “Smoking tobacco products increases the risk of developing lung cancer.” The alternate hypothesis often promotes belief in a causal effect. It proposes that when some treatment is applied, then those receiving the treatment are changed. Example treatments and changes include: smoking leads to lung cancer, taking a medication heals people from a disease, or exercising reduces cholesterol. It is usually harder to directly show that the alternate hypothesis is true than to show that the null hypothesis is false. If we can show that the null hypothesis is wrong, then the alternate hypothesis must be accepted.

To build a case for rejecting a null hypothesis, we develop the statistics of the population not receiving the treatment in question and then show that the statistic of the treatment population falls outside of the range of possibility for the null hypothesis to be true. So Hammond and Horn established the statistics for non-smokers developing lung cancer and then showed that smokers develop lung cancer at a rate beyond what is possible for non-smokers.

The tools that statisticians use to establish confidence intervals for population means and accept or reject null hypotheses with statistical significance require a normal distribution. When the data has a different probability distribution, it is usually possible to frame the problem as an instance of a normal distribution by taking advantage of the central limit theorem, as discussed in Central Limit Theorem.

Hypotheses testing is a matter of testing if two data sets could have come from the same source. We accept the null hypothesis if they likely came from the same source or an equivalent source, and reject the null hypothesis if we conclude that they are from different sources.

If the standard deviation of the null case population is known, use the Z–test with the Z–critical values. When the standard deviation is not known, use the t–test which uses the sample standard deviation and is based on the t–distribution, which is similar to the normal distribution except with more variability.

Counterfeit Competition

As an example, consider the following scenario. Suppose that our company makes a product that is very good at what it does. A new competitor introduced a product that it claims is just as good as your product, but is cheaper. We would like to prove that the competitor’s product is not as good as our product. So we draw sample sets of our product and the competitor’s product to establish quality of each.

4.10.1. Z–Test

For the Z–test, we only need to calculate the sample mean of the H_a data and then calculate the Z variable that maps the sample mean of the treatment data set to a standard normal distribution.

Z = \frac{\bar{x_a} - \mu_o}{\frac{\sigma}{\sqrt{n}}}

The Z variable is compared to a Z–critical value to determine if the H_a sample mean is consistent the H_o population. Note that in addition to the two tailed tests, tests based on either the upper or lower tail may also be used (see Fig. 4.16). Commonly used Z–critical values for one tailed and two tailed tests is shown in Table 4.3.

../_images/Zinterval.png

Fig. 4.16 One and two tailed 95% confidence intervals, \alpha = 0.05, Z_\alpha = 1.645, Z_{\alpha/2} = 1.96.

Table 4.3 Z–critical values
  90% 95% 99%
Z_{\alpha/2} 1.645 1.96 2.576
Z_\alpha 1.282 1.645 2.326
Counterfeit Competition -- Z--Test

For the sake of our example, let’s say that the metric of concern is a Bernoulli event. We obtain 500 each of our competitor’s product and our product for testing. We split the 500 products into 10 sampling sets of 50 items.

We can use the Z–test for our product since we have a well established success rate for it. In several tests, our company’s product has consistently shown a success rate of at least 90%. We just want to verify that our latest product matches up to our established quality level. We put our successes and failures in a matrix and compute the mean of each column. According to the central limit theory, the means of the sample sets give us a normal distribution that we can use with both Z–tests and T–tests.

You may wonder about the binomial distribution, which comes from the sum of successes of Bernoulli events. A binomial distribution is a discrete facsimile of a normal distribution. The central limit theorem equations can scale and shift a binomial distribution into a standard normal distribution.

In the simulation test of our product, we find a Z–Test score that is well below any Z–critical value. So we confirm a null hypothesis that these samples test match the quality of our product.

>> X = rand(50, 10) < 0.9;   % simulation setup
>> n = 10;
>> p = 0.9;
>> sigma = sqrt(p*(1-p))     % from Bernoulli equations
sigma =
    0.3000
>> X_bar = mean(X);
>> x_bar = mean(X_bar)
x_bar =
    0.9240
>> Z = (x_bar - 0.9)/(sigma*sqrt(n))
Z =
    0.0253

4.10.2. t–Test

In practice, the t–test is much like the Z–test except the sample’s calculated standard deviation (s) is used rather than the population standard deviation (\sigma). Because each sample from the population will have a slightly different standard deviation, we use the student’s t–distribution, which has a slightly wider bell-shaped PDF than the standard normal PDF. The distribution becomes wider for smaller sample sizes (degrees of freedom). As the sample size approaches infinity, the t-distribution approaches the standard normal distribution. The degrees of freedom (df) is one less than the sample size (df = n - 1). The t–distribution PDF plot for three df values is shown in Fig. 4.17.

../_images/tDist.png

Fig. 4.17 t-Distribution PDF plots showing two tailed 95% critical values for infinity, 9, and 3 degrees of freedom.

The T statistic is calculated and compared to critical values to test for acceptance of a null hypothesis as was done with the Z–test. The critical values are a function of both the desired confidence level and the degree of freedom (sample size). There are several ways to find the critical values. Tables found on web sites or in statistics books list t-distribution critical values. Interactive web sites as well as several software environments can calculate critical values. Unfortunately, MATLAB requires the Statistics and Machine Learning Toolbox to calculate critical values. However, as with the Z–test, the t–critical values are constants, so they may be hard-coded into a program to test a null hypothesis.

T = \frac{\bar{X} - \mu}{\left(\frac{s}{\sqrt{n}}\right)}

Counterfeit Competition -- t--Test

When we apply the t–test to our competitor’s product, we use statistics calculated from the data and the established mean success rate of our product. Although the success rate of our competitor’s product seems to be not far below that of our own, the calculated T is less than the negative one sideded t–critical value, which establishes with statistical significance that our competitor’s product is inferior to ours.

>> X = rand(50, 10) < 0.85;
>> n = 10;
>> X_bar = mean(X);
>> x_bar = mean(X_bar)
x_bar =
    0.8620
>> s = std(X_bar)
s =
    0.0382
>> T = (x_bar - 0.9)/(s/sqrt(n))
T =
   -3.1425