4.9. Sampling and Confidence Intervals

As mentioned in Introduction to Statistics, finding parameters for a large population is difficult to determine, if not impossible. When sampling from a large population, each sample set will likely have a different mean. However, we can calculate a range of values for which we have confidence that the actual population mean is within. The central limit theorem and what we know about the standard normal distribution give us the tools to do this. (The central limit theorem is covered in Central Limit Theorem.)

Based on the mapping of the random variable to a standard normal distribution by the central limit theorem, we establish the Z variable.

Z = \frac{\bar{x} - \mu}{\left(\frac{\sigma}{\sqrt{n}}\right)}

From the CDF of a standard normal distribution, we can establish probabilities and introduce Z–critical values, which are used to find an interval where we have confidence at a level of C that the true population mean is within the interval ([HOGG78], pg 212). The Z–critical values for establishing levels of confidence are shown in Table 4.2. We start with the following probabilities for a standard normal distribution.

C = Pr \left( -Z_{\alpha/2} < \mu <  Z_{\alpha/2} \right)
= 1 - \alpha

Then for the estimated mean of the population, we use the sample mean, population standard deviation, and the size of the sample to find the following probability relationship, which establishes a confidence interval.

Pr \left( \bar{x} - Z_{\alpha/2}\frac{\sigma}{\sqrt{n}} < \mu_x
<  \bar{x} + Z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \right) = C

We refer to these Z–critical values as \alpha/2 because there are regions greater than and less than the confidence interval where the population mean might be found. The combination of those two regions is called \alpha. A standard normal PDF plot marking the confidence interval is shown in Fig. 4.15.

../_images/Z2tail.png

Fig. 4.15 Two tailed 95% confidence interval, C = 0.95, \alpha = 0.05, Z_{\alpha/2} = 1.96.

A confidence level of 95% is most commonly used, but 90% and 99% are also useful. Commonly used Z–critical values are shown in Table 4.2.

Table 4.2 Z–critical values
  80% 85% 90% 95% 99%
Z_{\alpha/2} 1.28 1.44 1.645 1.96 2.576

In the following simulation of a normal random variable, we create 100 sampling sets. Using one of the samples, we find a 95% confidence interval. Our population mean is certainly within the confidence interval, but as you can see from the output of the simulation, the means of several samples is outside of our confidence interval.

% File: testConfidence.m
% Find a 95% confidence interval and see how many
% sample means are inside and outside the interval.
n = 100;
X = 50 + 20*randn(50, n);  % 50 x 100
X_bar = mean(X);
Za2 = 1.96;      % 95% Confidence
mu = mean(X(:))  % population mean
sigma = std(X(:));
% look at one sample
x_bar = X_bar(1);
Zlow = x_bar - Za2*sigma/sqrt(n)
Zhigh = x_bar + Za2*sigma/sqrt(n)
% test all samples
inAlpha = nnz(X_bar < Zlow | X_bar > Zhigh)
inC = nnz(X_bar > Zlow & X_bar < Zhigh)
>> testConfidence
mu =
    49.5914
Zlow =
    45.2956
Zhigh =
    53.5387
inAlpha =
    8
inC =
    92  % About as expected for a 95% confidence interval