4.2. Common Statistical Functions

A statistic is a metric, or measure, taken from a data set. That is to say, it is a value resulting from a calculation of the data points that gives useful information about the data. We will refer to the random variable with a capital letter, X, and the individual members of the random variable with a lower case letter, x.

4.2.1. Minimum and Maximum

The min and max functions take a vector input and can be used with either one or two outputs. The first output is either the minimum or maximum value of the vector. The second output is the index (location) of the value.

>> v = [31;12;8;29;36];

>> mn = min(v)
mn =
    8
>> mx = max(v)
mx =
    36
>> [mn, idx] = min(v)
mn =
    8
idx =
    3
>> [mx, idx] = max(v)
mx =
    36
idx =
    5

4.2.2. Mean, Standard Deviation, Median, and Mode

Mean

The mean of a data set is what we also call the average, or expected value (E(X)). The mean is the sum of the values divided by the number of values. MATLAB has a function called mean that takes an vector argument and returns the mean of the data. Outliers (a few values significantly different than the rest of the data) can move the mean. So it can be a poor estimator of the center of the data. The median is less affected by outliers. The symbol for the sample mean of a random variable, X, is \bar{x}. The symbol for a population mean is \mu, which is the expected value, E(X), of the random variable.

\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i}

\mu = E(X)

>> v = [31;12;8;29;36];
>> sum(v)/length(v)
ans =
   23.2000
>> mean(v)
ans =
   23.2000
Standard Deviation

The standard deviation, which is the square root of the variance, is a measure of how widely distributed a random variable is about its mean. As shown in Fig. 4.1, a small standard deviation means that the numbers are close to the mean. A larger standard deviation means that the numbers vary quite a bit. For a normal (Gaussian) distributed variable, 68% of the values are within one standard deviation of the mean, 95% are within two standard deviations and 99.7% are within three standard deviations. The symbol for a sample standard deviation is s and the symbol for a population standard deviation is \sigma. See Wikipedia site.

../_images/pdfs.png

Fig. 4.1 Probability Density Functions with \sigma = 3 and \sigma =
1.

We define the difference between a random variable and its mean as Y = (X - \bar{x}). Note that owing to the definition of the mean, \bar{y} = 0. To account equally for the variability of values less than or greater than the mean, we will use Y^2. The maximum likelihood estimator (MLE) of the variance computes the mean value of Y^2.

s_{MLE}^2 & = \left(\frac{1}{n}
\sum_{i=1}^{n} x_{i}^2\right) - \bar{x}^2

It is usually preferred to use the unbiased estimator of the variance, which divides the sum by (n - 1) instead of by n. The reason is because \sum_{i=1}^n y_i = 0. So if we know the sum of the first (n - 1) terms, then the last term can be determined. Only (n - 1) of the squared deviations can vary freely. The number (n - 1) is called the degrees of freedom of the variance. [MOORE18]

The unbiased variance is then calculated as

s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 +
    \cdots + (x_n - \bar{x})^2}{n - 1}
= \frac{1}{n - 1}\sum_{i=1}^n (x_i - \bar{x})^2.

The standard deviation, s, is the square root of the variance. MATLAB has a function called std that takes a vector argument and returns the standard deviation. Similarly, the var function returns the variance.

To find the MLE standard deviation or variance, pass a second argument of 1 to the functions (std(X, 1), var(X, 1)).

>> x = [-1 5 2 6 11 7 10 8 13 11];
>> x_bar = mean(x)
x_bar =
    7.2000
>> n = length(x)
n =
    10
>> s = sqrt(sum((x - x_bar).^2)/(n-1))
s =
    4.3665
>> std(x)
ans =
    4.3665
Median

The median of a data set is the number which is greater than half of the numbers and less than the other half of the numbers. Although the mean is a more commonly used metric for the center value of a data set, the median is often a better indicator of the center. This is because values that are outliers to the main body of data can skew a mean, but will not shift the median value.

When the number of items in the vector is odd, the median is just the center value of the sorted data. When the array has an even number of items, the median is the average of the two center items of the sorted data.

>> vs = sort(v)'
vs =
    8    12    29    31    36
>> vs(3)
ans =
    29
>> md = median(v)
md =
    29

>> vm = sort([v' 17])
vm =
    8    12    17    29    31    36
>> (vm(3) + vm(4))/2
ans =
    23
>> median(vm)
ans =
    23
Mode

The mode of a data set is the value that appears most frequently. MATLAB has a function called called mode. If there is a tie for the number of occurrences of values, the smaller value is returned. If no values repeat, then the mode function returns the minimum value of the data set.

4.2.3. Sort, Sum, Cumsum, and Diff

Use the sort function to sort the values of a vector. By default, the values are sorted in ascending order.

>> va = sort(v)
va =
    8
    12
    29
    31
    36

>> vd = sort(v,'descend')
vd =
    36
    31
    29
    12
    8
Sum and Cumsum

The sum function find the sum of the values in a vector. The cumsum function finds a cumulative sum as each number is added to the sum.

>> s = sum(v)
s =
   116
>> cs = cumsum(v)
cs =
    31
    43
    51
    80
   116
Diff

The diff function calculates the difference between each successive values of a vector. Note that the length of the returned vector is one less than the original vector.

>> diff(v)
ans =
   -19
    -4
    21
     7