4.2. Common Statistical Functions¶
statistic
A statistic is a metric, or measure, taken from the data.
We will refer to the random variable with a capital letter, \(X\), and the individual members of the random variable with a lower case letter, \(x\).
4.2.1. Minimum and Maximum¶
The min
and max
functions take a vector input and can be used
with either one or two outputs. The first output is either the vector’s
minimum or maximum value. The second output is the index (location) of
the value.
>> v = [31;12;8;29;36];
>> mn = min(v)
mn =
8
>> mx = max(v)
mx =
36
>> [mn, idx] = min(v)
mn =
8
idx =
3
>> [mx, idx] = max(v)
mx =
36
idx =
5
4.2.2. Mean, Standard Deviation, Median, and Mode¶
- Mean
The mean of a data set is what we also call the average, or expected value (\(E(X)\)). The mean is the sum of the values divided by the number of values. MATLAB has a function called
mean
that takes a vector argument and returns the mean of the data. Outliers (a few values significantly different than the rest of the data) can move the mean. So it can be a poor estimator of the center of the data. The median is less affected by outliers. The symbol for the sample mean of a random variable, \(X\), is \(\bar{x}\). The symbol for a population mean is \(\mu\), which is the random variable’s expected value, \(E(X)\).\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i} \quad \quad \mu = E(X)\]>> v = [31;12;8;29;36]; >> sum(v)/length(v) ans = 23.2000 >> mean(v) ans = 23.2000
- Standard Deviation
The standard deviation, which is the square root of the variance measures how widely distributed a random variable is about its mean. As shown in figure Fig. 4.1, a small standard deviation means that the numbers are close to the mean. A larger standard deviation means that the numbers vary quite a bit. For a normal (Gaussian) distributed variable, 68% of the values are within one standard deviation of the mean, 95% are within two standard deviations, and 99.7% are within three standard deviations. The symbol for a sample standard deviation is \(s\) and the symbol for a population standard deviation is \(\sigma\).
We define the difference between a random variable and its mean as \(Y = (X - \bar{x})\). Note that owing to the definition of the mean, \(\bar{y} = 0\). To account equally for the variability of values less than or greater than the mean, we will use \(Y^2\). The variance’s maximum likelihood estimator (MLE) computes the mean value of \(Y^2\).
\[s_{MLE}^2 = \left(\frac{1}{n} \sum_{i=1}^{n} x_{i}^2\right) - \bar{x}^2\]It is usually preferred to use the unbiased estimator of the variance, which divides the sum by \((n - 1)\) instead of by \(n\). The reason is because \(\sum_{i=1}^n y_i = 0\). The last term can be determined if we know the sum of the first \((n - 1)\) terms. Only \((n - 1)\) of the squared deviations can vary freely. The number \((n - 1)\) is called the degrees of freedom of the variance [MOORE18].
The unbiased variance is then calculated as
\[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2}{n - 1} = \frac{1}{n - 1}\sum_{i=1}^n (x_i - \bar{x})^2.\]The standard deviation, \(s\), is the square root of the variance. MATLAB has a function called
std
that takes a vector argument and returns the standard deviation. Similarly, thevar
function returns the variance.To find the MLE standard deviation or variance, pass a second argument of 1 to the functions (
std(X, 1)
,var(X, 1)
).>> x = [-1 5 2 6 11 7 10 8 13 11]; >> x_bar = mean(x) x_bar = 7.2000 >> n = length(x) n = 10 >> s = sqrt(sum((x - x_bar).^2)/(n-1)) s = 4.3665 >> std(x) ans = 4.3665

Fig. 4.1 Probability Density Functions with \(\sigma = 3\) and \(\sigma = 1\).¶
- Median
The median of a data set is the number greater than half of the numbers and less than the other half. Although the mean is a more commonly used metric for the center value of a data set, the median is often a better indicator of the center. Values that are outliers to the main body of data can skew a mean, but will not shift the median value.
When the number of items in the vector is odd, the median is just the center value of the sorted data. When the array has an even number of items, the median is the average of the two center items of the sorted data.
>> vs = sort(v)' vs = 8 12 29 31 36 >> vs(3) ans = 29 >> md = median(v) md = 29 >> vm = sort([v' 17]) vm = 8 12 17 29 31 36 >> (vm(3) + vm(4))/2 ans = 23 >> median(vm) ans = 23
- Mode
The mode of a data set is the value that appears most frequently. MATLAB has a function called
mode
. If there is a tie for the number of occurrences of values, the smaller value is returned. If no values repeat, then the mode function returns the minimum value of the data set.
4.2.3. Sort, Sum, Cumsum, and Diff¶
- Sort
Use the
sort
function to sort the values of a vector. By default, the values are sorted in ascending order.>> va = sort(v) va = 8 12 29 31 36 >> vd = sort(v,'descend') vd = 36 31 29 12 8
- Sum and Cumsum
The
sum
function finds the sum of the values in a vector. Thecumsum
function finds a cumulative sum as each number is added to the sum.>> s = sum(v) s = 116 >> cs = cumsum(v) cs = 31 43 51 80 116
- Diff
The
diff
function calculates the difference between each successive value of a vector. Note that the length of the returned vector is one less than the original vector.>> diff(v) ans = -19 -4 21 7