4.2. Common Statistical Functions¶
A statistic is a metric, or measure, taken from a data set. That is to say, it is a value resulting from a calculation of the data points that gives useful information about the data. We will refer to the random variable with a capital letter, , and the individual members of the random variable with a lower case letter, .
4.2.1. Minimum and Maximum¶
The min
and max
functions take a vector input and can be used with
either one or two outputs. The first output is either the minimum or maximum
value of the vector. The second output is the index (location) of the value.
>> v = [31;12;8;29;36];
>> mn = min(v)
mn =
8
>> mx = max(v)
mx =
36
>> [mn, idx] = min(v)
mn =
8
idx =
3
>> [mx, idx] = max(v)
mx =
36
idx =
5
4.2.2. Mean, Standard Deviation, Median, and Mode¶
-
Mean
The mean of a data set is what we also call the average, or expected
value (). The mean is the sum of the values divided by the number
of values. MATLAB has a function called mean
that takes an vector argument
and returns the mean of the data. Outliers (a few values significantly
different than the rest of the data) can move the mean. So it can be a poor
estimator of the center of the data. The median is less affected by outliers.
The symbol for the sample mean of a random variable, , is
. The symbol for a population mean is , which is the
expected value, , of the random variable.
>> v = [31;12;8;29;36];
>> sum(v)/length(v)
ans =
23.2000
>> mean(v)
ans =
23.2000
-
Standard Deviation
The standard deviation, which is the square root of the variance, is a measure of how widely distributed a random variable is about its mean. As shown in Fig. 4.1, a small standard deviation means that the numbers are close to the mean. A larger standard deviation means that the numbers vary quite a bit. For a normal (Gaussian) distributed variable, 68% of the values are within one standard deviation of the mean, 95% are within two standard deviations and 99.7% are within three standard deviations. The symbol for a sample standard deviation is and the symbol for a population standard deviation is . See Wikipedia site.
We define the difference between a random variable and its mean as . Note that owing to the definition of the mean, . To account equally for the variability of values less than or greater than the mean, we will use . The maximum likelihood estimator (MLE) of the variance computes the mean value of .
It is usually preferred to use the unbiased estimator of the variance, which divides the sum by instead of by . The reason is because . So if we know the sum of the first terms, then the last term can be determined. Only of the squared deviations can vary freely. The number is called the degrees of freedom of the variance. [MOORE18]
The unbiased variance is then calculated as
The standard deviation, , is the square root of the variance.
MATLAB has a function called std
that takes a vector argument and returns the
standard deviation. Similarly, the var
function returns the variance.
To find the MLE standard deviation or variance, pass a second argument of 1 to
the functions (std(X, 1)
, var(X, 1)
).
>> x = [-1 5 2 6 11 7 10 8 13 11];
>> x_bar = mean(x)
x_bar =
7.2000
>> n = length(x)
n =
10
>> s = sqrt(sum((x - x_bar).^2)/(n-1))
s =
4.3665
>> std(x)
ans =
4.3665
-
Median
The median of a data set is the number which is greater than half of the numbers and less than the other half of the numbers. Although the mean is a more commonly used metric for the center value of a data set, the median is often a better indicator of the center. This is because values that are outliers to the main body of data can skew a mean, but will not shift the median value.
When the number of items in the vector is odd, the median is just the center value of the sorted data. When the array has an even number of items, the median is the average of the two center items of the sorted data.
>> vs = sort(v)'
vs =
8 12 29 31 36
>> vs(3)
ans =
29
>> md = median(v)
md =
29
>> vm = sort([v' 17])
vm =
8 12 17 29 31 36
>> (vm(3) + vm(4))/2
ans =
23
>> median(vm)
ans =
23
-
Mode
The mode of a data set is the value that appears most frequently. MATLAB has
a function called called mode
. If there is a tie for the number of
occurrences of values, the smaller value is returned. If no values repeat, then
the mode function returns the minimum value of the data set.
4.2.3. Sort, Sum, Cumsum, and Diff¶
Use the sort
function to sort the values of a vector. By default,
the values are sorted in ascending order.
>> va = sort(v)
va =
8
12
29
31
36
>> vd = sort(v,'descend')
vd =
36
31
29
12
8
-
Sum and Cumsum
The sum
function find the sum of the values in a vector. The cumsum
function finds a cumulative sum as each number is added to the sum.
>> s = sum(v)
s =
116
>> cs = cumsum(v)
cs =
31
43
51
80
116
-
Diff
The diff
function calculates the difference between each successive
values of a vector. Note that the length of the returned vector is one less
than the original vector.
>> diff(v)
ans =
-19
-4
21
7