.. _vectorStats: Common Statistical Functions ============================================================ .. include:: ../replace.txt A *statistic* is a metric, or measure, taken from a data set. That is to say, it is a value resulting from a calculation of the data points that gives useful information about the data. We will refer to the random variable with a capital letter, :math:`X`, and the individual members of the random variable with a lower case letter, :math:`x`. .. _minMax: Minimum and Maximum ----------------------- .. index:: min, max The ``min`` and ``max`` functions take a vector input and can be used with either one or two outputs. The first output is either the minimum or maximum value of the vector. The second output is the index (location) of the value. :: >> v = [31;12;8;29;36]; >> mn = min(v) mn = 8 >> mx = max(v) mx = 36 >> [mn, idx] = min(v) mn = 8 idx = 3 >> [mx, idx] = max(v) mx = 36 idx = 5 .. _meanMetric: Mean, Standard Deviation, Median, and Mode --------------------------------------------- .. index:: mean, median, std, standard deviation, variance, mode .. describe:: Mean The **mean** of a data set is what we also call the *average*, or *expected value* (:math:`E(X)`). The mean is the sum of the values divided by the number of values. |M| has a function called ``mean`` that takes an vector argument and returns the mean of the data. Outliers (a few values significantly different than the rest of the data) can move the mean. So it can be a poor estimator of the center of the data. The median is less affected by outliers. The symbol for the sample mean of a random variable, :math:`X`, is :math:`\bar{x}`. The symbol for a population mean is :math:`\mu`, which is the expected value, :math:`E(X)`, of the random variable. .. math:: \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i} .. math:: \mu = E(X) :: >> v = [31;12;8;29;36]; >> sum(v)/length(v) ans = 23.2000 >> mean(v) ans = 23.2000 .. describe:: Standard Deviation The **standard deviation**, which is the square root of the **variance**, is a measure of how widely distributed a random variable is about its mean. As shown in :numref:`fig:pdfs`, a small standard deviation means that the numbers are close to the mean. A larger standard deviation means that the numbers vary quite a bit. For a normal (Gaussian) distributed variable, 68\% of the values are within one standard deviation of the mean, 95\% are within two standard deviations and 99.7\% are within three standard deviations. The symbol for a sample standard deviation is :math:`s` and the symbol for a population standard deviation is :math:`\sigma`. See `Wikipedia site `_. .. _fig:pdfs: .. figure:: pdfs.png :align: center :width: 40% Probability Density Functions with :math:`\sigma = 3` and :math:`\sigma = 1`. We define the difference between a random variable and its mean as :math:`Y = (X - \bar{x})`. Note that owing to the definition of the mean, :math:`\bar{y} = 0`. To account equally for the variability of values less than or greater than the mean, we will use :math:`Y^2`. The maximum likelihood estimator (MLE) of the variance computes the mean value of :math:`Y^2`. .. math:: s_{MLE}^2 & = \left(\frac{1}{n} \sum_{i=1}^{n} x_{i}^2\right) - \bar{x}^2 It is usually preferred to use the unbiased estimator of the variance, which divides the sum by :math:`(n - 1)` instead of by :math:`n`. The reason is because :math:`\sum_{i=1}^n y_i = 0`. So if we know the sum of the first :math:`(n - 1)` terms, then the last term can be determined. Only :math:`(n - 1)` of the squared deviations can vary freely. The number :math:`(n - 1)` is called the *degrees of freedom* of the variance. [MOORE18]_ The unbiased variance is then calculated as .. math:: s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2}{n - 1} = \frac{1}{n - 1}\sum_{i=1}^n (x_i - \bar{x})^2. The standard deviation, :math:`s`, is the square root of the variance. |M| has a function called ``std`` that takes a vector argument and returns the standard deviation. Similarly, the ``var`` function returns the variance. To find the MLE standard deviation or variance, pass a second argument of 1 to the functions (``std(X, 1)``, ``var(X, 1)``). :: >> x = [-1 5 2 6 11 7 10 8 13 11]; >> x_bar = mean(x) x_bar = 7.2000 >> n = length(x) n = 10 >> s = sqrt(sum((x - x_bar).^2)/(n-1)) s = 4.3665 >> std(x) ans = 4.3665 .. describe:: Median The **median** of a data set is the number which is greater than half of the numbers and less than the other half of the numbers. Although the mean is a more commonly used metric for the center value of a data set, the median is often a better indicator of the center. This is because values that are *outliers* to the main body of data can skew a mean, but will not shift the median value. When the number of items in the vector is odd, the median is just the center value of the sorted data. When the array has an even number of items, the median is the average of the two center items of the sorted data. :: >> vs = sort(v)' vs = 8 12 29 31 36 >> vs(3) ans = 29 >> md = median(v) md = 29 >> vm = sort([v' 17]) vm = 8 12 17 29 31 36 >> (vm(3) + vm(4))/2 ans = 23 >> median(vm) ans = 23 .. describe:: Mode The **mode** of a data set is the value that appears most frequently. |M| has a function called called ``mode``. If there is a tie for the number of occurrences of values, the smaller value is returned. If no values repeat, then the mode function returns the minimum value of the data set. .. _SortSumDiff: Sort, Sum, Cumsum, and Diff ---------------------------- .. index:: sort Use the ``sort`` function to sort the values of a vector. By default, the values are sorted in ascending order. :: >> va = sort(v) va = 8 12 29 31 36 >> vd = sort(v,'descend') vd = 36 31 29 12 8 .. describe:: Sum and Cumsum .. index:: sum, cumsum The ``sum`` function find the sum of the values in a vector. The ``cumsum`` function finds a cumulative sum as each number is added to the sum. :: >> s = sum(v) s = 116 >> cs = cumsum(v) cs = 31 43 51 80 116 .. describe:: Diff .. index:: diff The ``diff`` function calculates the difference between each successive values of a vector. Note that the length of the returned vector is one less than the original vector. :: >> diff(v) ans = -19 -4 21 7 .. raw:: latex \clearpage