4.7. Plots of Statistical Data

Here, we generate a data set of 200 random numbers to illustrate two standard plots that show the data distribution. The data was generated with a normal distribution random number generator.

d = 50 + 15*randn(1, 200);    % normal mean=50, std=15

4.7.1. Box Plot

A box plot gives us a quick picture of the distribution of the data. The plot makes it easy to see the range of each 25% of the data (called quartiles). A box plot example is shown in figure Fig. 4.14. The vertical line at the bottom of the plot shows the lower limit value and extends up to the bottom of a box showing the first quartile, \(Q_1\). The box in the center represents the range of the middle 50% of the data. It goes from the first quartile, \(Q_1\), to the third quartile, \(Q_3\). There is a horizontal line at the second quartile, which is the median of the data. Then there is another vertical line from the third quartile to the upper limit value. The lower and upper limits may be the minimum and maximum values of the data. However, they are often found relative to the center region of the data so that outliers are excluded from the four quartiles. The range of the center region is called \(IQR\), \(IQR = Q_3 - Q_1\). The lower and upper limit values (\(LL\) and \(UL\)) are computed as \(LL = Q_1 - 1.5{\times}IQR\) and \(UL = Q_3 + 1.5{\times}IQR\). Any data points less than \(LL\) or greater than \(UL\) are classified as outliers, and may appear as scatter points in the box plot.

>> boxplot(d')
A Box Plot with Quartiles Noted

Fig. 4.14 A Box Plot with Quartiles Noted

The box plot function from MathWorks is part of the Statistics and Machine Learning Toolbox, and a few free box plot functions are available on the MathWorks File Exchange. Some of those use functions from extra toolboxes. However, the free boxplot function [LUENGO15] is a simple function that uses only standard MATLAB functions.

The boxplot function wants the data to be in a column vector because it can make several box plots in a figure if each data set is a matrix column.

4.7.2. Histogram

A histogram plot divides the data into regions (called bins) and shows how many values fall into each region. If the data size is large, a histogram plot will begin to take the shape of the PDF.

The histogram function has several possible parameters, but the most common usage is to pass two arguments—the data and the number of bins to use. Another useful pair of options is ’Normalization’, ’probability’, which scales the height of each bin to its probability level, making a convenient overlay with another histogram or PDF plot. An example histogram is shown in figure Fig. 4.15.

histogram(d, 40)
A Histogram Plot

Fig. 4.15 A Histogram Plot