4.7. Plots of Statistical Data

Here we generate a data set of 200 random numbers to illustrate two common plots that show the distribution of the data. The data was generated with two random number generators.

d = 50 + 15*randn(1, 200);    % normal mean=50, std=15

4.7.1. Box Plot

A box plot gives us a quick picture of the distribution of the data. The plot makes it easy to see the range of each 25% of the data (called quartiles). The vertical line at the bottom of the plot shows the lower limit value and extends up to the bottom of a box showing the first quartile, Q_1. The box in the center represents the range of the middle 50% of the data. It goes from the first quartile, Q_1, to the third quartile, Q_3. There is a horizontal line at the second quartile, which is the median of the data. Then there is another vertical line from the third quartile to the upper limit value. The lower and upper limits may be the minimum and maximum values of the data. However, they are often found relative to the center region of the data so that outliers are excluded from the four quartiles. The range of the center region is called IQR, IQR = Q_3 - Q_1. The lower and upper limit values (LL and UL) are computed as LL = Q_1 -
1.5{\times}IQR and UL = Q_3 + 1.5{\times}IQR. Any data points less than LL or greater than UL are classified as outliers, and may appear as scatter points in the box plot.

The box plot function from MathWorks is part of the Statistics and Machine Learning Toolbox, which is an extra purchase. However, there are a few free box plot functions available on the MathWorks File Exchange. Some of those use functions from extra toolboxes. The Free Boxplot function, [LUENGO15] is a simple function that uses only standard MATLAB functions.

The boxplot function wants the data to be in a column vector because it can make several box plots in a figure if each data set is a column of a matrix. An example box plot is shown in Fig. 4.12.

boxplot(d')
../_images/boxplot.png

Fig. 4.12 A Box Plot with Quartile’s Noted

4.7.2. Histogram

A histogram plot divides the data into regions (called bins) and shows how many values fall into each region. If the size of the data is very large, a histogram plot will begin to take the shape of the PDF.

The histogram function has several possible parameters, but the most common usage is to pass two arguments – the data and the number of bins to use. Another useful set of options is 'Normalization', 'probability', which scales the height of each bin to its probability level. This is a good option if you want to overlay another histogram or a PDF plot. An example histogram is shown in Fig. 4.13.

histogram(d, 30)
../_images/histogram.png

Fig. 4.13 A Histogram Plot