6.9. Numerical Stability

Here we will consider the potential accuracy of solutions to systems of equations of the form \mathbf{A}\,\bm{x} = \bm{b}. In idealistic mathematical notation, we write the solution to a square matrix system of equations as \bm{x} = \mathbf{A}^{-1}\bm{b}. However, the path toward finding the true \bm{x} is fraught with potential problems.

Sometimes it may be difficult to find an accurate solution. It might be that the values in \bm{b} and \bf{A} are not accurate because they came from an experiment where noise in the data is unavoidable. It might be that the rows of \bf{A} are close to parallel, meaning that \bf{A} is nearly singular—also called poorly conditioned. It might also be that the absolution values of the numbers in \bm{b} and \bf{A} are likely to incur round-off errors because they contain a mixture of very large numbers, very small numbers, and even some irrational numbers and numbers that in binary contain unresolved repeating patterns of 1’s and 0’s.

Use of an algorithm that is not numerically stable also leads to inaccurate solution estimates. Numerical stability is a property of the algorithm rather than the problem being solved. The best algorithms might not always find accurate results in the worst conditions. But numerically stable algorithms find accurate results to most problems. Numerically stable algorithms order the operations on numbers such that round-off errors are minimized. They also avoid computations that will tend to degrade the condition of matrices. For example, our idealistic mathematical solution to an over-determined system of equations calls for three matrix multiplications, a matrix inverse, and a matrix–vector multiplication. We will see in this section that such a sequence of operations is not advisable. We already know that the matrix inverse is slow. The matrix multiplications can also introduce round-off errors.

After reviewing the source of round-off errors and the condition number of matrices, we will look at three case studies where modified algorithms has lead to improved numerical stability. Then in the next section, we will see how orthogonal matrix factoring provides tools needed for numerically stable solutions to rectangular systems of equations.

6.9.1. Round-off Errors

Round-off errors usually occurs on addition or subtraction between a small number and a large number. The source of the degraded precision is the finite number of binary bits used to hold the digits of floating point numbers. Addition or subtraction can cause the less significant binary digits of small numbers to be removed. When the precision of a number is reduced, a round-off error was introduced into the calculations.

The IEEE Standard for Binary Floating-Point Arithmetic [IEEE85] specifies the format of both 32-bit single precision and 64-bit double precision floating point numbers. We will focus on the double precision format because most scientific numeric calculations use double data type variables that corresponds to the double precision standard, as is the default in MATLAB. Floating point numbers are stored in a format similar to scientific notation, except the binary number system (base 2) is used instead of decimal. As shown in equation (6.22), the binary numbers are normalized with adjustments to the exponent so that there is a single 1 to the left of the binary point. The binary digits of the number is called the mantissa. Since the digit to the left of the binary point is always 1, it is not necessary to store that 1. The digits to the right of the binary point are called the significand. As shown in shown figure Fig. 6.22, double precision numbers store 52 binary digits of the significand. In equation (6.22), the significand is the binary digits b_{51}b_{50}\ldots b_0. Including the 1 to the left of the binary point, the level of available precision in double data type variables is u = 2^{-53} \approx 1.11 \times 10^{-16} [IEEE85].

(6.22)\begin{array}{l}
    (-1)^{sign} \times (1.b_{51}b_{50}\ldots b_0)_2 \times 2^{exp - 1023}\\
    \text{or} \\
    mantissa \times 2^{exponent}
    \end{array}

../_images/doubleFormat.png

Fig. 6.22 The format of 64-bit double precision floating point numbers.

Let x_1 = mantissa_1 \times 2^{exponent_1} and let x_2 = mantissa_2 \times
2^{exponent_2}. Further, let x_1 be much greater than x_2, (x_1 \gg x_2). The exponents are added in multiplication,

x_1 \times x_2 = mantissa_1 \times mantissa_2 \times
        2^{(exponent_1 + exponent_2)}.

Division is a similar process, except the exponents are subtracted. Multiplication and division are immune from numerical problems except in the rare event of overflow or underflow.

Addition and subtraction requires that the exponents match.

\begin{split} x_1 \,\pm\, x_2 &= mantissa_1 \times
    2^{exponent_1} \pm \\ & \quad \left(mantissa_2 \times 2^{(exponent_2 -
    exponent_1)}\right) \times 2^{exponent_1}.  \end{split}

The expression \left(mantissa_2 \times 2^{(exponent_2 - exponent_1)}\right) where exponent_1 \gg exponent_2, means that the binary digits of mantissa_2 are shifted to the right exponent_1 - exponent_2 positions. Any digits of mantissa_2 that are shifted beyond the 52 positions of the significand are lost and a round-off error is introduced into x_2.

Many numbers have 1s in only a few positions to the right of the binary point. These numbers are not likely to lose precision to round-off error. But irrational numbers such as \pi or \sqrt{2} fill out the full 52 bits of the significand. They can suffer from a loss of precision with even a single digit shift to the right, which occurs due to adding a number that is only twice its value. Just as a fraction, such as 1/3, can hold infinitely repeating digits in decimal (0.333333 \ldots), some numbers have repeating digits in binary. For example, (1.2)_{10} =
(1.\underline{0011}0011\underline{0011}\ldots)_2. A small round-off error occurs when a number as small as 2.4 is added to 1.2.

The example in figure Fig. 6.23 shows the round-off error when 1.2 is added to successive powers of 2. When the variable x is small, variable y is close to zero. But as x gets larger, variable a is increasingly lost to round-off error in x + a, which causes variable y to converge to one. A logarithmic scale to the y axis of the plot is needed to observe the loss of precision. The stair-step shape to the plot is because a holds a repeating pattern of the binary digits 0011. The precision of a is diminished when a binary 1 is lost to round-off error. But no damage to the precision of a occurs when a binary 0 is lost to round-off error. The plot shows an interesting progression of the round-off error when an irrational number is placed in the variable a. What do you think the plot will look like when a simple number like 1.5 is placed in a?

../_images/roundoffError.png

Fig. 6.23 allowed for double precision floating point numbers. As the number added to a gets larger, more of a is lost to round-off error.

The number a = 1.2 fills the full 52 bit significand allowed for double precision floating point numbers. As the number added to a gets larger, more of a is lost to round-off error.
ex = 1:60;
x = 2.^ex;
a = 1.2;
y = abs(x + a - x - a)/a;
y(y < eps) = eps;
semilogy(ex, y)
axis tight

6.9.2. Matrix Condition Number

Another problem besides round-off errors also degrades the accuracy of solutions to systems of equations. The symptom of the problem is an over sensitivity to variations of the elements of \bm{b} when the system is nearly singular, which is also called poorly conditioned, or ill-conditioned. If the coefficients are taken from experiments, then they likely contain a random component. We say that there are perturbations to \bm{b}. In many applications, \bf{A} is a fixed control matrix, so errors are more likely to be seen in \bm{b}.

Errors in \bm{b}, represented as \delta\bm{b}, are conveyed to the solution \bm{x}. The matrix equation \mathbf{A}\,\bm{x} = \bm{b} becomes \mathbf{A}\,\hat{\bm{x}} = \bm{b} + \delta\bm{b}. Our estimate of the correct value of \bm{x} is \hat{\bm{x}} = \bm{x} + \bm{\delta x}.

\begin{aligned}
        \mathbf{A}\,(\bm{x} + \bm{\delta x}) &= \bm{b} + \bm{\delta b} \\
        \mathbf{A}\,\bm{x} + \mathbf{A}\,\bm{\delta x} &= \bm{b}
                + \bm{\delta b} \\
        \mathbf{A}\,\bm{x} = \bm{b}, \quad &\text{and} \quad
                \mathbf{A}\,\bm{\delta x} =  \bm{\delta b}
    \end{aligned}

We care about the magnitudes of \bm{\delta x} and \bm{\delta b} relative to their true values. We want to find the length (norm) of each error vector relative their unperturbed length, \norm{\bm{\delta x}} / \norm{\bm{x}} and \norm{\bm{\delta b}} / \norm{\bm{b}}. We then have the following two inequalities because the norm of a product is less than or equal to the product of the norms for each.

(6.23)\norm{\bm{\delta x}} \leq \norm{\mathbf{A}^{-1}}\, \norm{\bm{\delta b}}
    \qquad \frac{1}{\norm{\bm{x}}} \leq
    \norm{\mathbf{A}}\,\frac{1}{\norm{\bm{b}}}

Multiplying the two inequalities of equation (6.23) leads to equation (6.24).

(6.24)\frac{\norm{\bm{\delta x}}}{\norm{\bm{x}}} \leq \norm{\mathbf{A}}
    \norm{\mathbf{A}^{-1}}\,\frac{\norm{\bm{\delta b}}}{\norm{\bm{b}}}

Equation (6.24) defines to the condition number, which is a metric that relates the error in \bm{x} to the error in \bm{b}. The condition number measures how well or poorly conditioned a matrix is. The condition of \bf{A} is denoted as \kappa({\mathbf{A}}) [WATKINS10].

(6.25)\kappa(\mathbf{A}) = \text{cond}(\mathbf{A}) =
            \norm{\mathbf{A}}\norm{\mathbf{A}^{-1}}

\frac{\norm{\bm{\delta x}}}{\norm{\bm{x}}} \leq \kappa(\mathbf{A})\,
    \frac{\norm{\bm{\delta b}}}{\norm{\bm{b}}}

The combined matrix norm of a matrix and its inverse is larger for matrices that are nearly singular. Matrices that are close to singular have large condition numbers. If \kappa(\mathbf{A}) is small, then small errors in \bm{b} results in only small errors in \bm{x}. Whereas if \kappa(\mathbf{A}) is large, then even small perturbations in \bm{b} can result in a large error in \bm{x}. The condition number of the identity matrix and an orthogonal matrix is 1, but nearly singular matrices can have condition numbers of several thousand to even over a million.

The cond function in MATLAB uses the \norm{\cdot}_2 matrix norm by default, but may also compute the condition number using other matrix norms defined in Matrix Norms. However, to avoid calculating a matrix inverse, MATLAB calculates the condition number using the singular values of the matrix as described in Condition Number. Using the singular values, there is a concern about division by zero, so MATLAB calculates the reciprocal, which it calls RCOND. MATLAB’s left-divide operator checks the condition of a matrix before attempting to find a solution. When using the left-divide operator, one may occasionally see a warning message such as follows.

Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = 3.700743e-18.

Note

The singular values of a matrix are briefly introduced in The Singular Value Decomposition and discussed in more detail in Singular Value Decomposition (SVD).

A handy rule of thumb for interpreting the condition number is as follows.

If \kappa(\mathbf{A}) \approx 0.d \times 10^k then the solution \bm{x} can be expected to have k fewer digits of accuracy than the elements of \bf{A} [WILLIAMS14].

Thus, if \kappa(\mathbf{A}) = 10 = 0.1 \times 10^1 and the elements of \bf{A} are given with 6 digits, then the solution in \bm{x} can be trusted to 5 digits of accuracy.

Here is an example to illustrate the problem. To be confident of the correct answer, we begin with a well-conditioned system.

(6.27)\spalignsys{17\,x_1 - 2\,x_2 = -3; -13\,x_1 + 29\,x_2 = 29}

It is easily confirmed with MATLAB that the solution is x_1 = -0.0621, x_2 = 0.9722.

We make the assertion that the system of equations in equation (6.27) is well-conditioned based on two pieces of information. First, the condition number of \bf{A} is a relatively small (2.3678). Secondly, minor changes to the equation have a minor impact on the solution. For example, if we add 10% of \bm{b}_2 to itself, we get a new solution of x_1 = -0.0609, x_2 = 0.9827, which is a change of 1.23% to the solution to x_1 and 1.08% to the x_2 solution.

Now, to illustrate a poorly conditioned system, we will alter the first equation of equation (6.27) to be the sum of the second equation and only 1/100 of itself.

(6.28)\spalignsys{-12.83\,x_1 + 28.998\,x_2 = 28.97; -13\,x_1 + 29\,x_2 = 29}

Perhaps surprisingly, both LU decomposition and RREF are able to find the correct solution to equation (6.28). One can see that equation (6.28) is nearly singular because the two equations are close to parallel. The condition number is large (431). If again we add 10% of \bm{b}_2 to itself, we get a new solution of x_1 = -1.8617, x_2 = 0.1754, which is a change to the x_1 solution of 2,898% and the x_2 solution changed by 81.96%.

Another important result related to the condition number of matrices is that the condition number of the product of uncorrelated matrices is usually greater than the condition numbers of either of the individual matrices, and is sometimes equal to the product of the condition numbers of the individual matrices. [1]

\kappa(\mathbf{A\,B}) = \norm{\mathbf{A\,B}}\,\norm{(\mathbf{A\,B})^{-1}}
    \leq \norm{\mathbf{A}}\, \norm{\mathbf{B}}\, \norm{\mathbf{B}^{-1}}\,
        \norm{\mathbf{A}^{-1}} = \kappa(\mathbf{A})\,\kappa(\mathbf{B})

This has implications for the next section (Orthogonal Matrix Decompositions) where we create matrix factors without increasing the condition number of the product. This is accomplished by using orthogonal matrices because the condition number of orthogonal matrices is 1. The following example shows the condition number of random matrices, products of matrices, and the pseudo-inverse of a matrix.

% A and B are 3-by-3
>> kA = cond(A)
kA =
    4.7885
>> kB = cond(B)
kB =
    2.4164
>> kA * kB
ans =
   11.5710
>> cond(A*B)
ans =
    9.7549
>> kA * kA
ans =
   22.9293
>> cond(A'*A)
ans =
   22.9293
>> cond(inv(A'*A))
ans =
   22.9293
% C is 6-by-3
>> cond(C)
ans =
    1.5439
>> ic = inv(C'*C);
>> cond(ic)
ans =
    2.3836
%  pseudo-inverse
>> cond(ic*C')
ans =
    1.5439

6.9.3. The Case for Partial Pivoting

For numerical stability, very small numbers should not be used as pivot variables in the elimination procedure. Each time the pivot is advanced, the remaining rows are sorted to maximize the absolute value of the pivot variable. The new row order is documented with a permutation matrix to associate results with their unknown variable. Reordering rows, or rows exchanges, is also called partial pivoting. Full pivoting is when columns are also reordered.

Here is an example showing why partial pivoting is needed for numerical stability. This example uses a 2 {\times} 2 matrix where we could see some round-off error. The \epsilon value used in the example is taken to be a very small number, such as the machine epsilon value (eps) that we used in Example Control Constructs - sinc. Keep in mind that round-off errors will propagate and build when solving a larger system of equations. Without pivoting, we end up dividing potentially large numbers by \epsilon producing even larger numbers. In the algebra, we can avoid the large numbers, but get very small numbers in places. In either case, during the row operation phase of elimination, we add or subtract numbers of very different value resulting in round-off errors. The accumulated round-off errors become especially problematic in the substitution phase where we might divide two numbers that are both close to zero.

\spalignsys{\epsilon x_1 + 2 x_2 = 4;
3 x_1 - x_2 = 7}

\spalignmat{\epsilon, 2; 3, -1} \spalignvector{x_1;x_2}
= \spalignvector{4;7}

Regarding \epsilon as nearly zero relative to the other coefficients, we can use substitution to quickly see the approximate values of x_1 and x_2.

\begin{array}{rl}
  x_2 &\approx \frac{4}{2} = 2 \\
      &\hfill\\
  x_1 &\approx \frac{7 + 2}{3} = 3
\end{array}

6.9.3.1. Elimination Without Partial Pivoting

If we don’t first exchange rows, we get the approximate value of x_2 with some round-off error, but fail to find a correct value for x_1.

\spalignaugmat{\epsilon, 2, 4; 3, -1, 7}

Only one elimination step is needed.

\spalignaugmat{\epsilon, 2, 4;
0, {-(1 + \frac{6}{\epsilon})}, {(7 - \frac{12}{\epsilon})} }

\left(1 + \frac{6}{\epsilon}\right) x_2 =
\frac{12}{\epsilon} - 7

We could multiply both sides of the equation by \epsilon to simplify it or calculate the fractions. In either case, the 1 and 7 terms will be largely lost to round off error.

x_2 \approx \frac{12}{6} = 2

Our calculation of x_1 is completely wrong. To get the correct answer would require division of two numbers both close to zero and the numerator likely suffers from significant round-off error.

\begin{array}{rl}
  &\epsilon x_1 + 2 x_2 = 4 \\
  &\hfill\\
  &\epsilon x_1 = 4 - 2 x_2 = 0 \\
  &\hfill\\
  &x_1 = \frac{4 - 2 x_2}{\epsilon} \approx \frac{0}{\epsilon} = 0
\end{array}

6.9.3.2. Elimination With Partial Pivoting

Now we begin with the rows exchanged.

\spalignaugmat{3, -1, 7;\epsilon, 2, 4}

Again, only one elimination step is needed.

\spalignaugmat{3, -1, 7;
0, {(2 + \frac{\epsilon}{3})}, {(4 - \frac{7 \epsilon}{3})} }

We will again lose some accuracy to round-off errors, but our small \epsilon values are paired with larger values by addition or subtraction and have minimal impact on the back substitution equations.

\begin{array}{rl}
  &x_2 \approx \frac{4}{2} = 2 \\ &\hfill \\
  &3 x_1 - x_2 = 7 \\ &\hfill \\
  &x_1 \approx \frac{7 + 2}{3} = 3
\end{array}

MATLAB uses partial pivoting in both RREF and LU decomposition to minimize the impact of the round-off error. It will use the small values in the calculation that we rounded to zero, so the results will be more accurate than our approximations.

>> x = A\b
x =
    3.0000
    2.0000
>> isequal(x, [3;2])  % not completely rounded
ans =
  logical
   0

6.9.4. The Case for LU Decomposition

Both MATLAB functions that use elimination (rref and lu) use partial pivoting to improve the accuracy of the results. The rref function is not as accurate as lu. Computer programs should use lu to solve systems of equations. The use case for the rref function is data analysis, exploration, and learning. However, rref is accurate enough for most applications. Only very poorly conditioned systems of equations will cause rref to fail to deliver reasonably accurate results.

As a case study of what strategies achieve accurate results, we will briefly consider some of the differences between the Gauss-Jordan elimination algorithm used by rref and Turing’s kij LU decomposition algorithm. The kij LU algorithm has four noticeable differences to Gauss-Jordan elimination.

  1. The pivot variables are not normalized to one before row operations. The pivot rows are not divided by the pivot variable.

  2. Row operations are not used to move elements to the right of the diagonal to zero values. Rather, with the kij LU algorithm the row operations directly yield both \bf{L} an \bf{U}. Each row operation performed is more additions or subtractions on the row elements and an additional opportunity to cause round-off errors.

  3. Matrix elements below each pivot are never actually set to zero from a row operation. The RREF row operations use subtraction to set matrix elements to zero. With the kij LU algorithm, the elements in \bf{U} that are below the diagonal are assigned a value of zero. The matter of assigning known values to variables is a certain step toward reducing round-off errors. If the math says that at a certain point in an algorithm a variable should be zero, or any other set value, then consider setting the variable to what it should be rather than relying on a mathematical operation to set it.

    The corresponding values of the \bf{L} matrix are set by division, not subtraction.

  4. The extra column of the \bm{b} vector is not a part of the elimination procedure. Although the \bm{b} vector is used in the forward and back substitution phase of the LU algorithm.

Two steps likely contribute to most of the inaccurate results from Gauss-Jordan elimination. First, the extra row operations to shift the elements to the right of the diagonal to zero gives more opportunities for round-off errors to accumulate. In this comparison, one could assert that doing less is sometimes better than doing more. Secondly, relying on subtraction in row operations to set values to zero can cause round-off errors. When the value of a variable is known, then explicitly setting the value with an assignment statement is usually preferred to using a mathematical operation to set the value.

6.9.5. The Case for the Modified Gram–Schmidt Process

A comparison of the two Gram–Schmidt algorithms presented in The Gram–Schmidt Algorithm demonstrates the value of reordering addition or subtraction operations in favor of other operations in an algorithm such that the differences between the magnitudes of the two variables is minimized before addition or subtraction is performed.

The classic Gram–Schmidt algorithm is an excellent application of vector projections and is taught in many linear algebra courses. The objective is to find a set of orthogonal basis vectors from the columns of a matrix. In each step, the portion of a column that is not orthogonal to the columns to its right is subtracted away. The subtractions leads to an algorithm with poor numerical stability. The modified Gram–Schmidt (MGS) process presented in Implementation of Modified Gram–Schmidt offers better numerical stability to the classic Gram–Schmidt (CGS) process presented in The Gram–Schmidt Algorithm. From observations of the source code for the two functions, the key difference is when the subtractions occur. In the CGS process, the algorithm goes column by column and the non-orthogonal components to each previous column is subtracted away. Then, after each column is orthogonal to the previous columns, it is normalized to a unit vector. In the MGS process an outer loop goes down the diagonal of the matrix. The inner loop processes each sub-column below the current diagonal position. The sub-column is first scaled to a unit vector and then incremental subtractions remove the non-orthogonal components between unit vectors. Improved accuracy is achieved because before each subtraction occurs, the two vectors involved have comparable magnitudes.

In the next section, we will look at the QR factoring algorithm that also finds an orthogonal set of basis vectors from the columns of a matrix. It has a completely different strategy than subtracting away the non-orthogonal components from vectors.

6.9.6. Lessons Learned

We have seen three case studies where improvements to algorithm design leads to better numerical stability. Here we list a summary of strategies for algorithm design and programming practices that can improve the accuracy of calculations.

  1. Avoid multiplication of large matrices. Remember that matrix multiplication involves sums of products, where round-off errors can accumulate. The condition number of matrices usually grows from matrix multiplication. The product \mathbf{A}^T\,\mathbf{A} becomes important to us for solving rectangular systems of equations and for finding the SVD factors in Singular Value Decomposition (SVD). We will take advantage of orthogonal matrix decompositions made using unitary transformation matrices to avoid the multiplication.
  2. Avoid division by numbers that are very close to zero. As we saw when elimination is attempted without partial pivoting, problems occur when we divide a large number by a number close to zero and unpredictable results occur when dividing two numbers that both very close to zero.
  3. Less work is usually better. Use algorithms that perform fewer numeric operations on matrix elements. This is a contributing factor to Gauss-Jordan elimination with the rref function having less accurate results than the Gaussian elimination performed with LU decomposition.
  4. Let zero be zero. When the math equation says that a value after a certain operation should be zero. Then make an assignment statement to set it to zero, or what ever know quantity the math says that it should be. The numeric operation performed by the computer will likely not set it to exactly zero due to round-off errors. To convince yourself that this simple step improves the accuracy of results, try informal experiments with factoring algorithms such as LU decomposition, and QR ( QR Decomposition).
  5. Order the steps of algorithms such that when two numbers are added together or subtracted from each other that the numbers are close to the same values. The modified Gram-Schmidt process improves numerical accuracy by only subtracting unit length vectors.

Footnote:

[1]The exception to this is when the matrices are correlated to each other. For example, the condition number of a pseudo-inverse (Projections Onto a Hyperplane) has the same condition number as the original matrix.