.. _num-stability: Numerical Stability =================== Here we will consider the potential accuracy of solutions to systems of equations of the form :math:`\mathbf{A}\,\bm{x} = \bm{b}`. In idealistic mathematical notation, we write the solution to a square matrix system of equations as :math:`\bm{x} = \mathbf{A}^{-1}\bm{b}`. However, the path toward finding the true :math:`\bm{x}` is fraught with potential problems. Sometimes it may be difficult to find an accurate solution. It might be that the values in :math:`\bm{b}` and :math:`\bf{A}` are not accurate because they came from an experiment where noise in the data is unavoidable. It might be that the rows of :math:`\bf{A}` are close to parallel, meaning that :math:`\bf{A}` is nearly singular—also called *poorly conditioned*. It might also be that the absolution values of the numbers in :math:`\bm{b}` and :math:`\bf{A}` are likely to incur round-off errors because they contain a mixture of very large numbers, very small numbers, and even some irrational numbers and numbers that, in binary, contain unresolved repeating patterns of 1’s and 0’s. Using an algorithm that is not numerically stable leads to inaccurate solution estimates. Numerical stability is a property of the algorithm rather than the problem being solved. The best algorithms might not always find accurate results in the worst conditions. However, numerically stable algorithms find accurate results for most problems. Numerically stable algorithms order the operations on numbers to minimize round-off errors. They also avoid computations that will tend to degrade the condition of matrices. For example, our idealistic mathematical solution to an over-determined system of equations calls for three matrix multiplications, a matrix inverse, and a matrix–vector multiplication. We will see in this section that such a sequence of operations is not advisable. We already know that the matrix inverse is slow. The matrix multiplications can also introduce round-off errors. After reviewing the source of round-off errors and the condition number of matrices, we will look at three case studies where modified algorithms led to improved numerical stability. Then, in the next section, we will see how orthogonal matrix factoring provides the tools needed for numerically stable solutions to rectangular systems of equations. .. index:: numerical stability .. _roundOffErrors: Round-off Errors ---------------- Round-off errors usually occur when adding or subtracting a small number from a large number. The source of the degraded precision is the finite number of binary bits used to hold the digits of floating-point numbers. Addition or subtraction can cause the less significant binary digits of small numbers to be removed. When the precision of a number is reduced, a round-off error is introduced into the calculations. The *IEEE Standard for Binary Floating-Point Arithmetic* [IEEE85]_ specifies the format of 32-bit single precision and 64-bit double precision floating point numbers. We will focus on the double precision format because most scientific numeric calculations use double data type variables that correspond to the double precision standard. Floating point numbers are stored in a format similar to scientific notation, except that the binary number system (base 2) is used instead of decimal. As shown in equation :eq:`eq-floatnums`, the binary numbers are normalized with adjustments to the exponent so that there is a single 1 to the left of the binary point. The binary digits of the number are called the mantissa. Since the digit to the left of the binary point is always 1, it is not necessary to store that 1. The digits to the right of the binary point are called the *significand*. As shown in figure :numref:`fig-doubleFormat`, double precision numbers store 52 binary digits of the significand. In equation :eq:`eq-floatnums`, the significand is the binary digits :math:`b_{51}b_{50}\ldots b_0`. Including the 1 to the left of the binary point, the level of available precision in double data type variables is :math:`u = 2^{-53} \approx 1.11 \times 10^{-16}` [IEEE85]_. .. math:: :label: eq-floatnums \begin{array}{l} (-1)^{sign} \times (1.b_{51}b_{50}\ldots b_0)_2 \times 2^{exp - 1023}\\ \text{or} \\ mantissa \times 2^{exponent} \end{array} .. _fig-doubleFormat: .. figure:: doubleFormat.png :figclass: center-caption :align: center :width: 60% :alt: The format of 64-bit double precision floating point numbers. The format of 64-bit double precision floating point numbers. Let :math:`x_1 = mantissa_1 \times 2^{exponent_1}` and let :math:`x_2 = mantissa_2 \times 2^{exponent_2}`. Further, let :math:`x_1` be much greater than :math:`x_2`, (:math:`x_1 \gg x_2`). The exponents are added in multiplication, .. math:: x_1 \times x_2 = mantissa_1 \times mantissa_2 \times 2^{(exponent_1 + exponent_2)}. Division is a similar process, except the exponents are subtracted. Multiplication and division are immune to numerical problems except in the rare event of overflow or underflow. Addition and subtraction require that the exponents match. .. math:: \begin{split} x_1 \,\pm\, x_2 &= mantissa_1 \times 2^{exponent_1} \pm \\ & \quad \left(mantissa_2 \times 2^{(exponent_2 - exponent_1)}\right) \times 2^{exponent_1}. \end{split} The expression :math:`\left(mantissa_2 \times 2^{(exponent_2 - exponent_1)}\right)` where :math:`exponent_1 \gg exponent_2`, means that the binary digits of :math:`mantissa_2` are shifted to the right :math:`exponent_1 - exponent_2` positions. Any digits of :math:`mantissa_2` that are shifted beyond the 52 positions of the significand are lost, and a round-off error is introduced into :math:`x_2`. Many numbers have 1s in only a few positions to the right of the binary point. These numbers are not likely to lose precision to round-off error. But irrational numbers such as :math:`\pi` or :math:`\sqrt{2}` fill out the full 52 bits of the significand. They can suffer from a loss of precision with even a single-digit shift to the right due to adding a number that is only twice its value. Just as a fraction, such as :math:`1/3`, can hold infinitely repeating digits in decimal (:math:`0.333333 \ldots`), some numbers have repeating digits in binary. For example, :math:`(1.2)_{10} = (1.\underline{0011}0011\underline{0011}\ldots)_2`. A small round-off error occurs when a number as small as 2.4 is added to 1.2. The example in figure :numref:`fig-roundoffError` shows the round-off error when 1.2 is added to successive powers of 2. When the variable ``x`` is small, the variable ``y`` is close to zero. But as ``x`` gets larger, variable ``a`` is increasingly lost to round-off error in ``x + a``, which causes variable ``y`` to converge to one. A logarithmic scale on the :math:`y` axis of the plot is needed to observe the loss of precision. The stair-step shape to the plot is because ``a`` holds a repeating pattern of the binary digits 0011. The precision of ``a`` is diminished when a binary 1 is lost to round-off error. But no damage to the precision of ``a`` occurs when a binary 0 is lost to round-off error. The plot shows an interesting progression of the round-off error when an irrational number is placed in the variable ``a``. What do you think the plot will look like when a simple number like 1.5 is placed in ``a``? :: ex = 1:60; x = 2.^ex; a = 1.2; y = abs(x + a - x - a)/a; y(y < eps) = eps; semilogy(ex, y) axis tight .. _fig-roundoffError: .. figure:: roundoffError.png :figclass: center-caption :align: center :width: 60% :alt: The number a = 1.2 fills the full 52-bit significand allowed for double precision floating point numbers. As the number added to a gets larger, more of a is lost to round-off error. The number ``a = 1.2`` fills the full 52-bit significand allowed for double precision floating point numbers. As the number added to ``a`` gets larger, more of ``a`` is lost to round-off error. .. index:: round-off errors .. _conditionNum: Matrix Condition Number ----------------------- Another problem besides round-off errors may degrade the accuracy of solutions to systems of equations. The symptom of the problem is an over-sensitivity to variations of the elements of :math:`\bm{b}` when the system is nearly singular, which is also called poorly conditioned, or ill-conditioned. If the coefficients are taken from experiments, they likely contain a random component. We say there are perturbations to :math:`\bm{b}`. In many applications, :math:`\bf{A}` is a fixed control matrix, so errors are more likely to be seen in :math:`\bm{b}`. Errors in :math:`\bm{b}`, represented as :math:`\delta\bm{b}`, are conveyed to the solution :math:`\bm{x}`. The matrix equation :math:`\mathbf{A}\,\bm{x} = \bm{b}` becomes :math:`\mathbf{A}\,\hat{\bm{x}} = \bm{b} + \delta\bm{b}`. Our estimate of the correct value of :math:`\bm{x}` is :math:`\hat{\bm{x}} = \bm{x} + \bm{\delta x}`. .. math:: \begin{aligned} \mathbf{A}\,(\bm{x} + \bm{\delta x}) &= \bm{b} + \bm{\delta b} \\ \mathbf{A}\,\bm{x} + \mathbf{A}\,\bm{\delta x} &= \bm{b} + \bm{\delta b} \\ \mathbf{A}\,\bm{x} = \bm{b}, \quad &\text{and} \quad \mathbf{A}\,\bm{\delta x} = \bm{\delta b} \end{aligned} We care about the magnitudes of :math:`\bm{\delta x}` and :math:`\bm{\delta b}` relative to their true values. We want to find the length (norm) of each error vector relative to their unperturbed length, :math:`\norm{\bm{\delta x}} / \norm{\bm{x}}` and :math:`\norm{\bm{\delta b}} / \norm{\bm{b}}`. We then have the following two inequalities because the norm of a product is less than or equal to the product of the norms of each variable. .. math:: :label: eq-conNum1 \norm{\bm{\delta x}} \leq \norm{\mathbf{A}^{-1}}\, \norm{\bm{\delta b}} \qquad \frac{1}{\norm{\bm{x}}} \leq \norm{\mathbf{A}}\,\frac{1}{\norm{\bm{b}}} Multiplying the two inequalities of equation :eq:`eq-conNum1` leads to equation :eq:`eq-conNum2`. .. math:: :label: eq-conNum2 \frac{\norm{\bm{\delta x}}}{\norm{\bm{x}}} \leq \norm{\mathbf{A}} \norm{\mathbf{A}^{-1}}\,\frac{\norm{\bm{\delta b}}}{\norm{\bm{b}}} Equation :eq:`eq-conNum2` defines the *condition number* of :math:`\mathbf{A}`, which is a metric that relates the error in :math:`\bm{x}` to the error in :math:`\bm{b}`. The condition number measures how well or poorly conditioned a matrix is. The condition of :math:`\bf{A}` is denoted as :math:`\kappa({\mathbf{A}})` [WATKINS10]_. .. math:: \frac{\norm{\bm{\delta x}}}{\norm{\bm{x}}} \leq \kappa(\mathbf{A})\, \frac{\norm{\bm{\delta b}}}{\norm{\bm{b}}} .. math:: :label: eq-conNum \kappa(\mathbf{A}) = \text{cond}(\mathbf{A}) = \norm{\mathbf{A}}\norm{\mathbf{A}^{-1}} Matrices that are close to singular have large condition numbers. If :math:`\kappa(\mathbf{A})` is small, then small errors in :math:`\bm{b}` results in only small errors in :math:`\bm{x}`. Whereas if :math:`\kappa(\mathbf{A})` is large, then even small perturbations in :math:`\bm{b}` can result in a large error in :math:`\bm{x}`. The condition number of the identity matrix and an orthogonal matrix is 1, but nearly singular matrices can have condition numbers of several thousand to even over a million. .. index:: cond The ``cond`` function in Python's ``numpy.linalg`` module uses the :math:`\norm{\cdot}_2` matrix norm by default, but may also compute the condition number using other matrix norms defined in :ref:`matNorms`. MATLAB calculates the condition number using the singular values of the matrix as described in :ref:`condSVD`. Using the singular values, there is a concern about division by zero, so MATLAB calculates the reciprocal, which it calls RCOND. Linear system solvers check the condition of a matrix before attempting to find a solution. A handy rule of thumb for interpreting the condition number is if :math:`\kappa(\mathbf{A}) \approx 0.d \times 10^k` then the solution :math:`\bm{x}` can be expected to have :math:`k` fewer digits of accuracy than the elements of :math:`\bf{A}` [WILLIAMS14]_. Thus, if :math:`\kappa(\mathbf{A}) = 10 = 0.1 \times 10^1` and the elements of :math:`\bf{A}` are given with six digits, then the solution in :math:`\bm{x}` can be trusted to five digits of accuracy. Here is an example to illustrate the problem. To be confident of the correct answer, we begin with a well-conditioned system. .. math:: :label: eq-goodsys \left\{\begin{aligned} 17\,x_1 - 2\,x_2 &= -3 \\ -13\,x_1 + 29\,x_2 &= 29 \end{aligned}\right. It is easily confirmed with MATLAB that the solution is :math:`x_1 = -0.0621, x_2 = 0.9722`. We assert that the system of equations in equation :eq:`eq-goodsys` is *well-conditioned* based on two pieces of information. First, the *condition number* of :math:`\bf{A}` is a relatively small (2.3678). Secondly, minor changes to the equation have a minor impact on the solution. For example, if we add 1% of :math:`b_2` to itself, we get a new solution of :math:`x_1 = -0.0609, x_2 = 0.9827`, which is a change of 2% to the solution :math:`x_1` and 1.09% to :math:`x_2`. Now, to illustrate a poorly conditioned system, we will alter the first equation of equation :eq:`eq-goodsys` to be the sum of the second equation and only :math:`1/100` of itself. .. math:: :label: eq-badsys \left\{\begin{aligned} -12.83\,x_1 + 28.98\,x_2 &= 28.97 \\ -13\,x_1 + 29\,x_2 &= 29 \end{aligned}\right. Perhaps surprisingly, both LU decomposition and RREF can find the correct solution to equation :eq:`eq-badsys`. One can see that equation :eq:`eq-badsys` is nearly singular because the two equations are close to parallel. The condition number is large (431). If we again add 1% of :math:`b_2` to itself, we get a new solution of :math:`x_1 = -1.8617, x_2 = 0.1754`, which is a change to the :math:`x_1` solution of 2,898%, and the :math:`x_2` solution changed by 81.96%. Another important result related to the condition number of matrices is that the condition number of the product of uncorrelated matrices is usually greater than the condition numbers of either of the individual matrices, and is sometimes equal to the product of the condition numbers of the individual matrices. .. math:: \kappa(\mathbf{A\,B}) = \norm{\mathbf{A\,B}}\,\norm{(\mathbf{A\,B})^{-1}} \leq \norm{\mathbf{A}}\, \norm{\mathbf{B}}\, \norm{\mathbf{B}^{-1}}\, \norm{\mathbf{A}^{-1}} = \kappa(\mathbf{A})\,\kappa(\mathbf{B}) In the next section (:ref:`orthogonal`), we multiply orthogonal matrices without increasing the condition number of the product because the condition number of an orthogonal matrix is 1. .. index:: condition number .. index:: nearly singular matrix .. index:: rcond number .. index:: poorly condition matrix .. index:: ill-conditioned matrix .. _pivoting: The Case for Partial Pivoting ----------------------------- For numerical stability, very small numbers should not be used as pivot variables in the elimination procedure. Each time the pivot is advanced, the remaining rows are sorted to maximize the absolute value of the pivot variable. The new row order is documented with a permutation matrix to associate results with their unknown variable. Reordering rows, or row exchanges, is also called partial pivoting. Full pivoting is when columns are also reordered. .. index:: eps Here is an example showing why partial pivoting is needed for numerical stability. The example uses a :math:`2{\times}2` matrix, where we could see some round-off error. The :math:`\epsilon` value used in the example is taken to be a very small number. Remember that round-off errors will propagate and build when solving a larger system of equations. Without pivoting, we divide potentially large numbers by :math:`\epsilon`, producing even larger numbers. In algebra, we can avoid large numbers but get small numbers in different places. In either case, during the row operation phase of elimination, we add or subtract numbers of very different values, resulting in round-off errors. The accumulated round-off errors become especially problematic in the substitution phase, where the denominator of a division might be near zero. .. math:: \left\{\begin{aligned} \epsilon \,x_1 + 2\,x_2 &= 4 \\ 3\,x_1 - x_2 &= 7 \end{aligned}\right. .. math:: \begin{bmatrix} \epsilon & 2 \\ 3 & -1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 4 \\ 7 \end{bmatrix} Regarding :math:`\epsilon` as nearly zero relative to the other coefficients, we can use substitution to quickly see the approximate values of :math:`x_1` and :math:`x_2`. .. math:: \begin{array}{rl} x_2 &\approx \frac{4}{2} = 2 \\ &\hfill\\ x_1 &\approx \frac{7 + 2}{3} = 3 \end{array} .. index:: row exchanges .. index:: partial pivoting Elimination Without Partial Pivoting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If we don’t first exchange rows, we get the approximate value of :math:`x_2` with some round-off error, but fail to find a correct value for :math:`x_1`. .. math:: \begin{bmatrix} \epsilon & 2 & 4 \\ 3 & -1 & 7 \end{bmatrix} Only one elimination step is needed. .. math:: \begin{bmatrix} \epsilon & 2 & 4 \\ 0 & {-(1 + \frac{6}{\epsilon})} & {(7 - \frac{12}{\epsilon})} \end{bmatrix} .. math:: \left(1 + \frac{6}{\epsilon}\right) x_2 = \frac{12}{\epsilon} - 7 We could multiply both sides of the equation by :math:`\epsilon` to simplify it, or calculate the fractions. In either case, the 1 and 7 terms will be largely lost to round-off error. .. math:: x_2 \approx \frac{12}{6} = 2 Our calculation of :math:`x_1` is completely wrong. Getting the correct answer requires dividing two numbers that are close to zero. .. math:: \begin{array}{rl} &\epsilon x_1 + 2 x_2 = 4 \\ &\hfill\\ &\epsilon x_1 = 4 - 2 x_2 = 0 \\ &\hfill\\ &x_1 = \frac{4 - 2 x_2}{\epsilon} \approx \frac{0}{\epsilon} = 0 \end{array} Elimination With Partial Pivoting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now we begin with the rows exchanged. .. math:: \begin{bmatrix} 3 & -1 & 7 \\ \epsilon & 2 & 4 \end{bmatrix} Again, only one elimination step is needed. .. math:: \begin{bmatrix} 3 & -1 & 7 \\ 0 & {(2 + \frac{\epsilon}{3})} & {(4 - \frac{7 \epsilon}{3})} \end{bmatrix} We will again lose some accuracy to round-off errors, but our small :math:`\epsilon` values are paired with larger values by addition or subtraction and have minimal impact on the back substitution equations. .. math:: \begin{array}{rl} &x_2 \approx \frac{4}{2} = 2 \\ &\hfill \\ &3 x_1 - x_2 = 7 \\ &\hfill \\ &x_1 \approx \frac{7 + 2}{3} = 3 \end{array} Partial pivoting is used in both RREF and LU decomposition. It will use the small values in the calculation that we rounded to zero, so the results will be more accurate than our approximations. :: In [5]: import numpy as np In [6]: eps = 2.5e-16 In [7]: A = np.array([[3, -1], [eps, 2]]) In [8]: b = np.array([7, 4], dtype=float).T In [9]: x = np.linalg.solve(A, b); x Out[9]: array([3., 2.]) # not completely rounded In [10]: np.all(x == np.array([3, 2])) Out[10]: np.False_