12.4.7. Convergence by Similarity Transforms

Eigenvalues finding algorithms must use an iterative sequence of similarity transformations to shift the matrix into the upper triangular form, where the eigenvalues are on the diagonal. From the matrix in upper Hessenberg form, the iterations work to cause the subdiagonal elements to converge to values near zero. As each subdiagonal element at T_{k, k-1} converges to zero, the diagonal element at T_{k,k} converges to eigenvalue \lambda_k. The speed of convergence varies due to algorithms using different similarity transformations. Here, we address the basic concept of why repeated applications of similarity transformations cause convergence. The complete explanation of convergence is quite advanced. More advanced numerical analysis texts such as by Demmel [DEMMEL97], Watkins [WATKINS10], and Golub and Van Loan [GOLUB13] has extensive coverage of convergence. Watkins makes two notable observations about convergence. First, he states that convergence is not guaranteed but that robust implementations can detect when the algorithm is not progressing toward convergence. When non-convergence is detected, a random shift is applied, which usually shakes things up, putting the algorithm on a path toward convergence. Secondly, he states that algorithms using similarity transformations and the power method converge for the same reason—changes to the vector subspace.

We can easily see from the power method (The Power Method) that raising each eigenvalue to the power of k causes some terms of the equation to go to near zero while the term for the largest eigenvalue dominates the equation. But there is more going on than a change to the basis vectors. The equations step through a sequence of subspaces with each multiplication by \mathbf{A}. The subspaces, \mathcal{S} \in \mathbb{R}^n used by the power method include

\mathcal{E}_k\mathcal{S},\, \mathbf{A}\mathcal{S},\, \m{A}^2\mathcal{S},\,
    \mathbf{A}^3\mathcal{S}  \ldots,\, \text{where } \mathcal{E}_k =
    span\{\bm{e}_1, \bm{e}_2, \ldots, \bm{e}_k\}.

Here by \mathcal{E}_k we mean the standard Euclidean space where \bm{e}_i is a standard basis vector with a one in the i\text{th} position and zeros elsewhere.

As Watkins explains, subspace iteration is how the power method relates to eigenvalue algorithms using similarity transformations [WATKINS11]. But similarity transform algorithms do not change to a subspace from powers of \mathbf{A} as the power method does. Rather, in each iteration, the subspace changes from the standard Euclidean subspace to one derived from p(\mathbf{A}) and then back to the Euclidean subspace. The p(\mathbf{A}) function is defined relative to the degree of the iteration, m, and an estimate, \rho, of the closest eigenvalue to \lambda_k, which is the eigenvalue that we expect to see at position (k,\,k) when the algorithm has converged. The degree is usually one or two, but it could be higher. We are mostly concerned with degree one iterative steps, but the degree two Francis step procedure is presented in Francis Steps of Degree Two.

p(\mathbf{A}) = (\m{A} - \rho_1\,\m{I}) \ldots (\m{A} - \rho_m\,\m{I})

The QR algorithm calculates p(\mathbf{A}) as a matrix, but the Francis algorithm is only concerned with the first column, so it is a column vector, p(\mathbf{A})\bm{e}_1. The subspaces then include

\mathcal{E}_k\mathcal{S},\, p(\mathbf{A})\mathcal{S},\, p(\m{A})^2\mathcal{S},\,
    p(\mathbf{A})^3\mathcal{S} \, \ldots\,.

The unitary transform, \mathbf{Q}, is formed from the orthogonalized columns of p(\mathbf{A}) and applied from the left to \m{A}, \mathbf{Q}^H\,\m{A}. This step is a change of the coordinate system from basis vectors \bm{e}_1,\ldots,\bm{e}_n to \bm{q}_1,\ldots,\bm{q}_n, but then we complete the similarity transform (\mathbf{Q}^H\m{A\, Q}) and the matrix is back to \bm{e}_1,\ldots,\bm{e}_n basis vectors. Although the matrix is back to a standard coordinate system, the matrix did change. Of course, the eigenvalues are the same, and the matrix is usually still upper Hessenberg [1]. We expect the values on the subdiagonal to have smaller absolute values and those on the diagonal to be closer to eigenvalues. The composite Schur vectors will also have been updated with the columns of \mathbf{Q}. The procedure is repeated as many times as needed with new \rho and p(\mathbf{A}) variables until the Schur triangular form is reached.

In this subspace iteration the subspace remained fixed (span:math:{bm{e}_1,ldots,bm{e}_n}) and the matrix changes. Contrast this with the subspace iteration used by the power method, where the matrix stays fixed and the subspace changes.

Subspaces from powers of \mathbf{A} emphasize the dominant eigenvalue. However, subspaces from powers of p(\mathbf{A}) emphasize the eigenvalue at position (k,k), where k is the last column used in the current iteration. Then the proof of convergence relies on what happens in the limit to a_{k,k-1} after several iterations of similarity transformations.

Watkins presents the convergence of the Francis algorithm without reference to the QR algorithm in his book on matrix computations [WATKINS10] and in a 2011 journal article [WATKINS11]. The specific tenets of how subspace iterations allow the Francis algorithm to converge are rather abstract. The more common pedagogy is to explain the convergence of the QR algorithm and then tie the Francis algorithm into the explanation with the implicit Q theorem (Implicit Q Theorem). Observing how convergence occurs with the QR algorithm is a bit easier. So, let’s shift our discussion to more pragmatic convergence considerations for both algorithms.

Two aspects of the Francis algorithm require special consideration—the bulge and similarity by Hessenberg factoring. As the bulge travels down the matrix’s diagonal, it changes to incorporate the first column of \mathbf{Q} and what it finds in the matrix. The Francis algorithm uses similarity transforms intending to move the bulge. Restoring the matrix to Hessenberg form is a side effect. The subdiagonal elements are diminished because they are unwitting victims of the bulge-chasing effort. It is more challenging to see exactly how the matrix is changed from Hessenberg form to upper triangular because the subdiagonal is reduced by indirect actions. Readers may find it helpful to witness examples of how the Francis algorithm changes a matrix using the MATLAB debugger to trace its steps.

The strategy used by the QR algorithm to reduce the subdiagonal is an obvious juxtaposition to the Francis algorithm’s approach. QR temporarily puts the matrix into upper triangular form until the similarity transform is completed. So we have a step forward followed by a half step backward. The forward steps will eventually triumph over the smaller backward steps.

Here are a few points to consider relative to how these two algorithms converge.

  1. Convergence is reached when \rho = \lambda_k. If the calculated Wilkinson shift from the 2{\times}2 submatrix from rows and columns k-1 and k yields \lambda_k then the following QR or Francis step will converge to \lambda_k. This can be seen either by working through the similarity transform equations manually or with numeric examples by setting breakpoints in the MATLAB debugger to track the variables [2].
  2. Convergence to eigenvalue \lambda_k is marked by a near zero value at position (k, k-1) on the subdiagonal. If deflation were not used to reduce the matrix, then any similarity transforms beyond this point would be identity matrices. So, it is time to move up the diagonal and work on finding \lambda_{k-1}. Once convergence is complete for all eigenvalues, future \mathbf{Q} matrices will be identity matrices, which, of course, are not transformation matrices because they will not change the matrix.
  3. As the algorithm moves up the diagonal with deflation as the eigenvalues are found, changes are made to all diagonal and subdiagonal matrix elements. Complex conjugate values move toward each other. When tracking the state of the matrix, perhaps using the debugger, one can see the diagonal values approach the values of eigenvalues.
  4. From the state of Hessenberg form, the goal of the algorithm is to move the subdiagonal elements to zero, not to place eigenvalues on the diagonal. By the Schur triangulation theorem ( Schur Triangularization Theorem), the eigenvalues will be on the diagonal automatically after similarity transforms push the matrix into upper triangular form.
  5. Probability favors convergence. The similarity transforms applied from the left are designed to move matrix elements below the diagonal to zero. Completion of the similarity requirement impedes progress, but convergence will eventually win. The left transforms are designed to clear out a column, but the magnitude of non-zero values placed below the diagonal by the transpose of the transform applied from the right is random chance.
  6. The Wilkinson shift is not required to find real eigenvalues from a real matrix but always makes the convergence progress faster. A shift is strictly necessary to find complex eigenvalues. The shift speeds up convergence because the value we seek to find on the diagonal \lambda_k - \rho should be much smaller than the values on the diagonal from 1 to k-1. The convergence rate is proportional to the ratio of the second smallest eigenvalue to the smallest eigenvalue.
  7. Eigenvalues (\lambda_k) are found when processing column k-1, not column k. That is because the element at position (k, k-1) is set to zero by the Householder or Givens transformation operating on column k-1. In the case of the QR algorithm, the value at position (k, k-1) is not finalized until the similarity transformation is completed with \mathbf{R\, Q}, but the QR decomposition prepares the matrix for convergence.
[1]The exceptions are complex eigenvalues with the step of degree two and after convergence to upper triangular.
[2]Use the eigQR function and set a breakpoint in qrFactor after the Householder transform is applied R = H*R; Then, enter A1 = R*Q in the command window. It may be easier to see what is happening if you start with a symmetric matrix.