Kaczmarz method

The Kaczmarz method or Kaczmarz's algorithm is an iterative algorithm for solving linear equation systems  A x = b . It was first discovered by the Polish mathematician Stefan Kaczmarz,[1] and was rediscovered in the field of image reconstruction from projections by Richard Gordon, Robert Bender, and Gabor Herman in 1970, where it is called the Algebraic Reconstruction Technique (ART).[2] ART includes the positivity constraint, making it nonlinear.[3]

The Kaczmarz method is applicable to any linear system of equations, but its computational advantage relative to other methods depends on the system being sparse. It has been demonstrated to be superior, in some biomedical imaging applications, to other methods such as the filtered backprojection method.[4]

It has many applications ranging from computed tomography (CT) to signal processing. It can be obtained also by applying to the hyperplanes, described by the linear system, the method of successive projections onto convex sets (POCS).[5][6]

Algorithm 1: Kaczmarz algorithm

Let  A x = b be a linear system, let  m the number of rows of A, a_{i} be the  i th row of complex-valued matrix A, and let x^{0} be arbitrary complex-valued initial approximation to the solution of  Ax=b . For  k=0,1,... compute:


  x^{k+1} 
  = 
  x^{k} 
  + 
  \frac{b_{i} - \langle a_{i}, x^{k} \rangle}{\lVert a_{i} \rVert^2} \overline{a_{i}}

where  i = k \,\bmod\, m + 1 and \overline{a_i} denotes complex conjugation of a_i.

If the linear system is consistent,  x^k converges to the minimum-norm solution, provided that the iterations start with the zero vector.

A more general algorithm can be defined using a relaxation parameter  \lambda^k


  x^{k+1} 
  = 
  x^{k} 
  + 
  \lambda^k 
  \frac{b_{i} - \langle a_{i}, x^{k} \rangle}{\lVert a_{i} \rVert^2} \overline{a_{i}}

There are versions of the method that converge to a regularized weighted least squares solution when applied to a system of inconsistent equations and, at least as far as initial behavior is concerned, at a lesser cost than other iterative methods, such as the conjugate gradient method.[7]

Algorithm 2: Randomized Kaczmarz algorithm

Recently, a randomized version of the Kaczmarz method for overdetermined linear systems was introduced by Strohmer and Vershynin[8] in which the i-th equation is selected randomly with probability proportional to  \lVert a_{i} \rVert ^2 .

This method can be seen as a particular case of stochastic gradient descent .[9]

Under such circumstances  x_{k} converges exponentially fast to the solution of  Ax=b , and the rate of convergence depends only on the scaled condition number  \kappa(A) .

Theorem

Let  x be the solution of  Ax=b . Then Algorithm 1 converges to  x in expectation, with the average error:

 E{\lVert x_{k}-x \rVert^2} \leq (1-\kappa(A)^{-2})^{k} \cdot {\lVert x_{0}-x \rVert^2}.

Proof

We have


\sum_{j=1}^{m}|\langle z,a_j \rangle|^2 \geq \frac{\lVert z \rVert^2}{\lVert A^{-1} \rVert^2} \qquad\qquad\qquad\qquad (1)
for all  z \in \mathbb C^n .

Using the fact that  {\lVert A \rVert^2}=\sum_{j=1}^{m}{\lVert a_j \rVert^2} we can write (1) as


\begin{align}
\sum_{j=1}^{m} \frac{{\lVert a_j \rVert^2}}{\lVert A \rVert^2}\left|\left\langle z,\frac {a_j}{\lVert a_j \rVert}\right\rangle \right|^2 \geq \kappa(A)^{-2}{\lVert z \rVert^2} \qquad\qquad\qquad\qquad (2)
\end{align}
for all  z \in \mathbb C^n .

The main point of the proof is to view the left hand side in (2) as an expectation of some random variable. Namely, recall that the solution space of the j-th equation of  Ax=b is the hyperplane  {y : \langle y,a_j \rangle = b_j} , whose normal is  \frac{a_j}{\lVert a_j \rVert^2}. Define a random vector Z whose values are the normals to all the equations of  Ax=b , with probabilities as in our algorithm:

 Z=\frac {a_j}{\lVert a_j \rVert} with probability  \frac{\lVert a_j \rVert^2}{\lVert A \rVert^2} \qquad\qquad\qquad j=1,\cdots,m

Then (2) says that


\begin{align}
\mathbb E|\langle z,Z\rangle|^2 \geq\kappa(A)^{-2}{\lVert z \rVert^2} \qquad\qquad (3)
\end{align}
for all  z \in \mathbb C^n .

The orthogonal projection P onto the solution space of a random equation of  Ax=b is given by  Pz= z-\langle z-x, Z\rangle Z.

Now we are ready to analyze our algorithm. We want to show that the error {\lVert x_k-x \rVert^2} reduces at each step in average (conditioned on the previous steps) by at least the factor of  (1-\kappa(A)^{-2}). The next approximation  x_k is computed from  x_{k-1} as  x_k= P_kx_{k-1}, where  P_1,P_2,\cdots are independent realizations of the random projection  P. The vector  x_{k-1}-x_k is in the kernel of  P_k. It is orthogonal to the solution space of the equation onto which  P_k projects, which contains the vector  x_k-x (recall that  x is the solution to all equations). The orthogonality of these two vectors then yields  {\lVert x_k-x \rVert^2}={\lVert x_{k-1}-x \rVert^2}-{\lVert x_{k-1}-x_k \rVert^2}. To complete the proof, we have to bound  {\lVert x_{k-1}-x_k \rVert^2} from below. By the definition of  x_k , we have  {\lVert x_{k-1}-x_k \rVert}=\langle x_{k-1}-x,Z_k\rangle

where  Z_1,Z_2,\cdots are independent realizations of the random vector  Z.

Thus  {\lVert x_k-x \rVert^2} = \left(1-\left|\left\langle\frac{x_{k-1}-x}{\lVert x_{k-1}-x \rVert},Z_k\right\rangle\right|^2\right){\lVert x_{k-1}-x \rVert^2}.

Now we take the expectation of both sides conditional upon the choice of the random vectors  Z_1,\cdots,Z_{k-1} (hence we fix the choice of the random projections  P_1,\cdots,P_{k-1} and thus the random vectors  x_1,\cdots,x_{k-1} and we average over the random vector  Z_k ). Then

 \mathbb E_{{Z_1,\cdots,Z_{k-1}}}{\lVert x_k-x \rVert^2} = \left(1-\mathbb E_{{Z_1,\cdots,Z_{k-1}}}\left|\left\langle\frac{x_{k-1}-x}{\lVert x_{k-1}-x \rVert},Z_k\right\rangle\right|^2\right){\lVert x_{k-1}-x \rVert^2}.

By (3) and the independence,

 \mathbb E_{{Z_1,\cdots,Z_{k-1}}}{\lVert x_k-x \rVert^2} \leq (1-\kappa(A)^{-2}){\lVert x_{k-1}-x \rVert^2}.

Taking the full expectation of both sides, we conclude that

 \mathbb E{\lVert x_k-x \rVert^2} \leq (1-\kappa(A)^{-2})\mathbb E{\lVert x_{k-1}-x \rVert^2}.

 \blacksquare

The superiority of this selection was illustrated with the reconstruction of a bandlimited function from its nonuniformly spaced sampling values. However, it has been pointed out[10] that the reported success by Strohmer and Vershynin depends on the specific choices that were made there in translating the underlying problem, whose geometrical nature is to find a common point of a set of hyperplanes, into a system of algebraic equations. There will always be legitimate algebraic representations of the underlying problem for which the selection method in [8] will perform in an inferior manner.[8][10][11]

Algorithm 3: Gower-Richtarik algorithm

In 2015, Gower and Richtarik[12] developed a versatile randomized iterative method for solving a consistent system of linear equations  Ax = b which includes the randomized Kaczmarz algorithm as a special case. Other special cases include randomized coordinate descent, randomized Gaussian descent and randomized Newton method. Block versions and versions with importance sampling of all these methods also arise as special cases. The method is shown to enjoy exponential rate decay (in expectation) - also known as linear convergence, under very mild conditions on the way randomness enters the algorithm. The Gower-Richtarik method is the first algorithm uncovering a "sibling" relationship between these methods, some of which were independently proposed before, while many of which were new.

Insights about Randomized Kaczmarz

Interesting new insights about the randomized Kaczmarz method that can be gained from the analysis of the method include:

Six Equivalent Formulations

The Gower-Richtarik method enjoys six seemingly different but equivalent formulations, shedding additional light on how to interpret it (and, as a consequence, how to interpret its many variants, including randomized Kaczmarz):

We now describe some of these viewpoints. The method depends on 2 parameters:

1. Sketch and Project

Given previous iterate x^k, the new point x^{k+1} is computed by drawing a random matrix S (in an iid fashion from some fixed distribution), and setting

x^{k+1} = \arg \min_x \| x - x^k \|_B  \text{ subject to } S^T A x = S^T b.

That is, x^{k+1} is obtain as the projection of x^k onto the randomly sketched system S^T Ax = S^T b. The idea behind this method is to pick S in such a way that a projection onto the sketched system is substantially simpler than the solution of the original system Ax=b. Randomized Kaczmarz method is obtained by picking  B to be the identity matrix, and S to be the i^{th} unit coordinate vector with probability p_i = \|a_i\|^2_2/\|A\|_F^2. Different choices of B and S lead to different variants of the method.

2. Constrain and Approximate

A seemingly different but entirely equivalent formulation of the method (obtained via Lagrangian duality) is

 x^{k+1} = \arg \min_x \|x - x^*\|_B \text{ subject to } x = x^k + B^{-1}A^T S y,

where y is also allowed to vary, and where x^* is any solution of the system Ax=b. Hence, x^{k+1} is obtained by first constraining the update to the linear subspace spanned by the columns of the random matrix  B^{-1}A^T S , i.e., to

 \{ h \;:\; h = B^{-1} A^T S y, \quad y \text{ can vary } \},

and then choosing the point x from this subspace which best approximates x^* . This formulation may look surprising as it seems impossible to perform the approximation step due to the fact that x^* is not known (after all, this is what we are trying the compute!). However, it is still possible to do this, simply because x^{k+1} computed this way is the same as x^{k+1} computed via the sketch and project formulation and since x^* does not appear there.

5. Random Update

The update can also be written explicitly as

 x^{k+1} = x^k - B^{-1}A^T S (S^T A B^{-1}A^T S)^{\dagger} S^T (Ax^k - b),

where by M^\dagger we denote the Moore-Penrose pseudo-inverse of matrix M. Hence, the method can be written in the form x^{k+1}=x^k + h^k , where  h^k is a <bold>random update</bold> vector.

Letting  M = S^T A B^{-1}A^T S, it can be shown that the system M y = S^T (Ax^k - b) always has a solution y^k, and that for all such solutions the vector x^{k+1} - B^{-1} A^T S y^k is the same. Hence, it does not matter which of these solutions is chosen, and the method can be also written as x^{k+1} = x^k - B^{-1}A^T S y^k . The pseudo-inverse leads just to one particular solution. The role of the pseudo-inverse is twofold:

6. Random Fixed Point

If we subtract x^* from both sides of the random update formula, denote  Z := A^T S (S^T A B^{-1} A^T S)^\dagger S^T A, and use the fact that  Ax^* = b, we arrive at the last formulation:

 x^{k+1} - x^* = (I - B^{-1}Z) (x^k - x^*),

where I is the identity matrix. The iteration matrix,  I- B^{-1}Z, is random, whence the name of this formulation.

Convergence

By taking conditional expectations in the 6th formulation (conditional on x^k), we obtain

 \mathbb{E} [x^{k+1}-x^* \;|\; x^k] = (I - B^{-1}\mathbb{E}[Z]) [x^k - x^*] .

By taking expectation again, and using the tower property of expectations, we obtain

 \mathbb{E} [x^{k+1}-x^*] = (I - B^{-1}\mathbb{E}[Z]) \mathbb{E}[x^k - x^*] .

Gower and Richtarik [12] show that \rho: = \|I-B^{-1/2}\mathbb{E}[Z]B^{-1/2}\|_B = \lambda_{\max}(I - B^{-1}\mathbb{E}[Z]), where the matrix norm is defined by  \|M\|_B := \max_{x\neq 0} \frac{\|Mx\|_B}{\|x\|_B}. Moreover, without any assumptions on S one has 0\leq \rho \leq 1. By taking norms and unrolling the recurrence, we obtain

Theorem [Gower & Richtarik 2015]

 \| \mathbb{E} [x^{k}-x^*] \|_B  \leq \rho^k (x^0 - x^*).

Remark: A sufficient condition for the expected residuals to converge to 0 is \rho<1. This can be achieved if A has a full column rank and under very mild conditions on S. Convergence of the method can be established also without the full column rank assumption in a different way.

It is also possible to show a stronger result:

Theorem [Gower & Richtarik 2015]

The expected squared norms (rather than norms of expectations) converge at the same rate:  \mathbb{E} \|  [x^{k}-x^*] \|^2_B  \leq \rho^k \|x^0 - x^*\|^2_B.

Remark: This second type of convergence is stronger due to the following identity [12] which holds for any random vector x and any fixed vector x^*:

 \|\mathbb{E}[x - x^*] \|^2 = \mathbb{E}[\|x-x^*\|^2] - \mathbb{E}[\|x-\mathbb{E}[x]\|^2].

Convergence of Randomized Kaczmarz

We have seen that the randomized Kaczmarz method appears as a special case of the Gower-Richtarik method for B=I and S being the i^{th} unit coordinate vector with probability p_i = \|a_i\|_2^2/\|A\|_F^2, where a_i is the i^{th} row of A. It can be checked by direct calculation that \rho = \|I-B^{-1}\mathbb{E}[Z]\|_B = 1 - \frac{\lambda_{\min}(A^T A)}{\|A\|_F^2}.

Notes

References

External links

This article is issued from Wikipedia - version of the Wednesday, April 20, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.