Parallel-TEBD

The parallel-TEBD is a version of the TEBD algorithm adapted to run on multiple hosts. The task of parallelizing TEBD could be achieved in various ways.

This article introduces the conceptual basis of the implementation, using MPI-based pseudo-code for exemplification, while not restricting itself to MPI - the same basic schema could be implemented with the use of home-grown messaging routines.

Introduction

The TEBD algorithm is a good candidate for parallel computing because the exponential operators used to calculate the time-evolution factorize under the Suzuki-Trotter expansion. A detailed presentation of the way TEBD works is given in the main article. Here we concern ourselves only with its parallel implementation.

Implementation

For our purposes, we will use the canonical form of the MPS as introduced by Guifré Vidal in his original papers. Hence, we will write the function of state | \Psi \rangle as:

| \Psi \rangle=\sum\limits_{i_1,..,i_N=1}^{M}\sum\limits_{\alpha_1,..,\alpha_{N-1}=0}^{\chi}\Gamma^{[1]i_1}_{\alpha_1}\lambda^{[1]}_{\alpha_1}\Gamma^{[2]i_2}_{\alpha_1\alpha_2}\lambda^{[2]}_{\alpha_2}\Gamma^{[3]i_3}_{\alpha_2\alpha_3}\lambda^{[3]}_{\alpha_3}\cdot..\cdot\Gamma^{[{N-1}]i_{N-1}}_{\alpha_{N-2}\alpha_{N-1}}\lambda^{[N-1]}_{\alpha_{N-1}}\Gamma^{[N]i_N}_{\alpha_{N-1}} | {i_1,i_2,..,i_{N-1},i_N} \rangle

This function describes a N-point lattice which we would like to compute on P different compute nodes. Let us suppose, for the sake of simplicity, that N=2k*P, where k is an integer number. This means that if we distribute the lattice points evenly among the compute nodes (the easiest scenario), an even number of lattice points 2k is assigned to each compute node. Indexing the lattice points from 0 to N-1 (note that the usual indexing is 1,N) and the compute nodes from 0 to P-1, the lattice points would be distributed as follows among the nodes:

 NODE 0: 0, 1, ..., 2k-1
 NODE 1: 2k, 2k+1, ..., 4k-1
 ...
 NODE m: m*2k, ..., (m+1)*2k - 1
 ...
 NODE P-1: (P-1)*2k, ..., N-1

Using the canonical form of the MPS, we define \lambda^{[l]}_{\alpha_l} as "belonging" to node m if m*2k ≤ l ≤ (m+1)*2k - 1. Similarly, we use the index l to assign the {\Gamma}'s to a certain lattice point. This means that \Gamma^{[0]i_0}_{\alpha_{0}} and \Gamma^{[l]i_l}_{\alpha_{l-1}\alpha_{l}},  l=1,2k-1, belong to NODE 0, as well as \lambda^{[l]}_{\alpha_l}, l = 0,2k-2. A parallel version of TEBD implies that the computing nodes need to exchange information among them. The information exchanged will be the MPS matrices and singular values lying at the border between neighbouring compute nodes. How this is done, it will be explained below.

The TEBD algorithm divides the exponential operator performing the time-evolution into a sequence of two-qubit gates of the form:

 e^{\frac{-i\delta}{\hbar}H_{k,k+1}}.

Setting the Planck constant to 1, the time-evolution is expressed as:

| \Psi(t+\delta) \rangle = e^{{-i\delta}\frac{F}{2}}e^{{-i\delta}G}e^{{-i\delta}\frac{F}{2}}|\Psi(t) \rangle,

where H = F + G,

F \equiv \sum_{k=0}^{\frac{N}{2}-1}(H_{2k,2k+1}) = \sum_{k=0}^{\frac{N}{2}-1}(F_{2k}),

G \equiv \sum_{k=0}^{\frac{N}{2}-2}(H_{2k+1,2k+2}) = \sum_{k=0}^{\frac{N}{2}-2}(G_{2k+1}).

What we can explicitly compute in parallel is the sequence of gates  e^{{-i}\frac{\delta}{2}F_{2k}}, e^{{-i\delta}{G_{2k+1}}}. Each of the compute node can apply most of the two-qubit gates without needing information from its neighbours. The compute nodes need to exchange information only at the borders, where two-qubit gates cross them, or just need information from the other side. We will now consider all three sweeps, two even and one odd and see what information has to be exchanged. Let us see what is happening on node m during the sweeps.

First (even) sweep

The sequence of gates that has to be applied in this sweep is:

e^{{-i}\frac{\delta}{2}F_{m*2k}}, e^{{-i}\frac{\delta}{2}F_{m*2k + 2}},...,e^{{-i}\frac{\delta}{2}F_{(m+1)*2k - 2}}

Already for computing the first gate, process m needs information from its lowest neighbour, m-1. On the other side, m doesn't need anything from its "higher" neighbour, m+1, because it has all the information it needs to apply the last gate. So the best strategy for m is to send a request to m-1, postponing the calculation of the first gate for later, and continue with the calculation of the other gates. What m does is called non-blocking communication. Let's look at this in detail. The tensors involved in the calculation of the first gate are:[1]

  1. Guifré Vidal, Efficient Classical Simulation of Slightly Entangled Quantum Computations, PRL 91, 147902 (2003)
This article is issued from Wikipedia - version of the Thursday, March 31, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.