Probalign

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.

Algorithm

The following describes the algorithm used by probalign to determine the base pair probabilities.[2]

Alignment score

To score an alignment of two sequences two things are needed:

The score S(a) of an alignment a is defined as:

 S(a) = \sum_{x_i-y_j \in a} \sigma(x_i,y_j) + \text{gap cost}

Now the boltzmann weighted score of an alignment a is:

 e^{\frac{S(a)}{T}} = e^{\frac{\sum_{x_i-y_j \in a} \sigma(x_i,y_j) + \text{gap cost}}{T}} = 
\left( \prod_{x_i - y_i \in a} e^{\frac{\sum_{x_i-y_j \in a} \sigma(x_i,y_j)}{T}} \right) \cdot e^{\frac{gapcost}{T}}

Where T is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

Pr[a|x,y] = \frac{e^{\frac{S(a)}{T}}}{Z}

Where Z is the partition function, i.e. the sum of the boltzmann weights of all alignments.

Dynamic Programming

Let Z_{i,j} denote the partition function of the prefixes x_0,x_1,...,x_i and y_0,y_1,...,y_j. Three different cases are considered:

  1. Z^{M}_{i,j}: the partition function of all alignments of the two prefixes that end in a match.
  2. Z^{I}_{i,j}: the partition function of all alignments of the two prefixes that end in an insertion (-,y_j).
  3. Z^{D}_{i,j}: the partition function of all alignments of the two prefixes that end in a deletion (x_i,-).

Then we have: Z_{i,j} = Z^{M}_{i,j} + Z^{D}_{i,j} + Z^{I}_{i,j}

Initialization

The matrixes are initialized as follows:

Recursion

The partition function for the alignments of two sequences x and y is given by Z_{|x|,|y|}, which can be recursively computed:

Base pair probability

Finally the probability that positions x_i and y_j form a base pair is given by:

P(x_i - y_j|x,y) = \frac{Z_{i-1,j-1} \cdot e^{\frac{\sigma(x_i,y_j)}{T}} \cdot Z'_{i',j'}}{Z_{|x|,|y|}}

 Z', i', j' are the respective values for the recalculated Z with inversed base pair strings.

See also

References

  1. U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)
  2. Lecture "Bioinformatics II" at University of Freiburg

External links

This article is issued from Wikipedia - version of the Sunday, January 10, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.