Stockholm format
Stockholm format is a Multiple sequence alignment format used by Pfam and Rfam to disseminate protein and RNA sequence alignments.[1][2] The alignment editors Ralee [3] and Belvu support Stockholm format as do the probabilistic database search tools, Infernal and HMMER, and the phylogenetic analysis tool Xrate. A simple example of an Rfam alignment (UPSK RNA) with a pseudoknot in Stockholm format is shown below:[4]
# STOCKHOLM 1.0 #=GF ID UPSK #=GF SE Predicted; Infernal #=GF SS Published; PMID 9223489 #=GF RN [1] #=GF RM 9223489 #=GF RT The role of the pseudoknot at the 3' end of turnip yellow mosaic #=GF RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA #=GF RT polymerase. #=GF RA Deiman BA, Kortlever RM, Pleij CW; #=GF RL J Virol 1997;71:5990-5996. AF035635.1/619-641 UGAGUUCUCGAUCUCUAAAAUCG M24804.1/82-104 UGAGUUCUCUAUCUCUAAAAUCG J04373.1/6212-6234 UAAGUUCUCGAUCUUUAAAAUCG M24803.1/1-23 UAAGUUCUCGAUCUCUAAAAUCG #=GC SS_cons .AAA....<<<<aaa....>>>> //
Here is a slightly more complex example showing the Pfam CBS domain:
# STOCKHOLM 1.0 #=GF ID CBS #=GF AC PF00571 #=GF DE CBS domain #=GF AU Bateman A #=GF CC CBS domains are small intracellular modules mostly found #=GF CC in 2 or four copies within a protein. #=GF SQ 5 #=GS O31698/18-71 AC O31698 #=GS O83071/192-246 AC O83071 #=GS O83071/259-312 AC O83071 #=GS O31698/88-139 AC O31698 #=GS O31698/88-139 OS Bacillus subtilis O83071/192-246 MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS #=GR O83071/192-246 SA 9998877564535242525515252536463774777 O83071/259-312 MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY #=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE O31698/18-71 MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS #=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH O31698/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE #=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH #=GC SS_cons CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH O31699/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE #=GR O31699/88-139 AS ________________*____________________ #=GR O31699/88-139 IN ____________1____________2______0____ //
A minimal well formed Stockholm files should contain the header which states the format and version identifier, currently '# STOCKHOLM 1.0'. Followed by the sequences and corresponding unique sequence names:
<seqname> <aligned sequence> <seqname> <aligned sequence> <seqname> <aligned sequence>
'<seqname>' stands for "sequence name", typically in the form "name/start-end" or just "name". Finally, the "//" line indicates the end of the alignment. Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".
The alignment mark-up
Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.
#=GF <feature> <Generic per-File annotation, free text> #=GC <feature> <Generic per-Column annotation, exactly 1 char per column> #=GS <seqname> <feature> <Generic per-Sequence annotation, free text> #=GR <seqname> <feature> <Generic per-Residue annotation, exactly 1 char per residue>
Recommended features
#=GF
(See the Pfam and the Rfam documentation under "Description of fields")
Pfam and Rfam may use the following tags:
Compulsory fields: ------------------ AC Accession number: Accession number in form PFxxxxx (Pfam) or RFxxxxx (Rfam). ID Identification: One word name for family. DE Definition: Short description of family. AU Author: Authors of the entry. SE Source of seed: The source suggesting the seed members belong to one family. SS Source of structure: The source (prediction or publication) of the consensus RNA secondary structure used by Rfam. BM Build method: Command line used to generate the model SM Search method: Command line used to perform the search GA Gathering threshold: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score (and domain score for Pfam) of match in the full alignment. NC Noise Cutoff: Highest sequence score (and domain score for Pfam) of match not in full alignment. TP Type: Type of family -- presently Family, Domain, Motif or Repeat for Pfam. -- a tree with roots Gene, Intron or Cis-reg for Rfam. SQ Sequence: Number of sequences in alignment. Optional fields: ---------------- DC Database Comment: Comment about database reference. DR Database Reference: Reference to external database. RC Reference Comment: Comment about literature reference. RN Reference Number: Reference Number. RM Reference Medline: Eight digit medline UI number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. PI Previous identifier: Record of all previous ID lines. KW Keywords: Keywords. CC Comment: Comments. NE Pfam accession: Indicates a nested domain. NL Location: Location of nested domains - sequence ID, start and end of insert. WK Wikipedia link: Wikipedia page CL Clan: Clan accession MB Membership: Used for listing Clan membership For embedding trees: ---------------- NH New Hampshire A tree in New Hampshire eXtended format. TN Tree ID A unique identifier for the next tree. Other: ------ FR False discovery Rate: A method used to set the bit score threshold based on the ratio of expected false positives to true positives. Floating point number between 0 and 1. CB Calibration method: Command line used to calibrate the model (Rfam only, release 12.0 and later)
- Notes: A tree may be stored on multiple #=GF NH lines.
- If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.
#=GS
Rfam and Pfam may use these features:
Feature Description --------------------- ----------- AC <accession> ACcession number DE <freetext> DEscription DR <db>; <accession>; Database Reference OS <organism> Organism (species) OC <clade> Organism Classification (clade, etc.) LO <look> Look (Color, etc.)
#=GR
Feature Description Markup letters ------- ----------- -------------- SS Secondary Structure For RNA [.,;<>(){}[]AaBb...], For protein [HGIEBTSCX] SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%) TM TransMembrane [Mio] PP Posterior Probability [0-9*] (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00) LI LIgand binding [*] AS Active Site [*] pAS AS - Pfam predicted [*] sAS AS - from SwissProt [*] IN INtron (in or after) [0-2] For RNA tertiary interactions: ------------------------------ tWW WC/WC in trans For basepairs: [<>AaBb...Zz] For unpaired: [.] cWH WC/Hoogsteen in cis cWS WC/SugarEdge in cis tWS WC/SugarEdge in trans notes: (1) {c,t}{W,H,S}{W,H,S} for general format. (2) cWW is equivalent to SS.
#=GC
The list of valid features includes those shown below, as well as the same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".
Feature Description Description ------- ----------- -------------- RF ReFerence annotation Often the consensus RNA or protein sequence is used as a reference Any non-gap character (e.g. x's) can indicate consensus/conserved/match columns .'s or -'s indicate insert columns ~'s indicate unaligned insertions Upper and lower case can be used to discriminate strong and weakly conserved residues respectively MM Model Mask Indicates which columns in an alignment should be masked, such that the emission probabilities for match states corresponding to those columns will be the background distribution.
Notes
- Do not use multiple lines with the same #=GC label.
- For a single sequence, do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.
- "X" in SA and SS means "residue with unknown structure".
- The protein SS letters are taken from DSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.)
- The RNA SS letters are taken from WUSS (Washington University Secondary Structure) notation. Matching nested parentheses characters <>, (), [], or {} indicate a basepair. The symbols '.', ',' and ';' indicate unpaired regions. Matched upper and lower case characters from the English alphabet indicate pseudoknot interactions. The 5' nucleotide within the knot should be in uppercase and the 3' nucleotide lowercase.
Recommended placements
- #=GF Above the alignment
- #=GC Below the alignment
- #=GS Above the alignment or just below the corresponding sequence
- #=GR Just below the corresponding sequence
Size limits
There are no explicit size limits on any field. However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits:
- Line length: 10000.
- <seqname>: 255.
- <feature>: 255.
See also
References
- ↑ Gardner PP, Daub J, Tate JG, et al. (January 2009). "Rfam: updates to the RNA families database". Nucleic Acids Res. 37 (Database issue): D136–40. doi:10.1093/nar/gkn766. PMC 2686503. PMID 18953034.
- ↑ Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A (2008). "The Pfam protein families database.". Nucleic Acids Res 36 (Database issue): D281–8. doi:10.1093/nar/gkm960. PMC 2238907. PMID 18039703.
- ↑ Griffiths-Jones S (January 2005). "RALEE--RNA ALignment editor in Emacs". Bioinformatics 21 (2): 257–9. doi:10.1093/bioinformatics/bth489. PMID 15377506.
- ↑ Deiman BA, Kortlever RM, Pleij CW (August 1997). "The role of the pseudoknot at the 3' end of turnip yellow mosaic virus RNA in minus-strand synthesis by the viral RNA-dependent RNA polymerase". J. Virol. 71 (8): 5990–6. PMC 191855. PMID 9223489.