FASTA format
In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a standard in the field of bioinformatics.
The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like Python, Ruby, and Perl.
Format
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:
>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE
In this example, "gi|31563518|ref|NP_852610.1|" is the name of the sequence.
To sum up. FASTA formated sequence has at least two, obligatory lines. First line starts with ">" and contains the description (this line cannot be omitted). Next lines (until to the next occurrence of ">") are treated as single sequence (the sequence may or may not be divided into 80 characters chunks in separate lines) with nucleotide or amino acids in single letter code. Next, occurrence of ">" at the beginning of line starts next record (in multiple-sequence file).
History
The original FASTA/Pearson format is described in the documentation for the FASTA suite of programs. It can be downloaded with any free distribution of FASTA (see fasta20.doc, fastaVN.doc or fastaVN.me—where VN is the Version Number).
A sequence in FASTA format is represented as a series of lines, each of which should be no longer than 120 characters and usually do not exceed 80 characters. This probably was to allow for preallocation of fixed line sizes in software: at the time most users relied on DEC VT (or compatible) terminals which could display 80 or 132 characters per line. Most people preferred the bigger font in 80-character modes and so it became the recommended fashion to use 80 characters or less (often 70) in FASTA lines. Also, the width of a standard printed page is 70 to 80 characters (depending on the font).
The first line in a FASTA file starts either with a ">" (greater-than) symbol or, less frequently, a ";" (semicolon) and was taken as a comment. Subsequent lines starting with a semicolon would be ignored by software. Since the only comment used was the first, it quickly became used to hold a summary description of the sequence, often starting with a unique library accession number, and with time it has become commonplace use to always use ">" for the first line and to not use ";" comments (which would otherwise be ignored).
Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code. Anything other than a valid code would be ignored (including spaces, tabulators, asterisks, etc...). Originally it was also common to end the sequence with an "*" (asterisk) character (in analogy with use in PIR formatted sequences) and, for the same reason, to leave a blank line between the description and the sequence.
A few sample sequences:
;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files. This does not imply a contradiction with the format as only the first line in a FASTA file may start with a ";" or ">", hence forcing all subsequent sequences to start with a ">" in order to be taken as different ones (and further forcing the exclusive reservation of ">" for the sequence definition line). Thus, the examples above may as well be taken as a multisequence file if taken together.
Description line
The description line (defline) or header line, which begins with '>', gives a name and/or a unique identifier for the sequence, and may also contain additional information. In a deprecated practice, the header line sometimes contained more than one header, separated by a ^A (Control-A) character.
In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Some databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification. An example of a multiple sequence FASTA file follows:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Sequence representation
After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.
The nucleic acid codes supported are:[1][2]
Nucleic Acid Code | Meaning | Mnemonic |
---|---|---|
A | A | Adenine |
C | C | Cytosine |
G | G | Guanine |
T | T | Thymine |
U | U | Uracil |
R | A or G | puRine |
Y | C, T or U | pYrimidines |
K | G, T or U | bases which are Ketones |
M | A or C | bases with aMino groups |
S | C or G | Strong interaction |
W | A, T or U | Weak interaction |
B | not A (i.e. C, G, T or U) | B comes after A |
D | not C (i.e. A, G, T or U) | D comes after C |
H | not G (i.e., A, C, T or U) | H comes after G |
V | neither T nor U (i.e. A, C or G) | V comes after U |
N | A C G T U | Nucleic acid |
- | gap of indeterminate length |
The codes supported (25 amino acids and 3 special codes) are:
Amino Acid Code | Meaning |
---|---|
A | Alanine |
B | Aspartic acid (D) or Asparagine (N) |
C | Cysteine |
D | Aspartic acid |
E | Glutamic acid |
F | Phenylalanine |
G | Glycine |
H | Histidine |
I | Isoleucine |
J | Leucine (L) or Isoleucine (I) |
K | Lysine |
L | Leucine |
M | Methionine |
N | Asparagine |
O | Pyrrolysine |
P | Proline |
Q | Glutamine |
R | Arginine |
S | Serine |
T | Threonine |
U | Selenocysteine |
V | Valine |
W | Tryptophan |
Y | Tyrosine |
Z | Glutamic acid (E) or Glutamine (Q) |
X | any |
* | translation stop |
- | gap of indeterminate length |
Sequence identifiers
The NCBI defined a standard for the unique identifier used for the sequence (SeqID) in the header line. The formatdb man page has this to say on the subject: "formatdb will automatically parse the SeqID and create indexes, but the database identifiers in the FASTA definition line must follow the conventions of the FASTA Defline Format."
The following list describes the NCBI FASTA defline format (see "The NCBI Handbook", Chapter 16, The BLAST Sequence Analysis Tool.).
Database | Format |
---|---|
GenBank | gb|accession|locus |
EMBL Data Library | emb|accession|locus |
DDBJ, DNA Database of Japan | dbj|accession|locus |
NBRF PIR | pir||entry |
Protein Research Foundation | prf||name |
SWISS-PROT | sp|accession|entry name |
Brookhaven Protein Data Bank | pdb|entry|chain |
Patents | pat|country|number |
GenInfo Backbone Id | bbs|number |
General database identifier | gnl|database|identifier |
NCBI Reference Sequence | ref|accession|locus |
Local Sequence identifier | lcl|identifier |
The vertical bars ("|") in the above list are not separators in the sense of the Backus-Naur form, but are part of the format. Multiple identifiers can be concatenated, also separated by vertical bars.
Compression
The compression of FASTA files requires a specific compressor to handle both channels of information: identifiers and sequence. For improved compression results, these are mainly divided in two streams where the compression is made assuming independence. For example, the algorithm MFCompress [3] performs lossless compression of these files using context modelling and arithmetic encoding.
File extension
There is no standard file extension for a text file containing FASTA formatted sequences. The table below shows each extension and its respective meaning.
Extension | Meaning | Notes |
---|---|---|
fasta (.fas) | generic fasta | Any generic fasta file. Other extensions can be fa, seq, fsa |
fna | fasta nucleic acid | Used generically to specify nucleic acids. |
ffn | FASTA nucleotide of gene regions | Contains coding regions for a genome. |
faa | fasta amino acid | Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa. |
frn | FASTA non-coding RNA | Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA |
See also
References
- ↑ Tao Tao (2011-08-24). "Single Letter Codes for Nucleotides". [NCBI Learning Center]. National Center for Biotechnology Information. Retrieved 2012-03-15.
- ↑ "IUPAC code table". NIAS DNA Bank.
- ↑ Pinho, A; Pratas, D (2014). "MFCompress: a compression tool for FASTA and multi-FASTA data.". Bioinformatics 30 (1): 117–118. doi:10.1093/bioinformatics/btt594. PMID 24132931.
External links
- What is FASTA Format? explains the FASTA format.
- HUPO-PSI Standard FASTA Format was describing another FASTA format as put forward by the Human Proteome Organisation's Proteomics Standards Initiative.
- Sequence ID (seqID) Fields in the FASTA Deflines of Sequences from NCBI describes the format of FASTA Deflines.