.
Computational Biology Research Group University of Oxford
.
.
. Bioinformatics FAQ - common sequence file formats .
. .
.
 
CBRG Home
CBRG Accounts (molbiol)
Analysis tools
Training courses
Tutorials
Unix help
Examples
Papers
Collaborative data
Presentations
Oxford-only section
FAQ: CBRG + UNIX
FAQ: Bioinformatics
Links
 
 
 

SITE MAP

Examples of common sequence file formats


gcg format

The programs in the GCG suite of biological analysis software accept sequences in gcg format. A typical gcg formatted sequence looks like:

 !!NA_SEQUENCE 1.0

 test.seq Length: 5390 April 22, 1999 13:50 Type: N Check:
 8167 ..

 1  ttatataaaa aatgctgaaa acaggatcaa ggaggaagat ttaaatatag 
 51 atataatata tgggaagaaa cataaaaacg aaataagaac agctaaatat 

The hallmarks of a gcg formatted sequence are:

  • Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional)
  • A description line which contains informative text describing what is in the file.
  • A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required)

top | back


staden format

The programs of the Staden suite of biological analysis software accept sequences in staden format. A typical staden format file:

 GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCATTA
 CGACGTAGATGCTAGCTGACTCGATGCAGTACGTAGTAGCTGCTG
 CTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACT
 CGATGC

Staden formatted sequence files contain the sequence and nothing else.

top | back


embl format

This is the format that the sequences in the EMBL database are distributed as. An example is:

 ID   A1MVRNA2   standard; unassigned RNA; VRL; 2593 BP.
 XX
 AC   X01572; J02002; K02702;
 XX
 SV   X01572.1
 XX
 DT   07-NOV-1985 (Rel. 07, Created)
 DT   17-FEB-1997 (Rel. 50, Last updated, Version 6)
 XX
 DE   Alfalfa mosaic virus (A1M4) RNA 2
 XX
 KW   unidentified reading frame.
 XX
 OS   Alfalfa mosaic virus
 OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Bromoviridae;
 OC   Alfamovirus.
 XX
 RN   [1]
 RP   1-2593
 RX   MEDLINE; 83220723.
 RX   PUBMED; 6304618.
 RA   Cornelissen B.J., Brederode F.T.M., Veeneman G.H., van Boom J.H., Bol J.F.;
 RT   "Complete nucleotide sequence of alfalfa mosaic virus RNA 2";
 RL   Nucleic Acids Res. 11(10):3019-3025(1983).
 XX
 DR   GOA; P03593.
 DR   SWISS-PROT; P03593; V90K_AMVLE.
 XX
 CC   Data kindly reviewed (30-JAN-1986) by J.F. Bol
 XX
 FH   Key             Location/Qualifiers
 FH
 FT   source          1..2593
 FT                   /db_xref="taxon:12321"
 FT                   /mol_type="unassigned RNA"
 FT                   /note="RNA2"
 FT                   /virion
 FT                   /organism="Alfalfa mosaic virus"
 FT   CDS             55..2427
 FT                   /db_xref="GOA:P03593"
 FT                   /db_xref="SWISS-PROT:P03593"
 FT                   /product="89.7 kd protein"
 FT                   /protein_id="CAA25728.1"
 FT                   /translation="MFTLLRCLGFGVNEPTNTSSSEYVPEYSVEEISNEVAELDSVDPL
 FT                   FQCYKHVFVSLMLVRKMTQAAEDFLESFGGEFDSPCCRVYRLYRHFVNEDDAPAWAIPN
 FT                   VVNEDSYDDYAYLREELDAIDSSFELLNEERELSEIT"
 FT   variation       1431..1431
 FT                   /note="c in pAL2-1; t in pAL2-[21,41]"
 XX
 SQ   Sequence 2593 BP; 736 A; 533 C; 547 G; 777 T; 0 other;
      gtttttatct tttcgcgatt gaaaagataa gtttttcagt ttaatctttt caatatgttc        60
      actcttttga gatgtctcgg attcggtgtt aatgaaccta ctaacacttc ctcatcagag       120
      tatgttcccg agtattccgt tgaagagatt tccaacgaag tcgctgaact cgattcagtg       180
      gatccattat tccaatgtta caaacatgtt tttgtatcat tgatgctcgt aagaaagatg       240
      actcaagctg ccgaagactt cctcgagagt tttgggggag a                           281
 //

A lot of information about the sequence can be included in embl formatted files. More detailed information on EMBL sequence files, can be found at the EBI.

top | back


clustal

Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential. (note: the multiple sequence alignment program Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format.) An example Clustal file:

 CLUSTAL W (1.74) multiple sequence alignment

 seq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNA
 seq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGA
 seq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQN-DGDNYFQLREDWWTANRSTVWKALTCSD
 seq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSA
 seq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSA
 seq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEA
 seq7 -------------------------------------------------KELWEALTCSR
 
 seq1 --GGGKYFRNTCDG--GQNPTETQNNCRCIG----------ATVPTYFDYVPQYLRWSDE
 seq2 P-GDASYFHATCDSGDGRGGAQAPHKCRCDG---------ANVVPTYFDYVPQFLRWPEE
 seq3 KLSNASYFRATC--SDGQSGAQANNYCRCNGDKPDDDKP-NTDPPTYFDYVPQYLRWSEE
 seq4 DKGNA-YFRRTCNSADGKSQSQARNQCRC---KDENGKN-ADQVPTYFDYVPQYLRWSEE
 seq5 DKGNA-YFRATCNSADGKSQSQARNQCRC---KDENGXN-ADQVPTYFDYVPQYLRWSEE
 seq6 P-GNAQYFRNACS----EGKTATKGKCRCISGDP----------PTYFDYVPQYLRWSEE
 seq7 P-KGANYFVYKLD-----RPKFSSDRCGHNYNGDP---------LTNLDYVPQYLRWSDE

top | back


phylip

The first line of the input file contains the number of species and the number of characters separated by blanks. The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. Phylip format files can be interleaved, as in the example below, or sequential. More information about phylip format is available from the authors at University of Washington. An example phylip format file:

 7 123
 seq1 ---------- ---------- ---KSKERYK DENGGNYFQL REDWWDANRE 
 seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE 
 seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH 
 seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA 
 seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS 
 seq6 ------FSKN IX--QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH 
 seq7 ---------- ---------- ---------- ---------- ---------K 

  TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCI G--------- 
  TVWKAITCGA P-GDASYFHA TCDSGDGRGG AQAPHKCRCD G--------- 
  TVWEAITCSA DKGNA-YFRR TCNSADGKSQ SQARNQCRC- --KDENGKN- 
  TIWEAITCSA DKGNA-YFRA TCNSADGKSQ SQARNQCRC- --KDENGXN- 
  TVWKALTCSD KLSNASYFRA TC--SDGQSG AQANNYCRCN GDKPDDDKP- 
  TVWEALTCEA P-GNAQYFRN ACS----EGK TATKGKCRCI SGDP------ 
  ELWEALTCSR P-KGANYFVY KLD-----RP KFSSDRCGHN YNGDP----- 

top | back


msf format

msf formatted multiple sequence files are most often created when using programs of the GCG suite. msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. You can specify a single sequence or many sequences within an msf file.

An example of part of an msf file, created using the GCG multiple sequence alignment program:

 !!AA_MULTIPLE_ALIGNMENT 1.0
 PileUp of: @hsp70.list

 Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430

               GapWeight: 8
         GapLengthWeight: 2

 hsp70.msf MSF: 743 Type: P October 6, 1998 18:23 Check: 7784 ..

 Name: S11448 Len: 743 Check: 3635 Weight: 1.00
 Name: S06443 Len: 743 Check: 5861 Weight: 1.00
 Name: S29261 Len: 743 Check: 7748 Weight: 1.00

 //

        1                                                   50
 S11448 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
 S06443 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
 S29261 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT
 

Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:

  • Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional)
  • A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.(optional)
  • A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required)

msf files contain some other information as well:

  • Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).(required)
  • Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.(required)
  • Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences.

top | back


GenBank format

This is the format that the sequences in the GenBank database, at NCBI are distributed as. An example is:

LOCUS       AF012907                 735 bp    DNA     linear   BCT 12-FEB-1998
DEFINITION  Serratia marcescens CarR gene, complete cds.
ACCESSION   AF012907
VERSION     AF012907.1  GI:2873370
KEYWORDS    .
SOURCE      Serratia marcescens
  ORGANISM  Serratia marcescens
            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
            Enterobacteriaceae; Serratia.
REFERENCE   1  (bases 1 to 735)
  AUTHORS   Cox,A.R., Thomson,N.R., Bycroft,B., Stewart,G.S., Williams,P. and
            Salmond,G.P.
  TITLE     A pheromone-independent CarR protein controls carbapenem antibiotic
            synthesis in the opportunistic human pathogen Serratia marcescens
  JOURNAL   Microbiology 144 (Pt 1), 201-209 (1998)
  MEDLINE   98129093
   PUBMED   9467912
REFERENCE   2  (bases 1 to 735)
  AUTHORS   Cox,A.R.J.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-JUL-1997) Biochemistry, University of Cambridge,
            Tennis Court Road, Cambridge CB2 1QW, UK
FEATURES             Location/Qualifiers
     source          1..735
                     /organism="Serratia marcescens"
                     /mol_type="genomic DNA"
                     /strain="ATCC 39006"
                     /db_xref="ATCC:39006"
                     /db_xref="taxon:615"
     CDS             1..735
                     /function="putative transcriptional activator of the
                     carbapenem biosynthetic genes"
                     /note="LuxR homolog"
                     /codon_start=1
                     /transl_table=11
                     /product="CarR"
                     /protein_id="AAC38168.1"
                     /db_xref="GI:2873371"
                     /translation="MNKEISYFIERKLKAYGNVLFAYFMMDKSSLSNPVFISNYPQKC
                     IDTYIDNKLFINDPVIHYSLKRVTPFSWDDNDLAVLRSENEDVAMYLREHDITVGYTF
                     VLHDHDNNLAILTIANNDEKNDFEDFIKNRENDLQMLLVTTHEKAMKHKHFVKGKTAP
                     LDCLQSALITPRETEVLFLVSRGNTYKEVSRTLGISEATVKFHINNSVRKLNVINSRH
                     AISKALELNLFRAFTGSLMTRKLVAI"
ORIGIN      
        1 atgaataaag agatcagtta ttttatagaa agaaagctaa aggcctatgg gaatgtttta
       61 ttcgcttact ttatgatgga taaatcttct ttatcaaatc ctgtttttat ttctaactac
      121 ccccaaaaat gtattgatac ctatattgat aataaacttt ttatcaatga tcctgttata
      181 cattactctt taaaaagagt aactccattt tcctgggatg ataacgatct cgctgtatta
      241 cggtccgaaa atgaagatgt tgccatgtat ctaagggagc atgacatcac tgtaggttac
      301 acatttgttc ttcacgacca tgataacaat ctggcgatcc tgactattgc taacaatgat
      361 gaaaaaaatg attttgagga ttttataaag aacagagaga atgatttaca aatgttgtta
      421 gtgactactc acgaaaaagc aatgaaacat aaacacttcg ttaaaggtaa aacggcgccc
      481 ttggattgct tgcaaagtgc attgattaca ccacgtgaaa cagaagtact tttcttggtc
      541 agtagaggga atacttataa agaggtgtcc agaacactgg gtatcagtga agcaacagtg
      601 aaattccata tcaataactc tgtcagaaaa cttaatgtca ttaattctcg ccatgccata
      661 agcaaagcac ttgagctcaa tctgtttcga gcctttacgg gatctctcat gaccagaaaa
      721 ttggttgcaa tatag
//

top | back


rsf files

rsf (Rich Sequence Format) files contain one or more sequences that may or may not be related. One way to create a file in rsf format is to use GCG's NetFetch program. This program will download the appropriate file from NCBI and save the result in RSF format. For example:
netfetch DQ160058
Will yield the following RSF output file:

 
!!RICH_SEQUENCE 1.0
..
{
name  DQ160058
descrip    Taraxacum officinale TO52-2 (To52-2) mRNA, partial cds.
type    DNA
longname  Taraxacum officinale
sequence-ID  DQ160058
checksum    4492
creation-date 09/23/2005 09:57:02
strand  1
comments  
  LOCUS       DQ160058                 389 bp    mRNA    linear   HTC 21-SEP-2005
  DEFINITION  Taraxacum officinale TO52-2 (To52-2) mRNA, partial cds.
  ACCESSION   DQ160058
  VERSION     DQ160058.1  GI:75755890
  KEYWORDS    HTC.
  SOURCE      Taraxacum officinale
    ORGANISM  Taraxacum officinale
              Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
              Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
              asterids; campanulids; Asterales; Asteraceae; Cichorioideae;
              Cichorieae; Taraxacum.
  REFERENCE   1  (bases 1 to 389)
    AUTHORS   Hulzink,R.J.M., van Dijk,P.J. and Biere,A.
    TITLE     Isolation and characterization of candidate genes for pathogen and
              herbivory defense in common dandelion (Taraxacum officinale) upon
              salicylic acid or methyl jasmonate treatment
    JOURNAL   Unpublished
  REFERENCE   2  (bases 1 to 389)
    AUTHORS   Hulzink,R.J.M., van Dijk,P.J. and Biere,A.
    TITLE     Direct Submission
    JOURNAL   Submitted (09-AUG-2005) Centre for Terrestrial Ecology, Netherlands
              Institute for Ecology, Boterhoeksestraat 48, Heteren 6666 ZG, The
              Netherlands
  FEATURES             Location/Qualifiers
       source          1. .389
                       /organism="Taraxacum officinale"
                       /mol_type="mRNA"
                       /db_xref="taxon:50225"
                       /dev_stage="seedling; fifth leaf stage"
                       /note="salicylic acid or methyl jasmonate responsive
                       genotype: triploid apomictic clone 68"
       gene            <1. .389
                       /locus_tag="To52-2"
       CDS             <1. .175
                       /locus_tag="To52-2"
                       /note="similar to aspartyl-tRNA synthetase or
                       aspartate--tRNA ligase; contains premature internal stop
                       codon"
                       /codon_start=2
                       /product="TO52-2"
                       /protein_id="ABA27003.1"
                       /db_xref="GI:75755891"
                       /translation="NTEIRKNSKENLFWKSMEPSFIFYTGTLLLSGLSTQCHVVTTSC
                       IATHLMFSLEGRR"
feature 1 389 -1 square solid source 
       source          1. .389
                       /organism="Taraxacum officinale"
                       /mol_type="mRNA"
                       /db_xref="taxon:50225"
                       /dev_stage="seedling; fifth leaf stage"
                       /note="salicylic acid or methyl jasmonate responsive
                       genotype: triploid apomictic clone 68"
feature 1 389 6 square solid gene 
       gene            <1. .389
                       /locus_tag="To52-2"
feature 1 175 4 square solid CDS 
       CDS             <1. .175
                       /locus_tag="To52-2"
                       /note="similar to aspartyl-tRNA synthetase or
                       aspartate--tRNA ligase; contains premature internal stop
                       codon"
                       /codon_start=2
                       /product="TO52-2"
                       /protein_id="ABA27003.1"
                       /db_xref="GI:75755891"
                       /translation="NTEIRKNSKENLFWKSMEPSFIFYTGTLLLSGLSTQCHVVTTSC
                       IATHLMFSLEGRR"
sequence  
  taatacagagatccgaaagaactctaaggaaaacttgttttggaaaagtatggaaccgag
  ttttatattctacaccggtaccctcttgctgtccggcctttctacacaatgccatgtcgt
  gacaacgagttgtatagcaactcatttgatgttttcattagaggggaggagataatttca
  ggagctcaacgtgtgcacatacctgaacttttggaggcacgtgcaactgcatgtgggatt
  gatctcaaaaccatatcatcatacattgattccttcaggtatggtgcgcctccacatggc
  gggattggagttggattggaacgtgttgtgatgcttttttgtggccttgataacattcgt
  aaagtctcacttttcccacgtgaccttcg
}

In addition to the sequence data, each sequence can be annotated with descriptive sequence information such as:

  • Creator/author of the sequence
  • Sequence weight
  • Creation date
  • One-line description of the sequence
  • Offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project
  • Known sequence features

rsf files are particularly useful if you are using Seqlab, the graphical user interface (windows-like) version of GCG. rsf files are more richly annotated than list files or msf files, which do not save sequence features information as part of the file.

top | back


FASTA format

Sequences in FASTA formatted files are preceded by a line starting with >. The first word on this line is the name of the sequence. The rest of the line is a description of the sequence. The remaining lines contain the sequence itself. Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.

An example of a FASTA file containing a single sequence is:

 > seq1 This is the description of my first sequence.
 GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
 CGACGTAGATGCTAGCTGACTCGATGC

Multi-FASTA format: FASTA files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs as well as BLAST.
An example of a FASTA file containing multiple sequences is:

 > seq1 This is the description of my first sequence.
 GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
 CGACGTAGATGCTAGCTGACTCGATGC
 > seq2 This is the description of my second sequence.
 GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
 CGACGTAGATGCTAGCTGACTCGATGC
 > seq3 This is the description of my third sequence.
 GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
 CGACGTAGATGCTAGCTGACTCGATGC

Search CBRG web site:

CBRG support

This file last modified Friday March 09, 2007