|
Examples of common sequence file formats
gcg format
The programs in the GCG suite of biological analysis software accept sequences in gcg format. A typical gcg
formatted sequence looks like:
!!NA_SEQUENCE 1.0
test.seq Length: 5390 April 22, 1999 13:50 Type: N Check:
8167 ..
1 ttatataaaa aatgctgaaa acaggatcaa ggaggaagat ttaaatatag
51 atataatata tgggaagaaa cataaaaacg aaataagaac agctaaatat
The hallmarks of a gcg formatted sequence are:
- Begins with the line (all uppercase)
!!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid
sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type
if its present.(optional)
- A description line which contains informative text describing what is in the file.
- A dividing line which contains the number of bases or residues in the sequence, when the file was created,
and importantly, two dots (..) which act as a divider between the descriptive information and the following
sequence information.(required)
top | back
staden format
The programs of the Staden suite of biological analysis software accept sequences in staden format. A typical staden
format file:
GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCATTA
CGACGTAGATGCTAGCTGACTCGATGCAGTACGTAGTAGCTGCTG
CTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACT
CGATGC
Staden formatted sequence files contain the sequence and nothing else.
top | back
embl format
This is the format that the sequences in the EMBL database are distributed as. An example is:
ID A1MVRNA2 standard; unassigned RNA; VRL; 2593 BP.
XX
AC X01572; J02002; K02702;
XX
SV X01572.1
XX
DT 07-NOV-1985 (Rel. 07, Created)
DT 17-FEB-1997 (Rel. 50, Last updated, Version 6)
XX
DE Alfalfa mosaic virus (A1M4) RNA 2
XX
KW unidentified reading frame.
XX
OS Alfalfa mosaic virus
OC Viruses; ssRNA positive-strand viruses, no DNA stage; Bromoviridae;
OC Alfamovirus.
XX
RN [1]
RP 1-2593
RX MEDLINE; 83220723.
RX PUBMED; 6304618.
RA Cornelissen B.J., Brederode F.T.M., Veeneman G.H., van Boom J.H., Bol J.F.;
RT "Complete nucleotide sequence of alfalfa mosaic virus RNA 2";
RL Nucleic Acids Res. 11(10):3019-3025(1983).
XX
DR GOA; P03593.
DR SWISS-PROT; P03593; V90K_AMVLE.
XX
CC Data kindly reviewed (30-JAN-1986) by J.F. Bol
XX
FH Key Location/Qualifiers
FH
FT source 1..2593
FT /db_xref="taxon:12321"
FT /mol_type="unassigned RNA"
FT /note="RNA2"
FT /virion
FT /organism="Alfalfa mosaic virus"
FT CDS 55..2427
FT /db_xref="GOA:P03593"
FT /db_xref="SWISS-PROT:P03593"
FT /product="89.7 kd protein"
FT /protein_id="CAA25728.1"
FT /translation="MFTLLRCLGFGVNEPTNTSSSEYVPEYSVEEISNEVAELDSVDPL
FT FQCYKHVFVSLMLVRKMTQAAEDFLESFGGEFDSPCCRVYRLYRHFVNEDDAPAWAIPN
FT VVNEDSYDDYAYLREELDAIDSSFELLNEERELSEIT"
FT variation 1431..1431
FT /note="c in pAL2-1; t in pAL2-[21,41]"
XX
SQ Sequence 2593 BP; 736 A; 533 C; 547 G; 777 T; 0 other;
gtttttatct tttcgcgatt gaaaagataa gtttttcagt ttaatctttt caatatgttc 60
actcttttga gatgtctcgg attcggtgtt aatgaaccta ctaacacttc ctcatcagag 120
tatgttcccg agtattccgt tgaagagatt tccaacgaag tcgctgaact cgattcagtg 180
gatccattat tccaatgtta caaacatgtt tttgtatcat tgatgctcgt aagaaagatg 240
actcaagctg ccgaagactt cctcgagagt tttgggggag a 281
//
A lot of information about the sequence can be included in embl formatted files. More detailed information on EMBL
sequence files, can be found at the EBI.
top | back
clustal
Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example
below) or sequential. (note: the multiple sequence alignment program Clustalw
(and clustalx) produce clustal format files by default, but you can specify in "output format options" if you
want your results in a different format.) An example Clustal file:
CLUSTAL W (1.74) multiple sequence alignment
seq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNA
seq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGA
seq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQN-DGDNYFQLREDWWTANRSTVWKALTCSD
seq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSA
seq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSA
seq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEA
seq7 -------------------------------------------------KELWEALTCSR
seq1 --GGGKYFRNTCDG--GQNPTETQNNCRCIG----------ATVPTYFDYVPQYLRWSDE
seq2 P-GDASYFHATCDSGDGRGGAQAPHKCRCDG---------ANVVPTYFDYVPQFLRWPEE
seq3 KLSNASYFRATC--SDGQSGAQANNYCRCNGDKPDDDKP-NTDPPTYFDYVPQYLRWSEE
seq4 DKGNA-YFRRTCNSADGKSQSQARNQCRC---KDENGKN-ADQVPTYFDYVPQYLRWSEE
seq5 DKGNA-YFRATCNSADGKSQSQARNQCRC---KDENGXN-ADQVPTYFDYVPQYLRWSEE
seq6 P-GNAQYFRNACS----EGKTATKGKCRCISGDP----------PTYFDYVPQYLRWSEE
seq7 P-KGANYFVYKLD-----RPKFSSDRCGHNYNGDP---------LTNLDYVPQYLRWSDE
top | back
phylip
The first line of the input file contains the number of species and the number of characters separated by blanks. The
information for each species follows, starting with a ten-character species name
(which can include punctuation marks and blanks), and continuing with the characters for that species. Phylip
format files can be interleaved, as in the example below, or sequential. More information about phylip format is available from
the authors at University
of Washington. An example phylip format file:
7 123
seq1 ---------- ---------- ---KSKERYK DENGGNYFQL REDWWDANRE
seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE
seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH
seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA
seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS
seq6 ------FSKN IX--QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH
seq7 ---------- ---------- ---------- ---------- ---------K
TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCI G---------
TVWKAITCGA P-GDASYFHA TCDSGDGRGG AQAPHKCRCD G---------
TVWEAITCSA DKGNA-YFRR TCNSADGKSQ SQARNQCRC- --KDENGKN-
TIWEAITCSA DKGNA-YFRA TCNSADGKSQ SQARNQCRC- --KDENGXN-
TVWKALTCSD KLSNASYFRA TC--SDGQSG AQANNYCRCN GDKPDDDKP-
TVWEALTCEA P-GNAQYFRN ACS----EGK TATKGKCRCI SGDP------
ELWEALTCSR P-KGANYFVY KLD-----RP KFSSDRCGHN YNGDP-----
top | back
msf format
msf formatted multiple sequence files are most often created when using programs of the GCG suite. msf files
include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. You can
specify a single sequence or many sequences within an msf file.
An example of part of an msf file, created using the GCG multiple sequence alignment program:
!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: @hsp70.list
Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430
GapWeight: 8
GapLengthWeight: 2
hsp70.msf MSF: 743 Type: P October 6, 1998 18:23 Check: 7784 ..
Name: S11448 Len: 743 Check: 3635 Weight: 1.00
Name: S06443 Len: 743 Check: 5861 Weight: 1.00
Name: S29261 Len: 743 Check: 7748 Weight: 1.00
//
1 50
S11448 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
S06443 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
S29261 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT
Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:
- Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid
sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type
if its present.(optional)
- A description line which contains informative text describing what is in the file. You can add this
information to the top of the MSF file using a text editor.(optional)
- A dividing line which contains the number of bases or residues in the sequence, when the file was created,
and importantly, two dots (..) which act as a divider between the descriptive information and the following
sequence information.(required)
msf files contain some other information as well:
- Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum
(both non-editable) and weight (editable).(required)
- Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence
alignment.(required)
- Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The
alignment allows you to view the relationship among sequences.
top | back
GenBank format
This is the format that the sequences in the GenBank database, at
NCBI are distributed as. An example is:
LOCUS AF012907 735 bp DNA linear BCT 12-FEB-1998
DEFINITION Serratia marcescens CarR gene, complete cds.
ACCESSION AF012907
VERSION AF012907.1 GI:2873370
KEYWORDS .
SOURCE Serratia marcescens
ORGANISM Serratia marcescens
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Serratia.
REFERENCE 1 (bases 1 to 735)
AUTHORS Cox,A.R., Thomson,N.R., Bycroft,B., Stewart,G.S., Williams,P. and
Salmond,G.P.
TITLE A pheromone-independent CarR protein controls carbapenem antibiotic
synthesis in the opportunistic human pathogen Serratia marcescens
JOURNAL Microbiology 144 (Pt 1), 201-209 (1998)
MEDLINE 98129093
PUBMED 9467912
REFERENCE 2 (bases 1 to 735)
AUTHORS Cox,A.R.J.
TITLE Direct Submission
JOURNAL Submitted (08-JUL-1997) Biochemistry, University of Cambridge,
Tennis Court Road, Cambridge CB2 1QW, UK
FEATURES Location/Qualifiers
source 1..735
/organism="Serratia marcescens"
/mol_type="genomic DNA"
/strain="ATCC 39006"
/db_xref="ATCC:39006"
/db_xref="taxon:615"
CDS 1..735
/function="putative transcriptional activator of the
carbapenem biosynthetic genes"
/note="LuxR homolog"
/codon_start=1
/transl_table=11
/product="CarR"
/protein_id="AAC38168.1"
/db_xref="GI:2873371"
/translation="MNKEISYFIERKLKAYGNVLFAYFMMDKSSLSNPVFISNYPQKC
IDTYIDNKLFINDPVIHYSLKRVTPFSWDDNDLAVLRSENEDVAMYLREHDITVGYTF
VLHDHDNNLAILTIANNDEKNDFEDFIKNRENDLQMLLVTTHEKAMKHKHFVKGKTAP
LDCLQSALITPRETEVLFLVSRGNTYKEVSRTLGISEATVKFHINNSVRKLNVINSRH
AISKALELNLFRAFTGSLMTRKLVAI"
ORIGIN
1 atgaataaag agatcagtta ttttatagaa agaaagctaa aggcctatgg gaatgtttta
61 ttcgcttact ttatgatgga taaatcttct ttatcaaatc ctgtttttat ttctaactac
121 ccccaaaaat gtattgatac ctatattgat aataaacttt ttatcaatga tcctgttata
181 cattactctt taaaaagagt aactccattt tcctgggatg ataacgatct cgctgtatta
241 cggtccgaaa atgaagatgt tgccatgtat ctaagggagc atgacatcac tgtaggttac
301 acatttgttc ttcacgacca tgataacaat ctggcgatcc tgactattgc taacaatgat
361 gaaaaaaatg attttgagga ttttataaag aacagagaga atgatttaca aatgttgtta
421 gtgactactc acgaaaaagc aatgaaacat aaacacttcg ttaaaggtaa aacggcgccc
481 ttggattgct tgcaaagtgc attgattaca ccacgtgaaa cagaagtact tttcttggtc
541 agtagaggga atacttataa agaggtgtcc agaacactgg gtatcagtga agcaacagtg
601 aaattccata tcaataactc tgtcagaaaa cttaatgtca ttaattctcg ccatgccata
661 agcaaagcac ttgagctcaa tctgtttcga gcctttacgg gatctctcat gaccagaaaa
721 ttggttgcaa tatag
//
top | back
rsf files
rsf (Rich Sequence Format) files contain one or more sequences that may or may not be related. One way to create
a file in rsf format is to use GCG's NetFetch program. This program will download the appropriate file from NCBI
and save the result in RSF format. For example:
netfetch DQ160058
Will yield the following RSF output file:
!!RICH_SEQUENCE 1.0
..
{
name DQ160058
descrip Taraxacum officinale TO52-2 (To52-2) mRNA, partial cds.
type DNA
longname Taraxacum officinale
sequence-ID DQ160058
checksum 4492
creation-date 09/23/2005 09:57:02
strand 1
comments
LOCUS DQ160058 389 bp mRNA linear HTC 21-SEP-2005
DEFINITION Taraxacum officinale TO52-2 (To52-2) mRNA, partial cds.
ACCESSION DQ160058
VERSION DQ160058.1 GI:75755890
KEYWORDS HTC.
SOURCE Taraxacum officinale
ORGANISM Taraxacum officinale
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
asterids; campanulids; Asterales; Asteraceae; Cichorioideae;
Cichorieae; Taraxacum.
REFERENCE 1 (bases 1 to 389)
AUTHORS Hulzink,R.J.M., van Dijk,P.J. and Biere,A.
TITLE Isolation and characterization of candidate genes for pathogen and
herbivory defense in common dandelion (Taraxacum officinale) upon
salicylic acid or methyl jasmonate treatment
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 389)
AUTHORS Hulzink,R.J.M., van Dijk,P.J. and Biere,A.
TITLE Direct Submission
JOURNAL Submitted (09-AUG-2005) Centre for Terrestrial Ecology, Netherlands
Institute for Ecology, Boterhoeksestraat 48, Heteren 6666 ZG, The
Netherlands
FEATURES Location/Qualifiers
source 1. .389
/organism="Taraxacum officinale"
/mol_type="mRNA"
/db_xref="taxon:50225"
/dev_stage="seedling; fifth leaf stage"
/note="salicylic acid or methyl jasmonate responsive
genotype: triploid apomictic clone 68"
gene <1. .389
/locus_tag="To52-2"
CDS <1. .175
/locus_tag="To52-2"
/note="similar to aspartyl-tRNA synthetase or
aspartate--tRNA ligase; contains premature internal stop
codon"
/codon_start=2
/product="TO52-2"
/protein_id="ABA27003.1"
/db_xref="GI:75755891"
/translation="NTEIRKNSKENLFWKSMEPSFIFYTGTLLLSGLSTQCHVVTTSC
IATHLMFSLEGRR"
feature 1 389 -1 square solid source
source 1. .389
/organism="Taraxacum officinale"
/mol_type="mRNA"
/db_xref="taxon:50225"
/dev_stage="seedling; fifth leaf stage"
/note="salicylic acid or methyl jasmonate responsive
genotype: triploid apomictic clone 68"
feature 1 389 6 square solid gene
gene <1. .389
/locus_tag="To52-2"
feature 1 175 4 square solid CDS
CDS <1. .175
/locus_tag="To52-2"
/note="similar to aspartyl-tRNA synthetase or
aspartate--tRNA ligase; contains premature internal stop
codon"
/codon_start=2
/product="TO52-2"
/protein_id="ABA27003.1"
/db_xref="GI:75755891"
/translation="NTEIRKNSKENLFWKSMEPSFIFYTGTLLLSGLSTQCHVVTTSC
IATHLMFSLEGRR"
sequence
taatacagagatccgaaagaactctaaggaaaacttgttttggaaaagtatggaaccgag
ttttatattctacaccggtaccctcttgctgtccggcctttctacacaatgccatgtcgt
gacaacgagttgtatagcaactcatttgatgttttcattagaggggaggagataatttca
ggagctcaacgtgtgcacatacctgaacttttggaggcacgtgcaactgcatgtgggatt
gatctcaaaaccatatcatcatacattgattccttcaggtatggtgcgcctccacatggc
gggattggagttggattggaacgtgttgtgatgcttttttgtggccttgataacattcgt
aaagtctcacttttcccacgtgaccttcg
}
In addition to the
sequence data, each sequence can be annotated with descriptive sequence information such as:
- Creator/author of the sequence
- Sequence weight
- Creation date
- One-line description of the sequence
- Offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project
- Known sequence features
rsf files are particularly useful if you are using Seqlab, the graphical user interface (windows-like) version of GCG.
rsf files are more richly annotated than list files or msf files, which do not save sequence features information as
part of the file.
top | back
FASTA format
Sequences in FASTA formatted files are preceded by a line starting with >. The first word on this line
is the name of the sequence. The rest of the line is a description of the sequence. The remaining lines
contain the sequence itself. Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols
(dashes, underscores, periods) in a sequence.
An example of a FASTA file containing a single sequence is:
> seq1 This is the description of my first sequence.
GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
CGACGTAGATGCTAGCTGACTCGATGC
Multi-FASTA format: FASTA files containing multiple sequences are just the same,
with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs
as well as BLAST.
An example of a FASTA file containing multiple sequences is:
> seq1 This is the description of my first sequence.
GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is the description of my second sequence.
GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
CGACGTAGATGCTAGCTGACTCGATGC
> seq3 This is the description of my third sequence.
GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
CGACGTAGATGCTAGCTGACTCGATGC
|