There are a large number of bioinformatics programs available to CBRG account holders via this web site. These
examples (Frequently Asked Questions) are used to illustrate the use of these tools
in answering typical queries posed by researchers.
Sequence formats
The format of a sequence refers to the way a sequence is written. There are many different formats, and many
ways available to convert between formats. We shall consider a few of the most popular here.
Some formats have only a single sequence per file, some single or multiple sequences, and
some are really only useful when they contain multiple sequences. In some cases, the analysis
software may accept sequences in a variety of different formats. Others however only accept a sequence
in one particular format. In the latter case, you may need to convert the format of your sequence to
be able to carry out the analysis.
NB: Unlike GCG, all EMBOSS programs take care of this sequence
inter-conversion automatically and this task does not have to be carried out manually.
Some common file formats:
(click on names to see information and examples)
- single sequence per file
- multiple sequences per file
- either single or multiple sequences per file
* NB: "fasta" can refer to either the sequence format, or to the fasta database
searching program!
Converting using seqret
Although all EMBOSS programs can understand a wide range of sequence file formats,
you may at some stage have to prepare a sequence for use with a program outside the EMBOSS suite. Probably the most convenient
tool to convert sequence file formats is the EMBOSS tool, 'seqret'. As well as pulling sequences from the databases on orac,
this program allows conversion between many different formats. It will carry out the conversion for files containing more than
one sequence, as well as those those containing single sequences. The file format conversion is carried out using the
-osformat (for, output sequence format) qualifier.
Example: To convert a sequence (in a file called, 'AF282618.gb') from GenBank format to fasta format and save
the results in another file called, 'AF282618.fasta' you would use:
seqret AF282618.gb -osformat fasta AF282618.fasta
For more information, see: seqret
'seqret' has many alternative sequence format options:
- fasta
- text or plain or raw (Plain sequence, no annotation or heading)
- genbank (GenBank entry format)
- embl
- swiss (Swiss-Prot entry format)
- ncbi (NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters)
- gcg (Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file)
- gcg8 (GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data)
- nbrf or pir (NBRF (PIR database) format)
- ig (Intelligenetics format)
- codata
- strider
- acedb
- staden
- fitch
- msf (Wisconsin Package GCG's MSF multiple sequence format)
- clustal or aln (Clustal multiple sequence format)
- phylip (PHYLIP interleaved format)
- phylip3 (PHYLIP non-interleaved format)
- asn1
- hennig86
- mega
- meganon
- nexus or paup (Nexus/PAUP format)
- nexusnon or paupnon (Nexusnon/PAUPnon format)
- jackknifer
- jackknifernon
- treecon
Converting using GCG
The GCG package comes with a number of useful format conversion programs. Some of the most commonly used include:
- tofasta and fromfasta - converts from gcg format to fasta format, and from fasta format to gcg format respectively
- tostaden and fromstaden - converts from gcg format to staden format, and from staden format to gcg format respectively
- fromembl - converts from embl format to gcg format
Converting using ReadSeq
ReadSeq is particularly useful as it automatically detects many sequence formats and
interconverts among them. To access ReadSeq using Unix, type:
ReadSeq
You will be prompted for a name for the output file, then
prompted for the format you want that output to be in, and then for the name
of the input file. You will be further prompted for any required extra options (just press return if none).
If you look in your directory, you should find a file with the name you designated, containing sequence in the required
format.
Formats which ReadSeq currently understands:
- IG/Stanford
- GenBank/GB, genbank flatfile format
- NBRF format
- EMBL, EMBL flatfile format
- GCG, single sequence format of GCG software
- DNAStrider, for common Mac program
- Fitch format, limited use
- Pearson/Fasta, a common format used by Fasta programs and others
- Zuker format, limited use. Input only.
- Olsen, format printed by Olsen VMS sequence editor. Input only.
- Phylip3.2, sequential format for Phylip programs
- Phylip, interleaved format for Phylip programs (v3.3, v3.4)
- Plain/Raw, sequence data only (no name, document, numbering)
- MSF multi sequence format used by GCG software
- PAUP"s multiple sequence (NEXUS) format
- PIR/CODATA format used by PIR
- ASN.1 format used by NCBI
- Pretty print with various options for nice looking output. Output only.
WARNING: If the sequences you are trying to convert contain gaps, you
need to take special note of what symbol is being used to designate them. GCG uses "
. " while other formats use " - ". Many programs won"t work if you have the
wrong gap symbol. GCG conversion programs will change a " . " to a " -
" or vice versa, but ReadSeq does not! If the GCG conversion programs don"t include
the conversion you need, you can convert your sequence using ReadSeq, but you may then
have to use a text editor to convert all the gap symbols to the appropriate type.
Converting using Clustalx
The multiple sequence alignment program Clustalx is very useful not only for aligning sequences, but
also for conversion of multiple sequence files from one format to another. To do this, you need to
enter your sequences into Clustalx, and under the alignment > multiple sequence alignment menu, choose "output
format options". If you select the format you want your multiple sequence file to be
output as, and then choose "create alignment output file(s) now?", you can
output your original multiple sequence file without carrying out the alignment process.
Additional comments:
There are many other file formats you may run across. There are also many other file
format conversion programs. If you find that you need to convert to or from file formats that not
mentioned here, please contact us (genmail@molbiol.ox.ac.uk)