.
Computational Biology Research Group University of Oxford
.
.
. Bioinformatics Frequently Asked Questions .
. .
.
 
CBRG Home
CBRG Accounts (molbiol)
Analysis tools
Training courses
Tutorials
Unix help
Examples
Papers
Collaborative data
Presentations
Oxford-only section
FAQ: CBRG + UNIX
FAQ: Bioinformatics
Links
 
 
 

SITE MAP

Bioinformatics FAQ

There are a large number of bioinformatics programs available to CBRG account holders via this web site. These examples (Frequently Asked Questions) are used to illustrate the use of these tools in answering typical queries posed by researchers.

The major web-based tools from the CBRG are listed, here.


Sequence formats

The format of a sequence refers to the way a sequence is written. There are many different formats, and many ways available to convert between formats. We shall consider a few of the most popular here.

Some formats have only a single sequence per file, some single or multiple sequences, and some are really only useful when they contain multiple sequences. In some cases, the analysis software may accept sequences in a variety of different formats. Others however only accept a sequence in one particular format. In the latter case, you may need to convert the format of your sequence to be able to carry out the analysis.

NB: Unlike GCG, all EMBOSS programs take care of this sequence inter-conversion automatically and this task does not have to be carried out manually.

Some common file formats:
(click on names to see information and examples)

* NB: "fasta" can refer to either the sequence format, or to the fasta database searching program!


Converting using seqret

Although all EMBOSS programs can understand a wide range of sequence file formats, you may at some stage have to prepare a sequence for use with a program outside the EMBOSS suite. Probably the most convenient tool to convert sequence file formats is the EMBOSS tool, 'seqret'. As well as pulling sequences from the databases on orac, this program allows conversion between many different formats. It will carry out the conversion for files containing more than one sequence, as well as those those containing single sequences. The file format conversion is carried out using the -osformat (for, output sequence format) qualifier.

Example: To convert a sequence (in a file called, 'AF282618.gb') from GenBank format to fasta format and save the results in another file called, 'AF282618.fasta' you would use:
seqret AF282618.gb -osformat fasta AF282618.fasta

For more information, see: seqret

'seqret' has many alternative sequence format options:

  • fasta
  • text or plain or raw (Plain sequence, no annotation or heading)
  • genbank (GenBank entry format)
  • embl
  • swiss (Swiss-Prot entry format)
  • ncbi (NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters)
  • gcg (Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file)
  • gcg8 (GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data)
  • nbrf or pir (NBRF (PIR database) format)
  • ig (Intelligenetics format)
  • codata
  • strider
  • acedb
  • staden
  • fitch
  • msf (Wisconsin Package GCG's MSF multiple sequence format)
  • clustal or aln (Clustal multiple sequence format)
  • phylip (PHYLIP interleaved format)
  • phylip3 (PHYLIP non-interleaved format)
  • asn1
  • hennig86
  • mega
  • meganon
  • nexus or paup (Nexus/PAUP format)
  • nexusnon or paupnon (Nexusnon/PAUPnon format)
  • jackknifer
  • jackknifernon
  • treecon


Converting using GCG

The GCG package comes with a number of useful format conversion programs. Some of the most commonly used include:

  • tofasta and fromfasta - converts from gcg format to fasta format, and from fasta format to gcg format respectively
  • tostaden and fromstaden - converts from gcg format to staden format, and from staden format to gcg format respectively
  • fromembl - converts from embl format to gcg format

Converting using ReadSeq

ReadSeq is particularly useful as it automatically detects many sequence formats and interconverts among them. To access ReadSeq using Unix, type: ReadSeq You will be prompted for a name for the output file, then prompted for the format you want that output to be in, and then for the name of the input file. You will be further prompted for any required extra options (just press return if none). If you look in your directory, you should find a file with the name you designated, containing sequence in the required format.

Formats which ReadSeq currently understands:

  • IG/Stanford
  • GenBank/GB, genbank flatfile format
  • NBRF format
  • EMBL, EMBL flatfile format
  • GCG, single sequence format of GCG software
  • DNAStrider, for common Mac program
  • Fitch format, limited use
  • Pearson/Fasta, a common format used by Fasta programs and others
  • Zuker format, limited use. Input only.
  • Olsen, format printed by Olsen VMS sequence editor. Input only.
  • Phylip3.2, sequential format for Phylip programs
  • Phylip, interleaved format for Phylip programs (v3.3, v3.4)
  • Plain/Raw, sequence data only (no name, document, numbering)
  • MSF multi sequence format used by GCG software
  • PAUP"s multiple sequence (NEXUS) format
  • PIR/CODATA format used by PIR
  • ASN.1 format used by NCBI
  • Pretty print with various options for nice looking output. Output only.

WARNING: If the sequences you are trying to convert contain gaps, you need to take special note of what symbol is being used to designate them. GCG uses " . " while other formats use " - ". Many programs won"t work if you have the wrong gap symbol. GCG conversion programs will change a " . " to a " - " or vice versa, but ReadSeq does not! If the GCG conversion programs don"t include the conversion you need, you can convert your sequence using ReadSeq, but you may then have to use a text editor to convert all the gap symbols to the appropriate type.


Converting using Clustalx

The multiple sequence alignment program Clustalx is very useful not only for aligning sequences, but also for conversion of multiple sequence files from one format to another. To do this, you need to enter your sequences into Clustalx, and under the alignment > multiple sequence alignment menu, choose "output format options". If you select the format you want your multiple sequence file to be output as, and then choose "create alignment output file(s) now?", you can output your original multiple sequence file without carrying out the alignment process.


Additional comments:

There are many other file formats you may run across. There are also many other file format conversion programs. If you find that you need to convert to or from file formats that not mentioned here, please contact us (genmail@molbiol.ox.ac.uk)


Back to FAQs



Search CBRG web site:

CBRG support

This file last modified Friday March 09, 2007