.
Computational Biology Research Group University of Oxford
.
.
. Analysis tools - BLAST - Filtering .
. .
.
 
CBRG Home
CBRG Accounts (molbiol)
Analysis tools  - ANALOG  - BASE (microarrays)  - BLAST  - EMBOSS  - GBrowse  - MASCOT  - Unix analysis software
Training courses
Tutorials
Unix help
Examples
Papers
Collaborative data
Presentations
Oxford-only section
FAQ: CBRG + UNIX
FAQ: Bioinformatics
Links
 
 
 

SITE MAP

Basic Local Alignment Search Tool.

  - BLAST introduction
  - Run a BLAST search
  - blastall
  - BLAST-searchable databases
  - blastall formatting examples
  - creating BLAST databases (formatdb)
  - sequence filtering

Sequence filtering

Many nucleotide and amino acid sequences are highly repetitive in nature. If your query sequence contains regions of low complexity or repeats, you can end up with many non-related, high scoring sequences being found during BLAST (or FASTA) searches (e.g. hits against proline-rich regions or poly-A tails). In other cases, your sequence may contain regions of vector sequence, or repeat regions such as Alu sequences, that you either do not want included in your sequence, or at the very least, wish to have discluded in any searches you carry out based on sequence similarity.

Programs have been devised to filter out unwanted segments of sequence from within a larger sequence. For example, filtering low complexity or repeat regions out of your query sequence before searching a database can reduce the reporting of un-related sequences that match by chance. Filtering works by running programs that identify regions containing particular types of sequences. These regions are replaced with a series of N's (in the case of nucleotide sequences), or X's (in the case of peptide sequences). One N or one X replaces each residue in the region.



Table 1. Sequence filtering types:

Sequence type Explanation Programs to identify/filter these regions
Repeats low complexity DNA sequences, highly repetitive sequences, VNTRs (Variable Number of Tandem Repeats), STRs (Short Tandem Repeats). DUST for nucleic acids, XNU for amino acids.
RepeatMasker to screen against known human genomic repeat sequences.
Low information content sequences runs of a single amino acid or few amino acids; runs of pyrimidines or purines in DNA DUST for nucleic acids, SEG for amino acids
Vector Due to experimental procedures, fragments of vector sequence can often be found in sample sequences, usually at the beginning or end, but occasionally within a sequence. Cross_match

top | back



Blast programs and filtering

BLAST programs have an option to run filtering on your query sequence before the actual search is carried out. The search is performed with a filtered version of the query sequence; your original sequence file will not be affected. blastall automatically filters sequences; you must TURN OFF the filtering if you do not want it. By default, queries searched with the blastn program are filtered with DUST. Other programs use SEG

NB. FASTA searches do not automatically filter for low complexity regions, so you must PRE-FILTER your query sequence(s) to remove/mark low complexity regions. To pre-filter, run the desired filtering program on your sequence and then use the output (the filtered version of your sequence) to carry out the search.

top | back



Filtering for low complexity and internal repeat sequences

Three main filtering programs of this type in use are SEG, XNU and DUST.

SEG and XNU are used to filter amino acid sequences. SEG finds areas of low compositional complexity, for example regions of biased amino acid composition like histidine-rich domains. XNU finds regions containing internal repeats of short periodicity, for example, poly-GXX of collagen tails. DUST is used to filter nucleotide sequences. You can run more than one filtering programs on a sequence before searching. But you still must choose filtering programs appropriate to your sequence type. SEG, XNU and DUST can be run by themselves against your sequences (i.e. they do not have to be run as part of a BLAST search).

top | back



Running SEG on the command line

SEG has several options that you can change. To run with the default settings, you can just type:

seg mysequencefile > oufile

..where mysequencefile should be replaced by the name of the file that contains your sequence in Fasta format. The > oufile part of this command causes the output of SEG to be saved to a file called outfile. This is optional. If you do not include this part of the command line, your results will appear on screen but will not be saved.

In the command line below, the information expected is listed between square brackets []:

seg [file] [window] [locut] [hicut] [options]

where:
[file] is the name of the file that contains your sequence in fasta format
[window] is the window size (default value is 12)
[locut] is the low (trigger) complexity (default is 2.2)
[hicut] is the high (extension) complexity (default is 2.5)
An example command line to change these values would be:

seg myseq.tfa 10 2.2 2.7 > outfile

The other options that can be given to seg on the command line are:

  • -x each input sequence is represented by a single output sequence with low-complexity regions replaced by strings of 'x' characters
  • -c the number of sequence characters per line (default is 60) Note: you should replace with an integer.
  • -m minimum length for a high-complexity segment (default is 0). Shorter segments are merged with adjacent low-complexity segments
  • -l show only low-complexity segments (fasta format)
  • -h show only high-complexity segments (fasta format)
  • -a show all segments (fasta format)
  • -n do not add complexity information to the header line
  • -o show overlapping low-complexity segments (default merge)
  • -t maximum trimming of raw sequence (default is 100)
  • -p prettyprint each segmented sequence (tree format)
  • -q prettyprint each segmented sequence (block format)

top | back



Running XNU on the command line

XNU has several options that you can change. To run XNU with the default settings, you can just type:

xnu mysequencefile > outfile

..where mysequencefile should be replaced by the name of the file that contains your sequence in fasta format. The > oufile of this command causes the output of XNU to be saved to a file called outfile. This is optional. If you do not include this part of the command line, your results will appear on screen but will not be saved.

The following options can be given to xnu on the command line:

  • -60 Use PAM matrix 60
  • -120 Use PAM matrix 120
  • -250 Use PAM matrix 250
  • -p The default is 0.01
  • -s
  • -n < max-search-offset> The default is 4, 0 to all diagonals
  • -m The default is 1
  • -a ascending
  • -d descending
  • -. replace filtered residues with a . (default is replacement with an X)
  • -o replace filtered residues with small letter versions of the residues in question (the original sequence is assumed to be in capital letters)
  • -x replace filtered residues with an X (this occurs by default)
  • -r print repeat segments instead of the unique regions: all unique regions are replaced by X's
  • -v verbose output: reports the values used for certain parameters during the filtering at the top of the output

top | back



Running DUST on the command line

Dust has no options. You need to have your sequence in Fasta format and type:

dust mysequencefile > outfile

..where mysequencefile should be replaced by the name of the file that contains your sequence in Fasta format. The > outfile part of this command causes the output of dust to be saved to a file called outfile. This is optional. If you do not include this part of the command line, your results will appear on screen but will not be saved.

One potential cause of confusion which BLAST users should be aware of is that if a sequence has a 100% percent match (not counting the filtered region) this figure may be reduced if this region contains filtered residues. For example, in the following example, even though every potential match is observed, the reported percent match will be 73%.

     GPLHIGIGXXXXXXGHILCCHPN
     ||||||||      |||||||||
     GPLHIGIGPPPPPPGHILCCHPN 

top | back



Cross_match

If you want to screen for vector sequence, and have that sequence X'd out of your query, you can use a program called cross_match, developed by Phil Green. To use this program you need to have two things:

  • a fasta formatted file containing the sequence(s) you want filtered (in the example below this is the file, queryseqs.tfa)
  • a fasta formatted file containing the unwanted sequence(s) that you want filtered out of your sequence (e.g. file vecseqs.tfa)

You then need to type the following on the command line:

cross_match -screen queryseqs.tfa vecseqs.tfa

Your query sequence(s) will be called "queryseqs.tfa.screen", and any sequence information found in the vecseqs.tfa file will be marked with X's. You can then use this file to search databases using native BLAST or Fasta programs.

top | back



RepeatMasker

Human genome sequence is notoriously rich in dispersed repetitive elements, such as Alu, L1, and MER elements. This can greatly increase the signal to noise ratio of a BLAST search. (This contrasts to the low-complexity filters DUST and SEG, which mask sequence based on compositional bias. Both low-complexity filters and repetitive element family filters are helpful for BLASTing against genomes.)

By default, RepeatMasker screens DNA sequences against a curated library of human repetitive elements, and masks all the regions that match a known repetitive element family. RepeatMasker can also be used to screen against a set of sequences you define (using the -lib flag). Please refer to the official RepeatMasker documentation at the University of Washington for all the options available. Note: the actual sequence comparisons in RepeatMasker are carried out by Cross_match.

RepeatMasker screens for:

  • simple repeats
  • full length ALU's
  • full length interspersed repeats
  • remaining ALU's
  • short interspersed repeats
  • long interspersed repeats
  • MIR and LINE2
  • retroviral sequences
  • LINE1's
  • more simple repeats
  • low complexity DNA

...in that order.

To run RepeatMasker, you need a Fasta file containing the sequence(s) you want to screen. To screen these sequence(s) type:

repeatmasker myseqs.tfa

If you want to see alignments of your sequence against the repeats/low complexity sequences when matches have been found, type instead:

repeatmasker -a myseqs.tfa

The output file with your masked sequences will be called myseqs.tfa.masked. If you have requested alignments, these can be found in the file myseqs.tfa.align.

RepeatMasker was written by Arian Smit (U Washington Seattle) and uses the Cross_match Smith/Waterman program from Phil Green (U Washington Seattle.)

top | back



Search CBRG web site:

CBRG support

This file last modified Wednesday March 16, 2005





es the Cross_match Smith/Waterman program from Phil Green (U Washington Seattle.)

top | back



Search CBRG web site:

CBRG support

This file last modified Wednesday March 16, 2005