|
Basic Local Alignment Search Tool.
Sequence filtering
Many nucleotide and amino acid sequences are highly repetitive in nature. If your query sequence contains regions
of low complexity or repeats, you can end up with many non-related, high scoring sequences being found during BLAST (or FASTA)
searches (e.g. hits against proline-rich regions or poly-A tails). In other cases, your sequence may contain
regions of vector sequence, or repeat regions such as Alu sequences, that you either do not want included in your
sequence, or at the very least, wish to have discluded in any searches you carry out based on sequence similarity.
Programs have been devised to filter out unwanted segments of sequence from within a larger sequence. For example,
filtering low complexity or repeat regions out of your query sequence before searching a database can reduce the
reporting of un-related sequences that match by chance. Filtering works by running programs that identify regions
containing particular types of sequences. These regions are replaced with a series of N's (in the case of
nucleotide sequences), or X's (in the case of peptide sequences). One N or one X replaces each residue in the region.
Table 1. Sequence filtering types:
| Sequence type |
Explanation |
Programs to identify/filter these regions |
| Repeats |
low complexity DNA sequences, highly repetitive sequences, VNTRs (Variable Number of
Tandem Repeats), STRs (Short Tandem Repeats). |
DUST for nucleic acids, XNU for amino acids.
RepeatMasker to screen against known human genomic repeat sequences. |
| Low information content sequences |
runs of a single amino acid or few amino acids; runs of pyrimidines or purines in DNA |
DUST for nucleic acids, SEG for amino acids |
| Vector |
Due to experimental procedures, fragments of vector sequence can often be found in
sample sequences, usually at the beginning or end, but occasionally within a sequence. |
Cross_match |
top | back
Blast programs and filtering
BLAST programs have an option to run filtering on your query sequence before the actual search is carried out. The
search is performed with a filtered version of the query sequence; your original sequence file will not be affected.
blastall automatically filters sequences; you must TURN OFF the filtering if you do
not want it. By default, queries searched with the blastn program are filtered with DUST. Other programs use SEG
NB. FASTA searches do not automatically filter for low complexity regions, so you must PRE-FILTER your query
sequence(s) to remove/mark low complexity regions. To pre-filter, run the desired filtering program on your sequence
and then use the output (the filtered version of your sequence) to carry out the search.
top | back
Filtering for low complexity and internal repeat sequences
Three main filtering programs of this type in use are SEG, XNU and DUST.
SEG and XNU are used to filter amino acid sequences. SEG finds areas of low compositional complexity, for example
regions of biased amino acid composition like histidine-rich domains. XNU finds regions containing internal repeats
of short periodicity, for example, poly-GXX of collagen tails. DUST is used to filter nucleotide sequences.
You can run more than one filtering programs on a sequence before searching. But you still must choose filtering programs
appropriate to your sequence type. SEG, XNU and DUST can be run by themselves against your sequences (i.e. they
do not have to be run as part of a BLAST search).
top | back
Running SEG on the command line
SEG has several options that you can change. To run with the default settings, you can just type:
seg mysequencefile > oufile
..where mysequencefile should be replaced by the name of the file that contains your sequence in Fasta
format. The > oufile part of this command causes the output of SEG to be saved to a file called
outfile. This is optional. If you do not include this part of the command line, your results will appear on screen
but will not be saved.
In the command line below, the information expected is listed between square brackets []:
seg [file] [window] [locut] [hicut] [options]
where:
[file] is the name of the file that contains your sequence in fasta format
[window] is the window size (default value is 12)
[locut] is the low (trigger) complexity (default is 2.2)
[hicut] is the high (extension) complexity (default is 2.5)
An example command line to change these values would be:
seg myseq.tfa 10 2.2 2.7 > outfile
The other options that can be given to seg on the command line are:
- -x each input sequence is represented by a single output sequence with low-complexity regions replaced by strings of 'x' characters
- -c the number of sequence characters per line (default is 60) Note: you should replace with an integer.
- -m minimum length for a high-complexity segment (default is 0). Shorter segments are merged with adjacent low-complexity segments
- -l show only low-complexity segments (fasta format)
- -h show only high-complexity segments (fasta format)
- -a show all segments (fasta format)
- -n do not add complexity information to the header line
- -o show overlapping low-complexity segments (default merge)
- -t maximum trimming of raw sequence (default is 100)
- -p prettyprint each segmented sequence (tree format)
- -q prettyprint each segmented sequence (block format)
top | back
Running XNU on the command line
XNU has several options that you can change. To run XNU with the default settings, you can just type:
xnu mysequencefile > outfile
..where mysequencefile should be replaced by the name of the file that contains your sequence in fasta format. The > oufile of
this command causes the output of XNU to be saved to a file called outfile. This is optional. If you do not
include this part of the command line, your results will appear on screen but will not be saved.
The following options can be given to xnu on the command line:
- -60 Use PAM matrix 60
- -120 Use PAM matrix 120
- -250 Use PAM matrix 250
- -p The default is 0.01
- -s
- -n < max-search-offset> The default is 4, 0 to all diagonals
- -m The default is 1
- -a ascending
- -d descending
- -. replace filtered residues with a . (default is replacement with an X)
- -o replace filtered residues with small letter versions of the residues in question (the original sequence is assumed to be in capital letters)
- -x replace filtered residues with an X (this occurs by default)
- -r print repeat segments instead of the unique regions: all unique regions are replaced by X's
- -v verbose output: reports the values used for certain parameters during the filtering at the top of the output
top | back
Running DUST on the command line
Dust has no options. You need to have your sequence in Fasta format and type:
dust mysequencefile > outfile
..where mysequencefile should be replaced by the name of the file that contains your sequence
in Fasta format. The > outfile part of this command causes the output of dust to be saved to a file called
outfile. This is optional. If you do not include this part of the command line, your results will appear on screen but
will not be saved.
One potential cause of confusion which BLAST users should be aware of is that if a sequence has a 100% percent
match (not counting the filtered region) this figure may be reduced if this region contains filtered residues. For example, in
the following example, even though every potential match is observed, the reported percent match will be 73%.
GPLHIGIGXXXXXXGHILCCHPN
|||||||| |||||||||
GPLHIGIGPPPPPPGHILCCHPN
top | back
Cross_match
If you want to screen for vector sequence, and have that sequence X'd out of your query, you can use a program called
cross_match, developed by Phil Green. To use this program you need to have two things:
- a fasta formatted file containing the sequence(s) you want filtered (in the example below this is the file, queryseqs.tfa)
- a fasta formatted file containing the unwanted sequence(s) that you want filtered out of your sequence (e.g. file vecseqs.tfa)
You then need to type the following on the command line:
cross_match -screen queryseqs.tfa vecseqs.tfa
Your query sequence(s) will be called "queryseqs.tfa.screen", and any sequence information found in the vecseqs.tfa file will
be marked with X's. You can then use this file to search databases using native BLAST or Fasta programs.
top | back
RepeatMasker
Human genome sequence is notoriously rich in dispersed repetitive elements, such as Alu, L1, and MER elements. This can greatly
increase the signal to noise ratio of a BLAST search. (This contrasts to the low-complexity filters DUST and SEG,
which mask sequence based on compositional bias. Both low-complexity filters and repetitive element family filters
are helpful for BLASTing against genomes.)
By default, RepeatMasker screens DNA sequences against a curated library of human repetitive elements, and masks all the regions
that match a known repetitive element family. RepeatMasker can also be used to screen against a set of sequences you
define (using the -lib flag). Please refer to the official RepeatMasker
documentation at the University of Washington for all the options available. Note: the actual sequence comparisons
in RepeatMasker are carried out by Cross_match.
RepeatMasker screens for:
- simple repeats
- full length ALU's
- full length interspersed repeats
- remaining ALU's
- short interspersed repeats
- long interspersed repeats
- MIR and LINE2
- retroviral sequences
- LINE1's
- more simple repeats
- low complexity DNA
...in that order.
To run RepeatMasker, you need a Fasta file containing the sequence(s) you want to screen. To screen these sequence(s) type:
repeatmasker myseqs.tfa
If you want to see alignments of your sequence against the repeats/low complexity sequences when matches have been found, type instead:
repeatmasker -a myseqs.tfa
The output file with your masked sequences will be called myseqs.tfa.masked. If you have requested alignments, these
can be found in the file myseqs.tfa.align.
RepeatMasker was written by Arian Smit (U Washington Seattle) and uses the Cross_match Smith/Waterman program
from Phil Green (U Washington Seattle.)
top | back
|