BLAST:Documentation

What are BLAST, mpiBLAST, and wwwBLAST?
Provided by NCBI, the Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between protein sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

http://www.ncbi.nlm.nih.gov/blast/

mpiBLAST is a freely available, open-source, parallelization of BLAST. mpiBLAST segments the BLAST database and distributes it across cluster nodes, permitting BLAST queries to be processed on many nodes simultaneously.

http://mpiblast.lanl.gov/

wwwBLAST is a web interface for BLAST programs and is provided by NCBI. It can be ported for use with mpiBLAST by using scripts in the mpiBLAST distribution.

http://www.ncbi.nlm.nih.gov/BLAST/download.shtml

How Do They Work?
LAM/MPI, BLAST, and mpiBLAST are installed on each of the nodes and a query is made using mpiBLAST on a single node (usually the first node). mpiBLAST then uses BLAST and LAM/MPI to query protein databases which are also installed on every node. The results are then returned to the single node that executed the command.

wwwBLAST is a web application that simply executes the mpiBLAST query on the single node from the gateway machine's webserver. The results are then formatted to be displayed neatly in the interface. However, this is all contingent on wwwBLAST working, which it currently isn't.

Starting LAM/MPI
Before mpiBLAST can be run, LAM (the MPI distribution we use) must be started. More information about this can be found at MPI:Documentation.

Managing BLAST databases
The current protein databases used by our installation of BLAST are UniProtKB/Swiss-Prot and Tetrahymena Thermophila. They are available at

http://ca.expasy.org/sprot/download.html (uniprot_sprot_varsplic.fasta.gz)

ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_thermophila/annotation_dbs/ (TTA1.pep)

The local copies for each node are at, as defined in

After installing a new database, the target BLAST database must be formatted and segmented using mpiformatdb (distributed with mpiBLAST) before a search can be performed with mpiBLAST. For a fully detailed description of the mpiBLAST guide to formatting a database, visit http://mpiblast.lanl.gov/Docs.Guide.html. To execute mpiformatdb you must specify the number of nodes and the database to format with these nodes. The command line syntax would look something like this:

In our case, the -N flag is used to specify the number of nodes, plus two. So, if our cluster has 25 physical nodes, format the database into 27 partitions. This is because mpiBLAST uses the extra processes for communication with the nodes.

Once the database is formatted, it must be distributed to each of the nodes by executing the following script on the gateway machine as root:

Once this step is complete, the database is ready to be queried using mpiBLAST.

Executing a BLAST query
'''All BLAST queries MUST be executed on child nodes. DO NOT RUN QUERIES ON THE HEAD NODE. If you do, you will need to reboot LAM.'''

A sample query is located in the following file:

To execute a BLAST query using mpiBLAST, you must use mpirun, which is a command that runs mpi programs on LAM nodes. A full description of mpirun can be found at http://www.ira.cnr.it/centrocal/cluster/mpirun.man.html. mpiBLAST requires the following options: -d [database] -i [query file] -p [blast program name] -o [post results to file]

Basically, this command allows us to run mpiblast in parallel on all available nodes. The objective is to query the sequences in  against the   database and write the results to. The -p flag specifies the blast program name. Since we are assuming this query is a protein query, we use blastnp as the program name. For nucleotide queries on nucleotide databases, use blastn. Further information on program names can be found at http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab.

BLAST
The BLAST archives contain utilities that allow you to run searches on your own computer. Installation requires a download of BLAST client and a BLAST database. More information can be found here:

http://www.ncbi.nlm.nih.gov/blast/download.shtml

mpiBLAST
Assuming an MPI implementation is installed, detailed instructions on installing mpiBLAST can be found here:

http://mpiblast.lanl.gov/Docs.Install.html

wwwBLAST
The wwwBlast distribution can be downloaded at http://www.ncbi.nlm.nih.gov/BLAST/download.shtml, or by FTP at.

Extracting the folder to the webserver (typically at ) on the gateway machine allows you to view the interface, but there is significant work required to make it functional. Documentation for wwwBLAST can be found at http://www.ncbi.nlm.nih.gov/blast/docs/wwwblast.html, but it does not offer much for getting the interface running in a parallel computing environment.

mpiBLAST documentation at http://mpiblast.lanl.gov/Docs.Install.html#web instructs us to use two files, one CGI and one Perl script distributed with mpiBLAST, to replace the interface-to-BLAST functionality provided in wwwBLAST. Unfortunately,  is designed to use a PBS system, and furthermore, it is not yet configured for our system. Until  can be customized for our use, mpiBLAST must be run via the command line.

Currently, a CGI script located at  on the gateway machine serves as the start of a bridge between mpiBLAST and the webserver. This CGI script simply executes an mpiBLAST query on wolf001 via SSH (which in turn executes the query in parallel on each of the nodes) and returns the output to a web browser. This script could be scaled to take a user inputted query and eventually we may be able to build our own custom user interface.

Note that the only stipulation for getting this script to work is creating a public/private SSH key to connect to wolf001 as admin for use by apache. This key is located under  on the gateway and   on wolf001. Apache then adds the following flag to SSH in the CGI script:

More information about SSH keys can be found by Google-ing "ssh keys".