Rattus Norvegicus ESTs with BLAST and Slurm

The following is a short tutorial on using BLAST with Slurm using fasta nucleic acid (fna) FASTA formatted sequence files for Rattus Norvegicus. It assumes that BLAST (Basic Local Alignment Search Tool) is already installed.

First, create a database directory, download the datafile, extract, and load the environment variables for BLAST.

mkdir -r ~/applicationtests/BLAST/dbs
cd ~/applicationtests/BLAST/dbs
wget ftp://ftp.ncbi.nih.gov/refseq/R_norvegicus/mRNA_Prot/rat.1.rna.fna.gz
gunzip rat.1.rna.fna.gz
module load BLAST/2.2.26-Linux_x86_64

Having extracted the file, there will be a fna formatted sequence file, rat.1.rna.fna. An example header line for a sequence:

>NM_175581.3 Rattus norvegicus cathepsin R (Ctsr), mRNA

The next step is to format the file using `formatdb`. This simply formats protein or nucleotide source databases before these databases can be searched by BLAST. There is a plethora of options available with this versitile command. In a nutshell however, the following reads in an input file (`-i`, this is always required), specifies the type of file (`-p F`, nucleotide), and parse options (`-o T`, parse SeqId and create indexes).

formatdb -i rat.1.rna.fna -p F -o T

After formatting there will a larger collection of files (including four binary packed data) in the database directory and a log file.

[lev@spartan dbs]$ ls -lart
total 390836
drwxr-xr-x 4 lev unimelb 4096 Nov 13 08:45 ..
-rw-r--r-- 1 lev unimelb 306886177 Nov 13 09:32 rat.1.rna.fna
-rw-r--r-- 1 lev unimelb 0 Nov 13 09:33 rat.1.fna.nhr
-rw-r--r-- 1 lev unimelb 0 Nov 13 09:33 rat.1.fna.nin
-rw-r--r-- 1 lev unimelb 0 Nov 13 09:33 rat.1.fna.ntm
-rw-r--r-- 1 lev unimelb 1 Nov 13 09:33 rat.1.fna.nsq
-rw-r--r-- 1 lev unimelb 4067346 Nov 13 09:36 rat.1.rna.fna.nsd
drwxr-xr-x 2 lev unimelb 4096 Nov 13 09:36 .
-rw-r--r-- 1 lev unimelb 86416 Nov 13 09:36 rat.1.rna.fna.nsi
-rw-r--r-- 1 lev unimelb 13032873 Nov 13 09:36 rat.1.rna.fna.nhr
-rw-r--r-- 1 lev unimelb 1094580 Nov 13 09:36 rat.1.rna.fna.nin
-rw-r--r-- 1 lev unimelb 73425470 Nov 13 09:36 rat.1.rna.fna.nsq
-rw-r--r-- 1 lev unimelb 765 Nov 13 09:36 formatdb.log

The next step to acquire the Express Sequence Tags, a short sub-sequence of cDNA sequence used to identify gene transcripts, used for gene discovery and gene-sequence determination. Create the directory, download, and extract.

mkdir -r ~/applicationtests/BLAST/rat-ests
cd ~/applicationtests/BLAST/rat-ests
wget http://mirrors.vbi.vt.edu/mirrors/ftp.ncbi.nih.gov/genomes/Rattus_norvegicus/ARCHIVE/2002/rn_est.gz
gunzip rn_est

A sample header takes the following format:

>gi|1154902|emb|X94495.1|X94495 RNSAP11G Rat brain, postnatal day 25 Rattus norvegicus cDNA clone sap11g, mRNA sequence /len=231

The final step is to run the Slurm script. The following example uses the default queue, a single node, eight cores in the node (each with one task each each), and for ten hours.

cd ~/applicationtests/BLAST/
sbatch blast.slurm

The script load the BLAST module and runs the blastall command, taking in the `rn_est` as input query file, the `rat.1.rna.fna` file as the database, running a `blastn` (nucleotide vs. nucleotide) search with a 0.05 expectation value, searching for five database sequences (this is a test), outputting a tabular alignment with comment lines, with the output file name `rat_blast_tab.txt`, and using 8 processor cores. Note that this must be specified even though it has been allocated by the Slurm script. Just because one allocates cores, the program doesn't automatically scale unless it is explicitly told to do so!

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
module load BLAST/2.2.26-Linux_x86_64
blastall -i ./rat-ests/rn_est -d ./dbs/rat.1.rna.fna -p blastn -e 0.05 -v 5 -b 5 -T F -m 9 -o rat_blast_tab.txt -a 8

In addition to the output file specified, there will elso be an `error.log` file and a Slurm output file (e.g., `slurm-1405519.out`). The output head of `rat_blast_tab.txt` takes the following format:

# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|1154902|emb|X94495.1|X94495 RNSAP11G Rat brain, postnatal day 25 Rattus norvegicus cDNA clone sap11g, mRNA sequence /len=231