« Return to HowTos

Bioinformatics Resources and UVA HPC

The UVA research community has access to numerous bioinformatics software installed directly or available through the bioconda Python modules.
Click here for a comprehensive list of currently-installed bioinformatics software.

Popular Bioinformatics Software

Below are some popular tools and useful links for their documentation and usage:

Tool	Version	Description	Useful Links
BEDTools	2.26.0	BEDTools utilities allow one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.	Homepage Tutorial
BLAST+	2.7.1	BLAST+ is a suite of command-line tools that offers applications for BLAST search, BLAST database creation/examination, and sequence filtering.	Web BLAST Manual
BWA	0.7.17	BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM	Homepage Manual
Bowtie2	2.2.9	Bowtie2 is a memory-efficient tool for aligning short sequences to long reference genomes.	Homepage Manual
FastQC	0.11.5	FastQC is a Java application that generates a comprehensive quality control report for raw sequencing data.	Homepage Documentation
GATK	4.0.0.0	The Genome Analysis Toolkit provide tools for variant discovery. In addition to SNP and INDEL identification in germline DNA and RNAseq data, GATK tools include somatic short variant calling, as well as tackle copy number and structural variation.	User Guide
Picard	2.1.1	Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.	Homepage Documentation
SAMTools	1.7	SAMTools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.	Homepage Manual
SPAdes	3.10.1	SPAdes provide pipelines for assembling genomes from Illumina and IonTorrent reads, as well as hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. It supports paired-end reads, mate-pairs and unpaired reads.	Homepage Manual
STAR	2.5.3a	Spliced Transcripts Alignment to a Reference (STAR) is a RNA-seq aligner based on an algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure.	Homepage
vsearch	2.7.1	VSEARCH (stands for Vectorized Search) is a toolkit for nucleotide sequence analyses, including database search and clustering algorithms. It supports clustering, chimera detection, database searching, merging of paired-end reads, and other sequence manipulation tools.	Homepage

Bioinformatics Modules

To get an up-to-date list of the installed bioinformatics applications, log on to UVA HPC and run the following command in a terminal window:

module keyword bio

If you know which package you wish to use, you can look for it with

module spider <software>

For example,

module spider bcftools

This returns

----------------------------------------------------------------------------
  bcftools:
----------------------------------------------------------------------------
    Description:
      SAMtools is a suite of programs for interacting with high-throughput
      sequencing data. BCFtools - Reading/writing BCF2/VCF/gVCF files and
      calling/filtering/summarising SNP and short indel sequence variants

     Versions:
        bcftools/1.3.1
        bcftools/1.9

----------------------------------------------------------------------------
  For detailed information about a specific "bcftools" module (including how to
load the modules) use the module's full name.
  For example:

     $ module spider bcftools/1.9
----------------------------------------------------------------------------

Available versions may change, but the format should be the same.

To obtain more information about a specific module version, including a list of any prerequisite modules that must be loaded first, run the module spider command with the version specified; for example:

module spider bcftools/1.3.1

Using a Specific Software Module

To use a specific software package, run the module load command. The module load command in itself does not execute any of the programs but only prepares the environment, i.e. it sets up variables needed to run specific applications and find libraries provided by the module.

After loading a module, you are ready to run the application(s) provided by the module. For example:

module load bcftools/1.3.1
bcftools --version

Output:

bcftools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

You will need to include the appropriate module load commands into your Slurm script.

General Considerations for Slurm Jobs

Most bioinformatics software packages are designed to run on a single compute node with varying support for multi-threading and utilization of multiple cpu cores. Many can run on only one core. In that case, please request only a single task.

Some software is multithreaded. Usually it communicates the number of threads requested through a command-line option. In this case the Slurm job scripts should contain the following two SBATCH directives:

#SBATCH -N 1                    # request single node
#SBATCH --cpus-per-task=<X>     # request multiple cpu cores

Replace <X> with the actual number of cpu cores to be requested. Requesting more than 8 cpu cores does not provide any significant performance gain for many bioinformatics packages. This is a limitation due to code design rather than a UVA HPC constraint.

Please be certain that the number of cores you request matches the number you communicate to the software. To be certain, you can often use the environment variable SLURM_CPUS_PER_TASK. For example,

biofoo -n ${SLURM_CPUS_PER_TASK}

You should only deviate from this general resource request format if you are absolutely certain that the software package supports execution on more than one compute node.

Reference Genomes on the HPC system

Research Computing provides a set of ready-to-use reference sequences and annotations for commonly analyzed organisms in a convenient, accessible location on Rivanna:

/project/genomes/

The majority of files have been downloaded from Illumina’s genomes repository (iGenomes), which contain assembly builds and corresponding annotations from Ensembl, NCBI and UCSC. Each genome directory contain index files of the whole genome for use with aligners like BWA and Bowtie2. In addition, STAR2 index files have been generated for each of Homo Sapiens (human) and Mus musculus (mouse) genomic builds.

Click the radio button for the genome of your choice, then click the clipboard icon to copy it. On Rivanna please use the right click method to paste.

Organism	Source	Build	Whole Genome	Index Files
Organism	Source	Build	FASTA	BWA	Bowtie2	STAR2
Arabidopsis thaliana	Ensembl	TAIR9
	Ensembl	TAIR10
	NCBI	build9.1
	NCBI	TAIR10
Chlorocebus sabeus	NCBI	chlSab2
Danio rerio	Ensembl	GRCz10
Danio rerio	UCSC	danRer10
Drosophila melanogaster	Ensembl	BDGP6
	NCBI	build5.3
	NCBI	build5.41
	UCSC	dm6
Escherichia coli strain K12, DH10B	Ensembl	EB1
Escherichia coli strain K12, DH10B	NCBI	2008-03-17
Escherichia coli strain K12, MG1655	NCBI	2001-10-15
Homo sapiens	Ensembl	GRCh37
	NCBI	GRCh38
	UCSC	hg19
	UCSC	hg38
Mus musculus	NCBI	GRCm38
	UCSC	mm9
	UCSC	mm10
Pan troglodytes	Ensembl	CHIMP2.1
	Ensembl	CHIMP2.1.4
	NCBI	build3.1
	UCSC	panTro3
	UCSC	panTro4

Updated November 17, 2020 | howto bioinformatics, genomics, rivanna, tools