Overview

Many commonly used bioinformatics software packages on Rivanna are available as individual modules or as Python packages bundled in the bioconda modules.

General considerations

Most bioinformatics software packages are designed to run on a single compute node with varying support for multi-threading and utilization of multiple cpu cores. Accordingly, the SLURM job scripts should contain the following two SBATCH directives:

#SBATCH -N 1                    # request single node
#SBATCH --cpus-per-task=<X>     # request multiple cpu cores

Replace <X> with the actual number of cpu cores to be requested. Requesting more than 8 cpu cores does not provide any significant performance gain for many bioinformatics packages. This is a limitation due to code design rather than a Rivanna constraint.

You should only deviate from this general resource request format if you are absolutely certain that the software package supports execution on more than one compute node.

Available Bioinformatics Software

To get an up-to-date list of the installed bioinformatics applications, log on to Rivanna and run the following command in a terminal window:

module keyword bio

To get more information about a specific module version, run the module spider command, for example:

module spider bcftools/1.3.1


List of Bioinformatics Software Modules

Module Category Description
amber bio A suite of biomolecular simulation programs. It began in the late 1970's, and is maintained by an active development community.
ascmeme bio ASC+MEME is a fast motif discovery tool that is 10,000 times faster than MEME while preserving the same accuracy.
augustus bio AUGUSTUS is a program to find genes and their structures in one or more genomes.
bamtools bio BamTools provides both a programmer's API and an end-user's toolkit for handling BAM files.
bart bio BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input.
bbmap bio BBMap includes a short read aligner, and other bioinformatic tools.
bcftools bio BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF
bcl2fastq2 bio bcl2fastq Conversion Software both demultiplexes data and converts BCL files generated by Illumina sequencing systems to standard FASTQ file formats for downstream analysis.
bedops bio BEDOPS is an open-source command-line toolkit that performs highly efficient and scalable Boolean and other set operations, statistical calculations, archiving, conversion and other management of genomic data of arbitrary scale.
bedtools bio The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM.
bicseq2-norm bio BICseq2 is an algorithm developed for the normalization of high-throughput sequencing (HTS) data and detect copy number variations (CNV) in the genome. BICseq2 can be used for detecting CNVs with or without a control genome. BICseq2-norm is for normalizing potential biases in the sequencing data.
bicseq2-seg bio BICseq2 is an algorithm developed for the normalization of high-throughput sequencing (HTS) data and detect copy number variations (CNV) in the genome. BICseq2 can be used for detecting CNVs with or without a control genome. BICseq2-seg is for detecting CNVs based on the normalized data given by BICseq2-norm.
bioconda bio Bioconda is a channel for the conda package manager specializing in bioinformatics software.
bioperl bio Bioperl is the product of a community effort to produce Perl code which is useful in biology. Examples include Sequence objects, Alignment objects and database searching objects.
biopython bio Biopython is a set of freely available tools for biological computation written in Python by an international team of developers. It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics.
bismark bio A tool to map bisulfite converted sequence reads and determine cytosine methylation states
blast bio Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.
bowtie2 bio Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
bwa bio Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.
caviar bio caviar is a statistical framework that quantifies the probability of each variant to be causal while allowing with arbitrary number of causal variants.
cd-hit bio CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences.
cellprofiler bio CellProfiler is an image processing package to generate morphometric measurements.
cellranger bio A set of analysis piplines that perform sample demultiplexing, barcode processing, and single cell 3' gene counting.
cellranger-dna bio Cell Ranger DNA is a set of analysis pipelines that process Chromium single cell DNA sequencing output to align reads, identify copy number variation (CNV), and compare heterogeneity among cells.
circos bio Circos is a software package for visualizing data and information. It visualizes data in a circular layout - this makes Circos ideal for exploring relationships between objects or positions.
clearcut bio Clearcut is the reference implementation for the Relaxed Neighbor Joining (RNJ) algorithm by J. Evans, L. Sheneman, and J. Foster from the Initiative for Bioinformatics and Evolutionary Studies (IBEST) at the University of Idaho.
cp-analyst bio CellProfiler Analyst (CPA) allows interactive exploration and analysis of data, particularly from high-throughput, image-based experiments. Included is a supervised machine learning system which can be trained to recognize complicated and subtle phenotypes, for automatic scoring of millions of cells. CellProfiler is an image processing package to generate morphometric measurements.
cushaw3 bio CUSHAW is a well-established leading next-generation sequencing read alignment software package based on multi-core and many-core computing.
cutadapt bio Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
danpos bio Danpos is a toolkit for Dynamic Analysis of Nucleosome and Protein Occupancy by Sequencing, version 2
deeptools bio deepTools contains useful modules to process the mapped reads data for multiple quality checks, creating normalized coverage files in standard bedGraph and bigWig file formats, that allow comparison between different files (for example, treatment and control). Finally, using such normalized and standardized files, deepTools can create many publication-ready visualizations to identify enrichments and for functional annotations of the genome.
diamond bio DIAMOND is a sequence aligner for protein and translated DNA searches and functions as a drop-in replacement for the NCBI BLAST software tools. It is suitable for protein-protein search as well as DNA-protein search on short reads and longer sequences including contigs and assemblies, providing a speedup of BLAST ranging up to x20,000.
eigensoft bio The EIGENSOFT package combines functionality from our population genetics methods (Patterson et al. 2006) and our EIGENSTRAT stratification correction method (Price et al. 2006). The EIGENSTRAT method uses principal components analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation; the resulting correction is specific to a candidate marker’s variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. The EIGENSOFT package has a built-in plotting script and supports multiple file formats and quantitative phenotypes.
emboss bio EMBOSS is 'The European Molecular Biology Open Software Suite'. EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community.
epic bio epic is a software package for finding medium to diffusely enriched domains in chip-seq data. It is a fast, parallel and memory-efficient implementation of the popular SICER algorithm.
exonerate bio Exonerate is a generic tool for pairwise sequence comparison. It allows you to align sequences using a many alignment models, using either exhaustive dynamic programming, or a variety of heuristics.
fasta bio The FASTA programs find regions of local or global (new) similarity between protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence.
fastqc bio FastQC is a Java application which takes a FastQ file and runs a series of tests on it to generate a comprehensive QC report.
fastx-toolkit bio The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
freebayes bio FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
freesurfer bio FreeSurfer is a set of tools for analysis and visualization of structural and functional brain imaging data. FreeSurfer contains a fully automatic structural imaging stream for processing cross sectional and longitudinal data.
fsa bio FSA:Fast Statistical Alignment, is a probabilistic multiple sequence alignment algorithm which uses a distance-based approach to aligning homologous protein, RNA or DNA sequences.
fsl bio FSL is a comprehensive library of analysis tools for FMRI, MRI and DTI brain imaging data.
gatk bio The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
gd bio GD.pm - Interface to Gd Graphics Library
gdc-client bio The gdc-client provides several convenience functions over the GDC API which provides general download/upload via HTTPS.
gemma bio Genome-wide Efficient Mixed Model Association
genometools bio The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named gt. It is based on a C library named “libgenometools” which consists of several modules.
gmap-gsnap bio GMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences GSNAP: Genomic Short-read Nucleotide Alignment Program
hic-pro bio HiC-Pro is an optimized and flexible pipeline for Hi-C data processing.
hisat2 bio HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) against the general human population (as well as against a single reference genome).
htslib bio A C library for reading/writing high-throughput sequencing data. This package includes the utilities bgzip and tabix
idr bio The IDR (Irreproducible Discovery Rate) framework is a unified approach to measure the reproducibility of findings identified from replicate experiments and provide highly stable thresholds based on reproducibility. The IDR method compares a pair of ranked lists of identifications (such as ChIP-seq peaks).
intervene bio Intervene is a tool for intersection and visualization of multiple genomic region sets.
irfinder bio IRFinder is a tool for detecting intron retention from RNA-Seq experiments.
jcuda bio Java bindings for NVIDIA CUDA and related libraries.
juicer bio Juicer is a one-click pipeline for processing terabase scale Hi-C datasets.
kallisto bio Kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment.
kraken bio Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
longranger bio Long Ranger is a set of analysis pipelines that processes Chromium sequencing output to align reads and call and phase SNPs, indels, and structural variants.
macs2 bio MACS (Model-based Analysis of ChIP-Seq) identifies transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation.
manta bio Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow.
marge bio MARGE is a robust methodology that leverages a comprehensive library of genome-wide H3K27ac ChIP-seq profiles to predict key regulated genes and cis-regulatory regions in human or mouse.
meme bio The MEME Suite allows you to: * discover motifs using MEME, DREME (DNA only) or GLAM2 on groups of related DNA or protein sequences, * search sequence databases with motifs using MAST, FIMO, MCAST or GLAM2SCAN, * compare a motif to all motifs in a database of motifs, * associate motifs with Gene Ontology terms via their putative target genes, and * analyse motif enrichment using SpaMo or CentriMo.
mothur bio Mothur is a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.
mummer bio MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. AMOS makes use of it.
muscle bio MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that are consistently better than CLUSTALW. MUSCLE can align hundreds of sequences in seconds. Most users learn everything they need to know about MUSCLE in a few minutes—only a handful of command-line options are needed to perform common alignment tasks.
ncbi-vdb bio The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives.
neuron bio Empirically-based simulations of neurons and networks of neurons.
ngs bio NGS is a new, domain-specific API for accessing reads, alignments and pileups produced from Next Generation Sequencing.
ngsplot bio ngs.plot allows easy visualization of next-generation sequencing (NGS) samples at functional genomic regions.
nseg bio Nseg is used to identify low complexity sequencesi.
openms bio OpenMS is an open-source software C++ library for LC-MS data management and analyses. It offers an infrastructure for rapid development of mass spectrometry related software.
openslide bio OpenSlide is a C library that provides a simple interface to read whole-slide images.
openslide-python bio Python bindings for the OpenSlide libary
paintor bio PAINTOR is a statistical fine-mapping method that integrates functional genomic data with association strength from potentially multiple populations (or traits) to prioritize variants for follow-up analysis.
patric bio PATRIC is an integration of different types of data and software tools that support research on bacterial pathogens.
peakseq bio PeakSeq is a program for identifying and ranking peak regions in ChIP-Seq experiments. It takes as input, mapped reads from a ChIP-Seq experiment, mapped reads from a control experiment and outputs a file with peak regions ranked with increasing Q-values.
picard bio A set of tools (in Java) for working with next generation sequencing data in the BAM (http://samtools.github.io/hts-specs) format.
plink bio PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype or CNV calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.
prokka bio Prokka is a software tool for the rapid annotation of prokaryotic genomes.
proteowiz bio ProteoWizard provides a set of open-source, cross-platform software libraries and tools (e.g. msconvert, Skyline, IDPicker, SeeMS) that facilitate proteomics data analysis. The libraries enable rapid tool creation by providing a robust, pluggable development framework that simplifies and unifies data file access, and performs standard chemistry and LCMS dataset computations.
qiime bio QIIME is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data.
raxml bio RAxML search algorithm for maximum likelihood based inference of phylogenetic trees.
rdp-classifier bio The RDP Classifier is a naive Bayesian classifier that can rapidly and accurately provides taxonomic assignments from domain to genus, with confidence estimates for each assignment.
relion bio RELION (for REgularised LIkelihood OptimisatioN, pronounce rely-on) is a stand-alone computer program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM).
rsem bio RNA-Seq by Expectation-Maximization
sicerpy bio SICER.py is a Python wrapper for the SICER peak caller software.
slim bio SLiM is an evolutionary simulation package that provides facilities for very easily and quickly constructing genetically explicit individual-based evolutionary models.
tabix bio Generic indexer for TAB-delimited genome position files
taggraph bio TagGraph is a computational tool that provides an unrestricted string-based search method that is as much as 350-fold faster than existing approaches, and a probabilistic validation model that was optimized for post-translational modification assignments.
tophat bio TopHat is a fast splice junction mapper for RNA-Seq reads.
trimgalore bio Trim Galore is a wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data.
trimmomatic bio Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data.
ucsc-tools bio A set of genome utilities developed at the University of California Santa Cruz.
vcftools bio The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.
velvet bio Sequence assembler for very short reads
vep bio VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
viennarna bio The Vienna RNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.
vsearch bio VSEARCH which supports de novo and reference based chimera detection, clustering, full-length and prefix dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering, conversion and merging of paired-end reads.

Using a Specific Software Module

To use a specific software package, run the module load command. The module load command in itself does not execute any of the programs but only prepares the environment, i.e. it sets up variables needed to run specific applications and find libraries provided by the module.

After loading a module, you are ready to run the application(s) provided by the module. For example:

module load bcftools/1.3.1
bcftools --version

Output:

bcftools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Bioconda Python packages

Many bioinformatics Python packages are now maintained and available for the popular Anaconda Python distribution. Python packages for the Anaconda distribution are maintained in a variety of different bundles, called channels. The bioconda channel is specifically set up for the maintenance and distribution of popular bioinformatics packages. On Rivanna, we offer two bioconda modules, one using Python 2.7 and the other using Python 3.6.

To see the list of available bioconda modules, run the module spider command:

module spider bioconda

Output:

-------------------------------------------------------------------------
bioconda:
-------------------------------------------------------------------------
    Description:
      Bioconda is a channel for the conda package manager specializing in bioinformatics software.

     Versions:
        bioconda/py2.7
        bioconda/py3.6

The bioconda/py2.7 and bioconda/py3.6 modules are backed by Anaconda distributions using Python 2.7 and Python 3.6, respectively. To view an up-to-date list of the Python packages provided by a particular bioconda module, load the bioconda module and run the conda list command. For example:

module load bioconda/py2.7
conda list | grep bioconda

The grep command filters the Python package list to only show the Bioconda channel packages. The output may look like this:

# packages in environment at /apps/software/standard/core/bioconda/py2.7:
bcftools                  1.9                  h4da6232_0    bioconda
biopython                 1.68                np110py27_1    bioconda
htslib                    1.9                  hc238db4_4    bioconda
kallisto                  0.44.0               h7d86c95_2    bioconda
libdeflate                1.0                  h470a237_0    bioconda
macs2                     2.1.1.20160309   py27h7eb728f_3    bioconda
mmtf-python               1.0.2                    py27_0    bioconda
pybigwig                  0.3.12           py27hdfb72b2_0    bioconda
salmon                    0.11.2               h445c947_0    bioconda
samtools                  1.9                  h46bd0b3_0    bioconda

Note that not all bioinformatics packages have been ported from Python 2.7 to Python 3 yet. So if you cannot find a specific Python Package in the bioconda/py3.6 module, it is worthwhile checking the bioconda/py2.7 module.

Reference Genomes on Rivanna

Research Computing maintains a set of ready-to-use reference sequences and annotations for commonly analyzed organisms in a convenient, shared storage location on Rivanna.

The majority of files have been downloaded from Illumina’s genomes repository (iGenomes), which contain assembly builds and corresponding annotations from Ensembl, NCBI and UCSC. Each genome directory contain index files of the whole genome for use with aligners like BWA and Bowtie2. In addition, STAR2 index files have been generated for each of Homo Sapiens (human) and Mus musculus (mouse) genomic builds.

Rivanna storage PATH for your genome and index files of interest:

Select a genome of interest and view location of its reference sequence and index files on Rivanna.

Organism Source Build Whole Genome Index Files
FASTA BWA Bowtie2 STAR2
Arabidopsis thaliana Ensembl TAIR9
TAIR10
NCBI build9.1
TAIR10
Chlorocebus sabeus NCBI chlSab2
Danio rerio Ensembl GRCz10
UCSC danRer10
Drosophila melanogaster Ensembl BDGP6
NCBI build5.3
build5.41
UCSC dm6
Escherichia coli strain K12, DH10B Ensembl EB1
NCBI 2008-03-17
Escherichia coli strain K12, MG1655 NCBI 2001-10-15
Homo sapiens Ensembl GRCh37
NCBI GRCh38
UCSC hg19
hg38
Mus musculus NCBI GRCm38
UCSC mm9
mm10
Pan troglodytes Ensembl CHIMP2.1
CHIMP2.1.4
NCBI build3.1
UCSC panTro3
panTro4