Description


The Snakemake workflow management system is a tool to create reproducible and scalable data analyses.


Software Category: tools

For detailed information, visit the Snakemake
website
.


Available Versions

The current installation of Snakemake
incorporates the most popular packages. To find the available versions and learn how to load them, run:

module spider snakemake

The output of the command shows the available Snakemake
module versions.

For detailed information about a particular Snakemake
module, including how to load the module, run the module spider command with the module’s full version label. For example:

module spider snakemake/7.24.2
ModuleVersion Module Load Command
snakemake7.24.2 module load snakemake/7.24.2

Snakemake workflow:

  • Snakemake is a workflow management system used to create reproducible and scalable data analyses
  • Workflows are written in Python and can be deployed in parallel on the HPC system
  • Workflows can be executed in containerized environments: Conda or Apptainer

Snakemake rules:

Snakemake DAG

  • Snakemake follows the GNU Make paradigm
  • Workflows are defined in rules, starting with the target rule
  • Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be parallelized

Config.yaml file:

  • Config files are for users to input filenames and paths for the workflow
  • In the case below, the user inputs 3 samples for a simple RNA-seq pipeline
  • Threads can be passed as an argument for multithreading
#Inset sample names as a list:
SAMPLES: ["exp1", "exp3", "Sham1"]

#Insert path to GTF file:
GENOME_GTF: "refGene.gtf"

#Insert path to Hisat2 index files:
HISAT2_INDEX_PREFIX: "grcm38_snp/genome_snp"

#Insert the path to the directory that has your fastq files:
DATA_PATH: "/project/some_directory/fastq_files"

#Each sample should have R1 and R2 extensions:
R1_EXT: "_R1_001_paired.fq.gz"
R2_EXT: "_R2_001_paired.fq.gz"

#Insert number of threads you would like to use:
THREADS: 8

Snakefile:

  • The Snakefile contains the rules of your workflow (the steps)
  • The target rule is your final output, Snakemake will determine the order of the rules in order to create that output
  • Each rule consists of 3 required parts: the input files, the output files, and the shell (command)
  • Below is an example of a rule to align sequences using hisat. The log and threads options are optional, but included for reference
  • The target output is a gene count matrix in a csv format
#Target Rule (last output):
rule all:
     input: "differential_expression/gene_count_matrix.csv"

rule align_hisat:
    input:
        fq1= config['DATA_PATH'] + "{sample}" + config['R1_EXT'],
        fq2= config['DATA_PATH'] + "{sample}" + config['R2_EXT'],
        hisat2_index=expand(f"{HISAT2_INDEX_PREFIX}.{{ix}}.ht2", ix=range(1, 9))
    output: "align_hisat2/{sample}.bam"
    log: "align_hisat2/{sample}.alignment.summary"
    threads: config['THREADS']
    shell:
        "hisat2 2>{log} -p {threads} --dta -x {HISAT2_INDEX_PREFIX} "
        "-1 {input.fq1} -2 {input.fq2} | "
        "samtools sort -@ {threads} -o {output}"
  • After the rule align_hisat is completed, the workflow can move to the next rule stringtie_assemble
  • Notice that the output of align_hisat is a .bam file, this is now the input to the rule stringtie_assemble
rule stringtie_assemble:
    input:
        genome_gtf=config['GENOME_GTF'],
        bam="align_hisat2/{sample}.bam"
    output: "stringtie/assembled/{sample}.gtf"
    threads: config['THREADS']
    shell:
        "stringtie -p {threads} -G {input.genome_gtf} "
        "-o {output} -l {wildcards.sample} {input.bam}"
  • You can add as many rules as you like as long as they are sequential with inputs and outputs

Slurm for Snakemake:

  • The Snakemake pipeline can be executed using a SLURM script on the HPC system
  • Below is an example script to submit to the standard partition with 8 threads
  • This script is using a conda environment called rnaseq
#!/bin/bash

#SBATCH -t 5:00:00
#SBATCH -p standard
#SBATCH --mem=50000
#SBATCH -A allocation_name
#SBATCH -n 9

module purge
module load anaconda
source activate rnaseq

snakemake -p -j 8

Dry Runs:

  • Dry-runs are a great way to check your commands before running them
  • The code is printed, but not actually run
  • For a dry run, use snakemake -n