An introduction to Snakemake: Writing reproducible bioinformatics workflows

Dates: 12-16 June 2017

Location: NTNU, Trondheim

Organisers: Pål Sætrom, Endre Bakken Stovner, Antonin Klima

External lecturers: Johannes Koester (author of Snakemake), Chris Tomkins-Tinch (core contributor), Arnar Flatberg (Genomics Core Facility, NTNU)

Register here, by 2nd May 2017

 

Writing reproducible bioinformatics workflows with Snakemake

As a researcher in bioinformatics you use computers for all aspects of your work, from performing analyses, manipulating data and generating graphs. Writing a paper or finishing a project entails writing plenty of small scripts, running software with many different parameters and perhaps even manually altering the data in some way.
 
This is not reproducible science; when you need to redo the analysis it is hard to remember exactly what you did the last time. And for reviewers or others who want to use your method on their own data, it is impossible to retrace your steps.
 
Of course reproducibility isn’t just good for science; it is also good for the individual researcher. Having all necessary steps woven together into one workflow is an incredible time-saver.
 
For these purposes workflow management systems were created. These enable you to write down the steps in your pipeline in an executable way, which makes them easy to rerun. In this workshop we will teach Snakemake, a system especially made with complex bioinformatics pipelines in mind.
 
In short: Snakemake makes your research reproducible and saves you time in the long run.
 

Who this workshop is for

This workshop is intended for bioinformaticians and researchers in biology or medicine who use the command line and programming languages in their daily work. Examples of this would be 
  1. writing simple R or Python scripts, or
  2. entering shell instructions on the command line, or
  3. using command line software
This course will teach you how to create a completely reproducible workflow using these elements.
 
On the first day, a crash course in all three topics will be given.
 
Note that Snakemake is not just for linux users; Snakemake supports OS X and can be run on Windows through a VM.
 

About Snakemake

Snakemake is a combination of the languages Python and Make. Python is famous for being easy to read, write and use, whereas Make is a robust system for creating reproducible workflows. Make is unfortunately inflexible and hard to use. Snakemake aims for the best of both worlds: it is a robust language and system to create reproducible workflows that is easy to read, write and learn.
 
Snakemake makes it easy to incorporate all aspects of a modern bioinformatics pipeline into one workflow. It also allows you to use the three most popular bioinformatics languages together: Snakemake uses the Python language natively, and has excellent support for incorporating R and shell.
 
Since its release in 2012, Snakemake has become a popular tool in bioinformatics. It
  • is used by thousands of researchers
  • has been used in a slew of papers in high-impact journals
  • has a core team of six contributors, with Johannes Koester, the inventor of Snakemake, as the lead developer
  • has a generous and long-lasting grant to ensure its
  • continuing development
  • is innovative, often being the first to introduce a new, useful feature
  • might be the most feature-complete framework
  • has its own Stack-Overflow tag, which means you can get help on Stack Overflow to solve your Snakemake problems
 

The Snakemake Workshop

The Snakemake workshop lasts for five days, where the first and the last days are optional. The first day is to help you install the necessary software and teaching you enough Python, R and shell to understand the rest of the course. The last day will be used to help you translate your own bioinformatics workflows into Snakemake, write a new one from scratch, get help from the instructors to understand concepts that are still blurry or anything else you
want.
 
*On all days, more than half of the time is dedicated to letting you solve exercises. The exercises are especially made for helping you consolidate the material taught. There will be many course organizers available at all times to help you solve the problems and understand the material taught. The external lecturers will be there all the time from Tuesday to Friday.*
 
Day one – Preparations (optional)
Monday will be used to help people set up their laptops and install the necessary software. We will then teach the basic Python, R and shell needed to understand the rest of the course. 
 
If you are able to follow our written instructions for
  1. installing Conda and Snakemake and getting it running, and
  2. know enough basic Python, R and shell to answer our skill check test correctly
you might want to skip this day.
 
Day two – Introduction
On Tuesday we will introduce Snakemake and teach you how to 
  • write basic pipelines in Snakemake
  • reason at a high level about how Snakemake works
  • follow good practices for doing data analysis
 
Day three – Advanced pipelines
  • walk through large real life bioinformatics pipeline (variant calling)
  • writing more advanced Snakemake pipelines: layout, modularity
 
Day four – Testing and debugging
  • testing Snakemake pipelines
  • common errors in pipelines and how to debug them
 
Day five – Do-it-yourself (optional)
On Friday, you will get help to start translating your existing workflows into Snakemake or in writing a new Snakemake pipeline from scratch. Participants on this optional part of the workshop should bring their own data and at least a rough idea of the types of analyses intended for their data. The course organizers will be available throughout the day to help you get started, solve issues that may arise, and answer any questions you might have about Snakemake, such as “how can I deploy Snakemake in the cloud?” or “how do the Snakemake
internals work?”
 

External lecturers

Johannes Koester
Johannes Koester is the author of Snakemake. He got his PhD from the University of Dortmund and then went on to do a post. doc. at Harvard. He is now finishing up a research position at CWI Amsterdam and will soon start a research group in Essen, Germany. 
 
Chris Tomkins-Tinch
Chris is a software engineer at the Broad Institute of MIT and Harvard and a core contributor to Snakemake. 
 
Arnar Flatberg
Arnar is an engineer at the Genomics Core Facility at NTNU.
 

Local lecturers

Antonin and Endre work with complex bioinformatics analyses at NTNU and use Snakemake to make their lives easier.