Sequence comparison and database search (NORBIS901) [NB! CANCELED]

This course has sadly been canceled.

Dates: May 9-13, 2022

Location: University of Bergen

Course code: NORBIS901

Organizers: Inge Jonassen, Cedric Notredame

Lecturer: Cedric Notredame – Center for Genomic Regulation, Barcelona, Spain

Registration: Closed.

The course is limited to 20 participants for the practical sessions and exam. NORBIS members have priority for the practical sessions. Visiting PhD students will have to provide extra information for exam registration. This is specified in the registration form.

Registration deadline: April 22, 2022

Credits: 5 ECTS

Course description

This course provides insight into methods for analyzing biological sequences. Its goal is to present an overview of some advanced concepts of sequence alignments and some of their applications with a strong emphasis the algorithmic homology based multiple sequence alignment modelling, one of the most widely used method in biology. The recent launch of large scale genomic sequencing, such as the Earth bio-genome project (https://www.earthbiogenome.org)

is making these methodologies increasingly relevant to a wide range of comparative analysis as well as the reliance of milestone AI methods such as Alpha Fold2 onto large-scale MSAs. For this reason we will introduce the most important notions with respect to large scale analysis.

The first part is dedicated to molecular evolution. We will focus on the implications of molecular evolution on sequence variation. We will use these concepts to define homology. We will then see how specific mathematical models (the substitution matrices) have been derived in order to quantify the evolutionary relationship between sequences.

In the second part we will cover the pairwise and single sequence analysis. We will introduce the Needleman and Wunsch algorithm (Dynamic programming), a very basic algorithm that makes it possible to derive pairwise alignments from the sequences while using the substitution matrices. Implementation aspects will be extensively covered, including Linear Space implementations and pair-HMM alternatives for pairwise sequence alignments. These dynamic programming algorithms will be compared for both scope and application to the Burrow Wheeler Transform algorithm an essential component of genomic data aligners. The course will finish with an introduction of the Zuker RNA fold algorithm including an overview of the CYK algorithm.

The third part of the course will deal with Multiple Sequence Alignment algorithms of both RNA and Protein sequences. We will introduce the basic notion of this important class of algorithm and will then move to the most recent algorithms, able to take into account protein and RNA 3D structure while building the models. We will discuss some recent results on the relationship between MSA computation and Tree reconstruction. The course will finish with a detailed description of the most recent large scale algorithm including the recently published Regressive Algorithm and an introduction to the current state of the art for ultra large scale analysis, including phylogenetic and structural predictions with Alpha Fold2.

Course program

The course will be given over a week (5 days Monday-Friday) with lectures (34 hours including discussions) in the mornings and practical hands-on sessions in the afternoons. On the last day there will be a written exam (two hours) and a summing-up section where the students can provide feedback.

The students will receive a reading list before the course and are expected prepare well for the course. In addition to the exam, the students will do a project after the course and deliver a report within a month.

Students must bring a laptop running a UNIX-Like operating system (any flavor of Linux, UNIX or Mac) to the course and will receive software installation instructions along with the reading list.

Learning outcomes and competence

Students will learn the principles underlying sequence comparison including an evolutionary understanding of sequence alignments, development and use of substitution matrices. Students will gain a solid understanding of the most important classes of sequence matching algorithms used in biology for single sequence, pairwise and multiple sequence analysis. The scope of the course will bridge basic concepts, often new to biologists, with state of the art issues raised by large scale data analysis. At the end of the course, the students will have a practical understanding of dynamic programming, RNA folding algorithms, Hidden Markov modelling of sequence alignment, multiple sequence alignment algorithms and scale up issues.

Prerequisites

Familiarity with basic concepts in programming and algorithms.

Basic knowledge of Python as well as well as command-line programs and Linux.

Basic background in pairwise and multiple sequence alignment.

Basic knowledge in linear algebra and statistics.

Evaluation

The students will sit through a two-hours written exam. In addition, the written report will need to be approved.

Grades: pass / no-pass.