Sequence comparison and database search (NORBIS901)

Dates: May 25-29, 2020

Course code: NORBIS901

Credits: 5 ECTS

Organisers: Inge Jonassen (UiB), Cedric Notredame (Center for Genomic Regulation, Barcelona, Spain)

Registration: HERE

The course is limited to 20 participants for the practical sessions and exam. The lectures will be open for all. NORBIS members have priority for the practical sessions. Visiting PhD students will have to provide extra information for exam registration. This is specified in the registration form.

Registration deadline: May 15th

Practical information:

This course is given online. Lectures will be open for all.

Program:

Monday-Friday

9:30-12:30 : 3X45 minutes

13:00-17:00 : Practicals

Course material: Presentations, excercises and reading material will be available from here. You are expected to prepare well ahead of this course, so please remember to read the recommended material.

Software: For the excercises, you will be using Phyton so you need access to a Linux shell.

Course description:

This course provides insight into methods for aligning biological sequences. Its goal is to present an overview of the basic concepts of sequence alignments and some of their applications with a strong emphasis on homology based multiple sequence alignment modelling, one of the most widely used method in biology.

The first part is dedicated to molecular evolution. We will focus on the implications of molecular evolution on sequence variation. We will use these concepts to define homology. We will then see how specific mathematical models (the substitution matrices) have been derived in order to quantify the evolutionary relationship between sequences. In the next part we introduce the Needleman and Wunsch algorithm (Dynamic programming), a very basic algorithm that makes it possible to derive pairwise alignments from the sequences while using the substitution matrices. Next, we will see how these pairwise alignment methods can be applied to database searches and we will develop the main concepts behind the BLAST algorithm. We will finally introduce the notion of multiple sequence alignment and show how a group of related sequences can be compared in order to infer common properties. We will further investigate algorithms for multiple sequence alignment, including RNA structure based alignments. We will then see the main principles behind two multiple sequence alignment package: the Clustal programs and TCoffee and the current challenges when modelling sets of homologous sequences (RNA and proteins). The course will also include a section on Hidden Markov Models and their use in alignment and in representing and extending protein domains and families. The course will also include use of structural information in alignment, especially towards RNA secondary structures.

The course is based on our previous courses An introduction to sequence comparison and database search, which was run in November 2015, and the Sequence comparison and database search course, which was run in January 2018. Note that this course is more advanced.

Course program:

The course will be given over a week (5 days Monday-Friday) with lectures (34 hours including discussions) in the mornings and practical hands-on sessions in the afternoons. On the last day there will be a written exam (two hours) and a summing-up section where the students can provide feedback.

The students will receive a reading list before the course and are expected prepare well for the course. In addition to the exam, the students will do a project after the course and deliver a report within a month.

Students must bring a laptop to the course and will receive software installation instructions along with the reading list.

Learning outcomes and competence:

Students will learn the principles underlying sequence comparison including an evolutionary understanding of sequence alignments, development and use of substitution matrices, and the heuristic methods for database sequence homology search in the Blast programs. Students will also gain an understanding of the concept of multiple sequence alignment, and their applications. An important focus of the course will be the detailed understanding of evolutionary based multiple sequence comparison methods, including the Clustal and TCoffee software series. This will make the students able to integrate various sources of data (protein sequences, protein 3D structures, RNA sequences, RNA 2D/3D models) into high quality multiple sequence models thus offering them an entry point into homology modelling for evolutionary, structural and functional analysis. The students will have an understanding of Hidden Markov Models and their use in sequence analysis, and also basic understanding of use of structural information in multiple sequence alignment approaches.

Prerequisites

Basic background in pairwise and multiple sequence alignment.
Experience using command-line programs and Linux.
Familiarity with basic concepts in programming and algorithms.
Basic knowledge in linear algebra and statistics.

Evaluation:

The students will sit through a two-hours written exam. In addition, the written report will need to be approved.

Grades: pass / no-pass.