Sequence comparison and database search

Dates: January 29 – February 2 2018

Course code: NORBIS901

Credits: 5 ECTS

Organisers: Inge Jonassen (UiB), Cedric Notredame (Center for Genomic Regulation, Barcelona, Spain)

Practical information:

Location: Seminarroom C, VilVite, Thormøhlens gt 51, 5006 Bergen. GoogleMap location here. You will reach VilVite by a 5-10 min walk through the park from Hotel Park, alternatively by the city light rail Bybanen to the stop Florida.

Program: An outline of the schedule can be viewed here. We will start at 10.00 Monday morning and finish by 16.00 Friday afternoon, please arrange your flights accordingly.

Course material: Presentations, excercises and reading material available from here. You are expected to prepare well ahead of this course, so please remember to read the recommended material at the bottom of the page.

Software: For the excercises, you will be using Virtual Box to run a Linux image including all necessary software. Please follow the instructions given here to install the virtual machine. This may take some time, kindly make sure everything is in place before you arrive.

Course dinner: We invite all participants to join us for pizza at Bien Centro on Monday 29 January at 18.00!

Course description:

This course provides insight into methods for aligning biological sequences. Its goal is to present an overview of the basic concepts of sequence alignments and some of their applications with a strong emphasis on homology based multiple sequence alignment modelling, one of the most widely used method in biology.

The first part is dedicated to molecular evolution. We will focus on the implications of molecular evolution on sequence variation. We will use these concepts to define homology. We will then see how specific mathematical models (the substitution matrices) have been derived in order to quantify the evolutionary relationship between sequences. In the next part we introduce the Needleman and Wunsch algorithm (Dynamic programming), a very basic algorithm that makes it possible to derive pairwise alignments from the sequences while using the substitution matrices. Next, we will see how these pairwise alignment methods can be applied to database searches and we will develop the main concepts behind the BLAST algorithm. We will finally introduce the notion of multiple sequence alignment and show how a group of related sequences can be compared in order to infer common properties. We will further investigate algorithms for multiple sequence alignment, including RNA structure based alignments. We will then see the main principles behind two multiple sequence alignment package: the Clustal programs and TCoffee and the current challenges when modelling sets of homologous sequences (RNA and proteins). The course will also include a section on Hidden Markov Models and their use in alignment and in representing and extending protein domains and families. The course will also include use of structural information in alignment, especially towards RNA secondary structures.

The course is based on our previous course ‘An introduction to sequence comparison and database search‘, which was run in November 2015, but is more advanced and will also include RNA structure based alignments.

Course program:

The course will be given over a week (5 days Monday-Friday) with lectures (34 hours including discussions) in the mornings and practical hands-on sessions in the afternoons. On the last day there will be a written exam (two hours) and a summing-up section where the students can provide feedback.

The students will receive a reading list before the course and are expected prepare well for the course. In addition to the exam, the students will do a project after the course and deliver a report within a month.

Students must bring a laptop to the course and will receive software installation instructions along with the reading list.

Learning outcomes and competence:

Students will learn the principles underlying sequence comparison including an evolutionary understanding of sequence alignments, development and use of substitution matrices, and the heuristic methods for database sequence homology search in the Blast programs. Students will also gain an understanding of the concept of multiple sequence alignment, and their applications. An important focus of the course will be the detailed understanding of evolutionary based multiple sequence comparison methods, including the Clustal and TCoffee software series. This will make the students able to integrate various sources of data (protein sequences, protein 3D structures, RNA sequences, RNA 2D/3D models) into high quality multiple sequence models thus offering them an entry point into homology modelling for evolutionary, structural and functional analysis. The students will have an understanding of Hidden Markov Models and their use in sequence analysis, and also basic understanding of use of structural information in multiple sequence alignment approaches.

Prerequisites

Basic background in pairwise and multiple sequence alignment.
Experience using command-line programs and Linux.
Familiarity with basic concepts in programming and algorithms.
Basic knowledge in linear algebra and statistics.

Evaluation:

The students will sit through a two-hours written exam. In addition, the written report will need to be approved.

Grades: pass / no-pass.

LagreLagre