Prediction

Dates: January 16-20 2017

Lecturers: Manuela Zucknick, Arnoldo Frigessi ++

Course code: IMB9275

Registration: Closed.

Credits: 5 ECTS. UiO students: register for credits here, Students from other institutions: read more and apply here

Location: University of Oslo, Domus Odontologica (preklinsk III). Monday: Store auditorium A1.1001, Tuesday-Friday: Lille auditorium A1.1004. Take tram 17 or 18 and get off at Rikshospitalet.

Schedule: can now be downloaded from [ddownload id=”1690″ text=”here” style=”link”]!

 

Description: (a concise flyer may be downloaded from [ddownload id=”1154″ text=”here” style=”link”])

The course focuses on prediction of future and/or unmeasured outcomes based on a variety of highdimensional molecular data. What do we want to predict? This is typically the success or not of a therapy given to a patient (binary or categorical outcome, also called classification); it can be the bone mineral density or the expression in so called eQTL studies (continuous outcomes); it can be survival after cancer surgery or time to recurrence of a disease (time to event outcomes).

In this course we are not studying methods to subdivide the patients in a study in subgroups, as is the aim in sub-typing a disease (unsupervised clustering). Prediction is based on various models which exploit molecular data as input data (genomics, metabolomics, proteomic, epigenetic data, for example) in addition to other individual variables (demographic, clinical, exposure data). What characterizes these data is their huge dimension (say all genes or all SNPs, so a large number p of variables), compared to a smaller number (n) of individuals in a study. Variables can be discrete, categorical, continuous and also related to more complex structures like ontologies and pathways (networks).

There are many methods which can be used to predict outcomes from data, in a p>n setting. In this course we will focus on

  • Penalised methods, like lasso, ridge and elastic net, including parameter tuning using crossvalidation
  • Bayesian methods, based on prior knowledge and exploiting Markov Chain Monte Carlo algorithms
  • Machine learning approaches, including tree-based methods, support vector machines, kernel methods and neural networks/ deep learning

We will study ways of combining different predictions with

  • Boosting, bagging and other ensemble methods

Finally we will discuss how to compare and evaluate various prediction methods to determine which one performs best:

  • Performance measures of prediction methods and their estimation using resampling methods (bootstrapping, cross-validation)

Additional themes which will be treated in the course, and will appear across the five topics above, include

  • selection of variables
  • interaction
  • integration of various data sets at different scale
  • resampling methods.

 

Prerequisites:

Basic knowledge in linear algebra and statistics is expected. The practicals will be run using the statistical computing environment R and Bioconductor. We expect students to be familiar with performing data analysis in R/ Bioconductor.

 

Learning outcomes:

After completing the course the student should be able to:

  • know what prediction is, in contrast to estimation, testing, and clustering
  • know which steps are involved in a prediction task, and which pitfalls need to be avoided
  • be able to identify appropriate methods for a given problem, and to perform prediction tasks using R and Bioconductor packages
  • be able to assess methods in the literature, and to put these in a wider context
  • be able to assess the performance of prediction results, as they are typically reported in publications

 

Course program:

The course will be given as an intensive one-week long course (5 days, Monday-Friday) with lectures (three hours including discussions in the mornings) and practical hands-on sessions (four hours in the afternoons). During the practical sessions the students will use R/Bioconductor to analyse given datasets using different prediction approaches. On the last day the students will give brief presentations of their prediction results to the class. This will be followed by a summary session. Students will have the opportunity to provide feedback at the end of the course. The students will receive a reading list before the course and are expected to prepare well for the course.

The students will do a project after the course, possibly using their own data, and deliver a written report within a month. The students will be divided into small groups where group members have complementary backgrounds (e.g. one biostatistics, one bioinformatics, and one molecular biology student in each group). Students will need to bring their own laptop. For the course we encourage the use of RStudio (http://www.rstudio.com) and of reproducible research tools knitR (http://yihui.name/knitr) and R Markdown.

 

Evaluation:

The students will be required to give a brief oral presentation during the course and deliver the written report of the take-home project in order to pass the class. Grades: pass/fail.