Development and benchmarking of improved computational methods for transcript-level expression analysis using RNA-seq data Grant uri icon

description

  • After sequencing of the human genome was completed, Scientists were surprised to discover that there are far fewer protein-coding genes than was previously predicted. One reason that an organism as complex as human can be built from a relatively small number of genes is that each gene encodes more than one protein. An intermediate molecule, messenger RNA (mRNA), carries the information from the genome in the cell nucleus to ribosomes which create proteins. These mRNA molecules are also known as transcripts and their full complement is termed the transcriptome. Before they mature these transcripts are edited to form the template for different proteins. This editing process is called splicing and different transcripts that result are called splice variants or isoforms. An additional complexity in the transcriptome is due to the fact that each gene has multiple copies (for example 2 in human, 6 in wheat) and these different copies, called alleles, can be expressed differently under different conditions or in different tissues. The transcriptome is a collection of transcripts which includes all the allele-specific gene isoforms that are expressed in the cell along with other non-coding RNA molecules. Splicing and allele usage are fundamental ways that the function of genes can be modulated in a tissue-specific manner. Therefore developing technologies to accurately measure transcript expression is a necessary step towards understanding and modelling cells and tissues. A recently developed experimental technology called RNA-seq gives unprecedented access to data about the transcriptome. Computational methods are required to interpret these data which are in the form of a list containing millions of short RNA sequence fragments. These fragments are difficult to interpret because, for example, the same fragment could have come from a large number of different gene isoforms. The question is, which one? Computational methods can be used to answer this question and infer the concentration of different gene isoforms in the sample given these data. In this project we will develop a new computational method, implemented in publically available free software, which uses advanced statistical procedures to solve this problem. An important distinguishing feature of the method is the ability to associate inferred concentrations with a degree of uncertainty which captures technical and biological sources of error as well as the inherent difficulty of the problem due to the difficulty of assigning fragments to gene isoforms. We will create benchmark data that allows us to assess the performance or our method and other available published methods, allowing researchers and end-users of different methods to understand their properties. Finally, we will adapt an existing computer program, puma, to work with the processed RNA-seq data in order to identify genes which change between conditions, which have similar expression patterns or which contribute most to the variance in the data.

date/time interval

  • October 1, 2012 - September 30, 2015

total award amount

  • 312402 GBP

sponsor award ID

  • BB/J009415/1