Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam Completed Project uri icon

description

  • Proteins are biological macromolecules that perform a diverse array of crucial functions, from enzymes (e.g. the entities responsible for fermentation) to transporters (e.g. hemoglobin in the blood) to mechanical structures (e.g. actin and myosin in muscle). Proteins are synthesized as linear polymers of building blocks called amino acids. They usually fold into complex three-dimensional (3D) structures, and typically interact with other proteins and molecules to perform their function. Knowledge of protein sequences can facilitate insights into hitherto undiscovered enzymes with potential applications in the biotechnology sector, or novel drugs of interest to the pharmaceutical industry. Detailed understanding of the functional architecture of proteins, including the arrangement of amino acids in a 3D structure, enables scientists to diagnose diseases as well as design more effective enzymes. These days, our ability to generate new protein sequences based on modern high-throughput DNA sequencing (HTS) techniques far outstrips our ability to functionally characterise them. Thus, most sequences are computationally annotated, by identifying similarities between new sequences and the few experimentally characterised examples, using these to infer function (i.e. annotate). More recently, HTS has been applied directly to environmental samples to discover previously uncultured bacteria and single cell eukaryotes, and to enable the reconstruction of large and complex genomes, like plants. Such approaches are correcting many of the historical biases in the protein sequence databases. However, for humankind to understand and utilise these data, sequences need to be functionally annotated, which is best accomplished using the information gleaned from sets of related sequences (known as protein families). InterPro is a world leading protein family resource that merges information from 13 different specialist databases to present the user with comprehensive functional analysis of sequences. One of its member databases, Pfam, is a collection of protein domain families containing functional annotations. Both InterPro and Pfam are well-established primary resources in the field of protein research. In this application, we propose crucial developments to both of these resources in order to augment their utility, functionality and scalability, as well as uniquely position them to tackle imminent advances in the field. We will leverage pre-established links with other protein databases and concurrently build additional pipelines to develop and exchange the latest information between these existing and new resources. We will improve coverage of protein sequences originating from environmental sources by building families for novel sets (or clusters) of related proteins. Considering the fundamental association between protein structure and function, we will develop a pipeline that will not only import structural models for Pfam entries and present them via the website, but will also ensure that the models remain up to date. To increase coverage and functional annotations in both resources, we will integrate new resources to provide sub-domain classifications, and improve annotations through combined literature searches and enhanced curation tools. To refine annotations, we will adopt a new algorithm called TreeGrafter to InterProScan (our software package that performs automatic annotations of protein sequences), and integrate controlled vocabularies for protein attributes from databases like PANTHER with those already in InterPro. We will evaluate the performance of an upgraded version of the HMMER software that is widely used to build protein families, including Pfam, to improve future scalability. Finally, we will focus on eight genomes of agricultural importance, including chicken, salmon, and wheat, by systematically annotating 2000 associated entries in Pfam and by extension, InterPro.

date/time interval

  • November 1, 2019 - October 31, 2023