Transforming the Structural Landscape of CATH to Aid Variant Analyses in Human and Agricultural Organisms and their Pathogens Grant uri icon

description

  • Proteins are Nature's molecular machines involved in most biochemical processes in living systems. Mutations in proteins can affect their stability and/or shape or chemical properties, altering their function. Knowing the 3D structure of the protein can be extremely helpful in understanding whether and how these mutations have this effect. Proteins are typically made up of multiple 'domains' - important functional modules - each associated with a distinct globular shape. Our CATH classification groups domains according to evolutionary ancestry. Relatives are recognised because they have similar structures in their core and often functional features in common, though variations outside the core can modify function. We therefore sub-classify relatives into functional families if they have highly similar structures and functions. Experimental techniques for determining protein structures are challenging <1% of known proteins have experimental structures. However, AI technologies for predicting structures have been improving immensely. The best use information from millions of protein sequences (1D strings of molecules (residues)) to predict how proteins will fold up in 3D. The massive increase in sequence data (> one billion sequences now known) obtained by sampling diverse environments have empowered new methods (DeepMind's AlphaFold2) to predict model structures that are as good as experimental structures. DeepMind will provide ~138 million protein structures in 2022, ~200 times more than exists now. We will transform knowledge in our CATH evolutionary classification by bringing in this vast 3D data - and we will also bring in the sequences involved in predicting the structures. This even vaster sequence data will reveal evolutionary conserved sites highly likely to be linked to function. To handle this massive amount of data we will build powerful new methods. Our recent trials using a new approach (CATHe) correctly assigned domain sequences to their evolutionary family ~90% of the time. Where we have an AlphaFold2 structure for the domain we will apply accurate structure comparisons to validate the classification. A major aim will be use this new 3D data and more accurately predicted functional sites to understand how mutations in pathogens (e.g. SARS-CoV-2) can lead to increased virulence or transmission. We'll do this through our CATH-FunVar platform which examines where mutations lie on the protein structure. Proximity to functional sites means the mutation may damage or enhance the function. We have started using FunVar to analyse variants of concern in SARS-CoV2. We will extend it to other organisms and pathogens linked to human health and well-being e.g. crops like wheat and rice that are essential for food security and where knowledge of variant impacts can guide selection and engineering of more hardy or faster growing varieties. To improve FunVar we will improve the accuracy of our predicted functional families and detection of conserved functional sites in them. To do this we will exploit the vast structure and sequence data and adapt our new AI methods to make them even more powerful for this challenging task. We will build tools to analyse structure - function relationships in these families and develop powerful new visualisations for displaying these insights. Since we'll need to handle massive expansions in the data coming into CATH and lots of new methods for processing it - and since some new data is now captured in a way our computer programs can't read - we will completely re-engineer existing pipelines for classifying domains in CATH. We have already built preliminary pipelines that brought over a quarter of a million AlphaFold2 models into CATH. This project will allow us to make these methods more robust and then apply them to bring in at least 100 fold more models to expand FunVar and determine the impacts of variants that could impact on human health and food security.

date/time interval

  • July 1, 2022 - June 30, 2027

total award amount

  • 875056 GBP

sponsor award ID

  • BB/W018802/1