Automatic identification of the protein fold type using representations from the amino acid sequence and deep learning techniques

Villegas Morcillo, Amelia Otilia

Automatic identification of the protein fold type using representations from the amino acid sequence and deep learning techniques

Villegas Morcillo, Amelia Otilia

Supervised by:

Victoria Eugenia Sánchez Calle Co-director
Ángel Manuel Gómez García Co-director

Defence university: Universidad de Granada

Fecha de defensa: 25 November 2022

Committee:

Noelia Ferruz Capapey Chair
José Andrés González López Secretary
Ahmed Mahfouz Committee member

Type: Thesis

Teseo: 745333 DIALNET DIGIBUG editor

Abstract

Proteins are the building blocks of life as they are present in most of the biological processes of living organisms. The accurate determination of the protein three-dimensional structure is essential for many applications including drug development and protein design. However, the high cost of experimental methods has generated an increasing gap between the number of protein sequences and 3D structures available in public databases. Furthermore, although all the information needed to fold a protein is contained in its amino acid sequence, the computational determination of the protein structure is a challenging problem due to the complexity of the physicochemical interactions that define such structure. One step towards resolving this is the identification of the fold type the protein belongs to by comparing it to solved structures. However, this approach has recently been superseded by several deep learning methods that succeeded in producing highly accurate 3D structures from scratch. Despite this, it remains crucial to develop algorithms that identify sequential and structural similarities between proteins at a low computational cost. Since structures tend to be better conserved than sequences over the course of evolution, protein fold prediction is also a tool to find structurally related proteins that may not be similar in sequence. This could help to annotate rare proteins that are yet to be characterized. The main objective of this Thesis is therefore to advance research on protein fold prediction methods by exploiting the information contained in the amino acid sequences using deep learning algorithms. The results are presented in this dissertation as a compendium of scientific papers that have been published during the doctoral period. The proposed strategies explore different research directions with a common ground: the use of deep learning techniques to learn meaningful embedding representations of protein fold types. First, image representations of the protein have been evaluated for the fold recognition task, including estimated and enhanced contact maps, as well as native contact and categorical distance maps (from the 3D structure). Then, a convolutionalrecurrent neural network architecture has been proposed for fold recognition, which successfully processes arbitrary-length protein sequences using amino acid residue-level features. Subsequently, more discriminative embedding spaces of protein fold classes have been learned by adjusting the training procedure of neural network models, in particular, the loss function and the use of prototype fold class vectors to guide the classification. Finally, the performance of several pre-trained protein language model embeddings has been analyzed for the fold recognition and fold classification tasks, which have shown promise and great potential for the field.