Nuevos métodos de predicción de interacción de proteína-proteína utilizando sistemas inteligentes en bases de datos de proteómica

Ortiz Urquiza, José Miguel

Nuevos métodos de predicción de interacción de proteína-proteína utilizando sistemas inteligentes en bases de datos de proteómica

Ortiz Urquiza, José Miguel

Dirigida por:

Ignacio Rojas Ruiz Director
Luis Javier Herrera Maldonado Codirector
Héctor Pomares Cintas Codirector

Universidad de defensa: Universidad de Granada

Fecha de defensa: 14 de octubre de 2011

Tribunal:

Julio Ortega Lopera Presidente
Armando Blanco Morón Secretario
Manuel Gonzalo Claros Díaz Vocal
Enrique Manuel Muro Sánchez Vocal
Oswaldo Trelles Vocal

Tipo: Tesis

Teseo: 314267 DIALNET DIGIBUG editor

Resumen

Protein-protein interactions play an important role in many cellular processes. Although experimental techniques for protein-protein interaction (PPI) prediction have been improved in recent years, computational approaches have presented in order to save costs, time and help to experimentalists. In this thesis, a novel PPI prediction methodology to dataset processing based on the extraction of genomic/proteomic information through well-known databases and the application of data mining techniques. This methodology obtains a SVM model with high levels of sensitivity and specificity in the prediction of PPIs. This proposed methodology has been implemented in three different approaches applied to yeast model organism. The common steps of this methodology are: 1) extraction of genomic/proteomic features through well-known databases, 2) feature selection, 3) creation of SVM model predictor, 4) validation of the model. All approach share feature extraction process, in the two first approaches 26 features are extracted, and the last approach 25. All SVM models use positive and negative examples to be trained. Positive dataset is common but negative dataset was created using three different approaches. The methodology was applied to yeast. Half of features in the feature extraction process were calculated using two new similarity measures presented in this thesis. In the first approach, the proposed feature selection is an ensemble approach using three filter margin based feature selection methods called Simba, G-flip and Relief. Then a SVM model with linear kernel and other with radial basis function (RBF) kernels were trained with the subset of 3 best features. A comparison between these models against SVM models trained using the best 3 features proposed in a previous work in the literature. An ROC analysis was performed for SVM kernel using a best subset of features selected by the presented feature selection method. Here, negative datasets was formed randomly selected pairs from a reliable negative dataset (4 millions of samples) proposed in a previous work of other authors. In the second approach, the proposed feature selection is a filter-wrapper feature selection approach using Relief (a margin based feature selection method used in the previous approach) and SVM models for wrapper side. Thus, RBF SVM predictor was created using the best selected features. A validation process was developed, this model was validated using experimental, computational and literature datasets previously filtered from training dataset. An ROC analysis was performed with the purpose to show the prediction capability of a SVM model trained using the selected subset of features. Here, two negative datasets were used, one was formed by randomly selected pairs (as previously) and the other dataset was obtained using a balancing approach based on the frequency of protein presented in the positive part of training datasets. This balancing approach was proposed in a previous work of other authors. In the third approach, the proposed feature selection is a filter-wrapper feature selection approach using the minimal-redundancy-maximal-relevance (mRMR) criterion method based on mutual information as filter; SVM models were used for wrapper approach. However this filter-wrapper feature selection was parallelized to use in a cluster of computers. Subsequently, RBF SVM predictor was created using the relevant selected features and specific negative dataset. This negative datasets was selected using a hierarchical clustering approach presented in this research work. Again a validation process was developed, this model was validated using experimental, computational and literature datasets previously filtered from training dataset. An ROC analysis was also performed with identical purpose from the former approach. In all approaches, the SVM model has been modified to return a probability that can be seen as a reliability measure or confidence score of the prediction of the interaction of every pair of proteins, which provides a statistical mechanism for PPI dataset validation. The performance of SVM models was good and shows high levels of specificity and sensitivity. The new methodology presented in this paper can be useful for biologists if it is provided a prediction service. On the other hand, in the last part of the thesis, a new approach is applied to a specific problem. This approach is proposed with the purpose of characterizing complex molecular associations. Therefore, it was applied to the E3 machinery - protein substrate relationship which plays a important role in neurodegenerative diseases. Two methods are provides: 1) a network neighborhood analysis between an E3 and a potential substrate 2) a biologically enrichment process and clustergram analysis as extension of the first part. This approach is designed in order to sort out potential ubiquitylation substrates and help to define the structural constituent of the E3-substrate complexes. As results, the interacting partners with similar cellular functions were found in clusters for those domain families associated to multi-protein E3 complexes and found grouped under the highest connectivity groups. This approach may be suitable as a potential framework for further research, considering also applying to other post-translational modifications.