Development of advanced machine learning models for the fusion of heterogeneous biological sources in clinical decision support systems for cancer

  1. Carrillo Pérez, Francisco
Dirigida por:
  1. Luis Javier Herrera Maldonado Codirector
  2. Ignacio Rojas Ruiz Codirector

Universidad de defensa: Universidad de Granada

Fecha de defensa: 27 de enero de 2023

Tribunal:
  1. Fátima Al-Shahrour Presidente/a
  2. José Carlos Prados Salazar Secretario
  3. Pedro María Carmona Sáez Vocal
  4. Almudena Espín Pérez Vocal

Tipo: Tesis

Resumen

Cancer is one of the leading causes of death worldwide, just behind cardiovascular diseases. An early diagnosis is key for the prognosis of the patient, since it allows applying the most suitable treatment. To do so, multiple screenings are routinely performed on the patient involving, for instance, the visual examination of histopathological slides, the analysis of the clinical history, or finding alterations in their gene expression. These examinations, however, are usually time-consuming, and not always the physicians have the experience to analyze them. To help them with these tasks, clinical decision support systems have been created in recent years using the advances in the machine learning field. Machine learning models are able to automatically learn from these data, and find insights that can help them to solve a specific task. This is part of the precision medicine field where, using a data-driven approach, we tailor the diagnosis, treatment, and other clinical outcomes to the specific characteristics of the patient. Thanks to the advances in this field, more heterogeneous sources of biological information are being gathered, and they provide diverse features that can help to accurately diagnose a cancer patient. This allows to create systems that use all the available information, accurately modelling the patient’s disease. This would be similar to having a separate diagnosis per data modality from a group of expert clinicians, where the final diagnosis is based on their analysis of their source of expertise. Unfortunately, not all these sources are always available, limiting the potential of creating multi-modal machine learning models. In this thesis, we explore the improvements that can be obtained by using multi-modal machine learning models resilient to missing modalities over single-modality ones in the area of cancer diagnosis. Firstly, we tackled the problem of lung cancer subtyping diagnosis using two of the most-used biomedical modalities in literature (gene expression and histopathology images), showing the improvements that can be obtained by fusing these two modalities in comparison to being independently used. Next, to study the limits that can be achieved by fusing heterogeneous biological sources, we include three new modalities to the proposed problem (micro-RNA, DNA Methylation values, and the copy number variation of the genes). We tested which modalities complemented each other, and which is the performance that can be obtained by fusing all these modalities in a classification model. Lastly, we approached the problem of data scarcity in biomedical multi-modal problems, presenting advance methodologies for biological data generation. Inspired by the recent advances in multi-modal generative models for natural images, we focus on generating one modality based on a paired one (RNA-to-image synthesis problem) for healthy tissues.We showed how the synthetic generated data were similar to the real samples and the model was able to impute missing modalities.