Clasificación del cáncer de próstata por medio de inteligencia artificial Explicable a partir de datos de expresión génica

Ramírez Mena, Alberto

Clasificación del cáncer de próstata por medio de inteligencia artificial Explicable a partir de datos de expresión génica

Ramírez Mena, Alberto

Dirixida por:

Jesús Alcalá Fernández Co-director
Luis Javier Martínez González Co-director

Universidade de defensa: Universidad de Granada

Fecha de defensa: 26 de outubro de 2023

Tribunal:

Carmen Entrala Bernal Presidenta
Carlos Cano Gutierrez Secretario/a
Javier Perez Florido Vogal

Tipo: Tese

Teseo: 822933 DIALNET DIGIBUG editor

Resumo

Prostate cancer (PC) is one of the most common cancers in men worldwide. Currently, screening strategies for PC typically focus on the measurement of prostate-speci c antigen (PSA) blood levels, the combination of various anatomical and functional magnetic resonance imaging, and digital rectal examination. However, PSA blood levels are prostate-speci c, not necessarily cancer-speci c, and can be elevated for a variety of reasons, including benign prostatic hyperplasia. On the other hand, the accuracy of imaging tests is highly dependent on the expertise and experience of the radiologist interpreting them, which limits their use and necessitates the use of more objective, speci c and precise methods. The diagnosis of PC is made by transrectal ultrasound-guided transrectal puncture biopsy (TRUS) or fusion biopsy, which combines magnetic resonance imaging (MRI) and ultrasound of the prostate. Although imaging-guided biopsies increase the success rate of diagnosing the disease, they often cause signi cant discomfort to the patient. For all these reasons, the integration of omics data with clinical data is key to understanding the pathogenesis and improving the diagnosis of the disease, and to e ectively translate this knowledge into clinical practice. Among omics data, those from RNA are among the most interesting, as it is the most dynamic component among omics and contains a wealth of information that is not often exploited for use in PC diagnosis. However, the potential and ability of transcriptomics to represent the physiological state of a patient at a given point in time is already used in the diagnosis of other diseases, so the application of transcriptomics for PC patient strati cation in clinical settings is promising. Many studies in PC have focused on the analysis of extracellular vesicles, free miRNA or, as in the case of other tumors, gene-speci c markers such as circulating mRNA molecules. Several genetic susceptibility markers for PC have also been identi ed using di erent approaches, but due to the heterogeneity of this disease, only a few of these markers have been robustly associated with PC. Moreover, all identi ed genetic markers are involved in tumor development or are biomarkers for increased risk of hereditary PC, but no gene has been described for PC diagnosis or screening, so the identi cation of new biomarkers at early stages of the disease that allow better detection and classi cation of PC remains a challenge for researchers. Recently, machine learning (ML) techniques have proven e ective in improving the prediction and diagnosis of PC due to their ability to automatically provide accurate predictive models from large amounts of data that can be used to build clinical decision support systems (CDSS) that can help specialists diagnose or detect the disease earlier and more accurately. However, the huge advances in ML have caused a wave of concern, as in most cases scientists do not understand how algorithms automatically learn from data or how they make decisions. Therefore, the European Commission has proposed a draft law on Arti cial Intelligence (AI) and established the so-called Ethics Guidelines for Trustworthy AI to promote the development of trustworthy AI that is legal, lawful and robust, which is especially important in particularly sensitive areas such as health and cancer, where decisions based on such systems can have a signi cant impact on people's lives. Therefore, the overall objective of this thesis is to design and develop a CDSS capable of predicting PC based on the expression of tissue from this organ using data from PC patients and healthy controls, and then to unravel its predictive mechanisms in order to obtain biologically relevant biomarkers that may be related to the disease. To this end, a selection and ltering of genes was performed according to their biological relevance in PC, based on their di erential expression, their gene ontology and the information available in the scienti c literature. The selected genes were used to develop several CDSSs from the gene expression information in 550 samples included in The Cancer Genome Atlas and using explainable AI techniques, obtaining models that are easily understood by humans and/or providing explanations of how the model makes its predictions and what features it takes into account. It should be noted that this approach facilitates the detection and prevention of possible biases and discriminations in the models, as it provides greater visibility and control over how decisions are made. The generated CDSSs performed well on various quality metrics, so the best performing CDSS was further validated on four external populations of diverse ethnic ancestry, with a total of 463 samples, obtaining mean sensitivity and speci city values of 0.9 and 0.8. Fi nally, a set of Shapley's additive explanations were extracted from the best performing CDSS to help clinicians understand the underlying reasons for each decision. These explanations allowed us to understand how the CDSS uses a number of genes that have been associated with PC in the literature, but never for screening, such as DLX1, MYL9, and FGFR, as well as new genes that have not been previously described, such as CAV2 and MYLK. At the same time, we were able to identify the key role of some genes, not so relevant in absolute terms, but with a certain in uence in some individuals, genes never before associated with cancer or prostate function, such as RNF112, APOF or MYOCD, among others. The explanations extracted from the CDSS proposed in this work are consistent with each other and with the literature, opening a horizon for its application in clinical practice. Fig. 2 shows a graphical overview of the CDSS construction process. To analyze the reliability and feasibility of applying the CDSS in clinical practice, we nally performed an analysis on samples of di erent types (fresh biopsy, para n-embedded biopsy and plasma) from a cohort of patients from the Andalusian Health Service monitored by our research group. We successfully validated its performance in local samples of fresh biopsy and para n-embedded biopsy, and we were able to demonstrate that the genes DLX1, TDRD1, AMACR, HPN, HOXC6 and OR51E2 have a higher di erential expression in tissue with PC compared to healthy tissue. In addition, we were able to demonstrate that the expression of the AMACR gene has the potential to predict the aggressiveness of PC. The analysis of expression in plasma a ected the behavior of the model because many of the genes lacked quanti able expression in this medium. Nevertheless, the results obtained are encouraging and open a very interesting line of future work to adapt the design carried out in this thesis to this type of samples.