Metilación diferencial en el genoma humano y su asociación con la transcripción

  1. Lebrón Aguilar, Ricardo
Supervised by:
  1. José Lutgardo Oliver Jiménez Director
  2. Michael Hackenberg Co-director

Defence university: Universidad de Granada

Fecha de defensa: 09 July 2019

Committee:
  1. Francisco Perfectti Álvarez Chair
  2. Inmaculada López Flores Secretary
  3. Francisca Martinez Real Committee member
  4. Pedro A. Bernaola Galván Committee member
  5. Pedro María Carmona Sáez Committee member
Department:
  1. GENÉTICA

Type: Thesis

Abstract

Abstract: A human being is composed of more than 400 cell types, which differ in the specific set of genes they transcribe, despite having the same genomic sequence. The differences between cell types lie in the specific epigenetic information accompanying the genome and in the transcription factors present in the cell. In adult human cells, cytosine methylation occurs primarily at CpG sites and is probably the most important epigenetic mark, as it contributes to transcription regulation while remaining stable throughout the cell lineage, and changing during cell fate establishment. According to the traditional paradigm, methylation in the promoter is associated with the repression of transcription, although there are cases in which it is associated with the activation of transcription or in which transcription is independent of methylation. On the other hand, the effect of methylation on transcription regulation is not limited to promoters, but also to other regions such as enhancers and the gene body are also involved. For more than a decade, it has been possible to detect the level of methylation of each cytosine in the genome, thanks to the emergence of a mass sequencing technique known as Whole-Genome Bisulfite Sequencing (WGBS). However, there are many sources of error that affect the reliability of the results obtained, causing erroneous detections in the methylation level of some cytosines and even loss of information in certain regions. As a response to these problems, many researchers choose to average the methylation levels of CpG sites within the regions of interest, assuming that errors will compensate for each other and therefore sacrificing the high resolution this technique offers. Nevertheless, the average methylation of a region is not always relevant and can even lead to erroneous conclusions. It has recently been described that only 16.6% of CpG sites on promoters have an effect upon transcription when their methylation changes. This evidences the need to develop methods that allow a more reliable detection of the methylation levels of each cytosine. The first objective of this Doctoral Thesis was to design and implement a protocol for obtaining methylation maps, from WGBS reads, in an attempt to solve all known problems: i) eliminating low quality positions or those that have been entered during library preparation, as well as duplicate reads, ii) correcting problems arising from the alignment of reads, iii) discarding positions and reads affected by bias in methylation and iv) distinguishing between C/T substitutions and non-methylated cytosines. During the development of this protocol, a type of bias caused by the use of new genomic assembly models was discovered. The last two versions of the human genome assembly include alternative haplotypes, which attempt to collect structural and sequence variations from different human populations or ethnicities, in order to prevent reads from these haplotypes from misaligning in other regions of the genome. However, it has not been evaluated whether this inclusion might cause any problems. In this Doctoral Thesis, it is described for the first time that the use of the new assembly models causes the loss of reads from polymorphic loci as a consequence of an increase in the percentage of reads with ambiguous alignment. To recover these reads and assign them to the consensus assembly, a two-stage alignment strategy was designed: i) all reads face full assembly and, ii) those whose alignment has been proved ambiguous during the first stage are confronted with a version of the assembly without alternative haplotypes. Finally, the unique-alignment reads from both alignments are brought together and will be used in later stages of the protocol. Once the protocol was mature enough, it was decided to implement it as an open-source program, which received the name of MethFlow. The workflow of this program starts from WGBS reads in FASTQ format and ends with the obtaining of methylation maps after going through several stages which deal with biases and contaminations using third-party programs combined with our own code. The most important stages are those in the two-stage alignment, in which Bismark is used following the strategy described above, and the detection of methylation levels from corrected alignments by using MethylExtract because it is capable of distinguishing C/T substitutions of non-methylated cytosines. One of the major problems that the scientific community faces today is the lack of reproducibility of results. To ensure this reproducibility, the MethFlow architecture was designed based on: i) containers generated from a configuration file, which indicates the version of each program, its installation process and configuration, and ii) a sophisticated framework for complex pipelines, providing comprehensive control and a thorough record of the executed processes. Finally, MethFlow was provided with a modular structure, so that later modules could be added to perform related tasks, such as analyzing changes in methylation or its association with transcription. Once a suitable tool was available to study the methylation levels of individual cytosines, it was hypothesized that, depending on the genomic context in which it occurs and the type of transcription factors involved, methylation may contribute to the positive or negative regulation of transcription or have no effect. To prove this hypothesis it was necessary to: i) obtain a collection of human methylation maps that would collect as many cell types and individuals as possible, ii) characterize the differences in methylation due to cell type and individual and, iii) study the association with transcription of methylation changes in individual CpG sites and their possible impact on regulatory elements of transcription. The Roadmap Epigenomics, ENCODE and Enhancing GTEx projects have public sets of WGBS reads for a wide range of human samples. Using MethFlow, methylation maps for 86 human samples from 52 cell types of 29 individuals were obtained. From 51 of the 86 samples, transcription profiles were also obtained through ENCODE DATA. These methylation maps and transcription profiles were fundamental in characterizing methylation changes in the human genome and their association with transcription. Each cell type has a characteristic methylation pattern, partly inherited from the stem cell that precedes it in its lineage and partly modified during the cell differentiation process. Similarly, the same cell type may have certain differences in methylation between individuals due to genetic and environmental factors. Both types of variability in methylation can be expected to have different biological implications. To study the variability of methylation, samples were compared in pairs and then those changes in methylations that were characteristic of the cell type or the individual were chosen. A method for detecting Differentially Methylated CpGs (DMCs) based on the Fisher’s Exact Test was developed and incorporated into MethFlow as a module. Two types of DMCs were then defined: i) intra-individual DMCs, whose methylation varies between different cell types of the same individual, and ii) inter-individual DMCs, whose methylation varies between individuals for a given cell type. Once as many sets of DMCs as pairs could be formed following these two definitions, strict sets of intra-individual DMCs and inter-individual DMCs were defined: i) for each sample, those DMCs common to all their peer comparisons (intra-individual or inter-individual, as appropriate) were selected and ii) all the selected DMCs were brought together in a single set. It was then necessary to design a method to study enrichment in DMCs of a given set of genomic elements. Since the distribution of CpG sites in the genome is not random, enrichment was defined as the ratio between the percentage of CpG sites that are DMCs within the set of genomic elements and the percentage of CpG sites that are DMCs outside the genome. After applying these methods and definitions to previously obtained methylation maps, it was found that 3,303,077 (12.19%) and 329,974 (1.22%) of the CpG sites of the human genome are, respectively, intra-individual DMC and inter-individual DMC. The main genomic elements related to the regulation of transcription (promoters, enhancers and transcription factors binding sites) do not show remarkable differences in intra-individual DMCs and inter-individual DMCs. However, open chromatin regions were found to be enriched in intra-individual DMCs, but impoverished in inter-individual DMCs. DMCs are under-represented in promoters, while they are over-represented in enhancers, suggesting that most methylation changes (both between cell types and between individuals) occur in enhancers. Transcription factors binding sites are also enriched in DMCs, regardless of the type of transcription factor involved. On the other hand, the proportion of DMCs decreases as the distance to the nearest transcription start site decreases, and it increases as the distance to the nearest transcription end site decreases. As it was previously mentioned, only 16.6% of CpG sites on promoters have an effect on transcription when their methylation changes. Recently, so-called "CpG traffic lights" (CpG-TLs) have been described, which are individual CpG sites whose level of methylation is associated with the transcription rate of a nearby gene. These biological markers are well-suited to test the hypothesis which suggests that the sign of the association between methylation and transcription depends on the genomic context in which methylation occurs and the type of transcription factors involved. Other authors had detected CpG-TLs in the human genome, using the Spearman’s correlation coefficient and selecting only those results with negative association. However, this test is sensitive to outliers. In order to reduce this problem and increase the reliability of the results, in this Doctoral Thesis a method to detect CpG-TLs was developed using a combination of the Spearman’s correlation coefficient and the Kruskal-Wallis test. Two classes of CpG-TLs were also distinguished: i) reds, when the association is negative, and ii) greens, when the association is positive. This method is available as a MethFlow module. After applying these methods and definitions to previously obtained methylation maps and transcription profiles, it was found that the number of green CpG-TLs is almost twice the number of red CpG-TLs: 126,959 (0.49%) and 66,746 (0.26%), respectively, on the CpG sites of the human genome. Red and green CpG-TLs are both over-represented in promoters and enhancers. This suggests that both have mechanisms to activate or repress transcription via methylation, probably due to different combinations of transcription factors binding sites. In sites recognized by transcription factors with greater affinity for non-methylated sites, both red and green CpG-TLs are over-represented. On the contrary, in sites recognized by transcription factors with greater affinity for methylated sites, red CpG-TLs are under-represented while green CpG-TLs are over-represented. This second type of transcription factors are fundamental in mammalian development and some are even able to recruit enzymes that remodel methylation. In terms of their distribution around genes, the proportion of green CpG-TLs decreases as the distance to the transcription starting site is reduced, while the proportion of red CpG-TLs increases. The NGSmethDB methylation database contains an extensive collection of methylation maps for different species, cell types and individuals. In order to optimize the storage and consultation of the large volume of data produced throughout this Doctoral Thesis, including methylation, DMCs and CpG-TLs maps, it was decided to completely redesign this database. To accelerate comparisons between samples, it was decided to migrate the data to the MongoDB database system and store them in a hierarchical structure of JSON documents (a standard format that allows exchanging tagged and hierarchical data between different programming languages), where: i) each assembly has its own database, ii) each chromosome has its own collection of JSON documents, iii) each CpG site has its own JSON document and, iv) each sub-document contains a type of biological information (methylation, differential methylation or association with transcription). In the case of methylation maps, each sub-document is divided into three levels: i) the individual, ii) the sample and, iii) the type of data. Several ways of access, comparison and visualization of the data contained in the NGSmethDB were implemented, among which the following stand out: i) its programmatic access through the HTTPS protocol through a RESTful API server and ii) its connectivity to UCSC Genome Browser through Track Hubs. In this Doctoral Thesis the reliability in detecting the methylation levels of individual cytosines from WGBS reads has been significantly improved, taking into account all the sources of error known today. This has allowed to test the hypothesis which argues that the sign of the association between methylation and transcription depends on the genomic context in which methylation occurs and the type of transcription factors that are involved. In the light of the results obtained, it has not been possible to refute this hypothesis. An unexpected finding has been that the positive association between methylation and transcription appears to be more frequent than it had been previously described, becoming even more frequent than cases with negative association. In relation to this, in transcription factors binding sites with greater affinity for methylated sites, CpG-TLs green are over-represented, but CpG-TLs red are under-represented. These positive associations may be due to a hitherto unknown transcription regulation mechanism, but there are also likely to be cases where hydroxymethylation is positively associated with transcription, as the WGBS technique is unable to discriminate between methylation and hydroxymethylation. In further studies, OxBS-seq or TAB-seq techniques should be used in order to clarify the true nature of green CpG-TLs.