Topic #1: Hidden Markov Models.
Hidden Markov Models are a popular machine learning approach in bioinformatics. Machine learning algorithms are
presented with training data, which are used to derive important insights about the (often hidden) parameters. Once an
algorithm has been suitably trained, it can apply these insights to the analysis of a test sample. As the amount of
training data increases, the accuracy of the machine learning algorithm typically increases as well. The parameters that are
learned during training represent knowledge; application of the algorithm with those parameters to new data (not used
in the training phase) represents the algorithm's use of that knowledge. The Hidden Markov Model (HMM) approach, considered
hereafter, learns some unknown probabilistic parameters from training samples and uses these parameters in the framework
of dynamic programming to find the best explanation for the experimental data. Our final target here is to find gene promoter
sequences inferred from the so-called CpG islands.
Suppose you are given the dinucleotide frequences in CpG-islands and the dinucleotide frequences outside CpG islands. Design
an HMM for finding CpG-islands in genomic sequences. Apply your algorithm to a sequence mined from the Human Genome reference
in a given gene of interest and check if there are CpG islands that are associated to the gene or to some of its switches ?
What are CpG islands and how to track them? Hints for a solution and results
It is sometimes relevant to predict the presence of genomic DNA features such as promoters, enhancers, silencers, insulators
and locus control regions. Such regulatory elements are called cis-regulatory modules or cis-regulatory switches. Identifying
them is difficult compared to finding protein-coding genes because the DNA sequences of interest may be very short (e.g. a
dozen base pairs for transcription factor-binding sites).
The dinucleotide CG is often referred to as CpG (p denotes the phosphate linkage between two desoxyriboses residues).
CpG islands represent an example of a regulatory element that are found in around \(70\%\) of the genes of eukariotic chromosomes.
The dinucleotide cytosine followed by guanosine (CpG) is approximately five-fold underrepresented in many genomes, partly
because the cytosine residue can be exchanged for thymidine by spontaneous deamination. Cytosine residues on CpG dinucleotides
are often methylated. This in turn leads to the recruitment of protein complexes that include histone diacetylases capable of
removing acetyl groups of histones and therefore inhibiting active transcription. CpG islands are regions of high density of
unmetylated CpG dinucleotides and are commonly found in upstream (5') regulatory regions near the transcription start sites of
constitutively active 'housekeeping' genes.
By one criterion, a CpG island is defined as having:
- a GC content \(\ge 50\%\)
- a length \(\ge 200\, bp\)
- a ratio of observed to expected number of CpG dinucleotides of \(>0.6\).
As an example here, We will focus on the GDF15 gene. The UCSC Genome browser shows this gene is on the forward strand of chromosome 19,
band p13.11. In the human genome reference GRch38/hg38, the physical position is located between 18.385 Mb and 18.450 Mb.
The UCSC browser shows there are 2 exons and 1 intron. By checking 'show' in the CpG islands option in the 'regulation' section
of the UCSC Genome browser, two CpG island sequences are indicated in the gene. We will try to retrieve those CpG islands after
importing the DNA sequence locally on a laptop and screening with a self-made HMM algorithm in Python.
This GDF15 gene encodes a secreted ligand of the TGF-beta (transforming growth factor-beta) superfamily of proteins.
Ligands of this family bind various TGF-beta receptors leading to recruitment and activation of SMAD family transcription
factors that regulate gene expression.
The encoded preproprotein is proteolytically processed to generate each subunit of the disulfide-linked homodimer.
The protein is expressed in a broad range of cell types, acts as a pleiotropic cytokine and is involved in the stress response
program of cells after cellular injury. Increased protein levels are associated with disease states such as tissue hypoxia,
inflammation, acute injury and oxidative stress.
The size and locus positions identified by the UCSC genome browser for the two CpG islands are: CpG island of size 206 bp and
CpG island of size 600 bp at physical positions: chr19: 18,386,254-18,386,479 and chr19: 18,388,288-18,388,887 respectively.
To design a Hidden Markov Model (HMM) algorithm to detect CpG islands in a genomic sequence, we first recall, in an
intuitively simple way, what a HMM is.
Proposed research questions:
- Extract the human genomic DNA sequence encompassing GDF15 on chromosome 19 between 18,385,000 bp and 18,450,000 bp
in FASTA format from the ENSEMBL repository.
- Compute the length of the sequence, compute the frequency table of the 4 bases A, T, C, G for this sequence.
- Compute the frequency table of each dinucleotides in the sequence and plot a histogram.
- Use a Hidden Markov Model to retrieve the CpG islands, indicate where there are and provide statistics about the retrieved
CpG islands. Use the CpG island definition criteria proposed earlier.
- Keep track of the CpG islands relative positions and relative to the GDF15 gene (align with mRNA if necessary) in order
to investigate later if causal SNPs appear in these regions which could be associated with a disease like ankylosing
spondylitis for which GWAS data are available.
- Redo all this workflow for the following genes: GFRAL receptor gene of GDF15, RUNX2 (chr6: 45,320,000 bp-45,560,000 which
has 3 CpG islands) involved in chondrogenesis.