Marc Joiret 's web space Bioinformatics and Computational Biology

Bioinformatics and Computational Biology are new fields in Biology and Biomedicine using classical toolboxes of algorithmics but also are providing new methods to solve original problems in Biology.

The typical first algorithms encountered in Bioinformatics are related to the alignment problem. How do you find and score the alignment of a sequence substring to any other sequence of reference ? A first description of important algorithms is presented hereafter:

• Dynamic programming;
• Optimal pairwise alignment;
• Global alignment;
• Suffix trees applications in detecting repeated and unique substrings or the longest common substring between sequences
• Fast alignments (BLAST)
• Multiple Sequence Alignments (CLUSTAL-W)

We provide in the following link here a brief practical introduction to those algorithms with practical illustrations of implemented programs in Python.

In the following sections, we deal with specific topics in Bioinformatics and Computational Biology of interests to curious biologists :

# Topic #1: Hidden Markov Models.

Hidden Markov Models are a popular machine learning approach in bioinformatics. Machine learning algorithms are presented with training data, which are used to derive important insights about the (often hidden) parameters. Once an algorithm has been suitably trained, it can apply these insights to the analysis of a test sample. As the amount of training data increases, the accuracy of the machine learning algorithm typically increases as well. The parameters that are learned during training represent knowledge; application of the algorithm with those parameters to new data (not used in the training phase) represents the algorithm's use of that knowledge. The Hidden Markov Model (HMM) approach, considered hereafter, learns some unknown probabilistic parameters from training samples and uses these parameters in the framework of dynamic programming to find the best explanation for the experimental data. Our final target here is to find gene promoter sequences inferred from the so-called CpG islands.

Suppose you are given the dinucleotide frequences in CpG-islands and the dinucleotide frequences outside CpG islands. Design an HMM for finding CpG-islands in genomic sequences. Apply your algorithm to a sequence mined from the Human Genome reference in a given gene of interest and check if there are CpG islands that are associated to the gene or to some of its switches ?

# What are CpG islands and how to track them? Hints for a solution and results

It is sometimes relevant to predict the presence of genomic DNA features such as promoters, enhancers, silencers, insulators and locus control regions. Such regulatory elements are called cis-regulatory modules or cis-regulatory switches. Identifying them is difficult compared to finding protein-coding genes because the DNA sequences of interest may be very short (e.g. a dozen base pairs for transcription factor-binding sites).

The dinucleotide CG is often referred to as CpG (p denotes the phosphate linkage between two desoxyriboses residues). CpG islands represent an example of a regulatory element that are found in around $$70\%$$ of the genes of eukariotic chromosomes. The dinucleotide cytosine followed by guanosine (CpG) is approximately five-fold underrepresented in many genomes, partly because the cytosine residue can be exchanged for thymidine by spontaneous deamination. Cytosine residues on CpG dinucleotides are often methylated. This in turn leads to the recruitment of protein complexes that include histone diacetylases capable of removing acetyl groups of histones and therefore inhibiting active transcription. CpG islands are regions of high density of unmetylated CpG dinucleotides and are commonly found in upstream (5') regulatory regions near the transcription start sites of constitutively active 'housekeeping' genes.

By one criterion, a CpG island is defined as having:

• a GC content $$\ge 50\%$$
• a length $$\ge 200\, bp$$
• a ratio of observed to expected number of CpG dinucleotides of $$>0.6$$.
As an example here, We will focus on the GDF15 gene. The UCSC Genome browser shows this gene is on the forward strand of chromosome 19, band p13.11. In the human genome reference GRch38/hg38, the physical position is located between 18.385 Mb and 18.450 Mb. The UCSC browser shows there are 2 exons and 1 intron. By checking 'show' in the CpG islands option in the 'regulation' section of the UCSC Genome browser, two CpG island sequences are indicated in the gene. We will try to retrieve those CpG islands after importing the DNA sequence locally on a laptop and screening with a self-made HMM algorithm in Python. This GDF15 gene encodes a secreted ligand of the TGF-beta (transforming growth factor-beta) superfamily of proteins. Ligands of this family bind various TGF-beta receptors leading to recruitment and activation of SMAD family transcription factors that regulate gene expression. The encoded preproprotein is proteolytically processed to generate each subunit of the disulfide-linked homodimer. The protein is expressed in a broad range of cell types, acts as a pleiotropic cytokine and is involved in the stress response program of cells after cellular injury. Increased protein levels are associated with disease states such as tissue hypoxia, inflammation, acute injury and oxidative stress. The size and locus positions identified by the UCSC genome browser for the two CpG islands are: CpG island of size 206 bp and CpG island of size 600 bp at physical positions: chr19: 18,386,254-18,386,479 and chr19: 18,388,288-18,388,887 respectively.

To design a Hidden Markov Model (HMM) algorithm to detect CpG islands in a genomic sequence, we first recall, in an intuitively simple way, what a HMM is.

Proposed research questions:
• Extract the human genomic DNA sequence encompassing GDF15 on chromosome 19 between 18,385,000 bp and 18,450,000 bp in FASTA format from the ENSEMBL repository.
• Compute the length of the sequence, compute the frequency table of the 4 bases A, T, C, G for this sequence.
• Compute the frequency table of each dinucleotides in the sequence and plot a histogram.
• Use a Hidden Markov Model to retrieve the CpG islands, indicate where there are and provide statistics about the retrieved CpG islands. Use the CpG island definition criteria proposed earlier.
• Keep track of the CpG islands relative positions and relative to the GDF15 gene (align with mRNA if necessary) in order to investigate later if causal SNPs appear in these regions which could be associated with a disease like ankylosing spondylitis for which GWAS data are available.
• Redo all this workflow for the following genes: GFRAL receptor gene of GDF15, RUNX2 (chr6: 45,320,000 bp-45,560,000 which has 3 CpG islands) involved in chondrogenesis.

# Code extracts that produced the solution above:

Click the link here to access the code on GitHub.