Back to homepage

Examples on a human contact map

We give examples on how to use BNMF to study real Hi-C data. Human Hi-C contact maps are provided by Ren Lab. In the paper, we checked the Hi-C data for all human and mouse cell lines. Here, we only show the example on the raw Hi-C matrix for chromosome 22 in IMR90 cell line.

Import required modules

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from contact_map import ContactMap
import numpy as np

Create an object for the contact map

In [2]:
map1 = ContactMap('demo-human')

Load the genome information. Basically, we only need to know the chromosome lengths and how to map chromosome names to unique integers. In the file we provided, they are

chr start   end         name
1   0       247249719   chr1
2   0       242951149   chr2
...

which means chr1 in human is shown as number 1 with a length of 247,249,719 base-pairs. All lines starting with # will be ignored.

In [3]:
map1.genome_info('hg18_chr_len.txt')
Read file hg18_chr_len.txt None column and row [1,23)
chr1 with index 1
chr2 with index 2
chr3 with index 3
chr4 with index 4
chr5 with index 5
chr6 with index 6
chr7 with index 7
chr8 with index 8
chr9 with index 9
chr10 with index 10
chr11 with index 11
chr12 with index 12
chr13 with index 13
chr14 with index 14
chr15 with index 15
chr16 with index 16
chr17 with index 17
chr18 with index 18
chr19 with index 19
chr20 with index 20
chr21 with index 21
chr22 with index 22
chrX with index 23
Read file hg18_chr_len.txt 0 column and row [1,23)
Read file hg18_chr_len.txt 1 column and row [1,23)
Read file hg18_chr_len.txt 2 column and row [1,23)

As we focus on the intra-chromosome interactions, we further narrow down the genome region to be on chr22 alone.

In [4]:
map1.focus_chromosome('chr22');

The data matrix we are using is constructed at 40k resolution. So, we first bin the chromosome with the same resolution and then load the whole matrix directly. If you are not sure the data format, please read the instructions at Ren Lab's website.

In [5]:
map1.create_binnedmap(40e3)
map1.contact_map = np.loadtxt('IMR90.uij.chr22')
Create contact map of shape (1243, 1243)

Let's see what the contact map looks like.

In [6]:
map1.plot_map(title='Input Hi-C contact map $X$');

Apply BNMF on it

We have make all computations to be automatic, by calling the decompose_auto function.

In [7]:
map1.decompose_auto();
dim_num = 78
> Solve NMF for size (1243, 1243) <type 'numpy.ndarray'>
Matrix density is 0.67695587648 and mask 0
Optimize available solution for H!
Optimize available solution for S!
Lambda for NMF-PoissonManifoldEqual is set to 1
Initial objective is 327914.751031
Current objective is 327925.357522;  
Density of H is 0.598; Density of S is 0.520;
The best NMF-PoissonManifoldEqual objective for NMF is 327925.357522 with r=78 after 2 iterations.

It creates a file demo-human_nmf.npz that saves the matrix factors which will be loaded when decomposing the contact matrix with same parameters.

Having the matrix factors, we can sort the cluster order according to their major bins in genome.

In [8]:
map1.sort_groups();

Now, we check the factorization solution one by one.

In [9]:
map1.plot_map(map1.contact_group, vmin=0.1, log=False, title='Cluster membership matrix $H$');
In [10]:
map1.plot_map(map1.group_map * map1.contact_group.T, vmin=0.1, log=False, title='Cluster membership matrix $W$');
In [11]:
map1.plot_map(map1.contact_group * map1.group_map * map1.contact_group.T, vmin=0.1, title='Balanced matrix $R=HSH^T$');

Also, we can check the position bias that is corrected. In this case, we only correct the bias on a single chromosome which may be less accurate than the one corrected on the whole contact map.

In [12]:
map1.plot_map(np.outer(map1.bias_vector, map1.bias_vector), log=False, title='Fatorable bias');

Comparing the clusters with topological domains

Topologically associated domains (TADs) is an important structure block found in many genomes. Using the TAD list from Ren Lab, we can check its correlation with our local clusters.

Again, we use IMR90 cell line as an example and focus on a single chromosome. Let's first get the bin indexes of TAD boundaries.

  • st=0: read the file by skipping the first 0 lines
  • ch=0: chromosome indexes are saved in the column of index 0
  • po=1: positions are saved in the column of index 1
  • add=0: shift the bin index in order to change between 0-based index and 1-based index
In [13]:
TAD_st, _ = map1.get_locations('IMR90.domain.txt', st=0, ch=0, po=1, add=0)
TAD_ed, _ = map1.get_locations('IMR90.domain.txt', st=0, ch=0, po=2, add=-1)
Read file IMR90.domain.txt 0 column and row [0,2348)
Read file IMR90.domain.txt 1 column and row [0,2348)
Read file IMR90.domain.txt -1 column and row [0,2348)
!! There are 2322 unmapped locations.
Read file IMR90.domain.txt 0 column and row [0,2348)
Read file IMR90.domain.txt 2 column and row [0,2348)
Read file IMR90.domain.txt -1 column and row [0,2348)
!! There are 2322 unmapped locations.

To better show the cluster membership values, we normalize the matrix to be rows summing to one. It means that for each bin we have a set of values with sum of one corresponding to the local clusters. If we treat those values as the probability of each bin contacting with each cluster, we can further calculate the Gini impurity score by $$Gini(i) = 1 - \sum_j p_{ij}^2$$

In [14]:
P = np.asarray(map1.contact_group * map1.group_map)
ps = P.sum(1)
P /= np.mean(ps[ps>0])

gini = 1-np.power(P,2).sum(1)
gini[ps==0] = 0

Let's plot a small region on the chromosome to check those values. TAD regions are marked as black bars at the top.

In [15]:
region = np.arange(700,800)
plt.plot(region, gini[region], '--k', label='Gini')
for i in xrange(P.shape[1]):
    if P[region,i].max() > 0.5:
        plt.plot(region, P[region,i], label='C%s'%(i+1))
for i,j in zip(TAD_st, TAD_ed):
    if i in region or j in region:
        plt.plot([i,j], [1,1], '-k', linewidth=3)
plt.xlim([region.min(), region.max()])
plt.ylim([0,1.1])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5));

We can see the local clusters provide a rich information on the multi-to-multi interaction pattern and have a smaller size compared with TADs. Bins with high Gini impurity score indicate the interactions from many clusters. As the Gini impurity can measure the interaction interfaces of local clusters, we check the enrichment of Gini impurity scores at the TAD broundary.

In [16]:
nearby = np.arange(-10,11)
freq = np.ones_like(nearby, dtype='float')
num = 0
for i in TAD_st + TAD_ed:
    sel = i + nearby
    if sel.min() >= 0 and sel.max() < gini.shape[0]:
        freq += gini[sel]
        num += 1
freq /= num
plt.plot(nearby, freq)
plt.xlabel('Bins near TAD broundaries')
plt.ylabel('Average Gini impurity score');

We can see a clear peak at the TAD broundary, which means our local clusters are able to recover TADs.

Back to homepage