We give examples on how to use BNMF to study real Hi-C data. Human Hi-C contact maps are provided by Ren Lab. In the paper, we checked the Hi-C data for all human and mouse cell lines. Here, we only show the example on the raw Hi-C matrix for chromosome 22 in IMR90 cell line.
%matplotlib inline
import matplotlib.pyplot as plt
from contact_map import ContactMap
import numpy as np
map1 = ContactMap('demo-human')
Load the genome information. Basically, we only need to know the chromosome lengths and how to map chromosome names to unique integers. In the file we provided, they are
chr start end name
1 0 247249719 chr1
2 0 242951149 chr2
...
which means chr1 in human is shown as number 1 with a length of 247,249,719 base-pairs. All lines starting with # will be ignored.
map1.genome_info('hg18_chr_len.txt')
As we focus on the intra-chromosome interactions, we further narrow down the genome region to be on chr22 alone.
map1.focus_chromosome('chr22');
The data matrix we are using is constructed at 40k resolution. So, we first bin the chromosome with the same resolution and then load the whole matrix directly. If you are not sure the data format, please read the instructions at Ren Lab's website.
map1.create_binnedmap(40e3)
map1.contact_map = np.loadtxt('IMR90.uij.chr22')
Let's see what the contact map looks like.
map1.plot_map(title='Input Hi-C contact map $X$');
We have make all computations to be automatic, by calling the decompose_auto function.
map1.decompose_auto();
It creates a file demo-human_nmf.npz that saves the matrix factors which will be loaded when decomposing the contact matrix with same parameters.
Having the matrix factors, we can sort the cluster order according to their major bins in genome.
map1.sort_groups();
Now, we check the factorization solution one by one.
map1.plot_map(map1.contact_group, vmin=0.1, log=False, title='Cluster membership matrix $H$');
map1.plot_map(map1.group_map * map1.contact_group.T, vmin=0.1, log=False, title='Cluster membership matrix $W$');
map1.plot_map(map1.contact_group * map1.group_map * map1.contact_group.T, vmin=0.1, title='Balanced matrix $R=HSH^T$');
Also, we can check the position bias that is corrected. In this case, we only correct the bias on a single chromosome which may be less accurate than the one corrected on the whole contact map.
map1.plot_map(np.outer(map1.bias_vector, map1.bias_vector), log=False, title='Fatorable bias');
Topologically associated domains (TADs) is an important structure block found in many genomes. Using the TAD list from Ren Lab, we can check its correlation with our local clusters.
Again, we use IMR90 cell line as an example and focus on a single chromosome. Let's first get the bin indexes of TAD boundaries.
st=0
: read the file by skipping the first 0 linesch=0
: chromosome indexes are saved in the column of index 0po=1
: positions are saved in the column of index 1add=0
: shift the bin index in order to change between 0-based index and 1-based indexTAD_st, _ = map1.get_locations('IMR90.domain.txt', st=0, ch=0, po=1, add=0)
TAD_ed, _ = map1.get_locations('IMR90.domain.txt', st=0, ch=0, po=2, add=-1)
To better show the cluster membership values, we normalize the matrix to be rows summing to one. It means that for each bin we have a set of values with sum of one corresponding to the local clusters. If we treat those values as the probability of each bin contacting with each cluster, we can further calculate the Gini impurity score by $$Gini(i) = 1 - \sum_j p_{ij}^2$$
P = np.asarray(map1.contact_group * map1.group_map)
ps = P.sum(1)
P /= np.mean(ps[ps>0])
gini = 1-np.power(P,2).sum(1)
gini[ps==0] = 0
Let's plot a small region on the chromosome to check those values. TAD regions are marked as black bars at the top.
region = np.arange(700,800)
plt.plot(region, gini[region], '--k', label='Gini')
for i in xrange(P.shape[1]):
if P[region,i].max() > 0.5:
plt.plot(region, P[region,i], label='C%s'%(i+1))
for i,j in zip(TAD_st, TAD_ed):
if i in region or j in region:
plt.plot([i,j], [1,1], '-k', linewidth=3)
plt.xlim([region.min(), region.max()])
plt.ylim([0,1.1])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5));
We can see the local clusters provide a rich information on the multi-to-multi interaction pattern and have a smaller size compared with TADs. Bins with high Gini impurity score indicate the interactions from many clusters. As the Gini impurity can measure the interaction interfaces of local clusters, we check the enrichment of Gini impurity scores at the TAD broundary.
nearby = np.arange(-10,11)
freq = np.ones_like(nearby, dtype='float')
num = 0
for i in TAD_st + TAD_ed:
sel = i + nearby
if sel.min() >= 0 and sel.max() < gini.shape[0]:
freq += gini[sel]
num += 1
freq /= num
plt.plot(nearby, freq)
plt.xlabel('Bins near TAD broundaries')
plt.ylabel('Average Gini impurity score');
We can see a clear peak at the TAD broundary, which means our local clusters are able to recover TADs.