We give examples on how to use BNMF to study real Hi-C data. Yeast Hi-C contact maps are provided by Noble Research Lab. In the paper, we did the experiment on HindIII interactions at FDR 1% at 10kb resolution.
Here, we only want to show the procedure and so choose the smallest Hi-C contact map that is contained in our package. Please check Noble lab's website for the full data set.
%matplotlib inline
import matplotlib.pyplot as plt
from contact_map import ContactMap
import numpy as np
map1 = ContactMap('demo-yeast')
Load the genome information. Basically, we only need to know the chromosome lengths and how to map chromosome names to unique integers. In the file we provided, they are
chr start end name
1 0 230218 chrI
2 0 813184 chrII
...
which means chrI in yeast is shown as number 1 with a length of 230,218 base-pairs.
All lines starting with #
will be ignored.
map1.genome_info('yeast_chr_len.txt')
Then, we can import contact links from Hi-C data. The format for long-range interaction files is like:
chr1 locus1 chr2 locus2 freq
1 3437 2 9409 7
1 31834 2 301976 6
...
which means location chrI:3437
and location chrII:9409
have 7 interactions and so on. Multiple interaction files can be imported in sequence.
datafiles = ['HindIII_intersect_EcoRI_fdr0.01_inter.txt',
'HindIII_intersect_EcoRI_fdr0.01_intra.txt']
for datafile in datafiles:
map1.add_interactions(datafile)
We group those interations into a contact map with a fixed bin size. We set a large bin size (40kb) in this example through the parameter binsize=40e3
.
map1.create_binnedmap(binsize=40e3)
Let's see what the contact map looks like.
map1.plot_map(title='Input Hi-C contact map $X$');
We have make all computations to be automatic, by calling the decompose_auto
function.
map1.decompose_auto();
It creates a file demo-yeast_nmf.npz
that saves the matrix factors which will be loaded when decomposing the contact matrix with same parameters.
Having the matrix factors, we can sort the cluster order according to their major bins in genome.
map1.sort_groups();
Now, we check the factorization solution one by one.
map1.plot_map(map1.contact_group, vmin=0.1, log=False, title='Cluster membership matrix $H$');
map1.plot_map(map1.group_map * map1.contact_group.T, vmin=0.1, log=False, title='Cluster membership matrix $W$');
map1.plot_map(map1.contact_group * map1.group_map * map1.contact_group.T, vmin=0.1, title='Balanced matrix $R=HSH^T$');
Also, we can check the position bias that is corrected.
map1.plot_map(np.outer(map1.bias_vector, map1.bias_vector), log=False, title='Fatorable bias');
We can test whether a type of genomic feature is enriched in those clusters. Again, we used the list of early origin sites from Noble lab's website.
To do the test, we need to transform the genomic locations into the bin indexes. According to the file format, we have several parameters.
st=0
: read the file by skipping the first 0 linesch=0
: chromosome indexes are saved in the column of index 0po=1
: positions are saved in the column of index 1nm=0
: the location names are saved in the column of index 0 (we don't have location's name)idx, names = map1.get_locations('origins_nonCDR_early.txt', st=0, ch=0, po=1, nm=0)
We test the bins list using the area under the ROC curve (AUC).
srt, val, pval = map1.test_enrichment(idx, method='AUC')
val
saves the AUC values and pval
are the corresponding p-values. srt
gives the indexes from the most significant cluster to the least one.
plt.bar(np.arange(10), val[srt[:10]]);
plt.xticks(np.arange(0.4,11,1), ['C%s'%(i+1) for i in srt[:10]]);
plt.ylim([0.5, val.max()+0.01]);
Let's check the cluster with the largest AUC in the whole contact map.
map1.plot_map(map1.contact_group[:,srt[0]]*map1.contact_group[:,srt[0]].T, log=False, title='Submap for C%s'%(srt[0]+1));
We can see the most enriched cluster is the one for centromeres.