Download
Links
Sample | Type | Download | MD5 |
---|---|---|---|
GM12878 | Per chromosome | HTTP | b6d322687af0bb50612a9bef9bd93775 |
GM12878 | Whole genome, Hi-C (B-B edges) | HTTP | a2c9551b325ee2309a1d986e879e11ac |
K562 | Per chromosome | HTTP | ad99c2c100f39e09cde36eda6dba011b |
K562 | Whole genome, Hi-C (B-B edges) | HTTP | 91cdf1d4a54552c7510079e97ead20e7 |
NHEK | Per chromosome | HTTP | 466a4cd9369dc207989a68ace680806c |
NHEK | Whole genome, Hi-C (B-B edges) | HTTP | 52dcf5c446d97aeed7cec309723eed23 |
HUVEC | Per chromosome | HTTP | e247dd00a5cfc7105eca6e52ba1ec5a2 |
HUVEC | Whole genome, Hi-C (B-B edges) | HTTP | 32dc348ee82887bc6b5d1239cacb9270 |
HMEC | Per chromosome | HTTP | 6c86d09fb77440ee68aeffed02462555 |
HMEC | Whole genome, Hi-C (B-B edges) | HTTP | 0d42af3b692d07d8453e8fba686376cb |
All samples | Whole genome, raw PPI table | HTTP | de9e6c11ecc5e27aa36531c378b538c6 |
Note
If you wish to reproduce per chromosome results, for a certain sample (e.g. GM12878
), please download Per chromosome
and Whole genome, raw PPI table
. To reproduce whole genome results, you will also need Whole genome, Hi-C (B-B edges)
package.
Directory structure
Per chromosome
package
From now on, we will be always assuming you have downloaded the sample GM12878's package. For other samples, you may safely replace GM12878
below with the corresponding sample you have downloaded.
./GM12878/
./GM12878/node/ # See 1.
./GM12878/node/gene_expr/ # See 1.a.
./GM12878/node/gene_expr/chr1.gene
./GM12878/node/gene_expr/chr2.gene
...
./GM12878/node/gene_expr/chrX.gene
./GM12878/node/bin_DNase/ # See 1.b.
./GM12878/node/bin_DNase/chr1.bin.bed.DNase
./GM12878/node/bin_DNase/chr2.bin.bed.DNase
...
./GM12878/node/bin_DNase/chrX.bin.bed.DNase
./GM12878/edge/ # See 2.
./GM12878/edge/gene_bin/ # See 2.a.
./GM12878/edge/gene_bin/chr1.g_b.csv
./GM12878/edge/gene_bin/chr2.g_b.csv
...
./GM12878/edge/gene_bin/chrX.g_b.csv
./GM12878/edge/bin_bin/ # See 2.b.
./GM12878/edge/bin_bin/chr1.bb
./GM12878/edge/bin_bin/chr2.bb
...
./GM12878/edge/bin_bin/chrX.bb
Whole genome, Hi-C (B-B edges)
package
./GM12878_wg/
./GM12878_wg/edge/ # See 2.
./GM12878_wg/edge/bin_bin/ # See 2.b.
./GM12878_wg/edge/bin_bin/all.bb
You may wish to merge the content of this package with Per chromosome
package of corresponding sample.
Whole genome, raw PPI table
package
./PPI_raw/
./PPI_raw/ppi.txt # See 3.
Package contents
1.
Nodes
1.a.
Gene expression levels
An excerpt:
> head -n 3 ./GM12878/node/gene_expr/chr1.gene
a.ENSG00000007908$1$169691781$169703220$-1$SELE,0
a.ENSG00000009709$1$18957500$19075360$1$PAX7,0
a.ENSG00000010932$1$171217638$171255117$1$FMO1,0
> tail -n 3 ./GM12878/node/gene_expr/chr1.gene
a.ENSG00000143184$1$168545711$168551315$1$XCL1,806.12
a.ENSG00000143119$1$111415772$111442550$1$CD53,818.15
a.ENSG00000074800$1$8921061$8939308$-1$ENO1,1224.75
The gene expression files are formatted as:
a.<Ensembl gene ID>$<Chromosome>$<Starting position>$<Ending position>$<Strand>$<Gene name>,<RPKM>
In GEEK, we always use a.
to identify a gene node and a.<Ensembl gene ID>$<Chromosome>$<Starting position>$<Ending position>$<Strand>$<Gene name>
is considered as the ID of a gene node.
1.b.
Genomic bin DNase I hypersensitivity levels
An excerpt:
> head -n 3 ./GM12878/node/bin_DNase/chr1.bin.bed.DNase
v.chr1.0.50000,0.148532
v.chr1.100000.150000,0.148671
v.chr1.1000000.1050000,0.657203
> tail -n 3 ./GM12878/node/bin_DNase/chr1.bin.bed.DNase
v.chr1.99850000.99900000,0.155038
v.chr1.99900000.99950000,0.202276
v.chr1.99950000.100000000,0.176335
The bin DNase I hypersensitivity files are formatted as:
v.chr<Chromosome>.<Bin start position>.<Bin end position>,<DNase I hypersensitivity level>
In GEEK, we always use 'v.' to identify a bin node, and v.chr<Chromosome>.<Bin start position>.<Bin end position>
is the ID of a bin node.
2.
Edges
2.a.
Gene-bin associations (G-B edges and B-G edges)
An excerpt:
> tail -n 3 ./GM12878/edge/gene_bin/chr1.g_b.csv
a.ENSG00000171163$1$249144205$249153343$-1$ZNF692,v.chr1.249150000.249200000
a.ENSG00000171161$1$249132409$249143716$1$ZNF672,v.chr1.249100000.249150000
a.ENSG00000185220$1$249200395$249214145$1$PGBD2,v.chr1.249200000.249250000
Given the definition of gene node ID (mentioned previously in 1.a.
) and bin node ID (1.b.
), the gene-bin associations file is formatted as:
<Gene node ID>,<Bin node ID>
As the association between gene and bin is symmetric or bidirectional, such association file could be used to build both G-B edges and B-G edges.
2.b.
Bin-bin interactions (B-B edges, Hi-C)
An excerpt:
> tail -n 3 ./GM12878/edge/bin_bin/chr1.bb
chr1.99950000.100000000,chr1.145050000.145100000,5.65
chr1.99950000.100000000,chr1.222100000.222150000,3.39
chr1.99950000.100000000,chr1.222150000.222200000,2.83
The bin-bin interactions file is formatted as:
<Bin node ID>,<Bin node ID>,<Weight>
3.
PPI table
An excerpt:
head -n 3 ./PPI_raw/ppi.txt
#ID Interactor A ID Interactor B Alt IDs Interactor A Alt IDs Interactor B Aliases Interactor A Aliases Interactor B Interaction Detection Method Publication 1st Author Publication Identifiers Taxid Interactor A Taxid Interactor B Interaction Types Source Database Interaction Identifiers Confidence Values
entrez gene/locuslink:6416 entrez gene/locuslink:2318 biogrid:112315|entrez gene/locuslink:MAP2K4 biogrid:108607|entrez gene/locuslink:FLNC entrez gene/locuslink:JNKK(gene name synonym)|entrez gene/locuslink:JNKK1(gene name synonym)|entrez gene/locuslink:MAPKK4(gene name synonym)|entrez gene/locuslink:MEK4(gene name synonym)|entrez gene/locuslink:MKK4(gene name synonym)|entrez gene/locuslink:PRKMK4(gene name synonym)|entrez gene/locuslink:SAPKK-1(gene name synonym)|entrez gene/locuslink:SAPKK1(gene name synonym)|entrez gene/locuslink:SEK1(gene name synonym)|entrez gene/locuslink:SERK1(gene name synonym)|entrez gene/locuslink:SKK1(gene name synonym) entrez gene/locuslink:ABP-280(gene name synonym)|entrez gene/locuslink:ABP280A(gene name synonym)|entrez gene/locuslink:ABPA(gene name synonym)|entrez gene/locuslink:ABPL(gene name synonym)|entrez gene/locuslink:FLN2(gene name synonym)|entrez gene/locuslink:MFM5(gene name synonym)|entrez gene/locuslink:MPD4(gene name synonym) psi-mi:"MI:0018"(two hybrid) "Marti A (1997)" pubmed:9006895 taxid:9606 taxid:9606 psi-mi:"MI:0407"(direct interaction) psi-mi:"MI:0463"(biogrid) biogrid:103 -
entrez gene/locuslink:84665 entrez gene/locuslink:88 biogrid:124185|entrez gene/locuslink:MYPN biogrid:106603|entrez gene/locuslink:ACTN2 entrez gene/locuslink:CMD1DD(gene name synonym)|entrez gene/locuslink:CMH22(gene name synonym)|entrez gene/locuslink:MYOP(gene name synonym)|entrez gene/locuslink:RCM4(gene name synonym) entrez gene/locuslink:CMD1AA(gene name synonym) psi-mi:"MI:0018"(two hybrid) "Bang ML (2001)" pubmed:11309420 taxid:9606 taxid:9606 psi-mi:"MI:0407"(direct interaction) psi-mi:"MI:0463"(biogrid) biogrid:117 -
tail -n 3 ./PPI_raw/ppi.txt
entrez gene/locuslink:5071 entrez gene/locuslink:898 biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:107338|entrez gene/locuslink:CCNE1 entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym) entrez gene/locuslink:CCNE(gene name synonym) psi-mi:"MI:0415"(enzymatic study) "Gong Y (2014)" pubmed:24793136 taxid:9606 taxid:9606 psi-mi:"MI:0407"(direct interaction) psi-mi:"MI:0463"(biogrid) biogrid:2449021 -
entrez gene/locuslink:5071 entrez gene/locuslink:898 biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:107338|entrez gene/locuslink:CCNE1 entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym) entrez gene/locuslink:CCNE(gene name synonym) psi-mi:"MI:0004"(affinity chromatography technology) "Gong Y (2014)" pubmed:24793136 taxid:9606 taxid:9606 psi-mi:"MI:0915"(physical association) psi-mi:"MI:0463"(biogrid) biogrid:2449022 -
entrez gene/locuslink:5071 entrez gene/locuslink:55294 biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:120581|entrez gene/locuslink:FBXW7 entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym) entrez gene/locuslink:AGO(gene name synonym)|entrez gene/locuslink:CDC4(gene name synonym)|entrez gene/locuslink:FBW6(gene name synonym)|entrez gene/locuslink:FBW7(gene name synonym)|entrez gene/locuslink:FBX30(gene name synonym)|entrez gene/locuslink:FBXO30(gene name synonym)|entrez gene/locuslink:FBXW6(gene name synonym)|entrez gene/locuslink:SEL-10(gene name synonym)|entrez gene/locuslink:SEL10(gene name synonym)|entrez gene/locuslink:hAgo(gene name synonym)|entrez gene/locuslink:hCdc4(gene name synonym) psi-mi:"MI:0004"(affinity chromatography technology) "Gong Y (2014)" pubmed:24793136 taxid:9606 taxid:9606 psi-mi:"MI:0915"(physical association) psi-mi:"MI:0463"(biogrid) biogrid:2449023 -
This tab-seperated table is downloaded directly from BioGRID and filtered to keep only physical interactions. You could match GEEK gene node ID by gene names and construct G-G edges easily.
Data source and processing
(This section is identical to which in GEEK's manuscript.)
PPI
We downloaded the human PPIs from BioGRID version 3.4.162. Only physical interactions were included in our network. (Corresponding to the Whole genome, raw PPI table
package.)
Genomic bin and reference genome
We used 50kb non-overlapping bins that tiled the whole genome based on the hg19 human reference sequence. The rationales for this bin size and the performance comparisons with two other bin sizes are provided in Supplementary Materials (please refer to GEEK's manuscript).
Genes
We downloaded the gene annotation file of Gencode version 10 from Roadmap Epigenomics 52.
Chromatin accessibility and gene expression
We downloaded DNase-seq and gene expression data of the human lymphoblastoid cells GM12878, human chronic myeloid leukemia cells K562, normal human epidermal keratinocytes NHEK, human umbilical vein endothelial cells HUVEC, and human primary mammary epithelial cells HMEC from Roadmap Epigenomics in the form of p-values and Reads Per Kilobase per Million mapped reads (RPKM), respectively. The DNase I signal of each genomic bin was defined as the average of the negative log p-values of the genomic positions involved. Each genomic bin was then given a DNase class label of 1 if its signal value was larger than the median, or 0 otherwise.
Hi-C
We downloaded Hi-C data of mapping quality ≥ 30 of GM12878, K562, NHEK, HUVEC and HMEC from NCBI Gene Expression Omnibus with accession number GSE63525. We then used Juicer version 1.5.6 to identify raw bin-bin interactions at 50kb resolution, and the Python version of Fit-Hi-C version 2.0.7 to call significant interactions at q-value threshold of 0.005. For the per-chromosome setting, the parameter “chromosome region” was set to “intraOnly” to consider only intra-chromosomal interactions while for the whole-genome setting, the parameter was set to “All” to consider both intra- and inter-chromosomal interactions. The q-values were then transformed by negative logarithm to serve as bin-bin interaction weights.