Input data of GEEK -

Download

Links

Sample	Type	Download	MD5
GM12878	Per chromosome	HTTP	b6d322687af0bb50612a9bef9bd93775
GM12878	Whole genome, Hi-C (B-B edges)	HTTP	a2c9551b325ee2309a1d986e879e11ac
K562	Per chromosome	HTTP	ad99c2c100f39e09cde36eda6dba011b
K562	Whole genome, Hi-C (B-B edges)	HTTP	91cdf1d4a54552c7510079e97ead20e7
NHEK	Per chromosome	HTTP	466a4cd9369dc207989a68ace680806c
NHEK	Whole genome, Hi-C (B-B edges)	HTTP	52dcf5c446d97aeed7cec309723eed23
HUVEC	Per chromosome	HTTP	e247dd00a5cfc7105eca6e52ba1ec5a2
HUVEC	Whole genome, Hi-C (B-B edges)	HTTP	32dc348ee82887bc6b5d1239cacb9270
HMEC	Per chromosome	HTTP	6c86d09fb77440ee68aeffed02462555
HMEC	Whole genome, Hi-C (B-B edges)	HTTP	0d42af3b692d07d8453e8fba686376cb
All samples	Whole genome, raw PPI table	HTTP	de9e6c11ecc5e27aa36531c378b538c6

Note

If you wish to reproduce per chromosome results, for a certain sample (e.g. GM12878), please download Per chromosome and Whole genome, raw PPI table. To reproduce whole genome results, you will also need Whole genome, Hi-C (B-B edges) package.

Directory structure

`Per chromosome` package

From now on, we will be always assuming you have downloaded the sample GM12878's package. For other samples, you may safely replace GM12878 below with the corresponding sample you have downloaded.

./GM12878/
    ./GM12878/node/ # See 1.
        ./GM12878/node/gene_expr/ # See 1.a.
            ./GM12878/node/gene_expr/chr1.gene
            ./GM12878/node/gene_expr/chr2.gene
            ...
            ./GM12878/node/gene_expr/chrX.gene
        ./GM12878/node/bin_DNase/ # See 1.b.
            ./GM12878/node/bin_DNase/chr1.bin.bed.DNase
            ./GM12878/node/bin_DNase/chr2.bin.bed.DNase
            ...
            ./GM12878/node/bin_DNase/chrX.bin.bed.DNase
    ./GM12878/edge/ # See 2.
        ./GM12878/edge/gene_bin/ # See 2.a.
            ./GM12878/edge/gene_bin/chr1.g_b.csv
            ./GM12878/edge/gene_bin/chr2.g_b.csv
            ...
            ./GM12878/edge/gene_bin/chrX.g_b.csv
        ./GM12878/edge/bin_bin/ # See 2.b.
            ./GM12878/edge/bin_bin/chr1.bb
            ./GM12878/edge/bin_bin/chr2.bb
            ...
            ./GM12878/edge/bin_bin/chrX.bb

`Whole genome, Hi-C (B-B edges)` package

./GM12878_wg/
    ./GM12878_wg/edge/ # See 2.
        ./GM12878_wg/edge/bin_bin/ # See 2.b.
            ./GM12878_wg/edge/bin_bin/all.bb

You may wish to merge the content of this package with Per chromosome package of corresponding sample.

`Whole genome, raw PPI table` package

./PPI_raw/
    ./PPI_raw/ppi.txt # See 3.

Package contents

`1.` Nodes

`1.a.` Gene expression levels

An excerpt:

> head -n 3 ./GM12878/node/gene_expr/chr1.gene
a.ENSG00000007908$1$169691781$169703220$-1$SELE,0
a.ENSG00000009709$1$18957500$19075360$1$PAX7,0
a.ENSG00000010932$1$171217638$171255117$1$FMO1,0

> tail -n 3 ./GM12878/node/gene_expr/chr1.gene
a.ENSG00000143184$1$168545711$168551315$1$XCL1,806.12
a.ENSG00000143119$1$111415772$111442550$1$CD53,818.15
a.ENSG00000074800$1$8921061$8939308$-1$ENO1,1224.75

The gene expression files are formatted as:

a.<Ensembl gene ID>$<Chromosome>$<Starting position>$<Ending position>$<Strand>$<Gene name>,<RPKM>

In GEEK, we always use a. to identify a gene node and a.<Ensembl gene ID>$<Chromosome>$<Starting position>$<Ending position>$<Strand>$<Gene name> is considered as the ID of a gene node.

`1.b.` Genomic bin DNase I hypersensitivity levels

An excerpt:

> head -n 3 ./GM12878/node/bin_DNase/chr1.bin.bed.DNase 
v.chr1.0.50000,0.148532
v.chr1.100000.150000,0.148671
v.chr1.1000000.1050000,0.657203

> tail -n 3 ./GM12878/node/bin_DNase/chr1.bin.bed.DNase
v.chr1.99850000.99900000,0.155038
v.chr1.99900000.99950000,0.202276
v.chr1.99950000.100000000,0.176335

The bin DNase I hypersensitivity files are formatted as:

v.chr<Chromosome>.<Bin start position>.<Bin end position>,<DNase I hypersensitivity level>

In GEEK, we always use 'v.' to identify a bin node, and v.chr<Chromosome>.<Bin start position>.<Bin end position> is the ID of a bin node.

`2.` Edges

`2.a.` Gene-bin associations (G-B edges and B-G edges)

An excerpt:

> tail -n 3 ./GM12878/edge/gene_bin/chr1.g_b.csv
a.ENSG00000171163$1$249144205$249153343$-1$ZNF692,v.chr1.249150000.249200000
a.ENSG00000171161$1$249132409$249143716$1$ZNF672,v.chr1.249100000.249150000
a.ENSG00000185220$1$249200395$249214145$1$PGBD2,v.chr1.249200000.249250000

Given the definition of gene node ID (mentioned previously in 1.a.) and bin node ID (1.b.), the gene-bin associations file is formatted as:

<Gene node ID>,<Bin node ID>

As the association between gene and bin is symmetric or bidirectional, such association file could be used to build both G-B edges and B-G edges.

`2.b.` Bin-bin interactions (B-B edges, Hi-C)

An excerpt:

> tail -n 3 ./GM12878/edge/bin_bin/chr1.bb     
chr1.99950000.100000000,chr1.145050000.145100000,5.65
chr1.99950000.100000000,chr1.222100000.222150000,3.39
chr1.99950000.100000000,chr1.222150000.222200000,2.83

The bin-bin interactions file is formatted as:

<Bin node ID>,<Bin node ID>,<Weight>

`3.` PPI table

An excerpt:

head -n 3 ./PPI_raw/ppi.txt 
#ID Interactor A    ID Interactor B Alt IDs Interactor A    Alt IDs Interactor B    Aliases Interactor A    Aliases Interactor B    Interaction Detection Method    Publication 1st Author  Publication Identifiers Taxid Interactor A  Taxid Interactor B  Interaction Types   Source Database Interaction Identifiers Confidence Values
entrez gene/locuslink:6416  entrez gene/locuslink:2318  biogrid:112315|entrez gene/locuslink:MAP2K4 biogrid:108607|entrez gene/locuslink:FLNC   entrez gene/locuslink:JNKK(gene name synonym)|entrez gene/locuslink:JNKK1(gene name synonym)|entrez gene/locuslink:MAPKK4(gene name synonym)|entrez gene/locuslink:MEK4(gene name synonym)|entrez gene/locuslink:MKK4(gene name synonym)|entrez gene/locuslink:PRKMK4(gene name synonym)|entrez gene/locuslink:SAPKK-1(gene name synonym)|entrez gene/locuslink:SAPKK1(gene name synonym)|entrez gene/locuslink:SEK1(gene name synonym)|entrez gene/locuslink:SERK1(gene name synonym)|entrez gene/locuslink:SKK1(gene name synonym)    entrez gene/locuslink:ABP-280(gene name synonym)|entrez gene/locuslink:ABP280A(gene name synonym)|entrez gene/locuslink:ABPA(gene name synonym)|entrez gene/locuslink:ABPL(gene name synonym)|entrez gene/locuslink:FLN2(gene name synonym)|entrez gene/locuslink:MFM5(gene name synonym)|entrez gene/locuslink:MPD4(gene name synonym) psi-mi:"MI:0018"(two hybrid)    "Marti A (1997)"    pubmed:9006895  taxid:9606  taxid:9606  psi-mi:"MI:0407"(direct interaction)    psi-mi:"MI:0463"(biogrid)   biogrid:103 -
entrez gene/locuslink:84665 entrez gene/locuslink:88    biogrid:124185|entrez gene/locuslink:MYPN   biogrid:106603|entrez gene/locuslink:ACTN2  entrez gene/locuslink:CMD1DD(gene name synonym)|entrez gene/locuslink:CMH22(gene name synonym)|entrez gene/locuslink:MYOP(gene name synonym)|entrez gene/locuslink:RCM4(gene name synonym)  entrez gene/locuslink:CMD1AA(gene name synonym) psi-mi:"MI:0018"(two hybrid)    "Bang ML (2001)"    pubmed:11309420 taxid:9606  taxid:9606  psi-mi:"MI:0407"(direct interaction)    psi-mi:"MI:0463"(biogrid)   biogrid:117 -

tail -n 3 ./PPI_raw/ppi.txt 
entrez gene/locuslink:5071  entrez gene/locuslink:898   biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:107338|entrez gene/locuslink:CCNE1  entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym)    entrez gene/locuslink:CCNE(gene name synonym)   psi-mi:"MI:0415"(enzymatic study)   "Gong Y (2014)" pubmed:24793136 taxid:9606  taxid:9606  psi-mi:"MI:0407"(direct interaction)    psi-mi:"MI:0463"(biogrid)   biogrid:2449021 -
entrez gene/locuslink:5071  entrez gene/locuslink:898   biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:107338|entrez gene/locuslink:CCNE1  entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym)    entrez gene/locuslink:CCNE(gene name synonym)   psi-mi:"MI:0004"(affinity chromatography technology)    "Gong Y (2014)" pubmed:24793136 taxid:9606  taxid:9606  psi-mi:"MI:0915"(physical association)  psi-mi:"MI:0463"(biogrid)   biogrid:2449022 -
entrez gene/locuslink:5071  entrez gene/locuslink:55294 biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:120581|entrez gene/locuslink:FBXW7  entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym)    entrez gene/locuslink:AGO(gene name synonym)|entrez gene/locuslink:CDC4(gene name synonym)|entrez gene/locuslink:FBW6(gene name synonym)|entrez gene/locuslink:FBW7(gene name synonym)|entrez gene/locuslink:FBX30(gene name synonym)|entrez gene/locuslink:FBXO30(gene name synonym)|entrez gene/locuslink:FBXW6(gene name synonym)|entrez gene/locuslink:SEL-10(gene name synonym)|entrez gene/locuslink:SEL10(gene name synonym)|entrez gene/locuslink:hAgo(gene name synonym)|entrez gene/locuslink:hCdc4(gene name synonym)    psi-mi:"MI:0004"(affinity chromatography technology)    "Gong Y (2014)" pubmed:24793136 taxid:9606  taxid:9606  psi-mi:"MI:0915"(physical association)  psi-mi:"MI:0463"(biogrid)   biogrid:2449023 -

This tab-seperated table is downloaded directly from BioGRID and filtered to keep only physical interactions. You could match GEEK gene node ID by gene names and construct G-G edges easily.

Data source and processing

(This section is identical to which in GEEK's manuscript.)

PPI

We downloaded the human PPIs from BioGRID version 3.4.162. Only physical interactions were included in our network. (Corresponding to the Whole genome, raw PPI table package.)

Genomic bin and reference genome

We used 50kb non-overlapping bins that tiled the whole genome based on the hg19 human reference sequence. The rationales for this bin size and the performance comparisons with two other bin sizes are provided in Supplementary Materials (please refer to GEEK's manuscript).

Genes

We downloaded the gene annotation file of Gencode version 10 from Roadmap Epigenomics 52.

Chromatin accessibility and gene expression

We downloaded DNase-seq and gene expression data of the human lymphoblastoid cells GM12878, human chronic myeloid leukemia cells K562, normal human epidermal keratinocytes NHEK, human umbilical vein endothelial cells HUVEC, and human primary mammary epithelial cells HMEC from Roadmap Epigenomics in the form of p-values and Reads Per Kilobase per Million mapped reads (RPKM), respectively. The DNase I signal of each genomic bin was defined as the average of the negative log p-values of the genomic positions involved. Each genomic bin was then given a DNase class label of 1 if its signal value was larger than the median, or 0 otherwise.

Hi-C

We downloaded Hi-C data of mapping quality ≥ 30 of GM12878, K562, NHEK, HUVEC and HMEC from NCBI Gene Expression Omnibus with accession number GSE63525. We then used Juicer version 1.5.6 to identify raw bin-bin interactions at 50kb resolution, and the Python version of Fit-Hi-C version 2.0.7 to call significant interactions at q-value threshold of 0.005. For the per-chromosome setting, the parameter “chromosome region” was set to “intraOnly” to consider only intra-chromosomal interactions while for the whole-genome setting, the parameter was set to “All” to consider both intra- and inter-chromosomal interactions. The q-values were then transformed by negative logarithm to serve as bin-bin interaction weights.