Input data of GEEK


Download

Links

Sample Type Download MD5
GM12878 Per chromosome HTTP b6d322687af0bb50612a9bef9bd93775
GM12878 Whole genome, Hi-C (B-B edges) HTTP a2c9551b325ee2309a1d986e879e11ac
K562 Per chromosome HTTP ad99c2c100f39e09cde36eda6dba011b
K562 Whole genome, Hi-C (B-B edges) HTTP 91cdf1d4a54552c7510079e97ead20e7
NHEK Per chromosome HTTP 466a4cd9369dc207989a68ace680806c
NHEK Whole genome, Hi-C (B-B edges) HTTP 52dcf5c446d97aeed7cec309723eed23
HUVEC Per chromosome HTTP e247dd00a5cfc7105eca6e52ba1ec5a2
HUVEC Whole genome, Hi-C (B-B edges) HTTP 32dc348ee82887bc6b5d1239cacb9270
HMEC Per chromosome HTTP 6c86d09fb77440ee68aeffed02462555
HMEC Whole genome, Hi-C (B-B edges) HTTP 0d42af3b692d07d8453e8fba686376cb
All samples Whole genome, raw PPI table HTTP de9e6c11ecc5e27aa36531c378b538c6

Note

If you wish to reproduce per chromosome results, for a certain sample (e.g. GM12878), please download Per chromosome and Whole genome, raw PPI table. To reproduce whole genome results, you will also need Whole genome, Hi-C (B-B edges) package.

Directory structure

Per chromosome package

From now on, we will be always assuming you have downloaded the sample GM12878's package. For other samples, you may safely replace GM12878 below with the corresponding sample you have downloaded.

./GM12878/
    ./GM12878/node/ # See 1.
        ./GM12878/node/gene_expr/ # See 1.a.
            ./GM12878/node/gene_expr/chr1.gene
            ./GM12878/node/gene_expr/chr2.gene
            ...
            ./GM12878/node/gene_expr/chrX.gene
        ./GM12878/node/bin_DNase/ # See 1.b.
            ./GM12878/node/bin_DNase/chr1.bin.bed.DNase
            ./GM12878/node/bin_DNase/chr2.bin.bed.DNase
            ...
            ./GM12878/node/bin_DNase/chrX.bin.bed.DNase
    ./GM12878/edge/ # See 2.
        ./GM12878/edge/gene_bin/ # See 2.a.
            ./GM12878/edge/gene_bin/chr1.g_b.csv
            ./GM12878/edge/gene_bin/chr2.g_b.csv
            ...
            ./GM12878/edge/gene_bin/chrX.g_b.csv
        ./GM12878/edge/bin_bin/ # See 2.b.
            ./GM12878/edge/bin_bin/chr1.bb
            ./GM12878/edge/bin_bin/chr2.bb
            ...
            ./GM12878/edge/bin_bin/chrX.bb

Whole genome, Hi-C (B-B edges) package

./GM12878_wg/
    ./GM12878_wg/edge/ # See 2.
        ./GM12878_wg/edge/bin_bin/ # See 2.b.
            ./GM12878_wg/edge/bin_bin/all.bb 

You may wish to merge the content of this package with Per chromosome package of corresponding sample.

Whole genome, raw PPI table package

./PPI_raw/
    ./PPI_raw/ppi.txt # See 3.

Package contents

1. Nodes

1.a. Gene expression levels

An excerpt:

> head -n 3 ./GM12878/node/gene_expr/chr1.gene
a.ENSG00000007908$1$169691781$169703220$-1$SELE,0
a.ENSG00000009709$1$18957500$19075360$1$PAX7,0
a.ENSG00000010932$1$171217638$171255117$1$FMO1,0

> tail -n 3 ./GM12878/node/gene_expr/chr1.gene
a.ENSG00000143184$1$168545711$168551315$1$XCL1,806.12
a.ENSG00000143119$1$111415772$111442550$1$CD53,818.15
a.ENSG00000074800$1$8921061$8939308$-1$ENO1,1224.75

The gene expression files are formatted as:

a.<Ensembl gene ID>$<Chromosome>$<Starting position>$<Ending position>$<Strand>$<Gene name>,<RPKM>

In GEEK, we always use a. to identify a gene node and a.<Ensembl gene ID>$<Chromosome>$<Starting position>$<Ending position>$<Strand>$<Gene name> is considered as the ID of a gene node.

1.b. Genomic bin DNase I hypersensitivity levels

An excerpt:

> head -n 3 ./GM12878/node/bin_DNase/chr1.bin.bed.DNase 
v.chr1.0.50000,0.148532
v.chr1.100000.150000,0.148671
v.chr1.1000000.1050000,0.657203

> tail -n 3 ./GM12878/node/bin_DNase/chr1.bin.bed.DNase
v.chr1.99850000.99900000,0.155038
v.chr1.99900000.99950000,0.202276
v.chr1.99950000.100000000,0.176335

The bin DNase I hypersensitivity files are formatted as:

v.chr<Chromosome>.<Bin start position>.<Bin end position>,<DNase I hypersensitivity level>

In GEEK, we always use 'v.' to identify a bin node, and v.chr<Chromosome>.<Bin start position>.<Bin end position> is the ID of a bin node.

2. Edges

2.a. Gene-bin associations (G-B edges and B-G edges)

An excerpt:

> tail -n 3 ./GM12878/edge/gene_bin/chr1.g_b.csv
a.ENSG00000171163$1$249144205$249153343$-1$ZNF692,v.chr1.249150000.249200000
a.ENSG00000171161$1$249132409$249143716$1$ZNF672,v.chr1.249100000.249150000
a.ENSG00000185220$1$249200395$249214145$1$PGBD2,v.chr1.249200000.249250000

Given the definition of gene node ID (mentioned previously in 1.a.) and bin node ID (1.b.), the gene-bin associations file is formatted as:

<Gene node ID>,<Bin node ID>

As the association between gene and bin is symmetric or bidirectional, such association file could be used to build both G-B edges and B-G edges.

2.b. Bin-bin interactions (B-B edges, Hi-C)

An excerpt:

> tail -n 3 ./GM12878/edge/bin_bin/chr1.bb     
chr1.99950000.100000000,chr1.145050000.145100000,5.65
chr1.99950000.100000000,chr1.222100000.222150000,3.39
chr1.99950000.100000000,chr1.222150000.222200000,2.83

The bin-bin interactions file is formatted as:

<Bin node ID>,<Bin node ID>,<Weight>

3. PPI table

An excerpt:

head -n 3 ./PPI_raw/ppi.txt 
#ID Interactor A    ID Interactor B Alt IDs Interactor A    Alt IDs Interactor B    Aliases Interactor A    Aliases Interactor B    Interaction Detection Method    Publication 1st Author  Publication Identifiers Taxid Interactor A  Taxid Interactor B  Interaction Types   Source Database Interaction Identifiers Confidence Values
entrez gene/locuslink:6416  entrez gene/locuslink:2318  biogrid:112315|entrez gene/locuslink:MAP2K4 biogrid:108607|entrez gene/locuslink:FLNC   entrez gene/locuslink:JNKK(gene name synonym)|entrez gene/locuslink:JNKK1(gene name synonym)|entrez gene/locuslink:MAPKK4(gene name synonym)|entrez gene/locuslink:MEK4(gene name synonym)|entrez gene/locuslink:MKK4(gene name synonym)|entrez gene/locuslink:PRKMK4(gene name synonym)|entrez gene/locuslink:SAPKK-1(gene name synonym)|entrez gene/locuslink:SAPKK1(gene name synonym)|entrez gene/locuslink:SEK1(gene name synonym)|entrez gene/locuslink:SERK1(gene name synonym)|entrez gene/locuslink:SKK1(gene name synonym)    entrez gene/locuslink:ABP-280(gene name synonym)|entrez gene/locuslink:ABP280A(gene name synonym)|entrez gene/locuslink:ABPA(gene name synonym)|entrez gene/locuslink:ABPL(gene name synonym)|entrez gene/locuslink:FLN2(gene name synonym)|entrez gene/locuslink:MFM5(gene name synonym)|entrez gene/locuslink:MPD4(gene name synonym) psi-mi:"MI:0018"(two hybrid)    "Marti A (1997)"    pubmed:9006895  taxid:9606  taxid:9606  psi-mi:"MI:0407"(direct interaction)    psi-mi:"MI:0463"(biogrid)   biogrid:103 -
entrez gene/locuslink:84665 entrez gene/locuslink:88    biogrid:124185|entrez gene/locuslink:MYPN   biogrid:106603|entrez gene/locuslink:ACTN2  entrez gene/locuslink:CMD1DD(gene name synonym)|entrez gene/locuslink:CMH22(gene name synonym)|entrez gene/locuslink:MYOP(gene name synonym)|entrez gene/locuslink:RCM4(gene name synonym)  entrez gene/locuslink:CMD1AA(gene name synonym) psi-mi:"MI:0018"(two hybrid)    "Bang ML (2001)"    pubmed:11309420 taxid:9606  taxid:9606  psi-mi:"MI:0407"(direct interaction)    psi-mi:"MI:0463"(biogrid)   biogrid:117 -

tail -n 3 ./PPI_raw/ppi.txt 
entrez gene/locuslink:5071  entrez gene/locuslink:898   biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:107338|entrez gene/locuslink:CCNE1  entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym)    entrez gene/locuslink:CCNE(gene name synonym)   psi-mi:"MI:0415"(enzymatic study)   "Gong Y (2014)" pubmed:24793136 taxid:9606  taxid:9606  psi-mi:"MI:0407"(direct interaction)    psi-mi:"MI:0463"(biogrid)   biogrid:2449021 -
entrez gene/locuslink:5071  entrez gene/locuslink:898   biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:107338|entrez gene/locuslink:CCNE1  entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym)    entrez gene/locuslink:CCNE(gene name synonym)   psi-mi:"MI:0004"(affinity chromatography technology)    "Gong Y (2014)" pubmed:24793136 taxid:9606  taxid:9606  psi-mi:"MI:0915"(physical association)  psi-mi:"MI:0463"(biogrid)   biogrid:2449022 -
entrez gene/locuslink:5071  entrez gene/locuslink:55294 biogrid:111105|entrez gene/locuslink:PARK2|entrez gene/locuslink:KB-152G3.1 biogrid:120581|entrez gene/locuslink:FBXW7  entrez gene/locuslink:AR-JP(gene name synonym)|entrez gene/locuslink:LPRS2(gene name synonym)|entrez gene/locuslink:PDJ(gene name synonym)|entrez gene/locuslink:PRKN(gene name synonym)    entrez gene/locuslink:AGO(gene name synonym)|entrez gene/locuslink:CDC4(gene name synonym)|entrez gene/locuslink:FBW6(gene name synonym)|entrez gene/locuslink:FBW7(gene name synonym)|entrez gene/locuslink:FBX30(gene name synonym)|entrez gene/locuslink:FBXO30(gene name synonym)|entrez gene/locuslink:FBXW6(gene name synonym)|entrez gene/locuslink:SEL-10(gene name synonym)|entrez gene/locuslink:SEL10(gene name synonym)|entrez gene/locuslink:hAgo(gene name synonym)|entrez gene/locuslink:hCdc4(gene name synonym)    psi-mi:"MI:0004"(affinity chromatography technology)    "Gong Y (2014)" pubmed:24793136 taxid:9606  taxid:9606  psi-mi:"MI:0915"(physical association)  psi-mi:"MI:0463"(biogrid)   biogrid:2449023 -

This tab-seperated table is downloaded directly from BioGRID and filtered to keep only physical interactions. You could match GEEK gene node ID by gene names and construct G-G edges easily.

Data source and processing

(This section is identical to which in GEEK's manuscript.)

PPI

We downloaded the human PPIs from BioGRID version 3.4.162. Only physical interactions were included in our network. (Corresponding to the Whole genome, raw PPI table package.)

Genomic bin and reference genome

We used 50kb non-overlapping bins that tiled the whole genome based on the hg19 human reference sequence. The rationales for this bin size and the performance comparisons with two other bin sizes are provided in Supplementary Materials (please refer to GEEK's manuscript).

Genes

We downloaded the gene annotation file of Gencode version 10 from Roadmap Epigenomics 52.

Chromatin accessibility and gene expression

We downloaded DNase-seq and gene expression data of the human lymphoblastoid cells GM12878, human chronic myeloid leukemia cells K562, normal human epidermal keratinocytes NHEK, human umbilical vein endothelial cells HUVEC, and human primary mammary epithelial cells HMEC from Roadmap Epigenomics in the form of p-values and Reads Per Kilobase per Million mapped reads (RPKM), respectively. The DNase I signal of each genomic bin was defined as the average of the negative log p-values of the genomic positions involved. Each genomic bin was then given a DNase class label of 1 if its signal value was larger than the median, or 0 otherwise.

Hi-C

We downloaded Hi-C data of mapping quality ≥ 30 of GM12878, K562, NHEK, HUVEC and HMEC from NCBI Gene Expression Omnibus with accession number GSE63525. We then used Juicer version 1.5.6 to identify raw bin-bin interactions at 50kb resolution, and the Python version of Fit-Hi-C version 2.0.7 to call significant interactions at q-value threshold of 0.005. For the per-chromosome setting, the parameter “chromosome region” was set to “intraOnly” to consider only intra-chromosomal interactions while for the whole-genome setting, the parameter was set to “All” to consider both intra- and inter-chromosomal interactions. The q-values were then transformed by negative logarithm to serve as bin-bin interaction weights.