OMSV enables accurate and comprehensive identication of large structural variations from nanochannel-based single-molecule optical maps

--The OMSV caller and supplementary files

Here we provide the latest package of OMSV (and original version) with all source codes (C++, bash script, and Matlab codes) and binary files. The scrips of OMSV have been tested in Debian GNU/Linux 9.0 (stretch) and CentOS Linux release 7.3.1611 (Core) platform and the matlab code was developed in Matlab R2011b (7.13.0.564) 64-bit (glnxa64). There is a readme file showing the way to start our pipeline within a few steps and the details of the programs, including input parameters and supported file formats. The alignment file, as well as the 2-round split alignment, of NA12878 are provided for running the pipeline. All data (raw optical mapping data and alignments) are accessible at Zenodo repository.

1. The OMSV caller

The OMSV package contains the following components:

OMSV: the main component, calling large indels from the alignments of OM molecules.
OMSV_mixedIndel: calling mixed indels from the alignments of OM molecules.
OMSV_CNVs: calling CNV candidates from the alignments of OM molecules.
OMSV_complex: calling other large complex SVs from 2-round split alignments of OM molecules.
OMSV_MediumInversion caller: the module assembled in OMTools, calling medium-size inversions.
Post-processing: removing spurious SVs overlapping N-gaps, fragile sites, and pseudo-autosomal regions.

The OMSV pipeline is described in the following reference:

Le Li, Alden King-Yung Leung, Tsz-Piu Kwok, et.al. OMSV enables accurate and comprehensive identification of large structural variations from nanochannel-based single-molecule optical maps. Genome Bilogy, 2017 (in press).

2. Commands of tools

Alignment of OM data (use OM data of NA12878 as example, e.g. NA12878.cmap)

OMBlastMapper (OMTools v1.3) to align molecules to the reference for all OM data:

java -jar OMTools.jar OMBlastMapper --optmapin NA12878.cmap --refmapin hg38_r.cmap --filtermode 1 --alignmentjoinmode 1 --minscore 0.3 --thread 64 --optresout NA12878_OMB.oma

RefAligner (v4287) with parameter optimization (-M: the parameters related to the data quality will be automatically optimized, and their initial values are trivial) to align molecules to the reference for all other data:

./RefAligner -i NA12878.cmap -ref hg38_r.cmap -o NA12878_Ref -M 3 3 -FP 0.918057 -FN 0.099062 -sf 0.233588 -sd 0.090609 -S 0 -minlen 100 -minsites 5 -T 1e-9 -res 3.5 -resSD 0.7 -maxthreads 72 -maxmem 128 -Mfast 0 -biaswt 0 -A 5 -BestRef 1 -nosplit 2 -outlier 1e-5 -endoutlier 1e-5 -f -hashgen 5 3 2.4 1.4 0.05 5.0 1 1 -hash -hashdelta 10 -Hash_Bits 23 -MHash_Bits 25 -HHash_Bits 18

RefAligner to align molecules to the reference for C666-1 cell line data (suggested):

./RefAligner -ref hg38_r.cmap -o output_file -i input_query.cmap -f -stderr -maxthreads 64 -usecolor 1 -FP 0.73971 -FN 0.16109 -sd -0.086911 -sf 0.254047 -sr 0.026453 -T 1e-3 -usecolor 1 -S -1000 -biaswt 0 -resSD 0.75 -outlier 0.0001 -extend 1 -BestRef 1 -maptype 0 -PVres 2 -HSDrange 1.0 -hashoffset 1 -hashMultiMatch 20 -f -hashgen 5 3 2.4 1.5 0.05 5.0 1 1 1 -hash -hashdelta 10 -insertThreads 25 -maxmem 7 -mres 0.9 -unmapped unmapped_file -NoBpp

2-round split alignment based on OMBlastMapper:

./split_alignment_2round.sh . NA12878.oma hg38_r.cmap NA12878_700bp_hg38_OMB_split.oma

SV detection

OMSV-indel:

mkdir temp && ./OMSV -inputLabel 12878 -outputFolder ./ -SVoutputFile Indel -chrMapFile hg38_r.cmap -optAlignFile NA12878_700bp_hg38_combRefOMB.oma -optTempFolder temp/

OMSV-mixedIndel:

mkdir temp && ./OMSV_mixedIndel -inputLabel 12878 -outputFolder ./ -SVoutputFile Mixed_Indel -chrMapFile hg38_r.cmap -optAlignFile NA12878_700bp_hg38_combRefOMB.oma -optTempFolder temp/

OMSV-mediumInversion:

java -jar OMTools.jar SVDetection --refmapin hg38_r.cmap --optresin NA12878_700bp_hg38_combRefOMB.oma --mininvsig 4 --svout 12878Med_inv.osv --flanksig 0 --deg 0 -svmode 2 -minsupport 10 && sed -i '/.site/d' 12878Med_inv.osv

OMSV-CNVs:

./CNV_caller.sh NA12878_700bp_hg38_combRefOMB.oma hg38_r.cmap .

OMSV-complex:

./complex_caller.sh NA12878_700bp_hg38_OMB_split.oma .

Pindel (before liftover):

pindel -i chr_config.txt -f hg19 chr24M.fa -o SVs -c chr.fa -T 72 -l true -M 3
*chr_contig.txt: chr sorted.bam 290 chr label

Manta (tumor SV):

Ran by IGN (illumina genome network) company with their default parameters

BioNano Solve v3.0 (parameters are provided in optArguments_haplotype_irys.xml):

python pipelineCL.py -T 72 -j 72 -t 5678.6119rel/ -l output_folder -r hg38_r.cmap -b NA12878.bnx -a optArguments_haplotype_irys.xml -V 1 -m

Other utilities

Convert .xmap files (output alignment of RefAligner) to .oma files (output alignments of OMBlastMapper):

java -jar OMTools.jar ResultTools --refmapin NA12878_Ref_r.cmap --optmapin NA12878_Ref_q.cmap --optresin NA12878_Ref.xmap --optresout NA12878_Ref.oma

Covert OM data from .cmap to .bnx format:

./cmapTObnx -inputCmapFile NA12878.cmap -outputBNXFile NA12878.bnx

Integrate alignments from OMBlastMapper and RefAligner:

java -Xmx12G -jar OMTools.jar ResultMerger --resultkey RefAligner OMBlast --optresin NA12878_Ref.oma NA12878_OMB.oma --prefix NA12878_700bp_hg38_combRefOMB_ --outtype .oma
echo "#RefAligner_unique, OMBlast_unique, merged together" > NA12878_700bp_hg38_combRefOMB.oma
sed '/#/d' NA12878_700bp_hg38_combRefOMB_OMBlast_unique.oma >> NA12878_700bp_hg38_combRefOMB.oma
sed '/#/d' NA12878_700bp_hg38_combRefOMB_RefAligner_unique.oma >> NA12878_700bp_hg38_combRefOMB.oma
sed '/#/d' NA12878_700bp_hg38_combRefOMB_merged.oma >> NA12878_700bp_hg38_combRefOMB.oma

Post-processing of SVs (.osv files only):

./postFilter input.osv 2000

3. Supplementary files

Files required for running OMSV

To call OM SVs, we first need to generate an in-silico human genome (hg38) map with chromosomes 1-22, X, Y, and M (which have been relabelled as 1-25) as reference. To be consistent with our OM data, we set the image resolution limit to 700bp, which means two sites within 700bp on the in-silico human genome map will be merged together into one site. Besides, in the post-processing step we need information about N-gaps, fragile sites, and pseudo-autosomal regions to remove spurious SVs. The N-gaps are directly extracted from the hg38 sequence file and fragile sites are obtained from the hg38 map. Pseudo-autosomal regions include the following four regions: chrX:10000-2781479, chrX:155701382-156030895, chrY:10000-2781479, chrY:56887902-57217415. All these files are included in the OMSV source file package.

Results from C666-1

We provide a sheet showing the basic information of SVs, annotation of SVs by different genomic regions, and overlaps between SVs called from short reads data for a cancer cell line C666-1. To annotate the SVs, we applied snpEff to check the overlaps between our SVs with the exons, introns, genes, downstream, upstream, utr3', utr5', integeneic regions. These annotations are shown in separate columns. To check the overlaps with SVs called from sequence data, we engaged Manta and Pindel to call SVs on short reads of C666-1.

The results for Trio samples (NA12891, NA12892, NA12878)

Another excel file providing the analysis for the trio samples. In this file there are two sheets. The first sheet shows all SVs for these three samples, while the second one lists only the confident (either confidently true or confidently false) cases. In each sheet there is, as in the C666-1 sheet, basic information of SVs and their annotations. Besides, there is another section, containing three columns, which indicate the calling results of the three individuals for the SVs. We used different colors in this section to show different situations of SV calls: Yellow: Concordance with Mendelian inheritance, Blue: Mendelian error only if zygosity is considered, Red: Mendelian error no matter zygosity is considered or not.

#Update log

##Latest version v1.1

[June 22, 2017]To regenerate the data used in the paper, please use the version v1.0. Because the SV list obtained by this version is slightly different from the previous one. The update details are as follow:

Fix some minor bugs in detecting SVs;

Combine the indels and the mixed indels detection together;

##Version v1.0

[June 9, 2017]The original version of OMSV with the optimization of memory usage (~4GB for ~80x OM data).

Should you have any questions, please contact us.