OMSV enables accurate and comprehensive identication of large structural variations from nanochannel-based single-molecule optical maps

--The OMSV caller and supplementary files

Here we provide the latest package of OMSV (and original version) with all source codes (C++, bash script, and Matlab codes) and binary files. The scrips of OMSV have been tested in Debian GNU/Linux 9.0 (stretch) and CentOS Linux release 7.3.1611 (Core) platform and the matlab code was developed in Matlab R2011b (7.13.0.564) 64-bit (glnxa64). There is a readme file showing the way to start our pipeline within a few steps and the details of the programs, including input parameters and supported file formats. The alignment file, as well as the 2-round split alignment, of NA12878 are provided for running the pipeline. All data (raw optical mapping data and alignments) are accessible at Zenodo repository.

1. The OMSV caller

The OMSV package contains the following components:

  1. OMSV: the main component, calling large indels from the alignments of OM molecules.
  2. OMSV_mixedIndel: calling mixed indels from the alignments of OM molecules.
  3. OMSV_CNVs: calling CNV candidates from the alignments of OM molecules.
  4. OMSV_complex: calling other large complex SVs from 2-round split alignments of OM molecules.
  5. OMSV_MediumInversion caller: the module assembled in OMTools, calling medium-size inversions.
  6. Post-processing: removing spurious SVs overlapping N-gaps, fragile sites, and pseudo-autosomal regions.

The OMSV pipeline is described in the following reference:

2. Commands of tools

Alignment of OM data (use OM data of NA12878 as example, e.g. NA12878.cmap)

OMBlastMapper (OMTools v1.3) to align molecules to the reference for all OM data:
RefAligner (v4287) with parameter optimization (-M: the parameters related to the data quality will be automatically optimized, and their initial values are trivial) to align molecules to the reference for all other data:
RefAligner to align molecules to the reference for C666-1 cell line data (suggested):
2-round split alignment based on OMBlastMapper:

SV detection

OMSV-indel:
OMSV-mixedIndel:
OMSV-mediumInversion:
OMSV-CNVs:
OMSV-complex:
Pindel (before liftover):
Manta (tumor SV):
BioNano Solve v3.0 (parameters are provided in optArguments_haplotype_irys.xml):

Other utilities

Convert .xmap files (output alignment of RefAligner) to .oma files (output alignments of OMBlastMapper):
Covert OM data from .cmap to .bnx format:
Integrate alignments from OMBlastMapper and RefAligner:
Post-processing of SVs (.osv files only):

3. Supplementary files

Files required for running OMSV

To call OM SVs, we first need to generate an in-silico human genome (hg38) map with chromosomes 1-22, X, Y, and M (which have been relabelled as 1-25) as reference. To be consistent with our OM data, we set the image resolution limit to 700bp, which means two sites within 700bp on the in-silico human genome map will be merged together into one site. Besides, in the post-processing step we need information about N-gaps, fragile sites, and pseudo-autosomal regions to remove spurious SVs. The N-gaps are directly extracted from the hg38 sequence file and fragile sites are obtained from the hg38 map. Pseudo-autosomal regions include the following four regions: chrX:10000-2781479, chrX:155701382-156030895, chrY:10000-2781479, chrY:56887902-57217415. All these files are included in the OMSV source file package.

Results from C666-1

We provide a sheet showing the basic information of SVs, annotation of SVs by different genomic regions, and overlaps between SVs called from short reads data for a cancer cell line C666-1. To annotate the SVs, we applied snpEff to check the overlaps between our SVs with the exons, introns, genes, downstream, upstream, utr3', utr5', integeneic regions. These annotations are shown in separate columns. To check the overlaps with SVs called from sequence data, we engaged Manta and Pindel to call SVs on short reads of C666-1.

The results for Trio samples (NA12891, NA12892, NA12878)

Another excel file providing the analysis for the trio samples. In this file there are two sheets. The first sheet shows all SVs for these three samples, while the second one lists only the confident (either confidently true or confidently false) cases. In each sheet there is, as in the C666-1 sheet, basic information of SVs and their annotations. Besides, there is another section, containing three columns, which indicate the calling results of the three individuals for the SVs. We used different colors in this section to show different situations of SV calls: Yellow: Concordance with Mendelian inheritance, Blue: Mendelian error only if zygosity is considered, Red: Mendelian error no matter zygosity is considered or not.

#Update log

##Latest version v1.1

[June 22, 2017]To regenerate the data used in the paper, please use the version v1.0. Because the SV list obtained by this version is slightly different from the previous one. The update details are as follow:
  • Fix some minor bugs in detecting SVs;
  • Combine the indels and the mixed indels detection together;
  • ##Version v1.0

    [June 9, 2017]The original version of OMSV with the optimization of memory usage (~4GB for ~80x OM data).

    Should you have any questions, please contact us.