How To Install Genomon-exome


Genomon-exome is an analysis pipeline distributed under the Genomon License. It is designed and optimized for the use on the Supercomputer at HGC.This software is freely available, modifiable and redistributed under the License. In this document, it is assumed you are analyzing data on the supercomputer at HGC.

System Preferences


Log on to the supercomputer. Open up ~/.bash_profile:

			vi ~/.bash_profile
			

Add parameters as shown below. Please be carefull when modifying your PATH variable. We suggested LANG be set to en_US, however if other applications needs it to be something different, you don't need to change it. The path to python2.6 or 2.7 should be included in the PATH. The python needs to be at least 2.6.

			# set the locale
			LANG=en_US; export LANG
			# on Shirokane1, use python2.6
			export PATH=/usr/local/package/python2.6/2.6.5/bin:$PATH
			# on Shirokane2, use python2.7
			export PATH=/usr/local/package/python2.7/2.7.2/bin:$PATH
			# The R libraries are found under this dir
			export R_LIBS=~/.R

After you eidt the profile, you have to log out and log in the system again or please type "source ~/.bash_profile". Env vars get set every time you logon.

Please check if the path to python in your environment is exactly the same path shown below:

				which python
				# if you are on Shirokane1
				/usr/local/package/python2.6/2.6.5/bin/python
				# if you are on Shirokane2
				/usr/local/package/python2.7/2.7.2/bin/python

Download Genomon-exome


Get the Genomon-exome source from the github's Genomon-exome downloads page (see below) and download it on to your local machine. Please find the file with its file name extension, .tar.gz (or .zip). You'll need upload the archive, exome_for_HGC-RB_${version}.tar.gz (or .zip) to the supercomputer. (You can put it anywhere under your home directory) If you local machine runs Windows, it is better to use winSCP for the upload work.

Genomon-exome downloads page: https://github.com/Genomon/exome_for_HGC

You logon to the supercomputer. Go to the directory in which the source archive is stored and unpack it. Once the unpack is finished, you can delete the .tar.gz archive file.

		# cd to dir which has Genomon-exome
		cd /dir/to/the/Genomon-exome-archive
		# unpack it
		tar xzvf exome_for_HGC-RB_${version}.tar.gz  # unzip exome_for_HGC-RB_${version}.zip
		# shorten the dir name
		mv exome_for_HGC-RB_${version} exome

cd to the exome/script directory. Add the exec bit to the files in the exome/script directory. Upon completion, cd to the exome/bin directory.

			cd exome/script
			chmod 740 *
			cd ../install

Writting The Setup Configuration File


Use the exome/script/exon_pipeline.config file to setup your analysis. Usually, you just need to change the USER_NAME to your user ID at HGC.

$ cat exon_pipeline.config

		[user-info]
		name=USER_NAME # USER_NAME should be your user id at HGC.
		
		[directory-path]
		project=/home/USER_NAME/exome  # same here. USER_NAME needs change.
		script=script
		ref=ref
		input=data/input
		output=data/output
		result=data/result
		db=db
		sys=sys
		tmp=tmp
		log=log
		inhousedata=
		summarydata=
		
		[data-file]
		hg19fasta=ref/hg19_bwa-0.5.10/hg19.fasta
		dbsnprod=
		
		# read1 and read2 are your adapter sequences.
		# You can add adapters by separating commas.
		# read1=ATGCAT,AACC
		[adapter]
		read1=NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  # read1 for first pairs
		read2=NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  # read2 for second pairs
		
		[bin] # Here you can adjust the paths to the binaries and scripts.
		bwa=bin/bwa-0.5.10/bwa
		picard=bin/picard-tools-1.39
		samtools=bin/samtools-0.1.15/samtools
		bedtools=bin/BEDTools-Version-2.14.3/bin
		cutadapt=bin/cutadapt-1.0/cutadapt
		annovar=bin/annovar
		javatools=
		python2.6=/usr/local/package/python2.6/2.6.5/bin/python
		# On Shirokane2, this should be /usr/local/package/python2.7/2.7.2/bin/python.
		java6=/usr/local/package/java/current6/bin/java
		maq=/usr/local/bin/maq
		R=/usr/local/bin/R
		gatk=bin/GenomeAnalysisTK-1.4-21-g30b937d
		gatk1_0=
		
		[db]
		inhouseflg=0
		inhouse_version=v1
		cosmicflg=0
		cosmic_version=v57
		
		[ngsdb]
		dbname=
		hostname=
		port=
		user=
		password=

Edit exome/copy_number/script/copynum.env. This is needed for Copy Number analyses.

			WORKDIR=${HOME}/exome
			
			HG19REF=${WORKDIR}/ref/hg19_bwa-0.5.10/hg19.fasta
			INTERVALDIR=${WORKDIR}/db/interval_list_hg19_nongap
			BAITINFO=${WORKDIR}/db/SureSelect50M.bed # bed file describes exon capture regions.
			
			BEDTOOLS=${WORKDIR}/bin/BEDTools-Version-2.14.3/bin
			SAMTOOLS=${WORKDIR}/bin/samtools-0.1.15
			ANNOPATH=${WORKDIR}/bin/annovar
			PERL=/usr/local/bin/perl
			R=/usr/local/bin/R
			PYTHON=/usr/local/package/python2.6/2.6.5/bin/python
			LOGDIR=${WORKDIR}/copy_number/log
			COMMAND_CN=${WORKDIR}/copy_number/script
			UTIL=${COMMAND_CN}/utility.sh

Edit exome/eb_call/script/config.sh. This is needed for Empirical Baysian mutation Calling.

		# path to the reference genome
		PATH_TO_REF=${HOME}/exome/ref/hg19/hg19.fasta
		
		# path to samtols
		PATH_TO_SAMTOOLS=${HOME}/exome/bin/samtools-0.1.18
		
		# path to R
		PATH_TO_R=/usr/local/bin
		
		# mapping quality threshould
		TH_MAPPING_QUAL=30
		
		# base quality threshould
		TH_BASE_QUAL=15
		
		# mapping quality threshould
		TH_MAPPING_QUAL_REF=30
		
		# base quality threshould
		TH_BASE_QUAL_REF=15
		
		# minimum depth in tumor
		MIN_TUMOR_DEPTH=8
		
		# minimum depth in normal
		MIN_NORMAL_DEPTH=8
		
		# minimum number of variant reads in tumor
		MIN_TUMOR_VARIANT_READ=4
		
		# minimum amount of tumor allele frequency
		MIN_TUMOR_ALLELE_FREQ=0.08
		
		# maximum amount of normal allele frequency
		MAX_NORMAL_ALLELE_FREQ=0.1
		
		# minimum value for the minus logarithm of p-value
		MIN_MINUS_LOG10_PV=3

		# interval list for multi-job operation
		INTERVAL=${HOME}/exome/db/interval_list_hg19_nongap

		# log dir
		LOGDIR=${HOME}/exome/log/ebcall

		# path to annovar
		ANNOPATH=${HOME}/exome/bin/annovar

The Directory Structure


After you finish installing the software, the directories should look something like below. The set of software packages and the dataset should be placed under the exome directory.

Download & Install Dataset on HGC Super Computer


You will here download the databases required for the Genomon-exome pipeline.

Before you start downloading the databases, please read the terms and use of the dbs. The Genomon-exome Licence doesn't cover all of those databases.

Download the hg19 fasta files from the UCSC site and place them under the exome/ref/hg19 directory. You should be in the exome/ref/hg19 before executing the wget command.

		# get hg19 FASTA files
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr2.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr3.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr4.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr5.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr6.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr7.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr8.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr9.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr10.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr11.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr12.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr13.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr14.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr15.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr16.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr17.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr18.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr19.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr20.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr21.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrX.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrY.fa.gz
		wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz
		# unpack them
		gunzip chr*.fa.gz
		# cat them to one file
		cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fasta
		# check the resultant file's md5sume value, if you see the same value the value, you are ok.
		md5sum hg19.fasta
		7c1739fd43764bd5e3b9b76ce8635bf0 hg19.fasta

Next, you need to uploade the bed file which describes designed enrichment regions for the exome sequencings. If your PC runs Windows, please use winSCP to upload the file. The .bed file should be localed under the exome/db directory.

		$ exome/db/xxxxxxxx.bed
		

Remove the header lines from the bed file (if there are any lines at all). Also remove lines that don't represent regions in chromosomes 1-22, X, Y, and mitochondrial DNA. As the fasta file, hg19.fasta we prepared only contains chromosomes 1-22, X, Y, and mitochondrial DNA, reads will not be mapped to other chromosomes.

		header=sample         # Header should be removed
		chr1    10000   11000
		chr2    10000   11000 # leave chr1-22, chrX, chrY, and chrM
		chrX    10000   11000 
		chrY    10000   11000 
		chrM    10000   11000 
		chr19_gl000nnn_random  20000   21000  # remove lines chr_xxxx_random
		chrUn_gl0002nn  30000   31000         # also remove lines chrUn_xxxx 

The bed file should be tab-delimited. For the Genomon-exome puropses, the fields should be; chrom, chromStart, chromEnd, Other1, Other2(optional), and strand.

		chr1    10000   11000    A_XX_XXXXXXX  0000  +
		chr2    10000   11000    A_XX_XXXXXXX  0000  -
		chrX    10000   11000    A_XX_XXXXXXX        +
		chrY    10000   11000    A_XX_XXXXXXX        -
		chrM    10000   11000    A_XX_XXXXXXX  0000  +

Download & Install Software on HGC Supter Computer


We need to download packages required for the Genomon-exome pipeline.

Please make sure you have well understood the terms of the licenses before using the packages. Genomon-exome is licensed differently.

Each download session should be initiated from the exome/bin directory.

		# Change directory
		cd ${path to the Genomon-exome}/exome/bin

Download the BWA (Burrows-Wheeler Aligner) package.

		# download bwa
		wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.10.tar.bz2
		tar xjvf bwa-0.5.10.tar.bz2
		# build the bwa package
		cd bwa-0.5.10
		make
		# go back to the exome/bin dir
		cd ..
		# create hardlink to the hg19.fasta file
		mkdir ../ref/hg19_bwa-0.5.10
		ln ../ref/hg19/hg19.fasta ../ref/hg19_bwa-0.5.10/hg19.fasta
		# in script dir, create index
		cd ../script
		qsub bwa_index.sh bwa-0.5.10

Download the Picard package.

		# download picard
		wget http://sourceforge.net/projects/picard/files/picard-tools/1.39/picard-tools-1.39.zip
		wget http://sourceforge.net/projects/picard/files/picard-tools/1.39/README.txt
		# unpack it
		unzip picard-tools-1.39.zip
		# mv readme for later convenience
		mv README.txt picard-tools-1.39.README.txt

Download the GATK (The Genome Analysis Toolkit) package.

		# download GATK
		wget ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/GenomeAnalysisTK-1.4-21-g30b937d.tar.bz2
		# unpack it
		tar xjvf GenomeAnalysisTK-1.4-21-g30b937d.tar.bz2

Download the SAMtools package.

		# download samtools-0.1.15
		wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.15/README
		wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.15/samtools-0.1.15.tar.bz2
		# unpack it
		tar xjvf samtools-0.1.15.tar.bz2
		mv README samtools-0.1.15.README
		# build the samtools package
		cd samtools-0.1.15
		make
		
		# download samtools-0.1.18 for EBCall
		wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.18/README
		wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.18/samtools-0.1.18.tar.bz2
		# unpack it
		tar xjvf samtools-0.1.18.tar.bz2
		mv README samtools-0.1.18.README
		# build the samtools package
		cd samtools-0.1.18
		make
		

Download the bedtools package.

		# download bedtools
		wget -nc http://bedtools.googlecode.com/files/BEDTools.v2.14.3.tar.gz
		# unpack it
		tar xzvf BEDTools.v2.14.3.tar.gz
		# build the bedtools package
		cd BEDTools-Version-2.14.3
		make

Download the cutadapt package.

		# download cutadapt
		wget -nc http://cutadapt.googlecode.com/files/cutadapt-1.0.tar.gz
		# unpack it
		tar xzvf cutadapt-1.0.tar.gz
		# build the cutadapt package
		cd cutadapt-1.0
		python setup.py build_ext -i

Download the ANNOVAR package. The ANNOVAR perl scripts help you to download various databasese including the dbSNP build131. To user this software, you need to register at the ANNOVAR site and you will recieve an email directed to your email address used upon registration. The email has the link to the package.
ANNOVAR should be placed under the exome/bin directory.

		# download annovar
		wget -nc ${the link address written in your email}
		# unpack the package
		tar xzvf annovar.tar.gz (the annovar package you just downloaded)
		# download annotation databases
		./annovar/annotate_variation.pl -buildver hg19 -downdb gene annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar mce46way annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb segdup annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_all annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2010nov_all annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp131 annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsift annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_pp2 annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_phylop annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_mt annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_lrt annovar/humandb/
		./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar esp5400_all annovar/humandb/

GATK (The Genome Analysis Toolkit) needs a dict file (hg19.dict for your case) for realignmnet. When creating the dict file, it is important to configure exon_pipeline.config properly.
Please see the setup

		# change dir to the script dir
		cd exome/script
		python realign_gatk_setup.py
		# check status is 0 (program exited normally)
		job id : 477616 failed =0 exit_status=0
		ls -l ../ref/hg19_bwa-0.5.10/hg19.dict # check the file is there 

Download the DNAcopy package from the Bioconductor site and put it under the exome/bin directory. (Make sure you are in the exome/bin directory before you start wget.)

		mkdir -p ~/.R
		export R_LIBS=~/.R
		wget -nc http://www.bioconductor.org/packages/2.10/bioc/src/contrib/DNAcopy_1.30.0.tar.gz
		cp DNAcopy_1.30.0.tar.gz ~/.R
		R CMD INSTALL DNAcopy_1.30.0.tar.gz
Execute R and check to see if you can use the library(DNAcopy) without an error. The R_LIBS environment variable must be exported before you use.
R
library(DNAcopy)
^ Go to Top