Genomon-exome is an analysis pipeline distributed under the Genomon License. It is designed and optimized for the use on the Supercomputer at HGC.This software is freely available, modifiable and redistributed under the License. In this document, it is assumed you are analyzing data on the supercomputer at HGC.
Links inside this page:
System Preferences
Download Genomon-exome
Writting The Setup Configuration File
Directory Structure
Download Data Set on HGC Super Computer
Download Software on HGC Super Computer
Log on to the supercomputer. Open up ~/.bash_profile:
vi ~/.bash_profile
Add parameters as shown below. Please be carefull when modifying your PATH variable. We suggested LANG be set to en_US, however if other applications needs it to be something different, you don't need to change it. The path to python2.6 or 2.7 should be included in the PATH. The python needs to be at least 2.6.
# set the locale LANG=en_US; export LANG # on Shirokane1, use python2.6 export PATH=/usr/local/package/python2.6/2.6.5/bin:$PATH # on Shirokane2, use python2.7 export PATH=/usr/local/package/python2.7/2.7.2/bin:$PATH # The R libraries are found under this dir export R_LIBS=~/.R
After you eidt the profile, you have to log out and log in the system again or please type "source ~/.bash_profile". Env vars get set every time you logon.
Please check if the path to python in your environment is exactly the same path shown below:
which python # if you are on Shirokane1 /usr/local/package/python2.6/2.6.5/bin/python # if you are on Shirokane2 /usr/local/package/python2.7/2.7.2/bin/python
Get the Genomon-exome source from the github's Genomon-exome downloads page (see below) and download it on to your local machine. Please find the file with its file name extension, .tar.gz (or .zip). You'll need upload the archive, exome_for_HGC-RB_${version}.tar.gz (or .zip) to the supercomputer. (You can put it anywhere under your home directory) If you local machine runs Windows, it is better to use winSCP for the upload work.
Genomon-exome downloads page: https://github.com/Genomon/exome_for_HGC
You logon to the supercomputer. Go to the directory in which the source archive is stored and unpack it. Once the unpack is finished, you can delete the .tar.gz archive file.
# cd to dir which has Genomon-exome cd /dir/to/the/Genomon-exome-archive # unpack it tar xzvf exome_for_HGC-RB_${version}.tar.gz # unzip exome_for_HGC-RB_${version}.zip # shorten the dir name mv exome_for_HGC-RB_${version} exome
cd to the exome/script directory. Add the exec bit to the files in the exome/script directory. Upon completion, cd to the exome/bin directory.
cd exome/script chmod 740 * cd ../install
Use the exome/script/exon_pipeline.config file to setup your analysis. Usually, you just need to change the USER_NAME to your user ID at HGC.
$ cat exon_pipeline.config
[user-info] name=USER_NAME # USER_NAME should be your user id at HGC. [directory-path] project=/home/USER_NAME/exome # same here. USER_NAME needs change. script=script ref=ref input=data/input output=data/output result=data/result db=db sys=sys tmp=tmp log=log inhousedata= summarydata= [data-file] hg19fasta=ref/hg19_bwa-0.5.10/hg19.fasta dbsnprod= # read1 and read2 are your adapter sequences. # You can add adapters by separating commas. # read1=ATGCAT,AACC [adapter] read1=NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN # read1 for first pairs read2=NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN # read2 for second pairs [bin] # Here you can adjust the paths to the binaries and scripts. bwa=bin/bwa-0.5.10/bwa picard=bin/picard-tools-1.39 samtools=bin/samtools-0.1.15/samtools bedtools=bin/BEDTools-Version-2.14.3/bin cutadapt=bin/cutadapt-1.0/cutadapt annovar=bin/annovar javatools= python2.6=/usr/local/package/python2.6/2.6.5/bin/python # On Shirokane2, this should be /usr/local/package/python2.7/2.7.2/bin/python. java6=/usr/local/package/java/current6/bin/java maq=/usr/local/bin/maq R=/usr/local/bin/R gatk=bin/GenomeAnalysisTK-1.4-21-g30b937d gatk1_0= [db] inhouseflg=0 inhouse_version=v1 cosmicflg=0 cosmic_version=v57 [ngsdb] dbname= hostname= port= user= password=
Edit exome/copy_number/script/copynum.env. This is needed for Copy Number analyses.
WORKDIR=${HOME}/exome HG19REF=${WORKDIR}/ref/hg19_bwa-0.5.10/hg19.fasta INTERVALDIR=${WORKDIR}/db/interval_list_hg19_nongap BAITINFO=${WORKDIR}/db/SureSelect50M.bed # bed file describes exon capture regions. BEDTOOLS=${WORKDIR}/bin/BEDTools-Version-2.14.3/bin SAMTOOLS=${WORKDIR}/bin/samtools-0.1.15 ANNOPATH=${WORKDIR}/bin/annovar PERL=/usr/local/bin/perl R=/usr/local/bin/R PYTHON=/usr/local/package/python2.6/2.6.5/bin/python LOGDIR=${WORKDIR}/copy_number/log COMMAND_CN=${WORKDIR}/copy_number/script UTIL=${COMMAND_CN}/utility.sh
Edit exome/eb_call/script/config.sh. This is needed for Empirical Baysian mutation Calling.
# path to the reference genome PATH_TO_REF=${HOME}/exome/ref/hg19/hg19.fasta # path to samtols PATH_TO_SAMTOOLS=${HOME}/exome/bin/samtools-0.1.18 # path to R PATH_TO_R=/usr/local/bin # mapping quality threshould TH_MAPPING_QUAL=30 # base quality threshould TH_BASE_QUAL=15 # mapping quality threshould TH_MAPPING_QUAL_REF=30 # base quality threshould TH_BASE_QUAL_REF=15 # minimum depth in tumor MIN_TUMOR_DEPTH=8 # minimum depth in normal MIN_NORMAL_DEPTH=8 # minimum number of variant reads in tumor MIN_TUMOR_VARIANT_READ=4 # minimum amount of tumor allele frequency MIN_TUMOR_ALLELE_FREQ=0.08 # maximum amount of normal allele frequency MAX_NORMAL_ALLELE_FREQ=0.1 # minimum value for the minus logarithm of p-value MIN_MINUS_LOG10_PV=3 # interval list for multi-job operation INTERVAL=${HOME}/exome/db/interval_list_hg19_nongap # log dir LOGDIR=${HOME}/exome/log/ebcall # path to annovar ANNOPATH=${HOME}/exome/bin/annovar
After you finish installing the software, the directories should look something like below. The set of software packages and the dataset should be placed under the exome directory.
You will here download the databases required for the Genomon-exome pipeline.
Before you start downloading the databases, please read the terms and use of the dbs. The Genomon-exome Licence doesn't cover all of those databases.
Download the hg19 fasta files from the UCSC site and place them under the exome/ref/hg19 directory. You should be in the exome/ref/hg19 before executing the wget command.
# get hg19 FASTA files wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr2.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr3.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr4.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr5.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr6.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr7.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr8.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr9.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr10.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr11.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr12.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr13.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr14.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr15.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr16.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr17.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr18.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr19.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr20.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr21.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrX.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrY.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz # unpack them gunzip chr*.fa.gz # cat them to one file cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fasta # check the resultant file's md5sume value, if you see the same value the value, you are ok. md5sum hg19.fasta 7c1739fd43764bd5e3b9b76ce8635bf0 hg19.fasta
Next, you need to uploade the bed file which describes designed enrichment regions for the exome sequencings. If your PC runs Windows, please use winSCP to upload the file. The .bed file should be localed under the exome/db directory.
$ exome/db/xxxxxxxx.bed
Remove the header lines from the bed file (if there are any lines at all). Also remove lines that don't represent regions in chromosomes 1-22, X, Y, and mitochondrial DNA. As the fasta file, hg19.fasta we prepared only contains chromosomes 1-22, X, Y, and mitochondrial DNA, reads will not be mapped to other chromosomes.
header=sample # Header should be removed chr1 10000 11000 chr2 10000 11000 # leave chr1-22, chrX, chrY, and chrM chrX 10000 11000 chrY 10000 11000 chrM 10000 11000 chr19_gl000nnn_random 20000 21000 # remove lines chr_xxxx_random chrUn_gl0002nn 30000 31000 # also remove lines chrUn_xxxx
The bed file should be tab-delimited. For the Genomon-exome puropses, the fields should be; chrom, chromStart, chromEnd, Other1, Other2(optional), and strand.
chr1 10000 11000 A_XX_XXXXXXX 0000 + chr2 10000 11000 A_XX_XXXXXXX 0000 - chrX 10000 11000 A_XX_XXXXXXX + chrY 10000 11000 A_XX_XXXXXXX - chrM 10000 11000 A_XX_XXXXXXX 0000 +
We need to download packages required for the Genomon-exome pipeline.
Please make sure you have well understood the terms of the licenses before using the packages. Genomon-exome is licensed differently.
Each download session should be initiated from the exome/bin directory.
# Change directory cd ${path to the Genomon-exome}/exome/bin
Download the BWA (Burrows-Wheeler Aligner) package.
# download bwa wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.10.tar.bz2 tar xjvf bwa-0.5.10.tar.bz2 # build the bwa package cd bwa-0.5.10 make # go back to the exome/bin dir cd .. # create hardlink to the hg19.fasta file mkdir ../ref/hg19_bwa-0.5.10 ln ../ref/hg19/hg19.fasta ../ref/hg19_bwa-0.5.10/hg19.fasta # in script dir, create index cd ../script qsub bwa_index.sh bwa-0.5.10
Download the Picard package.
# download picard wget http://sourceforge.net/projects/picard/files/picard-tools/1.39/picard-tools-1.39.zip wget http://sourceforge.net/projects/picard/files/picard-tools/1.39/README.txt # unpack it unzip picard-tools-1.39.zip # mv readme for later convenience mv README.txt picard-tools-1.39.README.txt
Download the GATK (The Genome Analysis Toolkit) package.
# download GATK wget ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/GenomeAnalysisTK-1.4-21-g30b937d.tar.bz2 # unpack it tar xjvf GenomeAnalysisTK-1.4-21-g30b937d.tar.bz2
Download the SAMtools package.
# download samtools-0.1.15 wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.15/README wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.15/samtools-0.1.15.tar.bz2 # unpack it tar xjvf samtools-0.1.15.tar.bz2 mv README samtools-0.1.15.README # build the samtools package cd samtools-0.1.15 make # download samtools-0.1.18 for EBCall wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.18/README wget -nc http://sourceforge.net/projects/samtools/files/samtools/0.1.18/samtools-0.1.18.tar.bz2 # unpack it tar xjvf samtools-0.1.18.tar.bz2 mv README samtools-0.1.18.README # build the samtools package cd samtools-0.1.18 make
Download the bedtools package.
# download bedtools wget -nc http://bedtools.googlecode.com/files/BEDTools.v2.14.3.tar.gz # unpack it tar xzvf BEDTools.v2.14.3.tar.gz # build the bedtools package cd BEDTools-Version-2.14.3 make
Download the cutadapt package.
# download cutadapt wget -nc http://cutadapt.googlecode.com/files/cutadapt-1.0.tar.gz # unpack it tar xzvf cutadapt-1.0.tar.gz # build the cutadapt package cd cutadapt-1.0 python setup.py build_ext -i
Download the ANNOVAR package. The ANNOVAR perl scripts help you to download various databasese including the dbSNP build131. To user this software, you need to register at the ANNOVAR site and you will recieve an email directed to your email address used upon registration. The email has the link to the package.
ANNOVAR should be placed under the exome/bin directory.
# download annovar wget -nc ${the link address written in your email} # unpack the package tar xzvf annovar.tar.gz (the annovar package you just downloaded) # download annotation databases ./annovar/annotate_variation.pl -buildver hg19 -downdb gene annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar mce46way annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb segdup annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_all annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2010nov_all annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp131 annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsift annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_pp2 annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_phylop annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_mt annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_lrt annovar/humandb/ ./annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar esp5400_all annovar/humandb/
GATK (The Genome Analysis Toolkit) needs a dict file (hg19.dict for your case) for realignmnet. When creating the dict file, it is important to configure exon_pipeline.config properly.
Please see the setup
# change dir to the script dir cd exome/script python realign_gatk_setup.py # check status is 0 (program exited normally) job id : 477616 failed =0 exit_status=0 ls -l ../ref/hg19_bwa-0.5.10/hg19.dict # check the file is there
Download the DNAcopy package from the Bioconductor site and put it under the exome/bin directory. (Make sure you are in the exome/bin directory before you start wget.)
mkdir -p ~/.R export R_LIBS=~/.R wget -nc http://www.bioconductor.org/packages/2.10/bioc/src/contrib/DNAcopy_1.30.0.tar.gz cp DNAcopy_1.30.0.tar.gz ~/.R R CMD INSTALL DNAcopy_1.30.0.tar.gzExecute R and check to see if you can use the library(DNAcopy) without an error. The R_LIBS environment variable must be exported before you use.
R library(DNAcopy)