How to Use Genomon-exome


There are 5 main scripts in Genomon-exome for analysis.

* If the FASTQ files are compressed, unpacking may occupy some amount of time.

Putting FASTQ Files in The Target Directory


We'll store fastq files in an organized manner using tools. There two tools, Type 1 and Type 2 for this purpose. If your fastq files aren't output in the manner explained below, please prepare directories yourself.

Type 1:
This is for fastq files processed with CASAVA ver. 1.8 or later. Please put your fastq files in the exome/data/input_org directory.

		# make <sample name>_<RUN date> dir
		exome/data/input_org/sample01_120405/sample01T
		exome/data/input_org/sample01_120405/sample01N
		
		# Here's the contents of the sample01T dir
		# If your sample01T dir looks something like below, use type 1 tool
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_001.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_002.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_003.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_004.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_005.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_006.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_007.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_001.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_002.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_003.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_004.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_005.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_006.fastq.gz
		-rw-r--r--  1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_007.fastq.gz

If you figured you should use Type 1 tool, in the exome/script directory, execute:

		qsub putFastqDir.sh [options] arg1 arg2 arg3 arg4
		# arg1: input dir name (full path required)
		# arg2: sample name
		# arg3: Run Date, the date the data were taken. It is important as the same sample gets sequenced more than once.
		# arg4: sample type - normal, tumor, TAM, AMKL, CR, you name it.
		# options
		# The -f option should be used if fileter is required.
		# If there are reads to be filtered in casava1.8, use the -f option.
		
		# For each dir, exec the shell script
		qsub putFastqDir.sh -f /home/genomon/exome/data/input_org/sample01_120402/sampleT sample01 120405 tumor
		qsub putFastqDir.sh -f /home/genomon/exome/data/input_org/sample01_120402/sampleN sample01 120405 normal
		
		# Below you see the results. Types 1 and 2 share the output dir structure and the format.
		# Dir structure and the format: 
		# data/input/{sample name}/{RUN date}/{sample type}/{lane #}/{sample name}_${sample type}_R{pair}.fastq 
		../data/input/sample01/120405/tumor/1/sample01_tumor_R1.fastq
		../data/input/sample01/120405/tumor/1/sample01_tumor_R2.fastq
		../data/input/sample01/120405/normal/2/sample01_normal_R1.fastq
		../data/input/sample01/120405/normal/2/sample01_normal_R2.fastq

Type 2:
If above doesn't apply to your dataset, your fastq files should be found under the exome/data/input_org directory.

		# put fastq file under the <sample name>_<Run date> dir under the input_org dir
		exome/data/input_org/sample01_120405/s_1_1_sequence.txt
		exome/data/input_org/sample01_120405/s_1_2_sequence.txt
		exome/data/input_org/sample01_120405/s_2_1_sequence.txt
		exome/data/input_org/sample01_120405/s_2_2_sequence.txt

This Type 2 tool needs to be exec'ed for each file in the exome/script directory.

		qsub putFastqFile.sh arg1 arg2 arg3 arg4 arg5 arg6 arg7
		# arg1: input file (full path)
		# arg2: sample name
		# arg3: Run date, the date data were taken. The same data gets sequenced twice
		# arg4: sample type - normal, tumor, TAM, AMKL, CR, you name it.
		# arg5: Lane # (1-8)
		# arg6: pair - 1 or 2
		#   1: reads from Read1 sequence primary
		#   2: reads from Read2 sequence primary
		# options
		# The -f option should be used if fileter is required.
		# If there are reads to be filtered in casava1.8, use the -f option.
		
		# execute shell per file
		qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_1_1_sequence.txt sample01 120405 tumor 1 1
		qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_1_2_sequence.txt sample01 120405 tumor 1 2
		qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_2_1_sequence.txt sample01 120405 normal 2 1
		qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_2_2_sequence.txt sample01 120405 normal 2 2
		
		# Results are shown below. Output dir structure and the format are the same with Type 1
		# Dir strcture and format: 
		# data/input/{sample name}/{RUN date}/{sample type}/{lane #}/{sample name}_${sample type}_R{pair}.fastq 
		../data/input/sample01/120405/tumor/1/sample01_tumor_R1.fastq
		../data/input/sample01/120405/tumor/1/sample01_tumor_R2.fastq
		../data/input/sample01/120405/normal/2/sample01_normal_R1.fastq
		../data/input/sample01/120405/normal/2/sample01_normal_R2.fastq

That's all we have to do to store fastq files.

Once the directories are made and file names are fixed, please don't touch them as programs used in the later steps assume that.

1. Alignment & Generate Summary Table


After alignment is finisehd, we calculate alignment rates and coverages. The shell should be executed in the exome/script directory.

		qsub map_bwa.sh [ options ] arg1 arg2 arg3
		# arg1: sample name
		# arg2: run date (as we might sequence the same data more than once)
		# arg3: capture region file name (Just put file name only, no file extensions. This is the file we uploaded Item 11.)
		# options
		# -j {s|m|l default is set to m}
		# select batch queue type s=sjob, m=mjob, and l=ljob
		# The -s option should be added to convert the quality scores to Sanger format
		# The -a option should be added to remove adaptor sequences from the fastq data
		# specify the adapter sequences in the exon_pipeline.config
		
		# examples
		qsub map_bwa.sh sample01 120405 SureSelect50M -a -s
		# the results
		../data/output/sample01/120405/map_bwa/tumor/ga.bam
		../data/output/sample01/120405/map_bwa/normal/ga.bam
		# alignment rate and other results
		../data/output/sample01/120405/summary/tumor/
		../data/output/sample01/120405/summary/normal/
		# alignment rate by lane #. if there's only one lane under the sample type dir, the results should be the same with the files found under the sample type dir
		../data/output/sample01/120405/summary/tumor/1/
		../data/output/sample01/120405/summary/normal/2/

After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.

2. Merge Mapped Bam Files (If Applicable)


If you have couples of bam files that are made from the same sample (say tumor sample), but sequnced independently, you cam merge them together. Multple sequencing may be needed when we need more depth, if that happens you can merge the mapping result bam files together. A "merge" dir will be created and the merged bam files are put in there. Please execute the script in the exome/script directory.

		qsub merge_bamfiles.sh [ options ] arg1 arg2 arg3
		# arg1: bam files to be merged. comma seperated full path required.
		# arg2: sample name
		# arg3: sample type - normal, tumor, ... etc.
		# options
		# -j {s|m|l default is set to m}
		# select the batch queue to use for the job. s=sjob,m=mjob,l=ljob
		# -t {Run date, default is set to merge} 
		# -g {Bam file output dir name, default is set to map_bwa} 
		# -f {Bam filename, default is set to ga.bam} 
		# so, the output dir is nested further like below
		# exome/data/output/<sample name>/<Run date>/<bam file output dir>/<bam filename>
		
		# example:
		qsub merge_bamfiles.sh /home/testuser/exome/data/output/sample01/120405/map_bwa/tumor/ga.bam,home/testuser/exome/data/output/sample01/110415/map_bwa/tumor/ga.bam sample01 tumor
		# The resulting bam file, ga.bam is placed under the dir below.
		../data/output/sample01/merge/map_bwa/tumor/ga.bam

After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.

3. Realignment (If Necessary)


Here we do realignment. The job submission scprit should be executed from the exome/script directory.

		qsub realign_gatk.sh [ options ] arg1 arg2 arg3 arg4 arg5
		# arg1: sample name
		# arg2: Run date, if there's a merge dir Run date is set to merge.
		# arg3: sample type - if you have normal and tumor sample, arg3 should be set to normal 
		#  arg4: sample type - to be compared against the sample type specified by arg3
		# arg5: sample type another sample to be compared against the normal
		# options
		# -j {s|m|l default is set to m} 
		# select the batch queue for the job s=sjob,m=mjob,l=ljob
		# -i {interval lis, default is set to a string, "interval_list_hg19_nogap"}
		# The -i option is used to select a dir that contains files to slice up the bam file (we take the advantage of the parallelism)
		# examples:
		qsub realign_gatk.sh sample01 120405 normal tumor
		# the results
		../data/output/sample01/120405/realign_gatk/tumor/realigned.bam
		../data/output/sample01/120405/realign_gatk/normal/realigned.bam

After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.

4. Mutation Calling & Annotation (Fisher's exact test)


We execute a data analysis shell script that in the exome/script directory.

		qsub fisher_test.sh [ options ] arg1 arg2 arg3 arg4 arg5
		# arg1: sample name
		# arg2: date you _execute_ this script. shouldn't be the data sequenced date. This is provided as you might need to do this couples of times for the same sample(after adjusting the alignment and base qualities and max Indels)
		# arg3: sample type - if you have normal and tumor samples, arg3 is set to normal 
		# arg4: sample type - to be compared against the sample type specified by arg3
		# arg5: sample type (optional) - another sample to be compared against the normal
		
		# options
		# -j {s|m|l default is set to m}
		# select the batch queue for the job. s=sjob,m=mjob,l=ljob 
		# -m {Mapping_Quality; default val is 30} 
		# For Fisher test. aligned reads whose mapping quality below the threshold are dropped
		# -b {Base_Quality; default val is 15} 
		# For Fisher test. Bases below this threashold are dropped
		# -x {Max_Indel; default val is 2} 
		# For Fisher test. If a read have more than this val, dropped.
		# -i {Interval list; default string is "interval_list_hg19_nogap"}
		# The -i option is used to select a dir that contains files to slice up the bam file (we take the advantage of the parallelism)
		# -t {RUN date; defaul is set to merge}
		# example
		# exome/data/output/sample01/<Run date>/realign_gatk/tumor/realigned.bam
		# -g {Bam file dir name; default name is "realign_gatk"}
		# example:
		# exome/data/output/sample01/merge/<bam file dir name>/tumor/realigned.bam
		# -f {Bam file name; default bam file name is realigned.bam}
		# example:
		# you need to include the .bam extension in the name
		# exome/data/output/sample01/merge/realign_gatk/tumor/<bam file name>
		
		# examples:
		# Use realigned bam files
		# exome/data/output/sample01/merge/realign_gatk/normal/readligned.bam 
		# exome/data/output/sample01/merge/realign_gatk/tumor/readligned.bam 
		qsub fisher_test.sh sample01 120430 normal tumor
# Use non merged, but realigned bam files # exome/data/output/sample01/120415/realign_gatk/normal/readligned.bam # exome/data/output/sample01/120415/realign_gatk/tumor/readligned.bam qsub fisher_test.sh sample01 120430 normal tumor -t 120415
# Use merged, but not realigned bam files # exome/data/output/sample01/merge/map_bwa/normal/ga.bam # exome/data/output/sample01/merge/map_bwa/tumor/ga.bam qsub fisher_test.sh sample01 120430 normal tumor -g map_bwa -f ga.bam
# Use unmerged and not realigned bam files # exome/data/output/sample01/120415/map_bwa/normal/ga.bam # exome/data/output/sample01/120415/map_bwa/tumor/ga.bam qsub fisher_test.sh sample01 120430 normal tumor -t 120415 -g map_bwa -f ga.bam # results are stored under the fisher_test dir exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.genome.result.txt # result file exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.exome.result.txt # exome info is extracted from genome.result.txt # If your jobs failed, go look at files under the result dir. These are useful when debugging, please keep them. exome/data/result/sample01_120430/fisher_test/output.tumor_normal.anno exome/data/result/sample01_120430/fisher_test/output.tumor.anno exome/data/result/sample01_120430/fisher_test/output.normal.anno exome/data/result/sample01_120430/fisher_test/enpty_tmp_file.list # These are the temoprary files. In the next release these are removed. exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.exome_summary.txt exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.genome_summary.txt

After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.

5. Copy Number Data Analysis


The script should be executed in the exome/copy_number/script directory.

			bash copy_num.sh arg1 arg2 arg3 arg4 (arg5)
			# arg1: input tumor bam file
			# arg2: input normal bam file
			# arg3: output dir name
			# arg4: tag name: tag jobs uniquely and it is used to set the output filename (<sample name>_rundate would be fine)
			# arg5: optional you can change the config file. copy_number/script/copynum.env is used by default.
			
			# example:
			bash copy_num.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/tumor/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/copyNum Sample01_120430
			# results are found here
			../data/output/sample01/120430/copyNum/sample01_120430.tumor_as.pdf
			../data/output/sample01/120430/copyNum/sample01_120430.tumor_total.pdf
			
			# If you submit two jobs simultaneously, tags should be unique.
			bash copy_num.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/TAM/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/copyNum Sample01_120430_TAM
			bash copy_num.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/AMKL/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/copyNum Sample01_120430_AMKL

You can check the completion status here.

6. Mutation Calling & Annotation (Empirical Baysian mutation Calling)


Create a list of normal reference samples. This is the list of absolute paths to .bam files for non-paired normal reference samples. Please name the text file as you like (e.g., myNormalRef.txt), and list the paths of .bam files as follows:

			/home/genomon/exome/data/sequence/normalreference1.bam
			/home/genomon/exome/data/sequence/normalreference2.bam
			... 
			/home/genomon/exome/data/sequence/normalreference10.bam  # Removing last blank line.

We compile C++ programs

			# in the exome/eb_call directory.
			make

Type the following command

				bash ebCall.sh arg1 arg2 arg3 arg4 arg5 (arg6)
				# arg1: input tumor bam file
				# arg2: input normal bam file
				# arg3: output dir name
				# arg4: list of normal reference samples
				# arg5: tag name: tag jobs uniquely and it is used to set the output filename (<sample name>_rundate would be fine)
				# arg6: optional you can change the config file. ebcall/config.sh is used by default.
				
				# example:
				bash ebCall.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/tumor/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/ebcall /home/genomon/exome/ebcall/list/myNormalRef.txt Sample01_120430
				# results are found here
				../data/output/sample01/120430/ebcall/Sample01_120430/output.txt
				
				# If you submit two jobs simultaneously, tags should be unique.
				bash ebCall.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/TAM/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430_TAM/ebcall /home/genomon/exome/ebcall/list/myNormalRef.txt Sample01_120430_TAM
				bash ebCall.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/AMKL/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430_AMKL/ebcall /home/genomon/exome/ebcall/list/myNormalRef.txt Sample01_120430_AMKL

You can check the completion status here.

^ Go to Top