There are 5 main scripts in Genomon-exome for analysis.
Links inside this page:
Putting Fastq Files in The Target Directory
Command1. Alignment & Generate Summary Table
Command2. Merge Mapped Bam
Command3. Realignment
Command4. Mutation Calling & Annotation (Fisher's exact test)
Command5. Copy Number Data Analysis
Command6. Mutation Calling & Annotation (Empirical Baysian mutation Calling)
We'll store fastq files in an organized manner using tools. There two tools, Type 1 and Type 2 for this purpose. If your fastq files aren't output in the manner explained below, please prepare directories yourself.
Type 1:
This is for fastq files processed with CASAVA ver. 1.8 or later. Please put your fastq files in the exome/data/input_org directory.
# make <sample name>_<RUN date> dir exome/data/input_org/sample01_120405/sample01T exome/data/input_org/sample01_120405/sample01N # Here's the contents of the sample01T dir # If your sample01T dir looks something like below, use type 1 tool -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_001.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_002.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_003.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_004.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_005.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_006.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R1_007.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_001.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_002.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_003.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_004.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_005.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_006.fastq.gz -rw-r--r-- 1 genomon genomon 1234567890 Aug 10 00:00 sample01T_ACGTAC_L001_R2_007.fastq.gz
If you figured you should use Type 1 tool, in the exome/script directory, execute:
qsub putFastqDir.sh [options] arg1 arg2 arg3 arg4 # arg1: input dir name (full path required) # arg2: sample name # arg3: Run Date, the date the data were taken. It is important as the same sample gets sequenced more than once. # arg4: sample type - normal, tumor, TAM, AMKL, CR, you name it. # options # The -f option should be used if fileter is required. # If there are reads to be filtered in casava1.8, use the -f option. # For each dir, exec the shell script qsub putFastqDir.sh -f /home/genomon/exome/data/input_org/sample01_120402/sampleT sample01 120405 tumor qsub putFastqDir.sh -f /home/genomon/exome/data/input_org/sample01_120402/sampleN sample01 120405 normal # Below you see the results. Types 1 and 2 share the output dir structure and the format. # Dir structure and the format: # data/input/{sample name}/{RUN date}/{sample type}/{lane #}/{sample name}_${sample type}_R{pair}.fastq ../data/input/sample01/120405/tumor/1/sample01_tumor_R1.fastq ../data/input/sample01/120405/tumor/1/sample01_tumor_R2.fastq ../data/input/sample01/120405/normal/2/sample01_normal_R1.fastq ../data/input/sample01/120405/normal/2/sample01_normal_R2.fastq
Type 2: If above doesn't apply to your dataset, your fastq files should be found under the exome/data/input_org directory.
# put fastq file under the <sample name>_<Run date> dir under the input_org dir exome/data/input_org/sample01_120405/s_1_1_sequence.txt exome/data/input_org/sample01_120405/s_1_2_sequence.txt exome/data/input_org/sample01_120405/s_2_1_sequence.txt exome/data/input_org/sample01_120405/s_2_2_sequence.txt
This Type 2 tool needs to be exec'ed for each file in the exome/script directory.
qsub putFastqFile.sh arg1 arg2 arg3 arg4 arg5 arg6 arg7 # arg1: input file (full path) # arg2: sample name # arg3: Run date, the date data were taken. The same data gets sequenced twice # arg4: sample type - normal, tumor, TAM, AMKL, CR, you name it. # arg5: Lane # (1-8) # arg6: pair - 1 or 2 # 1: reads from Read1 sequence primary # 2: reads from Read2 sequence primary # options # The -f option should be used if fileter is required. # If there are reads to be filtered in casava1.8, use the -f option. # execute shell per file qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_1_1_sequence.txt sample01 120405 tumor 1 1 qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_1_2_sequence.txt sample01 120405 tumor 1 2 qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_2_1_sequence.txt sample01 120405 normal 2 1 qsub putFastqFile.sh -f ../data/input_org/sample01_120405/s_2_2_sequence.txt sample01 120405 normal 2 2 # Results are shown below. Output dir structure and the format are the same with Type 1 # Dir strcture and format: # data/input/{sample name}/{RUN date}/{sample type}/{lane #}/{sample name}_${sample type}_R{pair}.fastq ../data/input/sample01/120405/tumor/1/sample01_tumor_R1.fastq ../data/input/sample01/120405/tumor/1/sample01_tumor_R2.fastq ../data/input/sample01/120405/normal/2/sample01_normal_R1.fastq ../data/input/sample01/120405/normal/2/sample01_normal_R2.fastq
That's all we have to do to store fastq files.
Once the directories are made and file names are fixed, please don't touch them as programs used in the later steps assume that.
After alignment is finisehd, we calculate alignment rates and coverages. The shell should be executed in the exome/script directory.
qsub map_bwa.sh [ options ] arg1 arg2 arg3 # arg1: sample name # arg2: run date (as we might sequence the same data more than once) # arg3: capture region file name (Just put file name only, no file extensions. This is the file we uploaded Item 11.) # options # -j {s|m|l default is set to m} # select batch queue type s=sjob, m=mjob, and l=ljob # The -s option should be added to convert the quality scores to Sanger format # The -a option should be added to remove adaptor sequences from the fastq data # specify the adapter sequences in the exon_pipeline.config # examples qsub map_bwa.sh sample01 120405 SureSelect50M -a -s # the results ../data/output/sample01/120405/map_bwa/tumor/ga.bam ../data/output/sample01/120405/map_bwa/normal/ga.bam # alignment rate and other results ../data/output/sample01/120405/summary/tumor/ ../data/output/sample01/120405/summary/normal/ # alignment rate by lane #. if there's only one lane under the sample type dir, the results should be the same with the files found under the sample type dir ../data/output/sample01/120405/summary/tumor/1/ ../data/output/sample01/120405/summary/normal/2/
After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.
If you have couples of bam files that are made from the same sample (say tumor sample), but sequnced independently, you cam merge them together. Multple sequencing may be needed when we need more depth, if that happens you can merge the mapping result bam files together. A "merge" dir will be created and the merged bam files are put in there. Please execute the script in the exome/script directory.
qsub merge_bamfiles.sh [ options ] arg1 arg2 arg3 # arg1: bam files to be merged. comma seperated full path required. # arg2: sample name # arg3: sample type - normal, tumor, ... etc. # options # -j {s|m|l default is set to m} # select the batch queue to use for the job. s=sjob,m=mjob,l=ljob # -t {Run date, default is set to merge} # -g {Bam file output dir name, default is set to map_bwa} # -f {Bam filename, default is set to ga.bam} # so, the output dir is nested further like below # exome/data/output/<sample name>/<Run date>/<bam file output dir>/<bam filename> # example: qsub merge_bamfiles.sh /home/testuser/exome/data/output/sample01/120405/map_bwa/tumor/ga.bam,home/testuser/exome/data/output/sample01/110415/map_bwa/tumor/ga.bam sample01 tumor # The resulting bam file, ga.bam is placed under the dir below. ../data/output/sample01/merge/map_bwa/tumor/ga.bam
After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.
Here we do realignment. The job submission scprit should be executed from the exome/script directory.
qsub realign_gatk.sh [ options ] arg1 arg2 arg3 arg4 arg5 # arg1: sample name # arg2: Run date, if there's a merge dir Run date is set to merge. # arg3: sample type - if you have normal and tumor sample, arg3 should be set to normal # arg4: sample type - to be compared against the sample type specified by arg3 # arg5: sample type another sample to be compared against the normal # options # -j {s|m|l default is set to m} # select the batch queue for the job s=sjob,m=mjob,l=ljob # -i {interval lis, default is set to a string, "interval_list_hg19_nogap"} # The -i option is used to select a dir that contains files to slice up the bam file (we take the advantage of the parallelism) # examples: qsub realign_gatk.sh sample01 120405 normal tumor # the results ../data/output/sample01/120405/realign_gatk/tumor/realigned.bam ../data/output/sample01/120405/realign_gatk/normal/realigned.bam
After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.
We execute a data analysis shell script that in the exome/script directory.
qsub fisher_test.sh [ options ] arg1 arg2 arg3 arg4 arg5 # arg1: sample name # arg2: date you _execute_ this script. shouldn't be the data sequenced date. This is provided as you might need to do this couples of times for the same sample(after adjusting the alignment and base qualities and max Indels) # arg3: sample type - if you have normal and tumor samples, arg3 is set to normal # arg4: sample type - to be compared against the sample type specified by arg3 # arg5: sample type (optional) - another sample to be compared against the normal # options # -j {s|m|l default is set to m} # select the batch queue for the job. s=sjob,m=mjob,l=ljob # -m {Mapping_Quality; default val is 30} # For Fisher test. aligned reads whose mapping quality below the threshold are dropped # -b {Base_Quality; default val is 15} # For Fisher test. Bases below this threashold are dropped # -x {Max_Indel; default val is 2} # For Fisher test. If a read have more than this val, dropped. # -i {Interval list; default string is "interval_list_hg19_nogap"} # The -i option is used to select a dir that contains files to slice up the bam file (we take the advantage of the parallelism) # -t {RUN date; defaul is set to merge} # example # exome/data/output/sample01/<Run date>/realign_gatk/tumor/realigned.bam # -g {Bam file dir name; default name is "realign_gatk"} # example: # exome/data/output/sample01/merge/<bam file dir name>/tumor/realigned.bam # -f {Bam file name; default bam file name is realigned.bam} # example: # you need to include the .bam extension in the name # exome/data/output/sample01/merge/realign_gatk/tumor/<bam file name> # examples: # Use realigned bam files # exome/data/output/sample01/merge/realign_gatk/normal/readligned.bam # exome/data/output/sample01/merge/realign_gatk/tumor/readligned.bam qsub fisher_test.sh sample01 120430 normal tumor
# Use non merged, but realigned bam files # exome/data/output/sample01/120415/realign_gatk/normal/readligned.bam # exome/data/output/sample01/120415/realign_gatk/tumor/readligned.bam qsub fisher_test.sh sample01 120430 normal tumor -t 120415
# Use merged, but not realigned bam files # exome/data/output/sample01/merge/map_bwa/normal/ga.bam # exome/data/output/sample01/merge/map_bwa/tumor/ga.bam qsub fisher_test.sh sample01 120430 normal tumor -g map_bwa -f ga.bam
# Use unmerged and not realigned bam files # exome/data/output/sample01/120415/map_bwa/normal/ga.bam # exome/data/output/sample01/120415/map_bwa/tumor/ga.bam qsub fisher_test.sh sample01 120430 normal tumor -t 120415 -g map_bwa -f ga.bam # results are stored under the fisher_test dir exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.genome.result.txt # result file exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.exome.result.txt # exome info is extracted from genome.result.txt # If your jobs failed, go look at files under the result dir. These are useful when debugging, please keep them. exome/data/result/sample01_120430/fisher_test/output.tumor_normal.anno exome/data/result/sample01_120430/fisher_test/output.tumor.anno exome/data/result/sample01_120430/fisher_test/output.normal.anno exome/data/result/sample01_120430/fisher_test/enpty_tmp_file.list # These are the temoprary files. In the next release these are removed. exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.exome_summary.txt exome/data/result/sample01_120430/fisher_test/sum_sample01_120430_tumor.genome_summary.txt
After you executed the qsub command, the job id will be returned on the screen. Please take the ID number, it is needed to take a look at the logs later.
The script should be executed in the exome/copy_number/script directory.
bash copy_num.sh arg1 arg2 arg3 arg4 (arg5) # arg1: input tumor bam file # arg2: input normal bam file # arg3: output dir name # arg4: tag name: tag jobs uniquely and it is used to set the output filename (<sample name>_rundate would be fine) # arg5: optional you can change the config file. copy_number/script/copynum.env is used by default. # example: bash copy_num.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/tumor/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/copyNum Sample01_120430 # results are found here ../data/output/sample01/120430/copyNum/sample01_120430.tumor_as.pdf ../data/output/sample01/120430/copyNum/sample01_120430.tumor_total.pdf # If you submit two jobs simultaneously, tags should be unique. bash copy_num.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/TAM/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/copyNum Sample01_120430_TAM bash copy_num.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/AMKL/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/copyNum Sample01_120430_AMKL
You can check the completion status here.
Create a list of normal reference samples. This is the list of absolute paths to .bam files for non-paired normal reference samples. Please name the text file as you like (e.g., myNormalRef.txt), and list the paths of .bam files as follows:
/home/genomon/exome/data/sequence/normalreference1.bam /home/genomon/exome/data/sequence/normalreference2.bam ... /home/genomon/exome/data/sequence/normalreference10.bam # Removing last blank line.
We compile C++ programs
# in the exome/eb_call directory. make
Type the following command
bash ebCall.sh arg1 arg2 arg3 arg4 arg5 (arg6) # arg1: input tumor bam file # arg2: input normal bam file # arg3: output dir name # arg4: list of normal reference samples # arg5: tag name: tag jobs uniquely and it is used to set the output filename (<sample name>_rundate would be fine) # arg6: optional you can change the config file. ebcall/config.sh is used by default. # example: bash ebCall.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/tumor/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430/ebcall /home/genomon/exome/ebcall/list/myNormalRef.txt Sample01_120430 # results are found here ../data/output/sample01/120430/ebcall/Sample01_120430/output.txt # If you submit two jobs simultaneously, tags should be unique. bash ebCall.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/TAM/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430_TAM/ebcall /home/genomon/exome/ebcall/list/myNormalRef.txt Sample01_120430_TAM bash ebCall.sh /home/genomon/exome/data/output/sample01/120430/map_bwa/AMKL/ga.bam /home/genomon/exome/data/output/sample01/120430/map_bwa/normal/ga.bam /home/genomon/exome/data/output/sample01/120430_AMKL/ebcall /home/genomon/exome/ebcall/list/myNormalRef.txt Sample01_120430_AMKL
You can check the completion status here.