ChIP-Seq pipeline
Here we provide the tools to perform paired end or single read ChIP-Seq analysis including raw data quality control, read mapping, peak calling, differential binding analysis and functional annotation. As input files you may use either zipped fastq-files (.fastq.gz) or mapped read data (.bam files). In case of paired end reads, corresponding fastq files should be named using .R1.fastq.gz and .R2.fastq.gz suffixes.
Pipeline Workflow
All analysis steps are illustrated in the pipeline flowchart. Specify the desired analysis details for your data in the essential.vars.groovy file (see below) and run the pipeline chipseq.pipeline.groovy as described here. A markdown file ChIPreport.Rmd will be generated in the output reports folder after running the pipeline. Subsequently, the ChIPreport.Rmd file can be converted to a final html report using the knitr R-package.
The pipelines includes
- raw data quality control with FastQC, BamQC and MultiQC
- mapping reads or read pairs to the reference genome using bowtie2 (default) or bowtie1
- filter out multimapping reads from bowtie2 output with samtools (optional)
- identify and remove duplicate reads with Picard MarkDuplicates (optional)
- generation of bigWig tracks for visualisation of alignment with deeptools bamCoverage. For single end design, reads are extended to the average fragment size
- characterization of insert size using Picard CollectInsertSizeMetrics (for paired end libraries only)
- characterize library complexity by PCR Bottleneck Coefficient using the GenomicAlignments R-package (for single read libraries only)
- characterize phantom peaks by cross correlation analysis using the spp R-package (for single read libraries only)
- peak calling of IP samples vs. corresponding input controls using MACS2
- peak annotation using the ChIPseeker R-package (optional)
- differential binding analysis using the diffbind R-package (optional). For this, input peak files must be given in NGSpipe2go/tools/diffbind/targets_diffbind.txt and contrasts of interest in NGSpipe2go/tools/diffbind/contrasts_diffbind.txt (see below)
Pipeline-specific parameter settings
-
targets.txt: tab-separated txt-file giving information about the analysed samples. The following columns are required:
- IP: bam file name of IP sample
- IPname: IP sample name to be used in plots and tables
- INPUT: bam file name of corresponding input control sample
- INPUTname: input sample name to be used in plots and tables
- group: variable for sample grouping (e.g. by condition)
-
essential.vars.groovy: essential parameter describing the experiment including:
- ESSENTIAL_PROJECT: your project folder name
- ESSENTIAL_BOWTIE_REF: full path to bowtie2 indexed reference genome (bowtie1 indexed reference genome if bowtie1 is selected as mapper)
- ESSENTIAL_BOWTIE_GENOME: full path to the reference genome FASTA file
- ESSENTIAL_BSGENOME: Bioconductor genome sequence annotation package
- ESSENTIAL_TXDB: Bioconductor transcript-related annotation package
- ESSENTIAL_ANNODB: Bioconductor genome annotation package
- ESSENTIAL_BLACKLIST: files with problematic 'blacklist regions' to be excluded from analysis (optional)
- ESSENTIAL_PAIRED: either paired end ("yes") or single read ("no") design
- ESSENTIAL_READLEN: read length of library
- ESSENTIAL_FRAGLEN: mean length of library inserts and also minimum peak size called by MACS2
- ESSENTIAL_THREADS: number of threads for parallel tasks
- ESSENTIAL_USE_BOWTIE1: if true use bowtie1 for read mapping, otherwise bowtie2 by default
-
additional (more specialized) parameter can be given in the var.groovy-files of the individual pipeline modules
If differential binding analysis is selected it is required additionally:
- contrasts_diffbind.txt: indicate intended group comparisions for differential binding analysis, e.g. KOvsWT=(KO-WT) if targets.txt contains the groups KO and WT. Give 1 contrast per line.
- targets_diffbind.txt:
- SampleID: IP sample name (as IPname in targets.txt)
- Condition: variable for sample grouping (as group in targets.txt)
- Replicate: number of replicate
- bamReads: bam file name of IP sample (as IP in targets.txt but with path relative to project directory)
- ControlID: input sample name (as INPUTname in targets.txt)
- bamControl: bam file name of corresponding input control sample (as INPUT in targets.txt but with path relative to project directory)
- Peaks: peak file name opbatined from peak caller (path relative to project directory)
- PeakCaller: name of peak caller (e.g. macs)
Programs required
- Bedtools
- Bowtie2
- deepTools
- encodeChIPqc (provided by another project from imbforge)
- FastQC
- MACS2
- MultiQC
- Picard
- R with packages ChIPSeeker, diffbind, GenomicAlignments, spp and genome annotation packages
- Samtools
- UCSC utilities
Version History
Version 1 (earliest) Created 7th Oct 2020 at 08:41 by Sergi Sayols
Added/updated 2 files
Open
master
2c13c9a
Creator
Submitter
Views: 3339 Downloads: 360
Created: 7th Oct 2020 at 08:41
Last updated: 10th Jan 2022 at 15:19
None