prepareChIPs
This is a simple snakemake workflow template for preparing single-end ChIP-Seq data.
The steps implemented are:
- Download raw fastq files from SRA
- Trim and Filter raw fastq files using AdapterRemoval
- Align to the supplied genome using bowtie2
- Deduplicate Alignments using Picard MarkDuplicates
- Call Macs2 Peaks using macs2
A pdf of the rulegraph is available here
Full details for each step are given below.
Any additional parameters for tools can be specified using config/config.yml, along with many of the requisite paths
To run the workflow with default settings, simply run as follows (after editing config/samples.tsv)
snakemake --use-conda --cores 16
If running on an HPC cluster, a snakemake profile will required for submission to the queueing system and appropriate resource allocation. Please discuss this will your HPC support team. Nodes may also have restricted internet access and rules which download files may not work on many HPCs. Please see below or discuss this with your support team
Whilst no snakemake wrappers are explicitly used in this workflow, the underlying scripts are utilised where possible to minimise any issues with HPC clusters with restrictions on internet access.
These scripts are based on v1.31.1 of the snakemake wrappers
Important Note Regarding OSX Systems
It should be noted that this workflow is currently incompatible with OSX-based systems. There are two unsolved issues
- fasterq-dumphas a bug which is specific to conda environments. This has been updated in v3.0.3 but this patch has not yet been made available to conda environments for OSX. Please check here to see if this has been updated.
- The following error appears in some OSX-based R sessions, in a system-dependent manner:
Error in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  : 
  polygon edge not found
The fix for this bug is currently unknown
Download Raw Data
Outline
The file samples.tsv is used to specify all steps for this workflow.
This file must contain the columns: accession, target, treatment and input
- accessionmust be an SRA accession. Only single-end data is currently supported by this workflow
- targetdefines the ChIP target. All files common to a target and treatment will be used to generate summarised coverage in bigWig Files
- treatmentdefines the treatment group each file belongs to. If only one treatment exists, simply use the value 'control' or similar for every file
- inputshould contain the accession for the relevant input sample. These will only be downloaded once. Valid input samples are required for this workflow
As some HPCs restrict internet access for submitted jobs, it may be prudent to run the initial rules in an interactive session if at all possible. This can be performed using the following (with 2 cores provided as an example)
snakemake --use-conda --until get_fastq --cores 2
Outputs
- Downloaded files will be gzipped and written to data/fastq/raw.
- FastQCand- MultiQCwill also be run, with output in- docs/qc/raw
Both of these directories are able to be specified as relative paths in config.yml
Read Filtering
Outline
Read trimming is performed using AdapterRemoval. Default settings are customisable using config.yml, with the defaults set to discard reads shorter than 50nt, and to trim using quality scores with a threshold of Q30.
Outputs
- Trimmed fastq.gz files will be written to data/fastq/trimmed
- FastQCand- MultiQCwill also be run, with output in- docs/qc/trimmed
- AdapterRemoval 'settings' files will be written to output/adapterremoval
Alignments
Outline
Alignment is performed using bowtie2 and it is assumed that this index is available before running this workflow.
The path and prefix must be provided using config.yml
This index will also be used to produce the file chrom.sizes which is essential for conversion of bedGraph files to the more efficient bigWig files.
Outputs
- Alignments will be written to data/aligned
- bowtie2log files will be written to- output/bowtie2(not the conenvtional log directory)
- The file chrom.sizeswill be written tooutput/annotations
Both sorted and the original unsorted alignments will be returned.
However, the unsorted alignments are marked with temp() and can be deleted using
snakemake --delete-temp-output --cores 1
Deduplication
Outline
Deduplication is performed using MarkDuplicates from the Picard set of tools. By default, deduplication will remove the duplicates from the set of alignments. All resultant bam files will be sorted and indexed.
Outputs
- Deduplicated alignments are written to data/deduplicatedand are indexed
- DuplicationMetrics files are written to output/markDuplicates
Peak Calling
Outline
This is performed using macs2 callpeak.
- Peak calling will be performed on: a. each sample individually, and b. merged samples for those sharing a common ChIP target and treatment group.
- Coverage bigWig files for each individual sample are produced using CPM values (i.e. Signal Per Million Reads, SPMR)
- For all combinations of target and treatment coverage bigWig files are also produced, along with fold-enrichment bigWig files
Outputs
- Individual outputs are written to output/macs2/{accession}- Peaks are written in narrowPeakformat along withsummits.bed
- bedGraph files are automatically converted to bigWig files, and the originals are marked with temp()for subsequent deletion
- callpeak log files are also added to this directory
 
- Peaks are written in 
- Merged outputs are written to output/macs2/{target}/- bedGraph Files are also converted to bigWig and marked with temp()
- Fold-Enrichment bigWig files are also created with the original bedGraph files marked with temp()
 
- bedGraph Files are also converted to bigWig and marked with 
Version History
v0.1.0 (earliest) Created 9th Jul 2023 at 09:54 by Stevie Pederson
Copied files from PRJNA509779 after resetting for phoenix
Frozen
 v0.1.0
v0.1.016d96b9
     Creators and Submitter
 Creators and SubmitterCreator
Submitter
Views: 3696 Downloads: 570
Created: 9th Jul 2023 at 09:54
 Attributions
 AttributionsNone

 View on GitHub
View on GitHub Download RO-Crate
Download RO-Crate

 https://orcid.org/0000-0001-8197-3303
 https://orcid.org/0000-0001-8197-3303
