A workflow to process CLIP-seq data
main @ 771a6f9

Workflow Type: Nextflow

CLIP-seq Workflow

A Nextflow workflow for end-to-end processing of CLIP-seq data, supporting multiple CLIP protocols.

Overview

Starting from raw FASTQ files (or un-demultiplexed iCLIP data), the workflow processes reads through quality control, adapter trimming, rRNA removal, genome alignment, and UMI deduplication, then runs shoji to extract crosslink sites and produce per-sample and combined count matrices ready for differential binding analysis (see DEWSeq).

Workflow steps

  • Demultiplexing (optional, iCLIP only)
  • Quality Control
  • UMI pre-processing (optional, R2-CLIP only)
  • Adapter and Quality Trimming
  • Fastq data sketching and similarity comparison
  • rRNA filtering (optional)
  • Alignment
  • Contamination estimation
  • UMI deduplication (optional)
  • Downstream processing
  • Final statistics report
    • Per-sample read counts at each processing stage (raw → trimmed → rRNA-filtered → aligned → deduplicated), plus Kraken2 classification summary

Prerequisites

⚠️ If using conda/mamba, ensure no active conda environment is loaded before launching Nextflow, as it can interfere with the JRE.

Reference files

The genome profiles (conf/genome/) contain organism specific reference configs.

ℹ️ See this note about creating genome specific configs

Before running, you will need to prepare and configure paths to:

File Used by Description
STAR genome index STAR Build using STAR --runMode genomeGenerate against your genome FASTA or FASTA + GTF
GFF3 annotation shoji Gene annotation file (GENCODE). See this section about using annotation files from non GENCODE sources.
Genome FAI tracks FASTA index (.fa.fai) for the genome, used to set chromosome sizes for track generation (see samtools faidx)
rRNA FASTA bbduk Reference sequences for rRNA filtering
Kraken2 database Kraken2 Pre-built Kraken2 database (e.g. from https://benlangmead.github.io/aws-indexes/k2). See kraken2 section

⚠️ Edit the relevant genome config (e.g. conf/genome/hsa.config or conf/genome/rDNA.config) to point to your local copies.

GFF3

⚠️ When using GFF3 files from sources other than GENCODE, shoji paramaters corresponding to gene id, name, type and optionally feature needs to supplied. See shoji annotation documentation for a description of these parameters. In hsa and rDNA configs, edit the variable annotation_params to fit the attribute names in the GFF3 file being used.

Kraken2

⚠️ Kraken2 config parameters db, nodes and names are placeholder pathes. Edit these to point to actual files before running the workflow

Parameter Required file Description
db Kraken2 index file See Kraken 2 index for a list of downloadable index files
nodes NCBI taxonomy db nodes.dmp file See this readme
names NCBI taxonomy db names.dmp file See this readme

ℹ️ See this shell script for an example supplying these files using command line parameters

Built-in profiles

Supported protocols

Profile Sequencing type Description
eCLIP paired-end ⚠️ two-step adapter trimming (cutadapt) and UMI deduplication
iCLIP single-end barcode demultiplexing + UMI extraction via flexbar
R2CLIP paired-end Read 2 is expected to contain only UMIs,and after UMI extraction Read 1 is processed as single-end
soniCLIP single-end no demultiplexing or deduplication

⚠️ The current version of eCLIP profile is designed to handle UMI-extracted reads available from the ENCODE portal

Genomes

Profile Description
hsa Human GRCh38 / GENCODE v42 primary assembly
rDNA Human hg38 with rDNA-masked genome (for rRNA binding RBPs); rRNA trimming disabled by default. rDNA genomes for human and mouse are available from this reference

ℹ️ it is also possible to skip creating/using genome configs altogether and supply these reference files using parameters. See this soniCLIP shell script template for an example

Run environments

Profile Description
apptainer Runs processes inside Apptainer containers (paths should be configured separately - see conf/containers/README.md)
conda Creates and caches conda environments per process (see conf/conda/ and conda config)
slurm SLURM executor settings (see conf/run/embl_hd.config); adapt queue names and resource limits for your cluster

Using profiles

Profiles are combined with commas. See nextflow.config for the full list.

nextflow run ... -profile slurm,apptainer,eCLIP,hsa

This runs the workflow on a SLURM cluster using Apptainer containers, the eCLIP protocol, and hg38 genome alignment.

ℹ️ The slurm profile is pre-configured for the EMBL Heidelberg HPC. For other SLURM clusters, copy conf/run/embl_hd.config, adjust queue names and resource parameters, and reference your copy in nextflow.config.

Workflow

Sample sheet format

This workflow uses nf-schema plugin and the supported sample sheet format.

For eCLIP, R2-CLIP and soniCLIP protocols, the following columns (in csv) is expected:

eCLIP

eCLIP: fastq_2 column MUST be provided.

sample fastq_1 fastq_2
sample1 /path/to/sample1_R1.fastq.gz /path/to/sample1_R2.fastq.gz

R2-CLIP

R2-CLIP: fastq_1 for acutal reads, and fastq_2 is expected to contain only UMIs.

umi_tools extract is used to extract UMIs from fastq_2 (based on parameter bc_pattern in config file) and add them to fastq_1 headers and are then processed as regular single-end reads.

sample fastq_1 fastq_2
sample1 /path/to/sample1_R1.fastq.gz /path/to/sample1_R2.fastq.gz

soniCLIP

soniCLIP: only uses fastq_1

sample fastq_1
sample1 /path/to/sample1.fastq.gz

iCLIP

For iCLIP protocol, the following columns (in csv) is expected:

fastq barcode
/path/to/run1.fastq.gz /path/to/run1_barcode.fa
  • fastq column contains the path to the raw, un-demultiplexed fastq files.
  • barcode column contains the path to the fasta file with barcodes for demultiplex

barcode fasta file format example:

>sample_1
NNNNATATATATNN
>sample_2
NNNNCGCGCGCGNN

ℹ️ flexbar is used for demultiplexing iCLIP data based on the provided barcodes with corresponding header as sample name. UMIs (Ns in the sequences) are extracted from the reads during demultiplexing and added to fastq header.

ℹ️ iCLIP fastq files that are already processed (demultiplexed and UMI extracted) can also be provided, using the same sample sheet format as for eCLIP/R2-CLIP/soniCLIP (with sample, fastq_1 columns) (see section eCLIP, R2-CLIP and soniCLIP).

Running the workflow

Pull the latest version of the workflow before running:

nextflow pull 

Replace with the URL of this repository (e.g. `https://github.com/your-org/clip-seq-nf`). The examples below use a local clone. To run directly from a remote URL, replace `/path/to/workflow` with.

⚠️ most of the example workflows below assumes that there is a genome assembly config with appropriate paths and parameters in the genome folder and that this assembly is included in the nextflow config file

eCLIP with human genome (hg38) on SLURM using conda

See this shell script

iCLIP with human genome (hg38) on SLURM using apptainer

See this shell script

soniCLIP with human genome (hg38) on SLURM using apptainer with custom shoji parameters

See this shell script

soniCLIP without using a genome config on SLURM and conda

See this shell script

ℹ️ The shell script above shows how to use custom genome files without adding a genome config.

Output

Given below is an example output directory structure from this pipeline.

ℹ️ the output directory is defined by nextflow -output-dir parameter, and the files in this directory will be symbolic links to the files in the work directory, defined by nextflow parameter -work-dir

Directory Sub-directory File Description
Annotation Shoji annotation files
Fastq Fastq files after trimming
rRNA_trim after rRNA read removal
trim after rRNA read removal
Genome_align Genome alignments
alignment bam files, alignment statistics,...
mapped_fq mapped reads in fq format
multimapped_fq multimapped reads in fq format
unmapped_fq un-mapped reads in fq format
Kraken2 Kraken 2 output directory
contamination_check Kraken2 classification files and contamination reports
QC QC files: fastqc and multiqc files
raw raw data QC
rRNA_trim QC after rRNA read removal
trim QC after adapter trimming
Shoji Shoji and related outputs
counts count files from shoji count
matrix Final output matrices for DEWSeq analysis
sites bed formatted output files from shoji extract
tracks .bw files for visualization
Sourmash Sourmash files and plots
align for aligned reads
kraken2 after Kraken2 contamination estimation
raw for raw reads
rRNA_trim after rRNA read trimming
trim after adapter trimming
Stats Read count statistics
all_samples_combined_stats.csv read count statistics for all samples from raw reads to alignment, deduplication (optional) and contamination estimation
``_all_stats.json per sample read count statistics in json format

Developed at: Hentze Group, EMBL Heidelberg

Version History

main @ 771a6f9 (earliest) Created 24th Jun 2026 at 12:05 by Hentze group

Add MIT License


Frozen main 771a6f9
help Creators and Submitter
Creator
Submitter
Activity

Views: 0   Downloads: 0

Created: 24th Jun 2026 at 12:05

help Attributions

None

Total size: 111 KB
Powered by
(v.1.17.3)
Copyright © 2008 - 2026 The University of Manchester and HITS gGmbH