A workflow to process CLIP-seq data
main @ 771a6f9

View on GitHub

Download RO-Crate

Workflow Type: Nextflow

CLIP-seq Workflow

A Nextflow workflow for end-to-end processing of CLIP-seq data, supporting multiple CLIP protocols.

Overview

Starting from raw FASTQ files (or un-demultiplexed iCLIP data), the workflow processes reads through quality control, adapter trimming, rRNA removal, genome alignment, and UMI deduplication, then runs shoji to extract crosslink sites and produce per-sample and combined count matrices ready for differential binding analysis (see DEWSeq).

Workflow steps

Demultiplexing (optional, iCLIP only)
Quality Control
UMI pre-processing (optional, R2-CLIP only)
Adapter and Quality Trimming
Fastq data sketching and similarity comparison
rRNA filtering (optional)
Alignment
Contamination estimation
UMI deduplication (optional)
Downstream processing
Final statistics report
- Per-sample read counts at each processing stage (raw → trimmed → rRNA-filtered → aligned → deduplicated), plus Kraken2 classification summary

Prerequisites

Java 11+ (required by Nextflow)
Nextflow - tested versions:
- 25.04.6
- 25.10.1
One of the following for software environments:
- Apptainer (formerly Singularity) - recommended for HPC, see apptainer configs
- Conda / Mamba - see conda configs

⚠️ If using conda/mamba, ensure no active conda environment is loaded before launching Nextflow, as it can interfere with the JRE.

Reference files

The genome profiles (conf/genome/) contain organism specific reference configs.

ℹ️ See this note about creating genome specific configs

Before running, you will need to prepare and configure paths to:

File	Used by	Description
STAR genome index	STAR	Build using `STAR --runMode genomeGenerate` against your genome FASTA or FASTA + GTF
GFF3 annotation	shoji	Gene annotation file (GENCODE). See this section about using annotation files from non GENCODE sources.
Genome FAI	tracks	FASTA index (`.fa.fai`) for the genome, used to set chromosome sizes for track generation (see `samtools faidx`)
rRNA FASTA	bbduk	Reference sequences for rRNA filtering
Kraken2 database	Kraken2	Pre-built Kraken2 database (e.g. from https://benlangmead.github.io/aws-indexes/k2). See kraken2 section

⚠️ Edit the relevant genome config (e.g. conf/genome/hsa.config or conf/genome/rDNA.config) to point to your local copies.

GFF3

⚠️ When using GFF3 files from sources other than GENCODE, shoji paramaters corresponding to gene id, name, type and optionally feature needs to supplied. See shoji annotation documentation for a description of these parameters. In hsa and rDNA configs, edit the variable annotation_params to fit the attribute names in the GFF3 file being used.

Kraken2

⚠️ Kraken2 config parameters db, nodes and names are placeholder pathes. Edit these to point to actual files before running the workflow

Parameter	Required file	Description
`db`	Kraken2 index file	See Kraken 2 index for a list of downloadable index files
`nodes`	NCBI taxonomy db `nodes.dmp` file	See this readme
`names`	NCBI taxonomy db `names.dmp` file	See this readme

ℹ️ See this shell script for an example supplying these files using command line parameters

Built-in profiles

Supported protocols

Profile	Sequencing type	Description
`eCLIP`	paired-end	⚠️ two-step adapter trimming (cutadapt) and UMI deduplication
`iCLIP`	single-end	barcode demultiplexing + UMI extraction via flexbar
`R2CLIP`	paired-end	Read 2 is expected to contain only UMIs,and after UMI extraction Read 1 is processed as single-end
`soniCLIP`	single-end	no demultiplexing or deduplication

⚠️ The current version of eCLIP profile is designed to handle UMI-extracted reads available from the ENCODE portal

Genomes

Profile	Description
`hsa`	Human GRCh38 / GENCODE v42 primary assembly
`rDNA`	Human hg38 with rDNA-masked genome (for rRNA binding RBPs); rRNA trimming disabled by default. rDNA genomes for human and mouse are available from this reference

ℹ️ it is also possible to skip creating/using genome configs altogether and supply these reference files using parameters. See this soniCLIP shell script template for an example

Run environments

Profile	Description
`apptainer`	Runs processes inside Apptainer containers (paths should be configured separately - see conf/containers/README.md)
`conda`	Creates and caches conda environments per process (see conf/conda/ and conda config)
`slurm`	SLURM executor settings (see conf/run/embl_hd.config); adapt queue names and resource limits for your cluster

Using profiles

Profiles are combined with commas. See nextflow.config for the full list.

nextflow run ... -profile slurm,apptainer,eCLIP,hsa

This runs the workflow on a SLURM cluster using Apptainer containers, the eCLIP protocol, and hg38 genome alignment.

ℹ️ The slurm profile is pre-configured for the EMBL Heidelberg HPC. For other SLURM clusters, copy conf/run/embl_hd.config, adjust queue names and resource parameters, and reference your copy in nextflow.config.

Workflow

Sample sheet format

This workflow uses nf-schema plugin and the supported sample sheet format.

For eCLIP, R2-CLIP and soniCLIP protocols, the following columns (in csv) is expected:

eCLIP

eCLIP: fastq_2 column MUST be provided.

sample	fastq_1	fastq_2
sample1	/path/to/sample1_R1.fastq.gz	/path/to/sample1_R2.fastq.gz

R2-CLIP

R2-CLIP: fastq_1 for acutal reads, and fastq_2 is expected to contain only UMIs.

umi_tools extract is used to extract UMIs from fastq_2 (based on parameter bc_pattern in config file) and add them to fastq_1 headers and are then processed as regular single-end reads.

sample	fastq_1	fastq_2
sample1	/path/to/sample1_R1.fastq.gz	/path/to/sample1_R2.fastq.gz

soniCLIP

soniCLIP: only uses fastq_1

sample	fastq_1
sample1	/path/to/sample1.fastq.gz

iCLIP

For iCLIP protocol, the following columns (in csv) is expected:

fastq	barcode
/path/to/run1.fastq.gz	/path/to/run1_barcode.fa

fastq column contains the path to the raw, un-demultiplexed fastq files.
barcode column contains the path to the fasta file with barcodes for demultiplex

barcode fasta file format example:

>sample_1
NNNNATATATATNN
>sample_2
NNNNCGCGCGCGNN

ℹ️ flexbar is used for demultiplexing iCLIP data based on the provided barcodes with corresponding header as sample name. UMIs (Ns in the sequences) are extracted from the reads during demultiplexing and added to fastq header.

ℹ️ iCLIP fastq files that are already processed (demultiplexed and UMI extracted) can also be provided, using the same sample sheet format as for eCLIP/R2-CLIP/soniCLIP (with sample, fastq_1 columns) (see section eCLIP, R2-CLIP and soniCLIP).

Running the workflow

Pull the latest version of the workflow before running:

nextflow pull

Replace with the URL of this repository (e.g. `https://github.com/your-org/clip-seq-nf`). The examples below use a local clone. To run directly from a remote URL, replace `/path/to/workflow` with.

⚠️ most of the example workflows below assumes that there is a genome assembly config with appropriate paths and parameters in the genome folder and that this assembly is included in the nextflow config file

eCLIP with human genome (hg38) on SLURM using conda

See this shell script

iCLIP with human genome (hg38) on SLURM using apptainer

See this shell script

soniCLIP with human genome (hg38) on SLURM using apptainer with custom shoji parameters

See this shell script

soniCLIP without using a genome config on SLURM and conda

See this shell script

ℹ️ The shell script above shows how to use custom genome files without adding a genome config.

Output

Given below is an example output directory structure from this pipeline.

ℹ️ the output directory is defined by nextflow -output-dir parameter, and the files in this directory will be symbolic links to the files in the work directory, defined by nextflow parameter -work-dir

Directory	Sub-directory	File	Description
Annotation			Shoji annotation files
Fastq			Fastq files after trimming
	rRNA_trim		after rRNA read removal
	trim		after rRNA read removal
Genome_align			Genome alignments
	alignment		bam files, alignment statistics,...
	mapped_fq		mapped reads in fq format
	multimapped_fq		multimapped reads in fq format
	unmapped_fq		un-mapped reads in fq format
Kraken2			Kraken 2 output directory
	contamination_check		Kraken2 classification files and contamination reports
QC			QC files: fastqc and multiqc files
	raw		raw data QC
	rRNA_trim		QC after rRNA read removal
	trim		QC after adapter trimming
Shoji			Shoji and related outputs
	counts		count files from `shoji count`
	matrix		Final output matrices for DEWSeq analysis
	sites		bed formatted output files from `shoji extract`
	tracks		`.bw` files for visualization
Sourmash			Sourmash files and plots
	align		for aligned reads
	kraken2		after Kraken2 contamination estimation
	raw		for raw reads
	rRNA_trim		after rRNA read trimming
	trim		after adapter trimming
Stats			Read count statistics
		all_samples_combined_stats.csv	read count statistics for all samples from raw reads to alignment, deduplication (optional) and contamination estimation
		``_all_stats.json	per sample read count statistics in json format

Developed at: Hentze Group, EMBL Heidelberg

SEEK ID: https://workflowhub.eu/workflows/2197?version=1

Version History

main @ 771a6f9 (earliest) Created 24th Jun 2026 at 12:05 by Hentze group

Add MIT License

Frozen main 771a6f9

Creators and Submitter

Creator

Sudeep Sahadevan

Submitter

Hentze group

License

MIT License (MIT)

Activity

Views: 0 Downloads: 0

Created: 24th Jun 2026 at 12:05

A workflow to process CLIP-seq data main @ 771a6f9

CLIP-seq Workflow

Overview

Workflow steps

Prerequisites

Reference files

GFF3

Kraken2

Built-in profiles

Supported protocols

Genomes

Run environments

Using profiles

Workflow

Sample sheet format

eCLIP

R2-CLIP

soniCLIP

iCLIP

Running the workflow

eCLIP with human genome (hg38) on SLURM using conda

iCLIP with human genome (hg38) on SLURM using apptainer

soniCLIP with human genome (hg38) on SLURM using apptainer with custom shoji parameters

soniCLIP without using a genome config on SLURM and conda

Output

Version History

main @ 771a6f9 (earliest) Created 24th Jun 2026 at 12:05 by Hentze group

Creator

Submitter

Related items

A workflow to process CLIP-seq data
main @ 771a6f9