Introduction

nf-CBRA-snvs (nf-core - CIBERER Bioinformatics for Rare diseases Analysis - Small Nucleotide Variant) is a workflow optimized for the analysis of rare diseases, designed to detect SNVs and INDELs in targeted sequencing data (CES/WES) as well as whole genome sequencing (WGS).

This pipeline is developed using Nextflow, a workflow management system that enables an easy execution across various computing environments. It uses Docker or Singularity containers, simplifying setup and ensuring reproducibility of results. The pipeline assigns a container to each process, which simplifies the management and updating of software dependencies. When possible, processes are sourced from nf-core/modules, promoting reusability across all nf-core pipelines and contributing to the broader Nextflow community.

Pipeline summary

The pipeline can perform the following steps:

Mapping of the reads to reference (BWA-MEM)
Process BAM file (GATK MarkDuplicates, GATK BaseRecalibrator and GATK ApplyBQSR)
Variant calling with the following tools:
- GATK4 Haplotypecaller (run_gatk = true). This subworkflow includes:
  - GATK4 Haplotypecaller.
  - Hard Filters and VarianFiltration to mark PASS variants. More information here.
  - Bcftools Filter to keep PASS variants on chr1-22, X, Y.
  - Split Multialletic.
- Dragen (run_dragen = true). This subworkflow includes:
  - GATK4 Calibratedragstrmodel
  - GATK4 Haplotypecaller with --dragen-mode.
  - VarianFiltration with --filter-expression "QUAL < 10.4139" --filter-name "DRAGENHardQUAL"to mark PASS variants. More information here.
  - Bcftools Filter to keep PASS variants on chr1-22, X, Y.
  - Split Multialletic.
- DeepVariant (run_deepvariant = true). This subworkflow includes:
  - DeepVariant makeexamples: Converts the input alignment file to a tfrecord format suitable for the deep learning model.
  - DeepVariant callvariants: Call variants based on input tfrecords. The output is also in tfrecord format, and needs postprocessing to convert it to vcf.
  - DeepVariant postprocessvariants: Convert variant calls from callvariants to VCF, and also create GVCF files based on genomic information from makeexamples. More information here.
  - Bcftools Filter to keep PASS variants on chr1-22, X, Y.
  - Split Multialletic.
Merge and integration of the vcfs obtained with the different tools.
Annotation of the variants:
- Regions of homozygosity (ROHs) with AUTOMAP
- Effect of the variants with Ensembl VEP using the flag --everything, which includes the following options: --sift b, --polyphen b, --ccds, --hgvs, --symbol, --numbers, --domains, --regulatory, --canonical, --protein, --biotype, --af, --af_1kg, --af_esp, --af_gnomade, --af_gnomadg, --max_af, --pubmed, --uniprot, --mane, --tsl, --appris, --variant_class, --gene_phenotype, --mirna
- Postvep format VEP tab demilited output and filter variants by minor allele frequency (--maf).
- You can enhance the annotation by incorporating gene rankings from GLOWgenes, a network-based algorithm developed to prioritize novel candidate genes associated with rare diseases. Precomputed rankings based on PanelApp gene panels are available here. To include a specific GLOWgenes ranking, use the option --glowgenes_panel (path to the panel.txt), for example: --glowgenes_panel https://raw.githubusercontent.com/TBLabFJD/GLOWgenes/refs/heads/master/precomputed_panelAPP/GLOWgenes_prioritization_Neurological_ciliopathies_GA.txt. Additionally, you can include the Gene-Disease Specificity Score (SGDS) using: --glowgenes_sgds https://raw.githubusercontent.com/TBLabFJD/GLOWgenes/refs/heads/master/SGDS.csv. This score ranges from 0 to 1, where 1 indicates a gene ranks highly for only a few specific diseases (high specificity), and 0 indicates the gene consistently ranks highly across many diseases (low specificity).

Usage

First, prepare a samplesheet with your input data:

sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz

Each row represents a pair of paired end fastq files.

You can run the pipeline using:

nextflow run nf-cbra-snvs/main.nf \
   -profile  \
   --input samplesheet.csv \
   --outdir

For more details and further functionality, please refer to the usage documentation.

Pipeline output

For details about the output files and reports, please refer to the output documentation.

Credits

nf-CBRA-snvs was developed within the framework of a call for intramural cooperative and complementary actions (ACCI) funded by CIBERER (Biomedical Research Network Centre for Rare Diseases).

Main Developer