Long Read WGS pipeline
Version 1

Visit source

Download RO-Crate

Workflow Type: Common Workflow Language

Work-in-progress

Workflow for long read quality control, contamination filtering, assembly, variant calling and annotation.

Steps:

Preprocessing of reference file : https://workflowhub.eu/workflows/1818
LongReadSum before and after filtering (read quality control)
Filtlong filter on quality and length
Flye assembly
Minimap2 mapping of reads and assembly
Clair3 variant calling of reads
Freebayes variant calling of assembly
Optional Bakta annotation of genomes with no reference
SnpEff building or downloading of a database
SnpEff functional annotation
Liftoff annotation lift over

All tool CWL files and other workflows can be found here: Tools: https://git.wur.nl/ssb/automated-data-analysis/cwl/-/tree/main/tools Workflows: https://git.wur.nl/ssb/automated-data-analysis/cwl/-/tree/main/workflows

SEEK ID: https://workflowhub.eu/workflows/1868?version=1

Inputs

ID	Name	Description	Type
NCBI_identifier	NCBI genome identifier	NCBI Identifier of a genome for SnpEff to extract a genbank file and build a custom database out of.	string
annotation_file	annotation file to lift over	GFF or GTF file containing the annotations to lift over.	File
bakta_db	Bakta DB	Bakta database directory (default bakta-db_v5.1-light built in the container).	Directory
coverage_threshold	assembly coverage	Reduced coverage for the initial disjointig assembly. If set, Flye will downsample the reads to the specified coverage before assembly. Useful for high-coverage datasets to reduce memory usage. If not set, Flye will use all available reads.	int
dummy_annotation_file	n/a	n/a	File
dummy_database_folder	n/a	n/a	Directory
genome_size	Genome size	Estimated genome size (for example, 5m or 2.6g).	string
haploid_sensitive	haploid calling mode	Set to true to enable haploid calling mode, this is an experimental flag.	boolean
include_assembly	include assembly	Will include mapping and variant calling an assembly in the pipeline, default is true.	boolean
include_reads	include filtered reads	Will include mapping and variant calling filtered reads in the pipeline, default is true.	boolean
include_snpeff	include SnpEff	Will include functional interpretation of variants with SnpEff in the pipeline, default is true.	boolean
include_strainy	include strainy	Will include strain level analysis on the filtered reads, default is false.	boolean
input_read	long reads input	Long read sequence file in FASTQ format.	File
input_type	input file type	Acceptable input types: fa FASTA file input fq FASTQ file input f5 FAST5 file input f5s FAST5 file input with signal statistics output seqtxt sequencing_summary.txt input bam BAM file input rrms RRMS BAM file input Defaults to FQ file in this workflow.	<strong>enum</strong> of: #main/input_type/fa, #main/input_type/fq, #main/input_type/f5, #main/input_type/f5s, #main/input_type/seqtxt, #main/input_type/bam, #main/input_type/rrms
keep_percent	Maximum read length threshold	Maximum read length threshold (default 90).	float
length_weight	Length weigth	Weight given to the length score (default 10).	float
log_level	level of logging	Logging level (1: DEBUG, 2: INFO, 3: WARNING, 4: ERROR, 5: CRITICAL), defaults to 2.	int
maximum_length	maximum length	Maximum read length threshold.	int
merging_script	merging script	Python script that merges input from both Clair3 and freebayes. Passed externally within the git structure to avoid having to host a new python docker.	File
min_alt_count	min_alt_count	Require at least this count of observations supporting an alternate allele. Defaults to 1 in this pipeline.	int
min_mean_q	minimum mean quality	Minimum mean quality threshold.	float
min_window_q	minimum window quality	Minimum window quality threshold.	float
minimum_length	Minimum read length	Minimum read length threshold (default 1000).	int
model_path	Clair3 Model Directory	Path to the Clair3 model inside the Docker container.	string
ncbi_data_exists	existing NCBI data	The used genome has an existing NCBI identifier, instead of annotating genes, the genbank file from NCBI will be used to build a database.	boolean
no_downstream	no downstream changes	Set to true to omit downstream changes.	boolean
no_phasing_for_fa	no phasing in full alignment	Set to true to skip whatshap phasing in full alignment, this is an experimental flag.	boolean
no_upstream	no upstream changes	Set to true to omit upstream changes.	boolean
plasmids	plasmid file(s)	Input plasmid GenBank files, which will be merged with the reference.	array containing File
ploidy	ploidy settings	Settings of the ploidy, for haploid organisms, set to 1 (default).	int
provenance	include provenance information	Will include metadata on tool performance of LongReadSum, Filtlong, and Flye, default is true.	boolean
readtype	read type	Type of read i.e. PacBio or Nanopore. Used for naming output files. Defaults to Nanopore for this workflow, other read types are untested.	string
reference_gb	reference GenBank file	Reference file in GenBank format. If not provided requires NCBI identifier.	File
sample_name	sample name	Sample name, by default is extracted from the file input. Used as output names for LongReadSum, Filtlong, and minimap2.	string
seed	random seed	Sets the random seed for reproducability. Using the same seed number for random seed. Default is set to 1.	int
skip_qc_filtered	skip LongReadSum after filtering	Skip LongReadSum analyses of filter input data, default is false.	boolean
skip_qc_unfiltered	skip LongReadSum before filtering	Skip LongReadSum analyses of unfiltered input data, default is false.	boolean
snpeff_database_exists	existing SnpEff database	The used genome has an existing database within SnpEff, instead of building a database, the existing database will be downloaded, default is false.	boolean
snpeff_genome	genome/database identifier	Identifier for the SnpEff database to download or build (e.g. 'GRCh37.75' for human, or a custom name for microbial strains).	string
target_bases	target bases	Keep only the best reads up to this many total bases.	int
threads	Number of threads	Number of threads to use for computational processes.	int
transfer_annotation	transfer annotation	Whether the annotation of the reference should be carried over to the new assembly (use Liftoff), default is false.	boolean

Steps

ID	Name	Description
bakta	bakta genome annotation	Bacterial genome annotation, only runs when no reference (genbank file(s) or NCBI identifier) is supplied.
clair3	Clair3 variant calling	Variant calling of filtered reads with Clair3 using input models.
filtlong	long read filtering	Filter long reads based on set parameters.
filtlong_files_to_folder	Filtlong folder	Preparation of Filtlong output files to a specific output folder.
flye	Flye assembly	De novo assembly of single-molecule reads with Flye.
flye_files_to_folder	Flye output folder	Preparation of Flye output files to a specific output folder.
freebayes	FreeBayes variant calling	Variant calling of assembly with FreeBayes.
liftoff	Liftoff annotation lift over	Lifting over annotations from reference to assembly.
liftoff_files_to_folder	liftoff assembly output folder	Preparation of Liftoff output files to a specific output folder.
longreadsum_filtered	LongReadSum filtered	LongReadSum Quality assessment of reads after filtering.
longreadsum_unfiltered	LongReadSum unfiltered	LongReadSum Quality assessment of reads prior to filtering.
merging_vcfs	merging vcf files	Merging the VCF output from Clair3 and freebayes.
minimap2_assembly	Minimap2 assembly mapping	Assembly mapping of filtered reads using Minimap2.
minimap2_reads	Minimap2 read mapping	Read mapping of filtered reads using Minimap2.
preprocess_reference	plasmid preprocessing	Pre-processing of reference, merging reference with optional plasmid input and extracting GenBank, GFF3 and FASTA files.
provenance_files_to_folder	provenance output folder	Preparation of provenance output files to a specific output folder.
quast	QUAST quality assessment	Quality assessment of assembly with QUAST.
samtools_assembly_index	samtools index assembly	Indexing of assembly BAM file with samtools index.
samtools_faidx_assembly	samtools faidx assembly	Indexing of FASTA file with samtools faidx.
samtools_faidx_reads	samtools faidx	Indexing of FASTA file with samtools faidx.
samtools_reads_index	samtools index reads	Indexing of reads BAM file with samtools index.
snpeff_assembly	SnpEff assembly	Running SnpEff on the assembly variant output of freebayes.
snpeff_assembly_files_to_folder	SnpEff assembly output folder	Preparation of SnpEff assembly output files to a specific output folder.
snpeff_build	SnpEff database building	Downloading of a SnpEff database based on the genome name within the database.
snpeff_download	SnpEff database downloading	Downloading of a SnpEff database based on the genome name within the database.
snpeff_merged	SnpEff merged	Running SnpEff on the merged variant output of both Clair3 and freebayes.
snpeff_merged_files_to_folder	SnpEff merged output folder	Preparation of SnpEff merged output files to a specific output folder.
snpeff_reads	SnpEff reads	Running SnpEff on the reads variant output of Clair3.
snpeff_reads_files_to_folder	SnpEff reads output folder	Preparation of SnpEff reads output files to a specific output folder.
strainy	Strainy strain level analysis	Strain level analysis on assembled reads. Produces multi-allelic phasing, individual haplotypes and strain-specific variant calls.
unzip	unzipping clair3	Unzipping Clair3 VCF file.

Outputs

ID	Name	Description	Type
assembly__fasta_index_out	indexed reference	Indexed reference FASTA file.	File
assembly_bam_index_out	indexed mapped assembly	Indexed mapped assembly.	File
bakta_outdir	bakta folder	Folder with bakta output files.	Directory
clair3_outdir	Clair3 output directory	Clair3 output directory containing the vcf file.	Directory
clair3_vcf	Clair3 output file	Output variant file from Clair3.	File
filtlong_outdir	Filtlong folder	Folder with Filtlong output files.	Directory
flye_outdir	Filtlong folder	Folder with Filtlong output files.	Directory
freebayes_output	freebayes output file	Output variant file from freebayes.	File
liftoff_outdir	Liftoff folder	Folder with liftoff output files.	Directory
logs_outdir	logs folder	Folder with provenance information.	Directory
longreadsum_filtered_outdir	LongReadSum folder 2	Folder with LongReadSum output files.	Directory
longreadsum_unfiltered_outdir	LongReadSum folder	Folder with LongReadSum output files.	Directory
merged_output	merged output file	Merged output variant file from both Clair3 and freebayes.	File
minimap2_assembly_bam	mapped assembly	Assembly mapped by minimap2.	File
minimap2_reads_bam	mapped reads	Filtered reads mapped by minimap2.	File
preprocessed_fasta	preprocessed FASTA file	The preprocessed FASTA file. This file is extracted from the above GenBank file.	File
preprocessed_genbank	preprocessed GenBank file	The preprocessed GenBank file. This file only differs from the input GenBank file (if provided) when plasmids are included.	File
preprocessed_gff3	preprocessed GFF3 file	The preprocessed GFF3 file. This file is extracted from the above GenBank file.	File
quast_outdir	Filtlong folder	Folder with Filtlong output files.	Directory
reads_bam_index_out	indexed mapped reads	Indexed filtered mapped reads.	File
reads_fasta_index_out	indexed reference	Indexed reference FASTA file.	File
snpeff_assembly_outdir	SnpEff assembly folder	Folder with SnpEff assembly output files.	Directory
snpeff_merged_outdir	SnpEff merged folder	Folder with SnpEff merged output files.	Directory
snpeff_reads_outdir	SnpEff reads folder	Folder with SnpEff reads output files.	Directory
strainy_outdir	strainy folder	Folder with strainy output files.	Directory