(Hybrid) Metagenomics workflow
Version 3 (latest)

Version 3 (latest)

WFP

Version 1 (earliest)

Visit source

Download RO-Crate

Workflow Type: Common Workflow Language

Work-in-progress

Workflow (hybrid) metagenomic assembly and binning

Workflow Illumina Quality:
- Sequali (control)
- hostile contamination filter
- fastp (quality trimming)
Workflow Longread Quality:
- NanoPlot (control)
- fastplong (quality trimming)
- hostile contamination filter
Kraken2 taxonomic classification of FASTQ reads
SPAdes/Flye (Assembly)
Medaka/PyPolCA (Assembly polishing)
QUAST (Assembly quality report)

(optional)

Workflow binnning
- Metabat2/MaxBin2/SemiBin
- Binette
- BUSCO
- GTDB-Tk

(optional)

Workflow Genome-scale metabolic models https://workflowhub.eu/workflows/372
- CarveMe (GEM generation)
- MEMOTE (GEM test suite)
- SMETANA (Species METabolic interaction ANAlysis)

Other UNLOCK workflows on WorkflowHub: https://workflowhub.eu/projects/16/workflows?view=default

All tool CWL files and other workflows can be found here:
https://gitlab.com/m-unlock/cwl/

How to setup and use an UNLOCK workflow:
https://docs.m-unlock.nl/docs/workflows/setup.html

Click and drag the diagram to pan, double click or use the controls to zoom.

SEEK ID: https://workflowhub.eu/workflows/367?version=3

Inputs

ID	Name	Description	Type
identifier	Identifier	Identifier for this dataset used in this workflow (required)	string
threads	Number of threads	Number of threads to use for each computational processe (default 2)	int
memory	Memory usage (MB)	Maximum memory usage in megabytes. This mostly important for SPAdes assembly. (default 8GB)	int
illumina_forward_reads	Forward reads	Illumina Forward sequence file(s)	File[]?
illumina_reverse_reads	Reverse reads	Illumina Reverse sequence file(s)	File[]?
pacbio_reads	PacBio reads	File(s) with PacBio reads in FASTQ format	File[]?
nanopore_reads	Oxford Nanopore reads	File(s) with Oxford Nanopore reads in FASTQ format	File[]?
fastq_rich	Fastq rich (ONT)	Input fastq is generated by albacore, MinKNOW or guppy with additional information concerning channel and time. Used to creating more informative quality plots (default false)	boolean
longread_minimum_length	Minimum length required	Reads shorter will be discarded. (default 100)	int?
longread_length_limit	Maximum length limit	Reads longer than length_limit will be discarded. (default no limit)	int?
longread_qualified_quality_phred	Qualified_quality_phred	The quality value that a base is qualified. (default 9 means phred quality >=Q9 is qualified)	int?
longread_mean_qual	Mean quality	If one read's mean_qual quality score < mean_qual, then this read is discarded. (default 10)	int?
longread_trim_front	Trim_front	Trimming how many bases in front for read. (default 0)	int?
longread_trim_tail	trim_tail	Trimming how many bases in tail for read. (default 0)	int?
longread_trim_poly_x	Trim_poly_x	Enable polyX trimming in 3' ends. (default false)	boolean?
longread_poly_x_min_len	Poly_x_min_len	The minimum length to detect polyX in the read tail. (default 10 when trim_poly_x is true)	int?
longread_start_adapter	start_adapter	The adapter sequence at read start (5'). (default auto-detect)	string?
longread_end_adapter	End adapter	The adapter sequence at read end (3'). (default auto-detect)	string?
longread_adapter_fasta	Adapter fasta	Specify a FASTA file to trim both read ends by all the sequences in this FASTA file. (default None)	File?
longread_disable_adapter_trimming	Disable adapter trimming	Adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled. (default false)	boolean?
illumina_humandb	Filter human reads	Bowtie2 index folder. Provide the folder in which the in index files are located. (optional)	Directory?
longread_humandb	Filter human illumina reads	A fasta file or minimap2 indexed filed (.mmi) index needs to be provided. Preindexed is much faster. (optional)	File?
illumina_reference_filter_db	Illumina reference filter db	Custom reference database for filtering with Hostile. Provide the folder in which the bowtie2 index files are located. (optional)	Directory?
longread_reference_filter_db	Longread reference filter db	A fasta file or minimap2 indexed filed (.mmi) index needs to be provided. Preindexed is much faster. (optional)	File?
use_reference_mapped_reads	Keep mapped reads	Discard unmapped and keep reads mapped to the given reference. (default false (discard mapped))	boolean
keep_filtered_reads	Keep filtered reads	Keep filtered reads in the final output (default false)	boolean
deduplicate_illumina_reads	Deduplicate illumina reads	Remove exact duplicate reads Illumina reads with fastp (default false)	boolean
run_kraken2_illumina	Run kraken2 on Illumina reads	Run kraken2 on Illumina reads. A kraken2 database needs to be provided using the input kraken2_database. (default false)	boolean
skip_bracken	Run Bracken	Skip Bracken analysis. Illumina only. A bracken compatible kraken2 database needs to be provided using the input kraken2_database. (default false)	boolean
bracken_levels	Bracken levels	Taxonomy levels in bracken estimate abundances on. Default runs through; [P,C,O,F,G,S]	string[]
illumina_read_length	Read length	Read length to use in bracken only atm. Usually 50,75,100,150,200,250 or 300. (default 150)	int?
kraken2_confidence	Kraken2 confidence threshold	Confidence score threshold must be in [0, 1] (default 0.0)	float?
kraken2_database	Kraken2 database	Database location of kraken2. (optional)	Directory[]?
kraken2_standard_report	Kraken2 standard report	Also output Kraken2 standard report with per read classification. These can be large. (default false)	boolean
genome_size	Genome Size	Estimated genome size (for example, 5m or 2.6g). Used in Flye. (optional)	string?
metagenome	When working with metagenomes	Metagenome option for assemblers (default true)	boolean
run_spades	Use SPAdes	Run with SPAdes assembler (default true)	boolean
only_assembler_mode_spades	Only spades assembler	Run spades in only assembler mode (without read error correction). (default false)	boolean
use_spades_scaffolds	Use SPAdes scaffolds	Use SPAdes scaffolds instead of contigs for post-processing (polishing/mapping/binning). (default false)	boolean
run_flye	Use Flye	Run with Flye assembler. Requires long reads (default false)	boolean
flye_deterministic	Deterministic Flye	Perform disjointig assembly single-threaded in Flye assembler (slower). (default false)	boolean
run_medaka	Use Medaka	Run with Mekada assembly polishing using nanopore (not pacbio) reads only. (default false)	boolean
run_pypolca	Use PyPolCA	Run with PyPolCA assembly polishing using Illumina reads only. (default false)	boolean
assembly_choice	Assembly choice	User's choice of assembly for post-assembly (binning) processes ('spades', 'flye', 'pypolca', 'medaka'). Optional. Only one choice allowed. When none is given, the first available assembly in this order is chosen: pypolca, medaka, flye, spades.	<strong>enum</strong> of: spades, flye, pypolca, medaka
output_bam_file	Output BAM file	Output BAM file of mapped reads to assembly of choice. (default false)	boolean
ont_basecall_model	ONT Basecalling model used for MEDAKA	Used in MEDAKA Basecalling model used with guppy default r941_min_high. Available: r103_fast_g507, r103_fast_snp_g507, r103_fast_variant_g507, r103_hac_g507, r103_hac_snp_g507, r103_hac_variant_g507, r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r103_sup_g507, r103_sup_snp_g507, r103_sup_variant_g507, r1041_e82_400bps_fast_g615, r1041_e82_400bps_fast_variant_g615, r1041_e82_400bps_hac_g615, r1041_e82_400bps_hac_variant_g615, r1041_e82_400bps_sup_g615, r1041_e82_400bps_sup_variant_g615, r104_e81_fast_g5015, r104_e81_fast_variant_g5015, r104_e81_hac_g5015, r104_e81_hac_variant_g5015, r104_e81_sup_g5015, r104_e81_sup_g610, r104_e81_sup_variant_g610, r10_min_high_g303, r10_min_high_g340, r941_e81_fast_g514, r941_e81_fast_variant_g514, r941_e81_hac_g514, r941_e81_hac_variant_g514, r941_e81_sup_g514, r941_e81_sup_variant_g514, r941_min_fast_g303, r941_min_fast_g507, r941_min_fast_snp_g507, r941_min_fast_variant_g507, r941_min_hac_g507, r941_min_hac_snp_g507, r941_min_hac_variant_g507, r941_min_high_g303, r941_min_high_g330, r941_min_high_g340_rle, r941_min_high_g344, r941_min_high_g351, r941_min_high_g360, r941_min_sup_g507, r941_min_sup_snp_g507, r941_min_sup_variant_g507, r941_prom_fast_g303, r941_prom_fast_g507, r941_prom_fast_snp_g507, r941_prom_fast_variant_g507, r941_prom_hac_g507, r941_prom_hac_snp_g507, r941_prom_hac_variant_g507, r941_prom_high_g303, r941_prom_high_g330, r941_prom_high_g344, r941_prom_high_g360, r941_prom_high_g4011, r941_prom_snp_g303, r941_prom_snp_g322, r941_prom_snp_g360, r941_prom_sup_g507, r941_prom_sup_snp_g507, r941_prom_sup_variant_g507, r941_prom_variant_g303, r941_prom_variant_g322, r941_prom_variant_g360, r941_sup_plant_g610, r941_sup_plant_variant_g610 (required for Medaka)	string?
binning	Run binning workflow	Run with contig binning workflow (default false)	boolean
run_maxbin2	Run Maxbin2	Run with MaxBin2 binner. (default true)	boolean
run_semibin2	Run SemiBin	Run with SemiBin2 binner. (default true)	boolean
semibin2_environment	SemiBin Environment	Semibin2 Built-in models (none/global/human_gut/dog_gut/ocean/soil/cat_gut/human_oral/mouse_gut/pig_gut/built_environment/wastewater/chicken_caecum). Choosing a built-in model is generally faster. Otherwise it will do (single-sample) training on the data. Default global. Choose none if you want to do training on your own data.	<strong>enum</strong> of: none, global, human_gut, dog_gut, ocean, soil, cat_gut, human_oral, mouse_gut, pig_gut, built_environment, wastewater, chicken_caecum
gtdbtk_data	gtdbtk data directory	Directory containing the GTDBTK repository	Directory?
busco_data	BUSCO dataset	Path to the BUSCO dataset downloaded location. (optional)	Directory?
annotate_bins	Annotate bins	Annotate bins. (default false)	boolean
annotate_unbinned	Annotate unbinned	Annotate unbinned contigs. Will be treated as metagenome. (default false)	boolean
bakta_db	Bakta DB	Bakta Database directory. Default is built-in bakta-light db. (optional)	Directory?
skip_bakta_crispr	Skip bakta CRISPR	Skip bakta CRISPR array prediction using PILER-CR. (default false)	boolean
interproscan_directory	InterProScan 5 directory	Directory of the (full) InterProScan 5 program. Used for annotating bins. (optional)	Directory?
eggnog_dbs	n/a	n/a	record containing Directory? File? File?
run_kofamscan	Run kofamscan	Run with KEGG KO KoFamKOALA annotation. (default false)	boolean
kofamscan_limit_sapp	SAPP kofamscan limit	Limit max number of entries of kofamscan hits per locus in SAPP. (default 5)	int?
run_eggnog	Run eggNOG-mapper	Run with eggNOG-mapper annotation. Requires eggnog database files. (default false)	boolean
run_interproscan	Run InterProScan	Run with eggNOG-mapper annotation. Requires InterProScan v5 program files. (default false)	boolean
interproscan_applications	InterProScan applications	Comma separated list of analyses: FunFam,SFLD,PANTHER,Gene3D,Hamap,PRINTS,ProSiteProfiles,Coils,SUPERFAMILY,SMART,CDD,PIRSR,ProSitePatterns,AntiFam,Pfam,MobiDBLite,PIRSF,NCBIfam default Pfam,SFLD,SMART,AntiFam,NCBIfam	string
destination	Output Destination	Optional output destination only used for cwl-prov reporting.	string?
source	Input URLs used for this run	A provenance element to capture the original source of the input data	string[]?

Steps

ID	Name	Description
workflow_quality_illumina	Oxford Nanopore quality workflow	Quality, filtering and taxonomic classification workflow for Oxford Nanopore reads
workflow_quality_nanopore	Oxford Nanopore quality workflow	Quality, filtering and taxonomic classification workflow for Oxford Nanopore reads
workflow_quality_pacbio	PacBio quality and filtering workflow	Quality, filtering and taxonomic classification for PacBio reads
workflow_kraken2_illumina	Kraken2 illumina	Taxonomic classification using kraken2 Illumina reads
spades	SPAdes assembly	Genome assembly using SPAdes with illumina and or long reads
spades_assembly	SPAdes contigs or scaffolds	Get chosen spades assembly. Contigs or scaffolds
compress_spades	SPAdes compressed	Compress the large Spades assembly output files
flye	Flye assembly	De novo assembly of single-molecule reads with Flye
medaka	Medaka polishing of assembly	Medaka for (ont reads) polishing of an assembled (flye) genome
workflow_pypolca	Run PyPolCA assemlby polishing	PyPolCA polishing of longreads assembly with illumina reads
get_assembly_to_use	Assembly choice	Get assembly choice
assembly_read_mapping_illumina	Minimap2	Illumina read mapping using Minimap2 on assembled scaffolds
contig_read_counts	Samtools idxstats	Reports alignment summary statistics
workflow_binning	Binning workflow	Binning workflow to create bins
keep_readfilter_files_to_folder	Read filtering output folder	Preparation of read filtering output files to a specific output folder
readfilter_files_to_folder	Read filtering output folder	Preparation of read filtering reports specific output folder
spades_files_to_folder	SPADES output to folder	Preparation of SPAdes output files to a specific output folder
flye_files_to_folder	Flye output folder	Preparation of Flye output files to a specific output folder
medaka_files_to_folder	Medaka output folder	Preparation of Medaka output files to a specific output folder
pypolca_files_to_folder	PyPolca output folder	Preparation of PyPolCA output files to a specific output folder
output_bamfile	Output bam file	Step needed to output bam file because there is an option to.
assembly_files_to_folder	Flye output folder	Preparation of Flye output files to a specific output folder
binning_files_to_folder	Binning output to folder	Preparation of binning output files and folders to a specific output folder

Outputs

ID	Name	Description	Type
read_filtering_output_keep	Read filtering output	Read filtering stats + filtered reads	Directory?
read_filtering_output	Read filtering output	Read filtering stats	Directory?
assembly_output	Assembly output	Output from different assembly steps	Directory
binning_output	Binning output	Binning outputfolders	Directory?