Workflow Type: Common Workflow Language
Work-in-progress

Workflow (hybrid) metagenomic assembly and binning

  • Workflow Illumina Quality:
    • Sequali (control)
    • hostile contamination filter
    • fastp (quality trimming)
  • Workflow Longread Quality:
    • NanoPlot (control)
    • fastplong (quality trimming)
    • hostile contamination filter
  • Kraken2 taxonomic classification of FASTQ reads
  • SPAdes/Flye (Assembly)
  • Medaka/PyPolCA (Assembly polishing)
  • QUAST (Assembly quality report)

(optional)

  • Workflow binnning
    • Metabat2/MaxBin2/SemiBin
    • Binette
    • BUSCO
    • GTDB-Tk

(optional)

Other UNLOCK workflows on WorkflowHub: https://workflowhub.eu/projects/16/workflows?view=default

All tool CWL files and other workflows can be found here:
https://gitlab.com/m-unlock/cwl/

How to setup and use an UNLOCK workflow:
https://docs.m-unlock.nl/docs/workflows/setup.html

Click and drag the diagram to pan, double click or use the controls to zoom.

Inputs

ID Name Description Type
identifier Identifier Identifier for this dataset used in this workflow (required)
  • string
threads Number of threads Number of threads to use for each computational processe (default 2)
  • int
memory Memory usage (MB) Maximum memory usage in megabytes. This mostly important for SPAdes assembly. (default 8GB)
  • int
illumina_forward_reads Forward reads Illumina Forward sequence file(s)
  • File[]?
illumina_reverse_reads Reverse reads Illumina Reverse sequence file(s)
  • File[]?
pacbio_reads PacBio reads File(s) with PacBio reads in FASTQ format
  • File[]?
nanopore_reads Oxford Nanopore reads File(s) with Oxford Nanopore reads in FASTQ format
  • File[]?
fastq_rich Fastq rich (ONT) Input fastq is generated by albacore, MinKNOW or guppy with additional information concerning channel and time. Used to creating more informative quality plots (default false)
  • boolean
longread_minimum_length Minimum length required Reads shorter will be discarded. (default 100)
  • int?
longread_length_limit Maximum length limit Reads longer than length_limit will be discarded. (default no limit)
  • int?
longread_qualified_quality_phred Qualified_quality_phred The quality value that a base is qualified. (default 9 means phred quality >=Q9 is qualified)
  • int?
longread_mean_qual Mean quality If one read's mean_qual quality score < mean_qual, then this read is discarded. (default 10)
  • int?
longread_trim_front Trim_front Trimming how many bases in front for read. (default 0)
  • int?
longread_trim_tail trim_tail Trimming how many bases in tail for read. (default 0)
  • int?
longread_trim_poly_x Trim_poly_x Enable polyX trimming in 3' ends. (default false)
  • boolean?
longread_poly_x_min_len Poly_x_min_len The minimum length to detect polyX in the read tail. (default 10 when trim_poly_x is true)
  • int?
longread_start_adapter start_adapter The adapter sequence at read start (5'). (default auto-detect)
  • string?
longread_end_adapter End adapter The adapter sequence at read end (3'). (default auto-detect)
  • string?
longread_adapter_fasta Adapter fasta Specify a FASTA file to trim both read ends by all the sequences in this FASTA file. (default None)
  • File?
longread_disable_adapter_trimming Disable adapter trimming Adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled. (default false)
  • boolean?
illumina_humandb Filter human reads Bowtie2 index folder. Provide the folder in which the in index files are located. (optional)
  • Directory?
longread_humandb Filter human illumina reads A fasta file or minimap2 indexed filed (.mmi) index needs to be provided. Preindexed is much faster. (optional)
  • File?
illumina_reference_filter_db Illumina reference filter db Custom reference database for filtering with Hostile. Provide the folder in which the bowtie2 index files are located. (optional)
  • Directory?
longread_reference_filter_db Longread reference filter db A fasta file or minimap2 indexed filed (.mmi) index needs to be provided. Preindexed is much faster. (optional)
  • File?
use_reference_mapped_reads Keep mapped reads Discard unmapped and keep reads mapped to the given reference. (default false (discard mapped))
  • boolean
keep_filtered_reads Keep filtered reads Keep filtered reads in the final output (default false)
  • boolean
deduplicate_illumina_reads Deduplicate illumina reads Remove exact duplicate reads Illumina reads with fastp (default false)
  • boolean
run_kraken2_illumina Run kraken2 on Illumina reads Run kraken2 on Illumina reads. A kraken2 database needs to be provided using the input kraken2_database. (default false)
  • boolean
skip_bracken Run Bracken Skip Bracken analysis. Illumina only. A bracken compatible kraken2 database needs to be provided using the input kraken2_database. (default false)
  • boolean
bracken_levels Bracken levels Taxonomy levels in bracken estimate abundances on. Default runs through; [P,C,O,F,G,S]
  • string[]
illumina_read_length Read length Read length to use in bracken only atm. Usually 50,75,100,150,200,250 or 300. (default 150)
  • int?
kraken2_confidence Kraken2 confidence threshold Confidence score threshold must be in [0, 1] (default 0.0)
  • float?
kraken2_database Kraken2 database Database location of kraken2. (optional)
  • Directory[]?
kraken2_standard_report Kraken2 standard report Also output Kraken2 standard report with per read classification. These can be large. (default false)
  • boolean
genome_size Genome Size Estimated genome size (for example, 5m or 2.6g). Used in Flye. (optional)
  • string?
metagenome When working with metagenomes Metagenome option for assemblers (default true)
  • boolean
run_spades Use SPAdes Run with SPAdes assembler (default true)
  • boolean
only_assembler_mode_spades Only spades assembler Run spades in only assembler mode (without read error correction). (default false)
  • boolean
use_spades_scaffolds Use SPAdes scaffolds Use SPAdes scaffolds instead of contigs for post-processing (polishing/mapping/binning). (default false)
  • boolean
run_flye Use Flye Run with Flye assembler. Requires long reads (default false)
  • boolean
flye_deterministic Deterministic Flye Perform disjointig assembly single-threaded in Flye assembler (slower). (default false)
  • boolean
run_medaka Use Medaka Run with Mekada assembly polishing using nanopore (not pacbio) reads only. (default false)
  • boolean
run_pypolca Use PyPolCA Run with PyPolCA assembly polishing using Illumina reads only. (default false)
  • boolean
assembly_choice Assembly choice User's choice of assembly for post-assembly (binning) processes ('spades', 'flye', 'pypolca', 'medaka'). Optional. Only one choice allowed. When none is given, the first available assembly in this order is chosen: pypolca, medaka, flye, spades.
  • <strong>enum</strong> of: spades, flye, pypolca, medaka
output_bam_file Output BAM file Output BAM file of mapped reads to assembly of choice. (default false)
  • boolean
ont_basecall_model ONT Basecalling model used for MEDAKA Used in MEDAKA Basecalling model used with guppy default r941_min_high. Available: r103_fast_g507, r103_fast_snp_g507, r103_fast_variant_g507, r103_hac_g507, r103_hac_snp_g507, r103_hac_variant_g507, r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r103_sup_g507, r103_sup_snp_g507, r103_sup_variant_g507, r1041_e82_400bps_fast_g615, r1041_e82_400bps_fast_variant_g615, r1041_e82_400bps_hac_g615, r1041_e82_400bps_hac_variant_g615, r1041_e82_400bps_sup_g615, r1041_e82_400bps_sup_variant_g615, r104_e81_fast_g5015, r104_e81_fast_variant_g5015, r104_e81_hac_g5015, r104_e81_hac_variant_g5015, r104_e81_sup_g5015, r104_e81_sup_g610, r104_e81_sup_variant_g610, r10_min_high_g303, r10_min_high_g340, r941_e81_fast_g514, r941_e81_fast_variant_g514, r941_e81_hac_g514, r941_e81_hac_variant_g514, r941_e81_sup_g514, r941_e81_sup_variant_g514, r941_min_fast_g303, r941_min_fast_g507, r941_min_fast_snp_g507, r941_min_fast_variant_g507, r941_min_hac_g507, r941_min_hac_snp_g507, r941_min_hac_variant_g507, r941_min_high_g303, r941_min_high_g330, r941_min_high_g340_rle, r941_min_high_g344, r941_min_high_g351, r941_min_high_g360, r941_min_sup_g507, r941_min_sup_snp_g507, r941_min_sup_variant_g507, r941_prom_fast_g303, r941_prom_fast_g507, r941_prom_fast_snp_g507, r941_prom_fast_variant_g507, r941_prom_hac_g507, r941_prom_hac_snp_g507, r941_prom_hac_variant_g507, r941_prom_high_g303, r941_prom_high_g330, r941_prom_high_g344, r941_prom_high_g360, r941_prom_high_g4011, r941_prom_snp_g303, r941_prom_snp_g322, r941_prom_snp_g360, r941_prom_sup_g507, r941_prom_sup_snp_g507, r941_prom_sup_variant_g507, r941_prom_variant_g303, r941_prom_variant_g322, r941_prom_variant_g360, r941_sup_plant_g610, r941_sup_plant_variant_g610 (required for Medaka)
  • string?
binning Run binning workflow Run with contig binning workflow (default false)
  • boolean
run_maxbin2 Run Maxbin2 Run with MaxBin2 binner. (default true)
  • boolean
run_semibin2 Run SemiBin Run with SemiBin2 binner. (default true)
  • boolean
semibin2_environment SemiBin Environment Semibin2 Built-in models (none/global/human_gut/dog_gut/ocean/soil/cat_gut/human_oral/mouse_gut/pig_gut/built_environment/wastewater/chicken_caecum). Choosing a built-in model is generally faster. Otherwise it will do (single-sample) training on the data. Default global. Choose none if you want to do training on your own data.
  • <strong>enum</strong> of: none, global, human_gut, dog_gut, ocean, soil, cat_gut, human_oral, mouse_gut, pig_gut, built_environment, wastewater, chicken_caecum
gtdbtk_data gtdbtk data directory Directory containing the GTDBTK repository
  • Directory?
busco_data BUSCO dataset Path to the BUSCO dataset downloaded location. (optional)
  • Directory?
annotate_bins Annotate bins Annotate bins. (default false)
  • boolean
annotate_unbinned Annotate unbinned Annotate unbinned contigs. Will be treated as metagenome. (default false)
  • boolean
bakta_db Bakta DB Bakta Database directory. Default is built-in bakta-light db. (optional)
  • Directory?
skip_bakta_crispr Skip bakta CRISPR Skip bakta CRISPR array prediction using PILER-CR. (default false)
  • boolean
interproscan_directory InterProScan 5 directory Directory of the (full) InterProScan 5 program. Used for annotating bins. (optional)
  • Directory?
eggnog_dbs n/a n/a
  • record containing
    • Directory?
    • File?
    • File?
run_kofamscan Run kofamscan Run with KEGG KO KoFamKOALA annotation. (default false)
  • boolean
kofamscan_limit_sapp SAPP kofamscan limit Limit max number of entries of kofamscan hits per locus in SAPP. (default 5)
  • int?
run_eggnog Run eggNOG-mapper Run with eggNOG-mapper annotation. Requires eggnog database files. (default false)
  • boolean
run_interproscan Run InterProScan Run with eggNOG-mapper annotation. Requires InterProScan v5 program files. (default false)
  • boolean
interproscan_applications InterProScan applications Comma separated list of analyses: FunFam,SFLD,PANTHER,Gene3D,Hamap,PRINTS,ProSiteProfiles,Coils,SUPERFAMILY,SMART,CDD,PIRSR,ProSitePatterns,AntiFam,Pfam,MobiDBLite,PIRSF,NCBIfam default Pfam,SFLD,SMART,AntiFam,NCBIfam
  • string
destination Output Destination Optional output destination only used for cwl-prov reporting.
  • string?
source Input URLs used for this run A provenance element to capture the original source of the input data
  • string[]?

Steps

ID Name Description
workflow_quality_illumina Oxford Nanopore quality workflow Quality, filtering and taxonomic classification workflow for Oxford Nanopore reads
workflow_quality_nanopore Oxford Nanopore quality workflow Quality, filtering and taxonomic classification workflow for Oxford Nanopore reads
workflow_quality_pacbio PacBio quality and filtering workflow Quality, filtering and taxonomic classification for PacBio reads
workflow_kraken2_illumina Kraken2 illumina Taxonomic classification using kraken2 Illumina reads
spades SPAdes assembly Genome assembly using SPAdes with illumina and or long reads
spades_assembly SPAdes contigs or scaffolds Get chosen spades assembly. Contigs or scaffolds
compress_spades SPAdes compressed Compress the large Spades assembly output files
flye Flye assembly De novo assembly of single-molecule reads with Flye
medaka Medaka polishing of assembly Medaka for (ont reads) polishing of an assembled (flye) genome
workflow_pypolca Run PyPolCA assemlby polishing PyPolCA polishing of longreads assembly with illumina reads
get_assembly_to_use Assembly choice Get assembly choice
assembly_read_mapping_illumina Minimap2 Illumina read mapping using Minimap2 on assembled scaffolds
contig_read_counts Samtools idxstats Reports alignment summary statistics
workflow_binning Binning workflow Binning workflow to create bins
keep_readfilter_files_to_folder Read filtering output folder Preparation of read filtering output files to a specific output folder
readfilter_files_to_folder Read filtering output folder Preparation of read filtering reports specific output folder
spades_files_to_folder SPADES output to folder Preparation of SPAdes output files to a specific output folder
flye_files_to_folder Flye output folder Preparation of Flye output files to a specific output folder
medaka_files_to_folder Medaka output folder Preparation of Medaka output files to a specific output folder
pypolca_files_to_folder PyPolca output folder Preparation of PyPolCA output files to a specific output folder
output_bamfile Output bam file Step needed to output bam file because there is an option to.
assembly_files_to_folder Flye output folder Preparation of Flye output files to a specific output folder
binning_files_to_folder Binning output to folder Preparation of binning output files and folders to a specific output folder

Outputs

ID Name Description Type
read_filtering_output_keep Read filtering output Read filtering stats + filtered reads
  • Directory?
read_filtering_output Read filtering output Read filtering stats
  • Directory?
assembly_output Assembly output Output from different assembly steps
  • Directory
binning_output Binning output Binning outputfolders
  • Directory?

Version History

Version 3 (latest) Created 9th Sep 2025 at 13:28 by Bart Nijsse

Major changes: This version changes the way read filtering is performed and replaces DAStool with Binette.


Open master d1190f4

WFP Created 16th Dec 2024 at 07:46 by Bart Nijsse

Workflow version used in analysis: "A metadata managed FAIR end-to-end workflow for microbial community Omics data analysis"


Frozen WFP 7c7adba

Version 1 (earliest) Created 14th Jun 2022 at 09:14 by Bart Nijsse

Initial commit


Frozen Version-1 1e42c47
help Creators and Submitter
Discussion Channel
Activity

Views: 5463   Downloads: 896

Created: 14th Jun 2022 at 09:14

Last updated: 9th Sep 2025 at 14:47

Annotated Properties
help Attributions

None

Total size: 720 KB
Powered by
(v.1.17.0-main)
Copyright © 2008 - 2025 The University of Manchester and HITS gGmbH