Research Object Crate for MAGNETO (automated workflow dedicated to MAG reconstruction)

Original URL: https://workflowhub.eu/workflows/1815/ro_crate?version=2

# MAGNETO MAGNETO is an automated snakemake workflow dedicated to MAG (Metagenome-Assembled Genomes) reconstruction from metagenomic data. It includes a fully-automated coassembly step informed by optimal clustering of metagenomic distances, and implements complementary genome binning strategies, for improving MAG recovery. # Key Features - **Quality Control (QC)**: Automatically assesses the quality and the contamination of input reads, ensuring that low-quality data are filtered out to improve downstream analyses. - **Assembly**: MAGNETO uses high-performance assembler to construct contigs from metagenomic reads. - **Gene Collection**: Extracts and compiles gene sequences from contigs, providing a comprehensive gene catalog directly after assembly. - **Binning**: Groups contigs into probable genomes using composition signatures and abundance profiles. - **Genomes collection**: Provides taxonomic and functional annotation of reconstructed MAGs. # Documentation **Full description in the [wiki pages](https://gitlab.univ-nantes.fr/bird_pipeline_registry/magneto/-/wikis/home)** # Dependencies A working installation of **conda** and **git** is mandatory to build magneto. If you have **mamba** already install on your system, the creation of the main environment will be faster. - python 3.8+ - snakemake 7.32.4 - mamba 1.5.8 - conda 4.10.3 - click 8.01 Other dependencies (such as python libraries for analysis, or to compute programs) are installed through the setup.py and conda management.\ The default conda libraries management for snakemake is mamba, since a couple of month now. Even if it is possible to use conda instead of mamba for conda libraries management, the design of Magneto set mamba mandatory, as `--conda-frontend conda` does not propagate to the subworkflows.\ Except if you use your own databases or already have downloaded them, MAGNETO will also require an internet connection. # Installation ## Install from bioconda ```bash conda install -c bioconda magneto ``` ## Install from source ### Main conda environment installation Start by creating a conda environment containing snakemake, mamba and the python module click: ```bash conda create -n magneto snakemake-minimal=7.32.4 click=8.01 mamba=1.5.8 -c bioconda -c conda-forge ``` > Note > - If you have **mamba** or **micromamba** already installed, you can create the environment with it instead of conda. Then, activate your environment: ``` conda activate magneto ``` ### Installation of magneto module in the conda environment Installation is performed using `pip`: ``` git clone https://gitlab.univ-nantes.fr/bird_pipeline_registry/magneto.git python3 -m pip install magneto/ ``` **Magneto** is now installed in the "magneto" conda environment. Activate your environment whenever you need to run the pipeline !! # Initialization of working directory magneto init --wd This will set configuration files into `/config/` : - `config.yaml`, in which all parameters for the programs in the workflow may be set; - SGE/Slurm profiles, to run the workflow on clusters. Two versions are currently available, for SGE and Slurm, you can find them here : `/config/[sge or slurm]/config.yaml` (Certain details may need to be modified to reflect the specific characteristics of the cluster like queue or partition names.) - `cluster_[sge or slurm].yaml`, these files are used to specify the resources allocated for the workflow. It is completely modular and the resources can be adapted for each snakemake rule. # Input data Magneto supports both single-end and paired-end reads in fasta/fasta.gz/fastq/fastq.gz format. You will need to provide a file at yaml format listing your reads files, following the general patterns below : For a sample file containing paired-end reads files : If you have one run by sample : ``` : : - - : <...> ``` If you have multiple runs by sample : ``` : : - - : - - : <...> ``` For a sample file containing single-ended reads files : ``` : : - : - : <...> ``` Set the path of your sample file in the samples field of the config file (`/config/config.yaml`). You can find a template for sample files in `/config/dummy_samples.yaml`, which can be used for test. [(see section below)](#Test). # Usage The general command line is as following : ``` magneto run --profile --rerun-incomplete ``` With the submodules names : - `qc` performs reads trimmering, using fastp/fastqScreen; - `motus` performs taxonomic profiling of reads; - `assembly` performs the assembly of metagenomic reads to contigs, using Megahit; - `genes` performs functional and taxonomic annotation on contigs, to obtain the gene collection - `binning` performs binning of contigs to putative genomes, using metabat2; - `genomes` performs bins quality check and dereplication, using checkM and dRep; - `all` allows to run the complete workflow at once. `--skip-qc` allows to bypass the reads trimming step. You can choose the type of assembly you want to perform : `--config target=[single_assembly or co_assembly]` A command line example is represented below: ``` magneto run all --profile config/slurm/ --config target=single_assembly --rerun-incomplete ``` The use of `--profile` allows to use Snakemake pre-configuration to run Magneto on clusters. Use either the SGE of Slurm profile, depending on your system. By default, Snakemake will use the config.yaml file located in the specified folder (in this example, config/slurm). More details [here](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles). # Test :warning: Test data not currently working, work in progress To test the workflow you can use a dummy dataset found in the `test` folder. Simply extract the archive to your working directory, e.g. : ``` tar -zxf test/dummy_dataset.tar.gz -C ``` Then launch magneto with `--dummy` option (or simply set the `samples` field in `/config/config.yaml` to ``): ``` magneto run all --dummy --rerun-incomplete --profile config/sge/ ``` # Databases management MAGNETO used dedicated databases to run fastp and checkm. It downloads them automatically and stored them by default in `/Database`. # Output Output of the different steps will be stored in `/intermediate_results` folder with following organization: ``` intermediate_results | |__reads | | | |__PE (for paired-end reads) | |__SE (for single-end reads) |__assembly | | | |__single_assembly | | | | | |__megahit | | | | | |__ | | |__ | | |__ ... | | |__ | |__co_assembly | | | |__megahit | | | | | |__ | | |__ | | |__... | | |__ | |__simka (distance matrix computed between samples) | |__clusters (repartition of the samples in clusters inferred from matrix distances) |__binning | | | |__single_binning | | | | | |__ | | | |__co_binning | | | | | |__ ``` The final output of the workflow will be stored in `/genomes_collection` subfolder. Graphics reports (notably from fastQscreen and multiQC) will be stored in `/reports`. # Steps implemented **pre-processing:** * [x] QC (fastp and fastqscreen) * [x] merging (bbmerge) **mOTU ** * [x] motus profiling (motus) **Assembly** * [x] single-assembly (megahit) * [x] metagenomic distance between samples (simka) * [x] co-assembly (megahit - [clustering : CAH + silhouette]) * [x] QC assembly and filtering (metacovest.py) * [x] missamblies detection (DeepMased for single assembly only at this time.) * [x] assembly taxonomic annotation (CAT) **Genomes collection** * [x] single-binning from single-assembly * [x] single-binning from co-assembly * [x] co-binning from single-assembly * [x] co-binning from co-assembly * [x] filter out contigs not consistent with bin's assignment (from CATBAT results, homemade script) #require an update * [x] improve collection with external genomes db. * [x] checkM (by batch) * [x] dRep (by batch, for every taxonomic level in theory - 0.95: species, 0.99: species) * [x] dRep95 followed by dRep99 on 95 clusters * [x] GTDB-TK * [x] functional annotation (eggNOG-mapper) * [x] genomes_length table * [x] genomes\_reads_counts table * [x] genomes\_bp_covered table * [x] genomes\_abundance table * [x] genomes\_function table * [x] genomes\_taxo table **Genes collection** * [x] CDS prediction using prodigal per sample * [x] concatenate all CDS * [x] CDS clustering using linclust * [x] read back mapping against genes collection * [x] eggNOG_mapper * [x] taxonomy using MMSEQ protocol (uniprot db as reference) **Reports:** * [x] multiqc report (pre-processing) * [x] assembly report #homemade MAGNETO workflow # Citing the pipeline Churcheward B, Millet M, Bihouée A, Fertin G, Chaffron S.
MAGNETO: An Automated Workflow for Genome-Resolved Metagenomics.
mSystems. 2022 Jun 15:e0043222. doi: [10.1128/msystems.00432-22](https://doi.org/10.1128/msystems.00432-22)

Author
Samuel Chaffron, Audrey Bihouee, Hugo Lefeuvre
License
GPL-3.0

Contents

Main Workflow Description: README.md
Size: 9792 bytes