MGnify genomes catalogue pipeline
MGnify A pipeline to perform taxonomic and functional annotation and to generate a catalogue from a set of isolate and/or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:
Gurbich TA, Almeida A, Beracochea M, Burdett T, Burgin J, Cochrane G, Raj S, Richardson L, Rogers AB, Sakharova E, Salazar GA and Finn RD. (2023) MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues. J Mol Biol. doi: https://doi.org/10.1016/j.jmb.2023.168016
Detailed information about existing MGnify catalogues: https://docs.mgnify.org/src/docs/genome-viewer.html
Tools used in the pipeline
Tool/Database | Version | Purpose |
---|---|---|
CheckM | 1.1.3 | Determining genome quality |
dRep | 3.2.2 | Genome clustering |
Mash | 2.3 | Sketch for the catalogue; placement of genomes into clusters (update only); strain tree |
GUNC | 1.0.3 | Quality control |
GTDB-Tk | 2.1.0 | Assigning taxonomy; generating alignments |
GTDB | r207_v2 | Database for GTDB-Tk |
Prokka | 1.14.6 | Protein annotation |
IQ-TREE 2 | 2.2.0.3 | Generating a phylogenetic tree |
Kraken 2 | 2.1.2 | Generating a kraken database |
Bracken | 2.6.2 | Generating a bracken database |
MMseqs2 | 13.45111 | Generating a protein catalogue |
eggNOG-mapper | 2.1.3 | Protein annotation (eggNOG, KEGG, COG, CAZy) |
InterProScan | 5.57-90.0 | Protein annotation (InterPro, Pfam) |
CRISPRCasFinder | 4.3.2 | Annotation of CRISPR arrays |
AMRFinderPlus | 3.11.4 | Antimicrobial resistance gene annotation; virulence factors, biocide, heat, acid, and metal resistance gene annotation |
AMRFinderPlus DB | 3.11 2023-02-23.1 | Database for AMRFinderPlus |
SanntiS | 0.9.3.2 | Biosynthetic gene cluster annotation |
Infernal | 1.1.4 | RNA predictions |
tRNAscan-SE | 2.0.9 | tRNA predictions |
Rfam | 14.6 | Identification of SSU/LSU rRNA and other ncRNAs |
Panaroo | 1.3.2 | Pan-genome computation |
Seqtk | 1.3 | Generating a gene catalogue |
VIRify | - | Viral sequence annotation |
MoMofy | 1.0.0 | Mobilome annotation |
samtools | 1.15 | FASTA indexing |
Setup
Environment
The pipeline is implemented in Nextflow.
Requirements:
Reference databases
The pipeline needs the following reference databases and configuration files (roughtly ~150G):
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/gunc_db_2.0.4.dmnd.gz
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/eggnog_db.tgz
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/rfams_cms/
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/kegg_classes.tsv
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/ncrna/
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/continent_countries.csv
- https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
Containers
This pipeline requires singularity or docker as the container engine to run pipeline.
The containers are hosted in biocontainers and quay.io/microbiome-informatics repository.
It's possible to build the containers from scratch using the following script:
cd containers && bash build.sh
Running the pipeline
Data preparation
-
You need to pre-download your data to directories and make sure that genomes are uncompressed. Scripts to fetch genomes from ENA (fetch_ena.py) and NCBI (fetch_ncbi.py) are provided and need to be executed separately from the pipeline. If you have downloaded genomes from both ENA and NCBI, put them into separate folders.
-
When genomes are fetched from ENA using the
fetch_ena.py
script, a CSV file with contamination and completeness statistics is also created in the same directory where genomes are saved to. If you are downloading genomes using a different approach, a CSV file needs to be created manually (each line should be genome accession, % completeness, % contamination). The ENA fetching script also pre-filters genomes to satisfy the QS50 cut-off (QS = % completeness - 5 * % contamination). -
You will need the following information to run the pipeline:
- catalogue name (for example, zebrafish-faecal)
- catalogue version (for example, 1.0)
- catalogue biome (for example, root:Host-associated:Human:Digestive system:Large intestine:Fecal)
- min and max accession number to be assigned to the genomes (only MGnify specific). Max - Min = #total number of genomes (NCBI+ENA)
Execution
The pipeline is built in Nextflow, and utilized containers to run the software (we don't support conda ATM).
In order to run the pipeline it's required that the user creates a profile that suits their needs, there is an ebi
profile in nexflow.config
that can be used as template.
After downloading the databases and adjusting the config file:
nextflow run EBI-Metagenomics/genomes-pipeline -c -profile \
--genome-prefix=MGYG \
--biome="root:Host-associated:Fish:Digestive system" \
--ena_genomes= \
--ena_genomes_checkm= \
--mgyg_start=0 \
--mgyg_end=10 \
--catalogue_name=zebrafish-faecal \
--catalogue_version="1.0" \
--ftp_name="zebrafish-faecal" \
--ftp_version="v1.0" \
--outdir=""
Development
Install development tools (including pre-commit hooks to run Black code formatting).
pip install -r requirements-dev.txt
pre-commit install
Code style
Use Black, this tool is configured if you install the pre-commit tools as above.
To manually run them: black .
Testing
This repo has 2 set of tests, python unit tests for some of the most critical python scripts and nf-test scripts for the nextflow code.
To run the python tests
pip install -r requirements-test.txt
pytest
To run the nextflow ones the databases have to downloaded manually, we are working to improve this.
nf-test test tests/*
Version History
v2.3.0 (latest) Created 23rd May 2024 at 12:21 by Martin Beracochea
Merge pull request #97 from EBI-Metagenomics/dev
Dev
Frozen
v2.3.0
6a0b0b2
v2.0.0 (earliest) Created 28th Apr 2023 at 10:36 by Martin Beracochea
generate_summary_json .. fix empty file comparison
Frozen
v2.0.0
8fa9134
Creators
Submitter
Views: 1928 Downloads: 423
Created: 28th Apr 2023 at 10:36
Last updated: 23rd May 2024 at 12:21
None