gSpreadComp: Streamlining Microbial Community Analysis for Resistance, Virulence, and Plasmid-Mediated Spread
Overview
gSpreadComp is a UNIX-based, modular bioinformatics toolkit designed to streamline comparative genomics for analyzing microbial communities. It integrates genome annotation, gene spread calculation, plasmid-mediated horizontal gene transfer (HGT) detection and resistance-virulence ranking within the analysed microbial community to help researchers identify potential resistance-virulence hotspots in complex microbial datasets.
[!TIP] After installation, the user may want to check a detailed tutorial with example input and output data here
Objectives and Features
- Six Integrated Modules: Offers modules for taxonomy assignment, genome quality estimation, ARG annotation, plasmid/chromosome classification, virulence factor annotation, and in-depth downstream analysis, including target-based gene spread analysis and prokaryotic resistance-virulence ranking.
- Weighted Average Prevalence (WAP): Employs WAP for calculating the spread of target genes at different taxonomical levels or target groups, enabling refined analyses and interpretations of microbial communities.
- Reference Pathogen Identification: Compares genomes to the NCBI pathogens database to create a resistance-virulence ranking within the community.
- HTML Reporting: Culminates in a structured HTML report after the complete downstream analysis, providing users with an overview of the results.
Modular Approach and Flexibility
gSpreadComp
’s modular nature enables researchers to use the tool's main analysis and report generation steps independently or to integrate only specific pieces of gSpreadComp
into their pipelines, providing flexibility and accommodating the varying software management needs of investigators.
Using other annotation tools with gSpreadComp
[!TIP] Users can incorporate results from other annotation tools within gSpreadComp's workflow, provided the input is formatted according to gSpreadComp's specifications. This allows for the integration of preferred or specialized tools for specific steps (e.g., alternative ARG or plasmid detection methods) while still benefiting from gSpreadComp's downstream analysis capabilities.
For the quality data it should look like: Quality DataFrame Format
For the taxonomy data it should look like: Taxonomy DataFrame Format
For the gene annotation (e.g. ARGs) data it should look like: Gene annotation DataFrame Format
For the plasmid identification data it should look like: Plasmid identification DataFrame Format
Metadata information data should look like: Metadata Sample
By the end of a successful run, you should have a report that looks like this: Download Example Report
Comprehensive Workflow
gSpreadComp consists of the following modules:
- Taxonomy Assignment: Uses GTDBtk v2 for taxonomic classification.
- Genome Quality Estimation: Employs CheckM for assessing genome completeness and contamination.
- ARG Annotation: Utilizes DeepARG for antimicrobial resistance gene prediction.
- Plasmid Classification: Implements Plasflow for plasmid sequence identification.
- Virulence Factor Annotation: Annotates virulence factors using the Victors and/or VFDB databases.
- Downstream Analysis: Performs gene spread analysis, resistance-virulence ranking, and potential plasmid-mediated HGT detection.
Requirements
Before installing and running gSpreadComp
, ensure that your system meets the following requirements:
1. Operating System
- Linux x64 system
2. Package Managers
- Miniconda: Required for creating environments and managing packages.
- Mamba: A faster package manager used within the
gSpreadComp
installation.
3. Storage
- Approximately 15 GB for software installation.
- Around 92 GB for the entire database requirements.
Installation
Database Management
gSpreadComp
includes an easy-to-use script for automatic download and configuration of the required databases, with scheduled updates every January and July.
Compatibility and Requirements
Designed to support Linux x64 systems, requiring approximately 15 GB for software installation and around 92 GB for the entire database requirements.
1 - Install miniconda
To bypass conflicting dependencies, the gSpreadComp approach uses miniconda to create automatically orchestrated environments. Mamba is a much faster package manager than conda and is used within the gSpreadComp installation. Consequently, miniconda and mamba are required to be previously installed in your system. Below is a possible way of installing miniconda and mamba. Please, be aware that mamba works best when installed in your base environment.
# See documentation: https://docs.conda.io/en/latest/miniconda.html
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ chmod +x Miniconda3-latest-Linux-x86_64.sh
$ ./Miniconda3-latest-Linux-x86_64.sh
$ export PATH=~/miniconda3/bin:$PATH
# Install mamba. See documentation: https://mamba.readthedocs.io/en/latest/installation.html
$ conda install mamba -n base -c conda-forge
2 - Install gSpreadComp
Once you have miniconda and mamba installed and on your PATH, you can proceed to install gSpreadComp. The installation script was designed to install and set up all necessary tools and packages.
# Clone repository
$ git clone https://github.com/mdsufz/gSpreadComp.git
# Go to the gSpreadComp cloned repository folder
$ cd gSpreadComp
# Make sure you have conda ready and that you are in your base environment.
$ conda activate base
$ echo $CONDA_PREFIX
# You should see something like the following:
/path/to/miniconda3
# Run the installation script as follows
$ bash -i installation/install.sh
# Follow the instructions on the screen:
# Enter "y" if you want to install all modules; otherwise, enter "n".
# If you entered "n", enter "y" for each of the modules you would like to install individually.
The MuDoGeR's installation will begin..
( ) ( ) )
) ( ) ( (
( ) ( ) )
_____________
<_____________> ___
| |/ _ \
| | | |
| |_| |
___| |\___/
/ \___________/ \
\_____________________/
This might take a while. Time to grab a coffee...
3 - Install necessary databases
Make sure to run the database setup after gSpreadComp is installed.
Some bioinformatics tools used within gSpreadComp require specific databases to work. We developed a database download and set up tool to make our lives easier. You can choose to install only the databases you intend to use. You can use the flag --dbs
to choose and set up the selected databases (all [default], install all databases).
Use this script if you want gSpreadComp to take care of everything.
# Make sure gSpreadComp_env is activated. It should have been created when you ran 'bash -i installation/install.sh'
$ conda activate gspreadcomp_env
# Go to gSpreadComp cloned directory
$ cd gSpreadComp
# Run the database setup script
$ bash -i installation/database-setup.sh --dbs all -o /path/to/save/databases
# You can also check out the database-setup help information
$ bash -i installation/database-setup.sh --help
gSpreadComp database script v=1.0
Usage: bash -i database-setup.sh --dbs [module] -o output_folder_for_dbs
USE THE SAME DATABASE LOCATION OUTPUT FOLDER FOR ALL DATABASES USED WITH gSpreadComp
--dbs all download and install the required and optional databases [default]"
--dbs required download and install the required databases (Victors and VFDB) for gSpreadComp
--dbs optional download and install all the optional (ARGs, GTDB-tk, CheckM) databases for gSpreadComp
--dbs args download and install the required and the ARGs databases.
-o path/folder/to/save/dbs output folder where you want to save the downloaded databases
--help | -h show this help message
--version | -v show database install script version
Usage
Activating the Conda Environment
Before using gSpreadComp
, activate the appropriate conda environment using the following command:
conda activate gSpreadComp_env
Command-Line Usage
gSpreadComp
provides several modules, each performing a specific task within the pipeline. The quick command-line usage is as follows:
gspreadcomp --help
Modules and Their Descriptions
gSpreadComp
comprises several modules, each serving a specific purpose in the genome analysis workflow:
1. Taxonomy Assignment
gspreadcomp taxonomy [options] --genome_dir genome_folder -o output_dir
- Assigns taxonomy to genomes using GTDBtk v2.
- Options:
--genome_dir STR
: folder with the bins to be classified (in fasta format)--extension STR
: fasta file extension (e.g. fa or fasta) [default: fa]-o STR
: output directory-t INT
: number of threads
2. Genome Quality Estimation
gspreadcomp quality [options] --genome_dir genome_folder -o output_dir
- Estimates genome completeness and contamination using CheckM.
- Options:
--genome_dir STR
: folder with the genomes to estimate quality (in fasta format)--extension STR
: fasta file extension (e.g. fa or fasta) [default: fa]-o STR
: output directory-t INT
: number of threads [default: 1]-h --help
: print this message
3. ARG Prediction
gspreadcomp args [options] --genome_dir genome_folder -o output_dir
- Predicts the Antimicrobial Resistance Genes (ARGs) in a genome using DeepARG.
- Options:
--genome_dir STR
: folder with the genomes to be classified (in fasta format)--extension STR
: fasta file extension (e.g. fa or fasta) [default: fa]--min_prob NUM
: Minimum probability cutoff for DeepARG [Default: 0.8]--arg_alignment_identity NUM
: Identity cutoff for sequence alignment for DeepARG [Default: 35]--arg_alignment_evalue NUM
: Evalue cutoff for DeepARG [Default: 1e-10]--arg_alignment_overlap NUM
: Alignment read overlap for DeepARG [Default: 0.8]--arg_num_alignments_per_entry NUM
: Diamond, minimum number of alignments per entry [Default: 1000]-o STR
: output directory-h --help
: print this message
4. Plasmid Prediction
gspreadcomp plasmid [options] --genome_dir genome_folder -o output_dir
- Predicts if a sequence within a fasta file is a chromosome, plasmid, or undetermined using Plasflow.
- Options:
--genome_dir STR
: folder with the genomes to be classified (in fasta format)--extension STR
: fasta file extension (e.g. fa or fasta) [default: fa]--threshold NUM
: threshold for probability filtering [default: 0.7]-o STR
: output directory-h --help
: print this message
5. Virulence Factor annotation
gspreadcomp pathogens [options] --genome_dir genome_folder -o output_dir
- Aligns provided genomes to Virulence Factors databases and formats the output.
- Options:
--genome_dir STR
: folder with the genomes to be aligned against Virulence factors (in fasta format)--extension STR
: fasta file extension (e.g. fa or fasta) [default: fa]--evalue NUM
: evalue, expect value, threshold as defined by NCBI-BLAST [default: 1e-50]-t INT
: number of threads-o STR
: output directory-h --help
: print this message
6. Main Analysis
gspreadcomp gspread [options] -o output_dir
- Runs the main
gSpreadComp
to compare spread and plasmid-mediated HGT. - Options:
--checkm STR
: Path to the formatted Quality estimation dataframe--gene STR
: Path to the formatted target Gene dataframe to calculate the spread--gtdbtk STR
: Path to the formatted Taxonomy assignment dataframe--meta STR
: Path to the formatted Sample's Metadata dataframe--vf STR
: Path to the formatted Virulence Factors assignment dataframe--plasmid STR
: Path to the formatted Plasmid prediction dataframe--nmag INT
: Minimum number of Genomes per Library accepted [default=0]--spread_taxa STR
: Taxonomic level to check gene spread [default=Phylum]--target_gene_col STR
: Name of the column from the gene dataset with the Gene_ids to analyse [default=Gene_id]-t INT
: number of threads-o STR
: output directory-h --help
: print this message
Important Considerations
- gSpreadComp is designed for hypothesis generation and is not a standalone risk assessment tool.
- Results should be interpreted cautiously and used to guide further experimental validation.
- The tool provides relative rankings within analyzed communities, not absolute risk assessments.
Citation
If you use gSpreadComp in your research, please cite:
[Citation information will be added upon publication]
Version History
main @ d34bd3f (earliest) Created 15th Apr 2025 at 11:29 by Jonas Kasmanas
Update README.md
Frozen
main
d34bd3f

Creators
Not specifiedSubmitter
Views: 27 Downloads: 3
Created: 15th Apr 2025 at 11:29

None