Vignette demonstrating microbial-like sequence discovery workflow¶

Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se¶

Abstract

In this vignette, we will demonstrate how to prepare and run the workflow detecting microbial-like sequeneces in eukaryotic reference genomes. The workflow accepts a eukaryotic reference in FASTA-format and outputs coordinates of microbial-like regions together with microbial species annotation.

Table of Contents¶

  • Prepare input files
  • Running workflow
  • Interpreting results

Green algae

Prepare input files ¶

For demonstration purposes we are going to use the reference genome of Bathycoccus prasinos which is a green algae (picoplankton) eukaryotic organism related to plants. The reference genome GCF_002220235.1 of this algae is small (15 Mb) and therefore computationally easy to handle. The worflow together with the test-files is available at the following github address: https://github.com/NikolayOskolkov/MCWorkflow. Let us first clone th github repository and inspect its content:

In [1]:
cd /home/nikolay
git clone https://github.com/NikolayOskolkov/MCWorkflow
cd MCWorkflow
Cloning into 'MCWorkflow'...
remote: Enumerating objects: 86, done.
remote: Counting objects: 100% (86/86), done.
remote: Compressing objects: 100% (68/68), done.
remote: Total 86 (delta 44), reused 38 (delta 13), pack-reused 0 (from 0)
Unpacking objects: 100% (86/86), done.
In [2]:
ls -l
total 7640
drwxrwxr-x 2 nikolay nikolay    4096 aug  4 12:01 data
-rw-rw-r-- 1 nikolay nikolay     152 aug  4 12:01 environment.yaml
-rwxrwxr-x 1 nikolay nikolay    7089 aug  4 12:01 extract_coords_micr_contam.R
-rw-rw-r-- 1 nikolay nikolay 3766396 aug  4 12:01 GTDB_fna2name.txt
-rw-rw-r-- 1 nikolay nikolay 3675565 aug  4 12:01 GTDB_sliced_seqs_sliding_window.fna.gz
drwxrwxr-x 2 nikolay nikolay    4096 aug  4 12:01 images
-rw-rw-r-- 1 nikolay nikolay       4 aug  4 12:01 LICENSE.txt
-rw-rw-r-- 1 nikolay nikolay    1191 aug  4 12:01 main.nf
-rwxrwxr-x 1 nikolay nikolay    5269 aug  4 12:01 micr_cont_detect.sh
-rw-rw-r-- 1 nikolay nikolay     860 aug  4 12:01 nextflow.config
-rw-rw-r-- 1 nikolay nikolay    2388 aug  4 12:01 README.md
-rw-rw-r-- 1 nikolay nikolay  306912 aug  4 12:01 vignette.html
-rw-rw-r-- 1 nikolay nikolay   23410 aug  4 12:01 vignette.ipynb

The workflow consists of four main files:

  • micr_cont_detect.sh - main shell-script that builds eukaryotic reference genome index, runs alignments and does pre- and post-processing
  • extract_coords_micr_contam.R - helping R script that extracts coordinates of covered regions and annotates them with microbial species
  • GTDB_sliced_seqs_sliding_window.fna.gz - prepared 60 bp long GTDB microbial pseudo-reads to be aligned to eukaryotic references
  • GTDB_fna2name.txt - annotation of pseudo-reads with scientific microbial names according to GTDB taxonomy

Currently it is important that all the four files are located in the same directory. We recommend to perform all the analysis in this cloned github repository in order to avoid complications with changing the paths and hacking the codes. Please note that the GTDB_sliced_seqs_sliding_window.fna.gz file is a small (3.7 MB) demonstration subset of the complete dataset (234 GB) available at the SciLifeLab Figshare https://doi.org/10.17044/scilifelab.28491956 which should be used for real-world applications, i.e. for your research you will have to replace the toy-dataset from the github repository by the large one (with identical file name) from the SciLifeLab Figshare. The toy-dataset includes pseudo-reads from only one bacterium named as UBA796 sp002707085 (with corresponding reference 5oFfr3yp0G.fna) belonging to the phylum Myxococcota in the GTDB databse.

Next, we recommend to place the eukaryotic reference genomes to be profiled for microbial-like sequences into the folder data within the cloned repository as shown above. Right now, the "data"-folder is already present in the cloned repository and the Bathycoccus prasinos reference genome GCF_002220235.fna.gz has been already placed within this folder. Let us double-check that the eukaryotic reference genome is indeed inside the "data"-folder:

In [3]:
cd data && ls -l && cd ..
total 4832
-rwxrwxr-x 1 nikolay nikolay 4941250 aug  4 12:01 GCF_002220235.fna.gz

Finally, we make sure that we are going to start the workflow, i.e. call the micr_cont_detect.sh script from the cloned github repository:

In [4]:
pwd
/home/nikolay/MCWorkflow

Now everything is ready for launching the workflow.

Running workflow ¶

Before you start the workflow, please make sure that R, awk, Bowtie2 and samtools are installed and available in your path. If you run the workflow on an HPC cluster, please make sure that you have loaded the corresponding modules. The workflow has the following format:

./micr_cont_detect.sh REF_GENOME INPUT_DIR TYPE_OF_PSEUDO_READS THREADS PSEUDO_READS N_ALLOWED_MULTIMAPPERS

where:

  • REF_GENOME - gzipped eukaryotic reference genome in FASTA-format (no path is needed, just the name of the file)
  • INPUT_DIR - directory containing the eukaryotic reference genome (here you need to provide the absolute path)
  • TYPE_OF_PSEUDO_READS - whether RefSeq or GTDB or human pseudo-reads are used, can only be "RefSeq" or "GTDB" or "human"
  • THERADS - number of threads available
  • PSEUDO_READS - GTDB, RefSeq or human pseudo-reads provided together with workflow (no path is needed, just the name of the file)
  • N_ALLOWED_MULTIMAPPERS - number of multi-mapping pseudo-reads allowed by Bowtie2

Now we can start the workflow with the following command line:

In [5]:
./micr_cont_detect.sh GCF_002220235.fna.gz /home/nikolay/MCWorkflow/data GTDB 4 GTDB_sliced_seqs_sliding_window.fna.gz 10
PREPARING FILES FOR ANALYSIS OF GCF_002220235.fna.gz REFERENCE GENOME

BUILDING BOWTIE2 INDEX FOR GCF_002220235.fna.gz REFERENCE GENOME
ALIGNING MICROBIAL READS WITH BOWTIE2 TO GCF_002220235.fna.gz REFERENCE GENOME
[bam_sort_core] merging from 0 files and 4 in-memory blocks...

RANKING GCF_002220235.fna.gz CONTIGS BY NUMBER OF MAPPED MICROBIAL READS
COMPUTING BREADTH OF COVERAGE FOR EACH CONTIG AND COORDINATES OF MICROBIAL CONTAMINATION FOR GCF_002220235.fna.gz REFERENCE GENOME
NC_023997.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024004.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024008.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_023992.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024006.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024001.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024000.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024003.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES

AGGREGATING RESULTS FOR GCF_002220235.fna.gz REFERENCE GENOME AND CLEANING
COMPUTING LIST OF MOST ABUNDANT MICROBES CONTMAMINATING GCF_002220235.fna.gz REFERENCE GENOME

ANALYSIS FOR GCF_002220235.fna.gz REFERENCE GENOME FINISHED SUCCESSFULLY

For the toy-dataset and the small eukaryotic reference genome, the workflow takes only a few seconds to finish. Please note that for real-world applications, the alignment step is the most time consuming. Since the full GTDB sliced microbial pseudo-reads data set includes 26 billion reads, to our experience, the alignment to e.g. mammalian reference genomes can take up to 48 hours on an HPC compute node with 20 cores. Multi-threading is crucial here, more available threads may considerable speed up the workflow execution.

Interepreting results ¶

Let us now go through the main outputs-files of the workflow. First of all, we see the bt2l Bowtie2 index-files within the data-folder, and the verbose output of bowtie2-build command was written to bowtie2-build.log file.

In [6]:
cd data && ls -l
total 49432
-rw-rw-r-- 1 nikolay nikolay    33688 aug  4 12:09 bowtie2-build.log
-rwxrwxr-x 1 nikolay nikolay  4941250 aug  4 12:01 GCF_002220235.fna.gz
-rw-rw-r-- 1 nikolay nikolay 13403961 aug  4 12:08 GCF_002220235.fna.gz.1.bt2l
-rw-rw-r-- 1 nikolay nikolay  7519068 aug  4 12:08 GCF_002220235.fna.gz.2.bt2l
-rw-rw-r-- 1 nikolay nikolay      709 aug  4 12:08 GCF_002220235.fna.gz.3.bt2l
-rw-rw-r-- 1 nikolay nikolay  3759531 aug  4 12:08 GCF_002220235.fna.gz.4.bt2l
drwxrwxr-x 2 nikolay nikolay     4096 aug  4 12:09 GCF_002220235.fna.gz_GTDB
-rw-rw-r-- 1 nikolay nikolay 13403961 aug  4 12:09 GCF_002220235.fna.gz.rev.1.bt2l
-rw-rw-r-- 1 nikolay nikolay  7519068 aug  4 12:09 GCF_002220235.fna.gz.rev.2.bt2l

Further, all the main outputs of the workflow were placed to the GCF_002220235.fna.gz_GTDB folder, let us navigate to the folder and check its content:

In [7]:
cd GCF_002220235.fna.gz_GTDB && ls -l
total 128
-rw-rw-r-- 1 nikolay nikolay   319 aug  4 12:09 contigs_abund_sorted_GTDB_GCF_002220235.fna.gz.txt
-rw-rw-r-- 1 nikolay nikolay   477 aug  4 12:09 contigs_boc_sorted_GTDB_GCF_002220235.fna.gz.txt
-rw-rw-r-- 1 nikolay nikolay 36549 aug  4 12:09 coords_micr_contam_GCF_002220235.fna.gz.txt
-rw-rw-r-- 1 nikolay nikolay    23 aug  4 12:09 microbes_abundant_GTDB_GCF_002220235.fna.gz.txt
-rw-rw-r-- 1 nikolay nikolay  1244 aug  4 12:09 NC_023992.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay  1858 aug  4 12:09 NC_023997.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay  2016 aug  4 12:09 NC_024000.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay  2171 aug  4 12:09 NC_024001.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay  2147 aug  4 12:09 NC_024003.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay  2404 aug  4 12:09 NC_024004.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay  2504 aug  4 12:09 NC_024006.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay  3108 aug  4 12:09 NC_024008.1__GCF_002220235.fna.gz.boc.gz
-rw-rw-r-- 1 nikolay nikolay 43463 aug  4 12:09 PseudoReads_aligned_to_GCF_002220235.fna.gz.bam
-rw-rw-r-- 1 nikolay nikolay   308 aug  4 12:09 PseudoReads_aligned_to_GCF_002220235.fna.gz.bam.csi

Here you can see a number of files. Probably the main file is coords_micr_contam_GCF_002220235.fna.gz.txt, this is the coordinates of microbial-like regions in BED-format (despite the file does not have *.bed - extension). The columns in this file have the following meaning: 1) name (id) of the eukaryotic reference genome profiles, 2) contig / scaffold / chromosome id within the eukaryotic reference genome containing microvbial-like region, 3) start coordinate of the detected microbial-like region, 4) end coordinate of the detected microbial-like region, 5) genomic length of the microbial-like region, 6) total number of reads aligned to the detected microbial-like region, 7) average number of reads supporting each position within the detected microbial-like region, 8) the next five columns represent top abundant microbial species for each detected microbial-like region (the number of reads is reported for each of the top abundant microbes); if fewer than five unique microbes are dicovered within the microbial-like region, the rest of the columns contain recods "NA_reads_NA".

In [8]:
head coords_micr_contam_GCF_002220235.fna.gz.txt | cut -f1-10
ORGANISM	CONTIG	START	END	LENGTH	N_READS	AVERAGE_DEPTH	MICR1	MICR2	MICR3
GCF_002220235.fna.gz	NC_023997.1	231657	231716	59	1	1	1_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	231807	232156	349	26	4	26_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	232752	232891	139	9	4	9_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	232905	232964	59	1	1	1_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	233015	233104	89	4	3	4_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	233145	233369	224	13	3	13_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	233373	233377	4	1	1	1_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	233395	233494	99	3	2	3_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA
GCF_002220235.fna.gz	NC_023997.1	233515	233694	179	11	4	11_reads_UBA796_sp002707085	NA_reads_NA	NA_reads_NA

Further, congigs within the eukaryotic reference genome sorted by the total number of aligned microbial pseuso-reads and contig-wide breadth of coverage are reported in contigs_abund_sorted_GTDB_GCF_002220235.fna.gz.tx and contigs_boc_sorted_GTDB_GCF_002220235.fna.gz.txt files, respectively.

In [9]:
head contigs_abund_sorted_GTDB_GCF_002220235.fna.gz.txt
ORGANISM	CONTIG	N_READS
GCF_002220235.fna.gz	NC_023997.1	347
GCF_002220235.fna.gz	NC_024004.1	274
GCF_002220235.fna.gz	NC_024008.1	221
GCF_002220235.fna.gz	NC_023992.1	193
GCF_002220235.fna.gz	NC_024006.1	186
GCF_002220235.fna.gz	NC_024001.1	175
GCF_002220235.fna.gz	NC_024000.1	123
GCF_002220235.fna.gz	NC_024003.1	59
In [10]:
head contigs_boc_sorted_GTDB_GCF_002220235.fna.gz.txt
ORGANISM	CONTIG	N_READS	LENGTH	BOC
GCF_002220235.fna.gz	NC_023997.1	347	712459	0.00871348
GCF_002220235.fna.gz	NC_023992.1	193	465570	0.00777327
GCF_002220235.fna.gz	NC_024004.1	274	1019276	0.00525471
GCF_002220235.fna.gz	NC_024008.1	221	1352724	0.00400451
GCF_002220235.fna.gz	NC_024001.1	175	937610	0.00396433
GCF_002220235.fna.gz	NC_024006.1	186	1091008	0.00389915
GCF_002220235.fna.gz	NC_024000.1	123	895536	0.00294349
GCF_002220235.fna.gz	NC_024003.1	59	989707	0.00161462

The N_READS coluumn in both files reports the number of microbial pseudo-reads aligned to each contig / scaffold / chromosome within the eukaryotic reference genome. In addition, the LENGTH and BOC columns in the "boc_sorted" file report the length of each contig / scaffold / chromosome in the eukaryotic reference and the breadth of coverage (BOC - fraction of reference nucleotides covered at least once), respectively. Finally, microbes_abundant_GTDB_GCF_002220235.fna.gz.txt file contains GTDB microbial species (their reference genome ids) sorted by their total abundance within the whole eukaryotic reference genome. Note that this file will be absent when screening a reference genome for human contamination since all aligned pseudo-reads belong to human in this case.

In [11]:
head microbes_abundant_GTDB_GCF_002220235.fna.gz.txt
   1578 5oFfr3yp0G.fna

In this particular example, we had only one GTDB microbial species UBA796 sp002707085 (with the reference 5oFfr3yp0G.fna) belonging to the phylum Myxococcota in the GTDB databse. All the 1578 micobial pseudo-reads aligned to the green algae reference genome belonged to that single microbe. The bam-file containng the alignments is PseudoReads_aligned_to_GCF_002220235.fna.gz.bam, let us quickly check that there are 1578 aligned reads in total.

In [12]:
samtools view PseudoReads_aligned_to_GCF_002220235.fna.gz.bam | wc -l
1578

The *.boc.gz files contain the outputs from samtools depth for each contig / scaffold / chromosome within the eukaryotic reference genome. They are not very important and can be deleted for the sake of disk space.

In [ ]: