# nf-core/createtaxdb: Output

## Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [MultiQC](#multiqc) - Aggregate report describing versions and methods text for your pipeline run
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
- [Bracken](#bracken) - Database files for Bracken
- [ganon](#ganon) - Database files for ganon
- [Centrifuge](#centrifuge) - Database files for Centrifuge
- [DIAMOND](#diamond) - Database files for DIAMOND
- [Kaiju](#kaiju) - Database files for Kaiju
- [KMCP](#kmcp) - Database files for KMCP
- [Kraken2](#kraken2) - Database files for Kraken2
- [KrakenUniq](#krakenuniq) - Database files for KrakenUniq
- [MALT](#malt) - Database files for MALT

The pipeline can also generate downstream pipeline input samplesheets.
These are stored in `<outdir>/downstream_samplesheets`.

### MultiQC

<details markdown="1">
<summary>Output files</summary>

- `multiqc/`
  - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
  - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
  - `multiqc_plots/`: directory containing static images from the report in various formats.

</details>

[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.

### Pipeline information

<details markdown="1">
<summary>Output files</summary>

- `pipeline_info/`
  - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
  - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
  - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
  - Parameters used by the pipeline run: `params.json`.

</details>

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

### Bracken

[Bracken](https://github.com/jenniferlu717/Bracken/)(Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.

<details markdown="1">
<summary>Output files</summary>

- `bracken/`
  - `<db_name>/`
    - `database100mers.kmer_distrib`: Bracken kmer distribution file
    - `database100mers.kraken`: Bracken index file
    - `database.kraken`: Bracken database file
    - `hash.k2d`: Kraken2 hash database file
    - `opts.k2d`: Kraken2 opts database file
    - `taxo.k2d`: Kraken2 taxo database file
    - `library/`: Intermediate Kraken2 directory containing FASTAs and related files of added genomes
    - `taxonomy/`: Intermediate Kraken2 directory containing taxonomy files of added genomes
    - `seqid2taxid.map`: Intermediate Kraken2 file containing taxonomy files of added genomes

</details>

Note that all intermediate files are required for Bracken2 database, even if Kraken2 itself only requires the `*.k2d` files.

The resulting `<db_name>/` directory can be given to Bracken itself with `bracken -d <your_database_name>` etc.

### Centrifuge

[Centrifuge](https://github.com/bbuchfink/diamond) is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.

<details markdown="1">
<summary>Output files</summary>

- `centrifuge/`
  - `database-centrifuge/`
    - `<database>.*.cf`: Centrifuge database files

</details>

A directory and `cf` files can be given to the Centrifuge command with `centrifuge -x /<path>/<to>/<cf_files_basename>` etc.

### Ganon

[ganon](https://github.com/pirovc/ganon/) classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more.

<details markdown="1">
<summary>Output files</summary>

- `ganon/`
  - `<database>.hibf`: main bloom filter index file
  - `<database>.tax`: taxonomy tree used for taxonomy assignment
  </details>

The directory containing these two files can be given to ganon itself with using the name as a prefix, e.g., `ganon classify -d /<path>/<to>/<database name without extensions>`.

### Diamond

[DIAMOND](https://github.com/bbuchfink/diamond) is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.

<details markdown="1">
<summary>Output files</summary>

- `diamond/`
  - `<database>.dmnd`: DIAMOND dmnd database file

</details>

The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.

### Kaiju

[Kaiju](https://bioinformatics-centre.github.io/kaiju/) is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.

<details markdown="1">
<summary>Output files</summary>

- `kaiju/`
  - `<database_name>.fmi`: Kaiju FMI index file

</details>

The `fmi` file can be given to Kaiju itself with `kaiju -f <your_database>.fmi` etc.

### KMCP

[KMCP](https://bioinf.shenwei.me/kmcp/) is a metagenomic profiling tool focused on prokaryotic and viral sequences.

<details markdown="1">
<summary>Output files</summary>

- `kmcp/`
  - `database-kmcp-index/`: directory containing KMCP index files

</details>

The `database-kmcp-index/` directory can be given to KMCP itself with `kmcp search --db-dir <your_database>/` etc, see [kmcp search documentation](https://bioinf.shenwei.me/kmcp/usage/#search).
Note that the pipeline does not output files from `kmcp-compute` as these are not used in downstream tools.

### Kraken2

[Kraken2](https://ccb.jhu.edu/software/kraken2/) is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.

<details markdown="1">
<summary>Output files</summary>

- `kraken2/`
  - `<db_name>/`
    - `hash.k2d`: Kraken2 hash database file
    - `opts.k2d`: Kraken2 opts database file
    - `taxo.k2d`: Kraken2 taxo database file
    - `library/`: Intermediate directory containing FASTAs and related files of added genomes (only present if `--build_bracken` or `--kraken2_keepintermediate` supplied)
    - `taxonomy/`: Intermediate directory containing taxonomy files of added genomes (only present if `--build_bracken` or `--kraken2_keepintermediate` supplied)
    - `seqid2taxid.map`: Intermediate file containing taxonomy files of added genomes (only present if `--build_bracken` or `--kraken2_keepintermediate` supplied)

</details>

The resulting `<db_name>/` directory can be given to Kraken2 itself with `kraken2 --db <your_database_name>` etc.

### KrakenUniq

[KrakenUniq](https://github.com/fbreitwieser/krakenuniq) Metagenomics classifier with unique k-mer counting for more specific results.

<details markdown="1">
<summary>Output files</summary>

- `kraken2/`
  - `<db_name>/`
  - `database-build.log`: KrakenUniq build process log
  - `database.idx`: KrakenUniq index file
  - `database.kdb`: KrakenUniq database file
  - `taxDB`: KrakenUniq taxonomy information file

</details>

Note there may be additional files in this directory, however the ones listed above are the reportedly the required ones.

### MALT

[MALT](https://software-ab.cs.uni-tuebingen.de/download/malt) is a fast replacement for BLASTX, BLASTP and BLASTN, and provides both local and semi-global alignment capabilities.

<details markdown="1">
<summary>Output files</summary>

- `malt/`
  - `malt_index/`: directory containing MALT index files

</details>

The `malt_index` directory can be given to MALT itself with `malt-run --index <your_database>/` etc.

### Downstream samplesheets

The pipeline can also generate input files for the following downstream
pipelines:

- [nf-core/taxprofiler](https://nf-co.re/taxprofiler)

<details markdown="1">
<summary>Output files</summary>

- `downstream_samplesheets/`
  - `taxprofiler.csv`: Partially filled out nf-core/taxprofiler `--databases` csv with paths to database directories or `tar.gz` relative to the results directory. e.g. `nextflow run nf-core/taxprofiler -profile docker --input samplesheet.csv --databases <createtaxdb_outdir>/downstream_samplesheets/<database_name>.csv>`
  </details>

:::warning
Any generated downstream samplesheet is provided as 'best effort' and are not guaranteed to work straight out of the box!
They may not be complete (e.g. some columns may need to be manually filled in).
:::

:::tip
We highly recommend moving all created database directories to a central 'cache' location before running downstream pipelines.
This ensures that the database files are not lost if the pipeline is re-run, and also allows you to share the database files with other users.

If you do so, make sure to update the paths in the corresponding downstream samplesheet files accordingly.
:::
