Library curation BOLD
main @ 4a78148

Workflow Type: Snakemake
Work-in-progress

Perl CI DOI

Library curation BOLD

alt text

This repository contains scripts and synonymy data for pipelining the automated curation of BOLD data dumps in BCDM TSV format. The goal is to implement the classification of barcode reference sequences as is being developed by the BGE consortium. A living document in which these criteria are being developed is located here.

A further goal of this project is to develop the code in this repository according to the standards developed by the community in terms of automation, reproducibility, and provenance. In practice, this means including the scripts in a pipeline system such as snakemake, adopting an environment configuration system such as conda, and organizing the folder structure in compliance with the requirements of WorkFlowHub. The latter will provide it with a DOI and will help generate RO-crate documents, which means the entire tool chain is FAIR compliant according to the current state of the art.

Install

Clone the repo:

git clone https://github.com/FabianDeister/Library_curation_BOLD.git

Change directory:

cd Library_curation_BOLD

The code in this repo depends on various tools. These are managed using the mamba program (a drop-in replacement of conda). The following sets up an environment in which all needed tools are installed:

mamba env create -f environment.yml

Once set up, this is activated like so:

mamba activate bold-curation

How to run

Bash

Although the aim of this project is to integrate all steps of the process in a simple snakemake pipeline, at present this is not implemented. Instead, the steps are executed individually on the command line as perl scripts within the conda/mamba environment. Because the current project has its own perl modules in the lib folder, every script needs to be run with the additional include flag to add the module folder to the search path. Hence, the invocation looks like the following inside the scripts folder:

perl -I../../lib scriptname.pl -arg1 val1 -arg2 val2

snakemake

Follow the installation instructions above.

Update config/config.yml to define your input data.

Navigate to the directory "workflow" and type:

snakemake -p -c {number of cores} target

If running on an HPC cluster with a SLURM scheduler you could use a bash script like this one:

#!/bin/bash
#SBATCH --partition=hour
#SBATCH --output=job_curate_bold_%j.out
#SBATCH --error=job_curate_bold_%j.err
#SBATCH --mem=24G
#SBATCH --cpus-per-task=2

source activate bold-curation

snakemake -p -c 2 target

echo Complete!

Click and drag the diagram to pan, double click or use the controls to zoom.

Version History

main @ 4a78148 (earliest) Created 24th Apr 2024 at 09:51 by Rutger Vos

omg it works


Frozen main 4a78148
help Creators and Submitter
Creators
  • Rutger Vos
  • Fabian Deister
  • Ben Price
Additional credit

Special thanks to Sujeevan Ratnasingham and the team at CBG for the creation of the BCDM data exchange format that this pipeline operates on

Submitter
Citation
Vos, R., Deister, F., & Price, B. (2024). Library curation BOLD. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.833.1
Activity

Views: 933   Downloads: 240

Created: 24th Apr 2024 at 09:51

Last updated: 24th Apr 2024 at 10:09

Annotated Properties
Topic annotations
help Attributions

None

Total size: 9.77 MB
Powered by
(v.1.16.0-main)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH