ParslRNA-seq Scientific Workflow
master @ 22ad96e

Workflow Type: Unrecognized workflow type

RNA-seq Scientific Workflow

Workflow for RNA sequencing using the Parallel Scripting Library - Parsl.

Reference: Cruz, L., Coelho, M., Terra, R., Carvalho, D., Gadelha, L., Osthoff, C., & Ocaña, K. (2021). Workflows Científicos de RNA-Seq em Ambientes Distribuídos de Alto Desempenho: Otimização de Desempenho e Análises de Dados de Expressão Diferencial de Genes. In Anais do XV Brazilian e-Science Workshop, p. 57-64. Porto Alegre: SBC. DOI: https://doi.org/10.5753/bresci.2021.15789

Requirements

In order to use RNA-seq Workflow the following tools must be available:

You can install Bowtie2 by running:

bowtie2-2.3.5.1-linux-x86_64.zip

Or

sudo yum install bowtie2-2.3.5-linux-x86_64

Samtools is a suite of programs for interacting with high-throughput sequencing data.

Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats.

HTSeq is a native Python library that folows conventions of many Python packages. You can install it by running:

pip install HTSeq

HTSeq uses NumPy, Pysam and matplotlib. Be sure this tools are installed.

To use DESEq2 script make sure R language is also installed. You can install it by running:

sudo apt install r-base

The recommended way to install Parsl is the suggest approach from Parsl's documentation:

python3 -m pip install parsl

To use Parsl, you need Python 3.5 or above. You also need Python to use HTSeq, so you should load only one Python version.

Workflow invocation

First of all, make a Comma Separated Values (CSV) file. So, onto the first line type: sampleName,fileName,condition. Remember, there must be no spaces between items. You can use the file "table.csv" in this repository as an example. Your CSV file will be like this:

sampleName fileName condition
tissue control 1 SRR5445794.merge.count control
tissue control 2 SRR5445795.merge.count control
tissue control 3 SRR5445796.merge.count control
tissue wntup 1 SRR5445797.merge.count wntup
tissue wntup 2 SRR5445798.merge.count wntup
tissue wntup 3 SRR5445799.merge.count wntup

The list of command line arguments passed to Python script, beyond the script's name, must be:

  1. The indexed genome;
  2. The number of threads for bowtie task, sort task, number of splitted files for split_picard task and number of CPU running in htseq task;
  3. Path to read fastaq file, which is the path of the input files;
  4. Directory's name where the output files must be placed;
  5. GTF file;
  6. and, lastly the DESeq script.

Make sure all the files necessary to run the workflow are in the same directory and the fastaq files in a dedicated folder, as a input directory. The command line will be like this:

python3 rna-seq.py ../mm9/mm9 24 ../inputs/ ../outputs ../Mus_musculus.NCBIM37.67.gtf ../DESeq.R

Remember to adjust the parameter multithreaded and multicore according with your computational environment. Example: If your machine has 8 cores, you should set the parameter on 8.

Version History

master @ 22ad96e (earliest) Created 6th Dec 2022 at 19:17 by Kary Ocaña

Create Dockerfile


Frozen master 22ad96e
help Creators and Submitter
Citation
Cruz, L., Gadelha, L., & Ocaña, K. (2022). ParslRNA-seq Scientific Workflow. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.411.1
Activity

Views: 1551   Downloads: 241

Created: 6th Dec 2022 at 19:17

help Tags

This item has not yet been tagged.

help Attributions

None

Total size: 65.5 KB
Powered by
(v.1.16.0-main)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH