RNA-seq Scientific Workflow
Workflow for RNA sequencing using the Parallel Scripting Library - Parsl.
Reference: Cruz, L., Coelho, M., Terra, R., Carvalho, D., Gadelha, L., Osthoff, C., & Ocaña, K. (2021). Workflows Científicos de RNA-Seq em Ambientes Distribuídos de Alto Desempenho: Otimização de Desempenho e Análises de Dados de Expressão Diferencial de Genes. In Anais do XV Brazilian e-Science Workshop, p. 57-64. Porto Alegre: SBC. DOI: https://doi.org/10.5753/bresci.2021.15789
Requirements
In order to use RNA-seq Workflow the following tools must be available:
You can install Bowtie2 by running:
bowtie2-2.3.5.1-linux-x86_64.zip
Or
sudo yum install bowtie2-2.3.5-linux-x86_64
Samtools is a suite of programs for interacting with high-throughput sequencing data.
Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats.
HTSeq is a native Python library that folows conventions of many Python packages. You can install it by running:
pip install HTSeq
HTSeq uses NumPy, Pysam and matplotlib. Be sure this tools are installed.
To use DESEq2 script make sure R language is also installed. You can install it by running:
sudo apt install r-base
The recommended way to install Parsl is the suggest approach from Parsl's documentation:
python3 -m pip install parsl
To use Parsl, you need Python 3.5 or above. You also need Python to use HTSeq, so you should load only one Python version.
Workflow invocation
First of all, make a Comma Separated Values (CSV) file. So, onto the first line type: sampleName,fileName,condition
. Remember, there must be no spaces between items. You can use the file "table.csv" in this repository as an example. Your CSV file will be like this:
sampleName | fileName | condition |
---|---|---|
tissue control 1 | SRR5445794.merge.count | control |
tissue control 2 | SRR5445795.merge.count | control |
tissue control 3 | SRR5445796.merge.count | control |
tissue wntup 1 | SRR5445797.merge.count | wntup |
tissue wntup 2 | SRR5445798.merge.count | wntup |
tissue wntup 3 | SRR5445799.merge.count | wntup |
The list of command line arguments passed to Python script, beyond the script's name, must be:
- The indexed genome;
- The number of threads for bowtie task, sort task, number of splitted files for split_picard task and number of CPU running in htseq task;
- Path to read fastaq file, which is the path of the input files;
- Directory's name where the output files must be placed;
- GTF file;
- and, lastly the DESeq script.
Make sure all the files necessary to run the workflow are in the same directory and the fastaq files in a dedicated folder, as a input directory. The command line will be like this:
python3 rna-seq.py ../mm9/mm9 24 ../inputs/ ../outputs ../Mus_musculus.NCBIM37.67.gtf ../DESeq.R
Remember to adjust the parameter multithreaded and multicore according with your computational environment. Example: If your machine has 8 cores, you should set the parameter on 8.
Version History
master @ 22ad96e (earliest) Created 6th Dec 2022 at 19:17 by Kary Ocaña
Create Dockerfile
Frozen
master
22ad96e
Creators
Submitter
Views: 1551 Downloads: 241
Created: 6th Dec 2022 at 19:17
This item has not yet been tagged.
None