HP2NET - Framework for Construction of Phylogenetic Networks on High Performance Computing (HPC) Environment
main @ 20ecbe3

Workflow Type: Python

Framework for construction of phylogenetic networks on High Performance Computing (HPC) environment

Introduction

Phylogeny refers to the evolutionary history and relationship between biological lineages related by common descent. Reticulate evolution refers to the origination of lineages through the complete or partial merging of ancestor lineages. Networks may be used to represent lineage independence events in non-treelike phylogenetic processes.

The methodology for reconstructing networks is still in development. Here we explore two methods for reconstructing rooted explicit phylogenetic networks, PhyloNetworks and Phylonet, which employ computationally expensive and time consuming algorithms. The construction of phylogenetic networks follows a coordinated processing flow of data sets analyzed and processed by the coordinated execution of a set of different programs, packages, libraries or pipelines, called workflow activities.

In view of the complexity in modeling network experiments, the present work introduces a workflow for phylogenetic network analyses coupled to be executed in High-Performance Computing (HPC) environments. The workflow aims to integrate well-established software, pipelines and scripts, implementing a challenging task since these tools do not consistently profit from the HPC environment, leading to an increase in the expected makespan and idle computing resources.

Requirements

Python >= 3.8
1. Biopython >= 1.75
2. Pandas >= 1.3.2
3. Parsl >= 1.0
Raxml >= 8.2.12
Astral >= 5.7.1
SnaQ (PhyloNetworks) >= 0.13.0
MrBayes >= 3.2.7a
BUCKy >= 1.4.4
Quartet MaxCut >= 2.10
PhyloNet >= 3.8.2
Julia >= 1.4.1
IQTREE >= 2.0

How to use

Setting up the framework

The framework uses a file to get all the needed parameters. For default it loads the file default.ini in the config folder, but you can explicitly load other files using the argument -s name_of_the_file, e.g. -s config/test.ini.

Edit parl.env with the environment variables you may need, such as modules loadeds in SLURM
Edit work.config with the directories of your phylogeny studies (the framework receives as input a set of homologous gene alignments of species in the nexus format).
Edit default.ini with the path for each of the needed softwares and the parameters of the execution provider.

For default, the execution logs are created in the runinfo folder. To change it you can use the -r folder_path parameter.

Contents of the configuration file

General settings

[GENERAL]
ExecutionProvider = SLURM
ScriptDir 		= ./scripts
Environ			= config/parsl.env
Workload		= config/work.config
NetworkMethod   = MP
TreeMethod      = RAXML
BootStrap       = 1000

The framework can be executed in a HPC environment using the Slurm resource manager using the parameter ExecutionProvider equals to SLURM or locally with LOCAL.
The path of the scripts folder is assigned in ScriptDir. It's recommended to use the absolute path to avoid errors.
The Environ parameter contains the path of the file used to set environment variables. More details can be seen below.
In Workload is the path of the experiments that will be performed.
NetworkMethod and TreeMethod are the default network and tree methods that will be used to perform the workloads' studies.
Bootstrap is the parameter used in all the software that use bootstrap (RAxML, IQTREE and ASTRAL)

Workflow execution settings

When using SLURM, these are the needed parameters:
```
[WORKFLOW]
Monitor			= False
PartCore	= 24
PartNode	= 1
Walltime	= 00:20:00
```
1. Monitor is a parameter to use parsl's monitor module in HPC environment. It can be true or false. If you want to use it, it's necessary to set it as true and manually change the address in infra_manager.py
2. If you are using it in a HPC environment (using SLURM), the framework is going to submit in a job. PartCore is the number of cores of the node; PartNode is the number of nodes of the partition; and the Walltime parameter is the maximum amount of time the job will be able to run.
However, if the the desired execution method is the LocalProvider, i.e. the execution is being performed in your own machine, only these parameters are necessary:
```
[WORKFLOW]
Monitor			= False
MaxCore	= 6
CoresPerWorker	= 1
```

RAxML settings

[RAXML]
RaxmlExecutable = raxmlHPC-PTHREADS
RaxmlThreads 	= 6
RaxmlEvolutionaryModel = GTRGAMMA --HKY85

IQTREE settings

[IQTREE]
IqTreeExecutable = iqtree2
IqTreeEvolutionaryModel = TIM2+I+G 
IqTreeThreads = 6

ASTRAL settings

[ASTRAL]
AstralExecDir 	= /opt/astral/5.7.1
AstralJar 		= astral.jar

PhyloNet settings

[PHYLONET]
PhyloNetExecDir 	= /opt/phylonet/3.8.2/
PhyloNetJar 		= PhyloNet.jar
PhyloNetThreads     = 6
PhyloNetHMax        = 3
PhyloNetRuns        = 5

SNAQ settings

[SNAQ]
SnaqThreads		= 6
SnaqHMax        = 3
SnaqRuns        = 3

Mr. Bayes settings

[MRBAYES]
MBExecutable	= mb
MBParameters	= set usebeagle=no beagledevice=cpu beagleprecision=double; mcmcp ngen=100000 burninfrac=.25 samplefreq=50 printfreq=10000 diagnfreq=10000 nruns=2 nchains=2 temp=0.40 swapfreq=10

Bucky settings

[BUCKY]
BuckyExecutable = bucky
MbSumExecutable = mbsum

Quartet MaxCut

QUARTETMAXCUT]
QmcExecDir       = /opt/quartet/
QmcExecutable    = find-cut-Linux-64

Workload file

For default the workload file is work.config in the config folder. The file contains the absolute paths of the experiment's folders.

/home/rafael.terra/Biocomp/data/Denv_1

You can comment folders using the # character in the beginning of the path. e. g. #/home/rafael.terra/Biocomp/data/Denv_1. That way the framework won't read this path.

You can also run a specific flow for a path using @TreeMethod|NetworkMethod in the end of a path. Where TreeMethod can be RAXML, IQTREE or MRBAYES and NetworkMethod can be MPL or MP (case sensitive). The supported flows are: RAXML|MPL, RAXML|MP, IQTREE|MPL, IQTREE|MP and MRBAYES|MPL. For example:

/home/rafael.terra/Biocomp/data/Denv_1@RAXML|MPL

Environment file

The environment file contains all the environment variables (like module files used in SLURM) used during the framework execution. Example:

module load python/3.8.2
module load raxml/8.2_openmpi-2.0_gnu
module load java/jdk-12
module load iqtree/2.1.1
module load bucky/1.4.4
module load mrbayes/3.2.7a-OpenMPI-4.0.4
source /scratch/app/modulos/julia-1.5.1.sh

Experiment folder

Each experiment folder needs to have a input folder containing a .tar.gz compressed file and a .json with the following content. The framework considers that there is only one file of each extension in the input folder.

{
	"Mapping":"",
	"Outgroup":""
}

Where Mapping is a direct mapping of the taxon, when there are multiple alleles per species, in the format species1:taxon1,taxon2;species2:taxon3,taxon4 (white spaces are not supported) and Outgroup is the taxon used to root the network. The Mapping parameter is optional (although it has to be in the json file without value), but the outgroup is obligatory. It's important to say that the flow MRBAYES|MPL doesn't support multiple alleles per species. Example:

{
  "Mapping": "dengue_virus_type_2:FJ850082,FJ850088,JX669479,JX669482,JX669488,KP188569;dengue_virus_type_3:FJ850079,FJ850094,JN697379,JX669494;dengue_virus_type_1:FJ850073,FJ850084,FJ850093,JX669465,JX669466,JX669475,KP188545,KP188547;dengue_virus_type_4:JN559740,JQ513337,JQ513341,JQ513343,JQ513344,JQ513345,KP188563,KP188564;Zika_virus:MH882543", 
  "Outgroup": "MH882543"
}

Running the framework

In a local machine:

After setting up the framework, just run python3 parsl_workflow.py.

In a SLURM environment:

Create an submition script that inside contains: python3 parsl_workflow.py.

#!/bin/bash
#SBATCH --time=15:00:00
#SBATCH -e slurm-%j.err
#SBATCH -o slurm-%j.out
module load python/3.9.6
cd /path/to/biocomp
python3 parsl_workflow.py

The framework is under heavy development. If you notice any bug, please create an issue here on GitHub.

Running in a DOCKER container

The framework is also available to be used in Docker. It can be built from source or pushed from DockerHub.

Building it from the source code

Adapt the default settings file config/default.ini according to your machine, setting the number of threads and bootstrap. After that, run docker build -t hp2net . in the project's root folder.

Downloading it from Dockerhub

The docker image can also be downloaded from Docker hub. To do that, just run the command docker pull rafaelstjf/hp2net:main

Running

The first step to run the framework is to setup your dataset. To test if the framework is running without problems in your machine, you can use the example datasets.

Alt text

Extracting the example_data.zip file, a new folder called with_outgroup is created. This folder contain four datasets of DENV sequences.

The next step is the creation of the settings and workload files. For the settings file, download the default.ini from this repository and change it to you liking (the path of all software are already configured to run on docker). The workload file is a text file containing the absolute path of the datasets, followed by the desired pipeline, as shown before in this document. Here for example purposes, the input.txt file was created.

Alt text

With all the files prepared, the framework can be executed from the example_data folder as following:

docker run --rm -v $PWD:$PWD rafaelstjf/hp2net:main -s $PWD/default.ini -w $PWD/input.txt

Important: the docker doesn't save your logs, for that add the parameter: -r $PWD/name_of_your_log_folder.

If you are running it on Santos Dumont Supercomputer, both downloading and execution of the docker container need to be performed from a submission script and executed using sg docker -c "sbatch script.sh". The snippet below shows an example of submission script.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH -p cpu_small
#SBATCH -J Hp2NET
#SBATCH --exclusive
#SBATCH --time=02:00:00
#SBATCH -e slurm-%j.err
#SBATCH -o slurm-%j.out

DIR='/scratch/pcmrnbio2/rafael.terra/WF_parsl/example_data'
docker  pull rafaelstjf/hp2net:main

docker run --rm -v $DIR:$DIR rafaelstjf/hp2net:main -s ${DIR}/sdumont.ini -w ${DIR}/entrada.txt -r ${DIR}/logs

If you use it, please cite

Terra, R., Coelho, M., Cruz, L., Garcia-Zapata, M., Gadelha, L., Osthoff, C., ... & Ocana, K. (2021, July). Gerência e Análises de Workflows aplicados a Redes Filogenéticas de Genomas de Dengue no Brasil. In Anais do XV Brazilian e-Science Workshop (pp. 49-56). SBC.

Also cite all the coupled software!

SEEK ID: https://workflowhub.eu/workflows/703?version=1

DOI: 10.48546/workflowhub.workflow.703.1

Version History

main @ 20ecbe3 (earliest) Created 9th Jan 2024 at 13:04 by Rafael Terra

Merge branch 'main' of https://github.com/rafaelstjf/biocomp into main

Frozen main 20ecbe3

Creators and Submitter

Creators

Rafael Terra
Diego Carvalho

Submitter

Rafael Terra

Citation

Terra, R., & Carvalho, D. (2024). HP2NET - Framework for Construction of Phylogenetic Networks on High Performance Computing (HPC) Environment. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.703.1

License

Creative Commons Attribution 4.0 International (CC-BY-4.0)

Activity

Views: 3672 Downloads: 749

Created: 9th Jan 2024 at 13:04

Last updated: 18th Jan 2024 at 17:50

Annotated Properties

Scientific disciplines

Biochemistry, Genetics and Molecular Biology

HP2NET - Framework for Construction of Phylogenetic Networks on High Performance Computing (HPC) Environment main @ 20ecbe3