plant2human workflow
Introduction
This analysis workflow is centered on foldseek, which enables fast structural similarity searches and supports the discovery of understudied genes by comparing plants, which are distantly related species, with humans, for which there is a wealth of information. Based on the list of genes you are interested in, you can easily create a scatter plot of “structural similarity vs. sequence similarity” by retrieving structural data from the AlphaFold protein structure database.
Report
- 2025-02-02: fix
foldseek easy-search
command process - 2025-09-02: update
makeblstadb
command process
Implementation Background
In recent years, with the AlphaFold protein structure database, it has become easier to obtain protein structure prediction data and perform structural similarity searches even for plant species such as rice. Against this background, searching for hits with “low sequence similarity and high structural similarity” for the gene groups being focused on has become possible. This approach may allow us to discover proteins that are conserved in distantly related species and to consider the characteristics of these proteins based on the wealth of knowledge we have about humans.
Analysis Environment
1. Using Dev Containers (Docker and VScode extension)
Most processes, such as Foldseek, use container (BioContainers), but some involve processing with jupyter notebook, which requires the preparation of some python libraries (e.g., polars.). If you want to experiment with a simple workflow, you can create an analysis environment easily using Dev Containers system, a VScode extension. Using this environment, the version of the python library is probably the one used during development, so errors are unlikely to occur (see Dockerfile for the package version).
Please check the official website for Dev Container details.
2. Executing with cwltool
This analysis workflow is tested using cwltool version 3.1.20250110105449.
3. Python Environment
I've already checked python package version below. Using “Devcontainers” makes it easy to reproduce your execution environment.
python==3.11
polars==1.17.1
matplotlib==3.8.2
seaborn==0.13.2
unipressed==1.4.0
papermill==2.6.0
Example 1 ( Oryza sativa 100 genes vs Homo sapiens)
Here, we will explain how to use the list of 10 rice genes as an example.
1. Creation of a TSV file of gene and UniProt ID correspondences
First, you need the following gene list tsv file. (Please set the column name as "From")
From
Os12g0269700
Os10g0410900
Os05g0403000
Os06g0127250
Os02g0249600
Os09g0349700
Os03g0735150
Os08g0547350
Os06g0282400
Os05g0576750
Os07g0216600
Os10g0164500
Os07g0201300
Os01g0567200
Os05g0563050
Os03g0660050
Os11g0436450
...
The following TSV file is required to execute the following workflow.
From UniProt Accession
Os01g0104800 A0A0N7KC66
Os01g0104800 Q657Z6
Os01g0104800 Q658C6
Os01g0152300 Q9LGI2
Os01g0322300 A0A9K3Y6N1
Os01g0322300 Q657N1
Os01g0567200 A0A0N7KD66
Os01g0567200 Q657K0
Os01g0571133 A0A0P0V4A8
Os01g0664500 A0A8J8XFG3
Os01g0664500 Q5SN58
Os01g0810800 A0A8J8XDQ1
Os01g0810800 B7FAC9
Os01g0875300 A0A0P0VB72
Os01g0924300 A0A0P0VCB7
...
To do this, you need to follow the CWL workflow command below. This yaml file is the parameter file for the workflow, for example.
cwltool --debug --outdir ./test/oryza_sativa_test_100genes_202509/ ./Tools/01_uniprot_idmapping.cwl ./job/oryza_sativa_100_genes/os_uniprot_idmapping.yml
In this execution, mmcif files are also retrieved. The execution results are output with the jupyter notebook.
Note: Network access required in this process!
2. Creating and Preparing Indexes
I'm sorry, but the main workflow does not currently include the creation of an index process (both for foldseek index and BLAST index). Please perform the following processes in advance.
2-1. Creating a Foldseek Index for structural alignment
In this workflow, the target of the structural similarity search is specified as the AlphaFold database to perform comparisons across a broader range of species.
Index creation using the foldseek databases
command is through the following command.
Please select the database you want to use from Alphafold/UniProt,
Alphafold/UniProt50-minimal
, Alphafold/UniProt50
, Alphafold/Proteome,
Alphafold/Swiss-Prot.
# Supported databases in this workflow
Alphafold/UniProt
Alphafold/UniProt50-minimal
Alphafold/UniProt50
Alphafold/Proteome
Alphafold/Swiss-Prot
You can check the details of this database using the following command.
docker run --rm quay.io/biocontainers/foldseek:9.427df8a--pl5321h5021889_2 foldseek databases --help
For example, if you want to specify AlphaFold/Swiss-Prot as the index, you can do so with the following CWL file;
# execute creation of foldseek index using "foldseek databases"
cwltool --debug --outdir ./index/ ./Tools/02_foldseek_database.cwl \
--database Alphafold/Swiss-Prot \
--index_dir_name index_swissprot \
--index_name swissprot \
--threads 16
Note:: urrently, this index creation process is not incorporated into the main analysis workflow (plant2human workflow). This is because the process requires network access, and maybe to be able to respond immediately if there are any changes to the tool requirements.
2-2. Creating a BLAST Index for sequence alignment
An index FASTA file must be downloaded to obtain the amino acid sequence using the blastdbcmd
command from the AlphaFold Protein Structure Database. This workflow uses the version of the protein sequence that was used for structure prediction.
Download URL: https://ftp.ebi.ac.uk/pub/databases/alphafold/sequences.fasta
Note:: This FASTA file is extremely large (92GB), so it's probably best to delete it after creating the index.
# Preparation for BLAST index
mkdir cwl_cache
cd ./index
curl -O https://ftp.ebi.ac.uk/pub/databases/alphafold/sequences.fasta
mv sequences.fasta afdb_all_sequneces.fasta
# execute creation of BLAST index using "makeblastdb"
mv ../
cwltool --debug --cachedir ./cwl_cache/ --outdir ./index/ ./Tools/14_makeblastdb_v2.cwl
Note:: It is estimated to take 2 hours for creating index. We are currently investigating whether it can be executed by another method.
Main Workflow
3. Execution of theIn this process, we perform a structural similarity search using the foldseek easy-search
command and then perform a pairwise alignment of the amino acid sequences of the hit pairs using the needle
and water
commands.
Finally, based on this information, we create a scatter plot and output a jupyter notebook as a report.
Examples of commands are as follows.
cwltool --debug --outdir ./test/oryza_sativa_test_100genes_202509/ ./Workflow/plant2human_v2.cwl ./job/oryza_sativa_100_genes/plant2human_job_example_os100.yml
For example, you can visualize the results of structural similarity and global alignment, as shown below. In this case, the x-axis represents the global alignment similarity match (%), and the y-axis represents the average lDDT score (an indicator of structural alignment).
The hit pairs in the upper-right plot indicate higher sequence similarity and structural similarity.
In this case, the x-axis represents the local alignment similarity match (%), and the y-axis represents the average lDDT score (an indicator of structural alignment).
After Filtering
The report notebook for the plant2human workflow also outputs scatter plots after applying the filtering conditions set in this workflow.
Filtering criteria
-
structural alignment coverage >= 50%
-
If there are hits with the same target for the same gene-derived UniProt ID, the one with the highest qcov is selected, and if the qcov is the same, the one with the highest lDDT is selected.
Note that in this study, we leave the states with the same foldseek hit even if the rice genes are different.
-
Select hits that can be converted to Ensembl gene id and HGNC Gene nomenclature with TogoID API
By applying these filtering conditions, you can examine hit pairs that are easier to investigate!
Global alignment
local alignment
Click and drag the diagram to pan, double click or use the controls to zoom.
Inputs
ID | Name | Description | Type |
---|---|---|---|
INPUT_DIRECTORY | input protein structure file directory | query protein structure file (default: mmCIF) directory for foldseek easy-search input. |
|
FILE_MATCH_PATTERN | file match pattern | file match pattern for listing input files. default: *.cif |
|
FOLDSEEK_INDEX | foldseek index files | "foldseek index files for foldseek easy-search input. default: ../index/index_swissprot/swissprot Note: At this time (2025/02/02), the process of acquiring and indexing index files for execution has not been incorporated into the workflow. Therefore, we would like you to execute the following commands in advance. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 8` " |
|
OUTPUT_FILE_NAME1 | output file name (foldseek easy-search) | output file name for foldseek easy-search result. Currently, this workflow only supports TSV file output. |
|
EVALUE | e-value (foldseek easy-search) | e-value threshold for foldseek easy-search. workflowdefault: 0.1 |
|
ALIGNMENT_TYPE | alignment type (foldseek easy-search) | alignment type for foldseek easy-search. default: 2 (3Di + AA: local alignment) for detailed information, see foldseek GitHub repository. |
|
THREADS | threads (foldseek easy-search) | threads for foldseek easy-search. default: 16 |
|
SPLIT_MEMORY_LIMIT | split memory limit (foldseek easy-search) | split memory limit for foldseek easy-search. default: 120G |
|
TAXONOMY_ID_LIST | taxonomy id list (foldseek easy-search) | taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon) |
|
OUTPUT_FILE_NAME2 | output file name (extract target species) | output file name for extract target species (default: human) python script. |
|
WF_COLUMN_NUMBER_QUERY_SPECIES | column number of query species | column number of query species. default: 1 (UniProt ID list) |
|
OUTPUT_FILE_NAME_QUERY_SPECIES | output file name (extract query species column) | output file name for extract query species column python script. default: foldseek_result_query_species.txt |
|
WF_COLUMN_NUMBER_HIT_SPECIES | column number of hit species | column number of hit species. default: 2 (UniProt ID list) |
|
OUTPUT_FILE_NAME_HIT_SPECIES | output file name (extract hit species column) | output file name for extract hit species column python script. default: foldseek_result_hit_species.txt |
|
SW_INPUT_FASTA_FILE_QUERY_SPECIES | input fasta file (blastdbcmd process) | input fasta file for blastdbcmd. Retrieve files in advance. default: rice UniProt FASTA file |
|
SW_INPUT_FASTA_FILE_HIT_SPECIES | input fasta file (blastdbcmd process) | input fasta file for blastdbcmd. Retrieve files in advance. default: human UniProt FASTA file |
|
ROUTE_DATASET | route dataset (ID conversion using togoID) | route dataset for ID conversion. This operation selects the UniProt ID of the target species (human) for which cross-references exist (final destination is HGNC gene symbol). default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol |
|
OUTPUT_FILE_NAME3 | output file name (ID conversion using togoID) | output file name for ID conversion. default: foldseek_hit_species_togoid_convert.tsv |
|
OUT_NOTEBOOK_NAME | output notebook name (papermill process) | output notebook name for papermill. After the analysis workflow is output, it can be freely customized such as changing the parameter values. default: plant2human_report.ipynb |
|
QUERY_IDMAPPING_TSV | query idmapping tsv (papermill process) | query idmapping tsv file. Retrieve files in advance. default: rice UniProt ID mapping file |
|
QUERY_GENE_LIST_TSV | query gene list tsv (papermill process) | query gene list tsv file. Retrieve files in advance. default: rice random gene list |
|
Steps
ID | Name | Description |
---|---|---|
sub_workflow_foldseek_easy_search | foldseek easy-search sub-workflow process | "Execute foldseek easy-search using foldseek using BioContainers docker image. This workflow supports only TSV file output. Step 1: listing files Step 2: foldseek easy-search process" |
extract_target_species | extract target species (human) process | Extract target species (human) from foldseek easy-search result. execute: ../Tools/12_extract_target_species.cwl |
extract_query_species_column | extract query species column process | Extract query species column (UniProt ID list) from foldseek easy-search result. execute: ../Tools/13_extract_id.cwl |
extract_hit_species_column | extract hit species column process | Extract hit species column (UniProt ID list) from foldseek easy-search result. execute: ../Tools/13_extract_id.cwl |
sub_workflow_retrieve_sequence_query_species | retrieve sequence sub-workflow process using EMBOSS package | "Perform pairwise alignment of protein sequences for pairs identified by structural similarity search. Step 1: retrieve sequence from blastdbcmd result Step 2: makeblastdb: ../Tools/14_makeblastdb.cwl Step 3: blastdbcmd: ../Tools/15_blastdbcmd.cwl Step 4: seqretsplit: ../Tools/16_seqretsplit.cwl Step 5: needle (Global alignment): ../Tools/17_needle.cwl Step 6: water (Local alignment): ../Tools/17_water.cwl" |
togoid_convert | togoid convert process | retrieve UniProt ID to HGNC gene symbol using togoID python script. execute: ../Tools/18_togoid_convert.cwl |
papermill | papermill process | output notebook using papermill. This process allows you to create a scatter plot of structural similarity vs. sequence similarity. execute: ../Tools/19_papermill.cwl |
Outputs
ID | Name | Description | Type |
---|---|---|---|
TSVFILE1 | output file (foldseek easy-search result) | output file for foldseek easy-search all hit result. |
|
TSVFILE2 | output file (extract target species) | extract target species foldseek result file. (in this workflow, human result only) |
|
IDLIST1 | output file (extract query species column) | extract query species column UniProt ID list file. |
|
IDLIST2 | output file (extract hit species column) | extract hit species column UniProt ID list file. |
|
INDEX_DIR1 | index directory (query species) | index directory for query species. |
|
INDEX_FILES1 | index files (query species) | index files for query species. |
|
INDEX_DIR2 | index directory (hit species) | index directory for hit species. |
|
INDEX_FILES2 | index files (hit species) | index files for hit species. |
|
BLASTDBCMD_RESULT1 | blastdbcmd result (query species) | blastdbcmd result file for query species. |
|
BLASTDBCMD_RESULT2 | blastdbcmd result (hit species) | blastdbcmd result file for hit species. |
|
LOGFILE1 | logfile (blastdbcmd query species) | logfile for blastdbcmd query species. |
|
LOGFILE2 | logfile (blastdbcmd hit species) | logfile for blastdbcmd hit species. |
|
DIR1 | directory (seqretsplit query species) | directory for seqretsplit query species. |
|
FASTA_FILES1 | split fasta files (seqretsplit query species) | split fasta files using seqretsplit for pairwise sequence alignment. |
|
DIR2 | directory (seqretsplit hit species) | directory for seqretsplit hit species. |
|
FASTA_FILES2 | split fasta files (seqretsplit hit species) | split fasta files using seqretsplit for pairwise sequence alignment. |
|
DIR3 | needle result directory | needle (global alignment) result directory. |
|
NEEDLE_RESULT_FILE | needle result file (.needle) | needle (global alignment) result files. suffix is .needle. |
|
DIR4 | water result directory | water (local alignment) result directory. |
|
WATER_RESULT_FILE | water result file (.water) | water (local alignment) result files. suffix is .water. |
|
TSVFILE3 | output file (togoid convert) | output file for togoid convert. |
|
REPORT_NOTEBOOK | output notebook (papermill) | output notebook using papermill. notebook name is `plant2human_report.ipynb`. |
|
Version History
main @ 10d8268 (latest) Created 3rd Sep 2025 at 21:20 by Sora Yonezawa
update oryza sativa 100genes test
Frozen
main
10d8268
main @ 6911e7a Created 13th Jul 2025 at 05:00 by Sora Yonezawa
update foldseek database process
Frozen
main
6911e7a
main @ 76a6471 Created 20th Nov 2024 at 05:29 by Sora Yonezawa
update README & add description
Frozen
main
76a6471
main @ 1aa2763 Created 16th Nov 2024 at 10:34 by Sora Yonezawa
main workflow changed
Frozen
main
1aa2763
main @ b8c0b1d (earliest) Created 16th Nov 2024 at 04:56 by Sora Yonezawa
UPDATE README & zey mays test files
Frozen
main
b8c0b1d

Creator
Submitter
Views: 4372 Downloads: 1303
Created: 16th Nov 2024 at 04:56
Last updated: 3rd Sep 2025 at 21:20

None