CroMaSt: A workflow for assessing protein domain classification by cross-mapping of structural instances between domain databases and structural alignment

Workflow Type: Common Workflow Language

CroMaSt: A workflow for domain family curation through cross-mapping of structural instances between protein domain databases

CroMaSt (Cross Mapper of domain Structural instances) is an automated iterative workflow to clarify domain definition by cross-mapping of domain structural instances between domain databases. CroMaSt (for Cross-Mapper of domain Structural instances) will classify all structural instances of a given domain type into 3 different categories (core, true and domain-like).

Requirements

  1. Conda or Miniconda
  2. Kpax
    Download and install conda (or Miniconda) and Kpax by following the instructions from their official site.

Get it running

(Considering the requirements are already met)

  1. Clone the repository and change the directory
git clone https://gitlab.inria.fr/capsid.public_codes/CroMaSt.git
cd CroMaSt
  1. Create the conda environment for the workflow
conda env create --file yml/environment.yml
conda activate CroMaSt
  1. Change the path of variables in paramter file
sed -i 's/\/home\/hdhondge\/CroMaSt\//\/YOUR\/PATH\/TO_CroMaSt\//g' yml/CroMaSt_input.yml 
  1. Create the directory to store files from PDB and SIFTS (if not already)
mkdir PDB_files SIFTS
  1. Download the source input data
cwl-runner Tools/download_data.cwl yml/download_data.yml

Basic example

1. First, we will run the workflow for the KH domain with family identifiers RRM_1 and RRM in Pfam and CATH, respectively.

Run the workflow -

cwl-runner --parallel  --outdir=Results/  CroMaSt.cwl yml/CroMaSt_input.yml

2. Once the iteration is complete, check the new_param.yml file from the outputdir (Results), if there is any family identifier in either pfam or cath; run the next iteration using following command (Until there is no new families explored by workflow) -

cwl-runner --parallel  --outdir=Results/  CroMaSt.cwl Results/new_param.yml

Extra: Start the workflow with multiple families from one or both databases

If you would like to start the workflow with multiple families from one or both databases, then simply add a comma in between two family identifiers.

pfam: ['PF00076', 'PF08777']
cath: ['3.30.70.330']
  • Pro Tip: Don't forget to give different path to --outdir option while running the workflow multiple times or at least move the results to some other location after first run.

Run the workflow for protein domain of your choice

1. You can run the workflow for the domain of your choice by simply changing the family identifers in yml/CroMaSt_input.yml file.

Simply replace the following values of family identifiers (for pfam and cath) with the family identifiers of your choice in yml/CroMaSt_input.yml file.

pfam: ['PF00076']
cath: ['3.30.70.330']

Data files used in current version are as follows:

Files in Data directory can be downloaded as follows:

  1. File used from Pfam database: pdbmap.gz

  2. File used from CATH database: cath-domain-description-file.txt

  3. Obsolete entries from RCSB PDB obsolete_PDB_entry_ids.txt

CATH Version - 4.3.0 (Ver_Date - 11-Sep-2019) FTP site
Pfam Version - 33.0 (Ver_Date - 18-Mar-2020) FTP site

Reference

Poster - 
1. Hrishikesh Dhondge, Isaure Chauvot de Beauchêne, Marie-Dominique Devignes. CroMaSt: A workflow for domain family curation through cross-mapping of structural instances between protein domain databases. 21st European Conference on Computational Biology, Sep 2022, Sitges, Spain. ⟨hal-03789541⟩

Acknowledgements

This project has received funding from the Marie Skłodowska-Curie Innovative Training Network (MSCA-ITN) RNAct supported by European Union’s Horizon 2020 research and innovation programme under granta greement No 813239.

Click and drag the diagram to pan, double click or use the controls to zoom.

Inputs

ID Name Description Type
pfam Pfam family ids n/a
  • string[]?
cath CATH family ids n/a
  • string[]?
iteration Iteration number n/a
  • int
filename Filename to store family ids per iteration n/a
  • File
  • string
true_domain_file To store all the true domain StIs n/a
  • File
  • string
siftsDir Directory for storing all SIFTS files n/a
  • Directory
paramfile Parameter file for current iteration n/a
  • File
db_for_core Database to select to compute core average structure n/a
  • string
core_domain_struct Core domain structure (.pdb) n/a
  • File
  • string
prev_crossMapped_pfam Pfam cross-mapped domain StIs from previous iteration n/a
  • File
prev_crossMapped_cath CATH cross-mapped domain StIs from previous iteration n/a
  • File
unmapped_analysis_file Filename with alignment scores for unmapped instances n/a
  • string
pdbDir The directory for storing all PDB files n/a
  • Directory
cath_resmap Filename for residue-mapped CATH domain StIs n/a
  • string
cath_lost Obsolete and inconsistent CATH domain StIs n/a
  • string
pfam_resmap Filename for residue-mapped Pfam domain StIs n/a
  • string
pfam_lost Obsolete and inconsistent Pfam domain StIs n/a
  • string
domain_like To store all the domain-like StIs n/a
  • File
  • string
failed_domain To store all failed domain StIs n/a
  • File
  • string
min_domain_length Threshold for minimum domain length n/a
  • int
alignment_score Alignment score from Kpax to analyse structures n/a
  • string
score_threshold Score threshold for given alignment score from Kpax n/a
  • float
unmap_pfam_pass Filename to store unmapped but structurally well aligned instances from Pfam n/a
  • string
unmap_pfam_fail Filename to store unmapped and not properly aligned instances from Pfam n/a
  • string
unmap_cath_pass Filename to store unmapped but structurally well aligned instances from CATH n/a
  • string
unmap_cath_fail Filename to store unmapped and not properly aligned instances from CATH n/a
  • string

Steps

ID Name Description
get_family_ids Get domain family ids Get domain family ids from CATH and Pfam databases from parameter file provided by user
pfam_domain_instances Produce a list of residue-mapped domain StIs from Pfam ids Retrieve and process the PDB structures corresponding to the Pfam family ids resulting in a list of residue-mapped structural domain instances along with lost structural instances (requires Data/pdbmap downloaded from Pfam and uses SIFTS resource for UniProt to PDB residue Mapping)
cath_domain_instances Produce a list of residue-mapped domain StIs from CATH ids Retrieve and process the PDB structures corresponding to the CATH superfamily ids resulting in a list of residue-mapped structural domain instances along with lost structural instances (requires Data/cath_domain_description_file.txt downloaded from CATH and uses SIFTS resource for PDB to UniProt residue Mapping)
add_crossmapped_to_resmapped Add cross-mapped to residue-mapped domain StIs Add crossmapped domain instances from last iteration to current list of residue mapped domain instances.
compare_instances_CATH_Pfam Compare residue-mapped domain StIs Find the intersection between residue-mapped domain StIs of Pfam and CATH lists. Allows variable domain boundaries in a certain range +/- 30aa. Produces three files: common domain instances, and unique domain instances to each Pfam and CATH.
crossmapping_Pfam2CATH Map unique Pfam domain StIs to CATH db Maps the unique domain StIs from Pfam to the whole CATH database (using residue numbering from PDB allowing variable domain boundaries +/-30aa)
crossmapping_CATH2Pfam Map unique CATH domain StIs to Pfam db Maps the unique domain StIs from CATH to the whole Pfam database (using residue numbering from UniProt allowing variable domain boundaries +/-30aa)
format_core_list Format core domain StIs list Fornat core domain instances list from the common instances list identified at first iteration; Preparing input for average structure computation
chop_and_avg_for_core Compute average of average for core domain instances Compute average structure for all averaged structures corresponding to core UniProt domain instances. First computes average per UniProt domain instance and then average all averaged structures.
chop_and_avg_for_CATH2Pfam Compute average of average per cross-mapped Pfam Compute average structure for all averaged structures corresponding to UniProt domain instances cross-mapped from CATH to a Pfam family. First computes average per UniProt domain instance and then average all averaged structures per Pfam family.
chop_and_avg_for_Pfam2CATH Compute average of average per cross-mapped CATH Compute average structure for all averaged structures corresponding to UniProt domain instances cross-mapped from Pfam to a CATH superfamily. First computes average per UniProt domain instance and then average all averaged structures per CATH superfamily.
align_avg_structs_pairwise Pairwise alignemnt with core average structure Align crossmapped averaged structures against core average domain structure pairwise using Kpax Outputs a csv file with all the scores from pairwise alignments
check_alignment_scores Checks the alignment score for given threshold Checks the alignment score for each aligned structure based on the given threshold Outputs the structural instances passing and failing the threshold in separate files
unmapped_from_pfam Averages and aligns the unampped instances from Pfam First computes average per UniProt domain instance and then aligns all the average structures against core average structure. Outputs the alignment results along with the structures passing and failing the threshold for given Kpax score.
unmapped_from_cath Averages and aligns the unampped instances from CATH First computes average per UniProt domain instance and then aligns all the average structures against core average structure. Outputs the alignment results along with the structures passing and failing the threshold for given Kpax score.
gather_domain_like Collects all domain-like structural instances Collects all domain-like structural instances from Pfam and CATH Outputs the list with all domain-like structural instances together.
gather_failed_domains Collects all failed domain instances Collects all domain instances failed to pass the criteria from both Pfam and CATH Outputs the list with all failed domain instances together.
create_new_parameters Create parameter file for next iteration Create parameter file for next iteration from previous parameter file Filter the pairwise alignments to retrieve family ids passing the threshold for a given Kpax score type

Outputs

ID Name Description Type
family_ids_x Family ids per iteration n/a
  • File
resmapped_pfam All Pfam residue-mapped domain StIs with domain labels n/a
  • File
reslost_pfam Obsolete and inconsistent domain StIs from Pfam n/a
  • File
resmapped_cath All CATH residue-mapped domain StIs with domain labels n/a
  • File
reslost_cath Obsolete and inconsistent domain StIs from CATH n/a
  • File
true_domains True domain StIs per iteration n/a
  • File
core_domains_list Core domain StIs n/a
  • File
core_structure Core domain structure (.pdb) n/a
  • File
all_domain_like Domain-like StIs n/a
  • File
all_failed_domains Failed domain StIs n/a
  • File
crossmapped_pfam_passed Cross-mapped families with Pfam domain StIs passing the threshold n/a
  • File
crossmapped_cath_passed Cross-mapped families with CATH domain StIs passing the threshold n/a
  • File
crossres_mappedpfam Merged cross-mapped and residue-mapped domain StIs from Pfam n/a
  • File
crossres_mappedcath Merged cross-mapped and residue-mapped domain StIs from CATH n/a
  • File
unmap_pfam All Pfam un-mapped domin StIs n/a
  • File
allmap_pfam All Pfam domain StIs cross-mapped to CATH family-wise n/a
  • File
unmap_cath All un-mapped domin StIs from CATH n/a
  • File
allmap_cath All CATH cross-mapped domin StIs family-wise together n/a
  • File
pfam_crossmap_cath_avg Average structures per cross-mapped CATH family for Pfam StIs at family level n/a
  • array containing
    • File
cath_crossmap_pfam_avg Average structures per cross-mapped Pfam family for CATH StIs at family level n/a
  • array containing
    • File
avg_alignment_result Alignment results from Kpax for all cross-mapped families n/a
  • File
next_parmfile Parameter file for next iteration of the workflow n/a
  • File
align_unmap_pfam Alignment results for Pfam unmapped instances n/a
  • File
unmap_pfam_passed Domain-like StIs from Pfam n/a
  • File
unmap_pfam_failed Failed domain StIs from Pfam n/a
  • File
align_unmap_cath Alignment results for CATH unmapped instances n/a
  • File
unmap_cath_passed Domain-like StIs from CATH n/a
  • File
unmap_cath_failed Failed domain StIs from CATH n/a
  • File

Version History

v1.1 (latest) Created 20th Jun 2023 at 13:06 by Hrishikesh Dhondge

Pfam v35.0 and Results_archive for publication


Frozen v1.1 b5a9d4b

main @ 9f38328 (earliest) Created 28th Sep 2022 at 12:34 by Hrishikesh Dhondge

Updated input parameter file


Frozen main 9f38328
help Creators and Submitter
Creators
Submitter
Citation
Dhondge, H., Chauvot De Beauchêne, I., & Devignes, M.-D. (2022). CroMaSt: A workflow for domain family curation through cross-mapping of structural instances between protein domain databases. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.390.1
License
Activity

Views: 3411   Downloads: 353

Created: 28th Sep 2022 at 12:34

Last updated: 20th Jun 2023 at 13:06

Annotated Properties
help Attributions

None

Total size: 206 KB
Powered by
(v.1.16.0-main)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH