Publications

What is a Publication?
39 Publications visible to you, out of a total of 39

Abstract (Expand)

Preprint: https://arxiv.org/abs/2110.02168 The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolatedd research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projects partnered to bring the international workflows community together. This paper reports on discussions and findings from two virtual "Workflows Community Summits" (January and April, 2021). The overarching goals of these workshops were to develop a view of the state of the art, identify crucial research challenges in the workflows community, articulate a vision for potential community efforts, and discuss technical approaches for realizing this vision. To this end, participants identified six broad themes: FAIR computational workflows; AI workflows; exascale challenges; APIs, interoperability, reuse, and standards; training and education; and building a workflows community. We summarize discussions and recommendations for each of these themes.

Authors: Rafael Ferreira da Silva, Henri Casanova, Kyle Chard, Ilkay Altintas, Rosa M Badia, Bartosz Balis, Taina Coleman, Frederik Coppens, Frank Di Natale, Bjoern Enders, Thomas Fahringer, Rosa Filgueira, Grigori Fursin, Daniel Garijo, Carole Goble, Dorran Howell, Shantenu Jha, Daniel S. Katz, Daniel Laney, Ulf Leser, Maciej Malawski, Kshitij Mehta, Loic Pottier, Jonathan Ozik, J. Luc Peterson, Lavanya Ramakrishnan, Stian Soiland-Reyes, Douglas Thain, Matthew Wolf

Date Published: 1st Nov 2021

Publication Type: Journal

Abstract (Expand)

Motivation Protein-protein interactions (PPIs) can be used for a plenty of applications like inferring protein functions or even helping the drug discovery process. For human specie, there is a lot of validated information and functional annotations for the proteins in its interactome. In other species, the known interactome is much smaller compared with human and there are many proteins with few or no annotations by specialists. Understanding the interactome of other species helps to trace evolutionary characteristics, compare important biological processes and also build interactomes for new organisms according to other organisms more related with it instead of relying just to the human interactome. Results In this study, we evaluate the performance of PredPrIn workflow in predicting interactome for seven organisms in terms of scalability and precision showing that PredPrIn gets over than 70% of precision and it takes less than three days even on the largest datasets. We made a transfer learning analysis predicting an organism interactome from each other organism, we then showed an implication regarding to their evolutionary relation in the number of ortholog proteins shared between these organisms. We also present an analysis of functional enrichment showing the proportion of shared annotations between positive and false interactions predicted and extraction of topological features of each organism interactome such as proteins acting as hubs and bridge between modules. From each organism, one of the most frequent biological processes was selected and the proteins and pairs present in it were compared in terms of quantity in the interactome available in HINT database for that organism and the one predicted by PredPrIn. In this comparison we showed that we covered those proteins and pairs covered in HINT and also enriched these processes for almost all organisms. Conclusions In this work, we have proved the efficiency of PredPrIn workflow for protein interaction prediction for seven different organisms using scalability, performance and transfer learning analyses. We have also made cross-species interactome comparisons showing the most frequent biological processes for each organism as well as the topological features of each organism interactome showing the consistency with hypothesis about biological networks. Finally, we described the enrichment made by PredPrIn in selected biological processes showing that its prediction was important to enhance information about these organisms interactomes.

Author: Yasmmin C Martins

Date Published: 7th Jun 2023

Publication Type: Journal

Abstract (Expand)

Provenance registration is becoming more and more important, as we increase the size and number of experiments performed using computers. In particular, when provenance is recorded in HPC environments, it must be efficient and scalable. In this paper, we propose a provenance registration method for scientific workflows, efficient enough to run in supercomputers (thus, it could run in other environments with more relaxed restrictions, such as distributed ones). It also must be scalable in order to deal with large workflows, that are more typically used in HPC. We also target transparency for the user, shielding them from having to specify how provenance must be recorded. We implement our design using the COMPSs programming model as a Workflow Management System (WfMS) and use RO-Crate as a well-established specification to record and publish provenance. Experiments are provided, demonstrating the run time efficiency and scalability of our solution.

Authors: Raul Sirvent, Javier Conejero, Francesc Lordan, Jorge Ejarque, Laura Rodriguez-Navas, Jose M. Fernandez, Salvador Capella-Gutierrez, Rosa M. Badia

Date Published: 1st Nov 2022

Publication Type: Proceedings

Abstract (Expand)

In the recent years, the improvement of software and hardware performance has made biomolecular simulations a mature tool for the study of biological processes. Simulation length and the size and complexity of the analyzed systems make simulations both complementary and compatible with other bioinformatics disciplines. However, the characteristics of the software packages used for simulation have prevented the adoption of the technologies accepted in other bioinformatics fields like automated deployment systems, workflow orchestration, or the use of software containers. We present here a comprehensive exercise to bring biomolecular simulations to the “bioinformatics way of working”. The exercise has led to the development of the BioExcel Building Blocks (BioBB) library. BioBB’s are built as Python wrappers to provide an interoperable architecture. BioBB’s have been integrated in a chain of usual software management tools to generate data ontologies, documentation, installation packages, software containers and ways of integration with workflow managers, that make them usable in most computational environments.

Authors: Pau Andrio, Adam Hospital, Javier Conejero, Luis Jordá, Marc Del Pino, Laia Codo, Stian Soiland-Reyes, Carole Goble, Daniele Lezzi, Rosa M. Badia, Modesto Orozco, Josep Ll. Gelpi

Date Published: 1st Dec 2019

Publication Type: Journal

Abstract (Expand)

Identification of honey bee (Apis mellifera) from various parts of the world is essential for protection of their biodiversity. The identification can be based on wing measurements which is inexpensive and easy available. In order to develop such identification there are required reference samples from various parts or the world. We provide collection of 26481 honey bee fore wing images from 13 countries in Europe: Austria (AT), Croatia (HR), Greece (GR), Moldova (MD), Montenegro (ME), Poland (PL), Portugal (PT), Romania (RO), Serbia (RS), Slovenia (SI), Spain (ES), Turkey (TR). For each country there are three files starting with the two letter country code (indicated earlier in the parentheses): XX-wing-images.zip, XX-raw-coordinates.csv and XX-data.csv, which contain wing images, raw landmark coordinates and geographic coordinates, respectively. Files with prefix EU contain combined data from all countries.

Authors: Andrzej Oleksa, Eliza Căuia, Adrian Siceanu, Zlatko Puškadija, Marin Kovačić, M. Alice Pinto, Pedro João Rodrigues, Fani Hatjina, Leonidas Charistos, Maria Bouga, Janez Prešern, Irfan Kandemir, Slađan Rašić, Szilvia Kusza, Adam Tofilski

Date Published: 1st Oct 2022

Publication Type: Journal

Abstract (Expand)

The third Dutch national airborne laser scanning flight campaign (AHN3, Actueel Hoogtebestand Nederland) conducted between 2014 and 2019 during the leaf-off season (October–April) across the whole Netherlands provides a free and open-access, country-wide dataset with ∼700 billion points and a point density of ∼10(–20) points/m2. The AHN3 point cloud was obtained with Light Detection And Ranging (LiDAR) technology and contains for each point the x, y, z coordinates and additional characteristics (e.g. return number, intensity value, scan angle rank and GPS time). Moreover, the point cloud has been pre-processed by ‘Rijkswaterstraat’ (the executive agency of the Dutch Ministry of Infrastructure and Water Management), comes with a Digital Terrain Model (DTM) and a Digital Surface Model (DSM), and is delivered with a pre-classification of each point into one of six classes (0: Never Classified, 1: Unclassified, 2: Ground, 6: Building, 9: Water, 26: Reserved [bridges etc.]). However, no detailed information on vegetation structure is available from the AHN3 point cloud. We processed the AHN3 point cloud (∼16 TB uncompressed data volume) into 10 m resolution raster layers of ecosystem structure at a national extent, using a novel high-throughput workflow called ‘Laserfarm’ and a cluster of virtual machines with fast central processing units, high memory nodes and associated big data storage for managing the large amount of files. The raster layers (available as GeoTIFF files) capture 25 LiDAR metrics of vegetation structure, including ecosystem height (e.g. 95th percentiles of normalized z), ecosystem cover (e.g. pulse penetration ratio, canopy cover, and density of vegetation points within defined height layers), and ecosystem structural complexity (e.g. skewness and variability of vertical vegetation point distribution). The raster layers make use of the Dutch projected coordinate system (EPSG:28992 Amersfoort / RD New), are each ∼1 GB in size, and can be readily used by ecologists in a geographic information system (GIS) or analytical open-source software such as R and Python. Even though the class ‘1: Unclassified’ mainly includes vegetation points, other objects such as cars, fences, and boats can also be present in this class, introducing potential biases in the derived data products. We therefore validated the raster layers of ecosystem structure using >180,000 hand-labelled LiDAR points in 100 randomly selected sample plots (10 m × 10 m each) across the Netherlands. Besides vegetation, objects such as boats, fences, and cars were identified in the sampled plots. However, the misclassification rate of vegetation points (i.e. non-vegetation points that were assumed to be vegetation) was low (∼0.05) and the accuracy of the 25 LiDAR metrics derived from the AHN3 point cloud was high (∼90%). To minimize existing inaccuracies in this country-wide data product (e.g. ships on water bodies, chimneys on roofs, or cars on roads that might be incorrectly used as vegetation points), we provide an additional mask that captures water bodies, buildings and roads generated from the Dutch cadaster dataset. This newly generated country-wide ecosystem structure data product provides new opportunities for ecology and biodiversity science, e.g. for mapping the 3D vegetation structure of a variety of ecosystems or for modelling biodiversity, species distributions, abundance and ecological niches of animals and their habitats.

Authors: W. Daniel Kissling, Yifang Shi, Zsófia Koma, Christiaan Meijer, Ou Ku, Francesco Nattino, Arie C. Seijmonsbergen, Meiert W. Grootes

Date Published: 1st Feb 2023

Publication Type: Journal

Abstract (Expand)

This dataset provides a standardized collection of rasterized Light Detection And Ranging (LiDAR) metrics in GeoTIFF format, derived from country-wide airborne laser scanning (ALS) data across seven demonstration sites in five European countries: Mols Bjerge National Park (Denmark), Reserve Naturelle Nationale du Bagnas (France), Oostvaardersplassen (Netherlands), Salisbury Plain (United Kingdom), Knepp Estate (United Kingdom), Monks Wood (United Kingdom), and the island of Comino (Malta). The sites range in areal size from 0.08 km2 to 54 km2 and include habitat types such as forests, broadleaf and conifer woodlands, small plantations, dry and wet grasslands, marshes, reedbeds, arable fields, farmland, scrublands and mediterranean garigue. A total of 35 LiDAR metrics were calculated, of which 28 represent vegetation structural attributes. These include vegetation height (seven metrics), vegetation cover (fourteen metrics), and vegetation vertical variability (seven metrics). Additionally, seven metrics describe point density (one metric), eigenvalues (three metrics), and normal vectors (three metrics). The rasterized LiDAR metrics have a spatial resolution of 10 m, with coverage and extent defined by shapefiles corresponding to each demonstration site. The raw ALS point clouds were clipped to the site boundaries and processed with the 'Laserfarm' workflow, a standardized computational workflow that includes modular pipelines for re-tiling, normalization, feature extraction, and rasterization. Laserfarm employs the feature extraction module of the open-source ‘Laserchicken’ software to compute the LiDAR metrics. The workflow was implemented using the IT services of the Dutch national facility for information and communication technology, SURF. The clipped LiDAR point clouds are available through a public repository, except for the LiDAR point clouds from Comino, Malta, which are not publicly available. The 35 rasterized LiDAR metrics (GeoTIFF files, 10 m resolution) from all sites, including Comino, as well as the corresponding site boundary shapefiles (geospatial vector format), are provided in a Zenodo repository. Additionally, the Jupyter Notebooks with Python code for executing the Laserfarm workflow are available to facilitate reproducibility and further computational applications. Users should note that the rasterized LiDAR metrics may contain zero or NA values, particularly over water surfaces, with the pulse penetration ratio metric potentially indicating false high vegetation cover over water. Users may reclassify or mask areas with zero values accordingly. Some pixels exhibit abnormal vegetation height values, which can be filtered before analysis. Certain striping patterns, likely resulting from overlapping flight lines and increased point density, are present in some metrics, though their overall impact appears minimal. This dataset enables diverse applications, including canopy height measurements, mapping of hedgerows, treelines, and forest patches, as well as characterizing vegetation density, vertical stratification, and habitat openness. It supports landscape-scale habitat analysis and contributes to the standardization of vegetation metrics from ALS data for site-specific ecological monitoring (e.g., Natura 2000). Moreover, the dataset demonstrates the automated execution of LiDAR data processing workflows, which is crucial for establishing a transnational and multi-site biodiversity and ecosystem observation network.

Authors: W. Daniel Kissling, Wessel Mulder, Jinhu Wang, Yifang Shi

Date Published: 1st Jun 2025

Publication Type: Journal

Abstract (Expand)

Coordinates of 19 landmarks from honey bee (Apis mellifera) worker wings. They represent 1832 workers, 187 colonies, 25 subspecies and four evolutionary lineages. The material was obtained from thee Morphometric Bee Data Bank in Oberursel, Germany.

Authors: Anna Nawrocka, Irfan Kandemir, Stefan Fuchs, Adam Tofilski

Date Published: 1st Apr 2018

Publication Type: Journal

Abstract (Expand)

Considerable efforts have been made to build the Web of Data. One of the main challenges has to do with how to identify the most related datasets to connect to. Another challenge is to publish a local dataset into the Web of Data, following the Linked Data principles. The present work is based on the idea that a set of activities should guide the user on the publication of a new dataset into the Web of Data. It presents the specification and implementation of two initial activities, which correspond to the crawling and ranking of a selected set of existing published datasets. The proposed implementation is based on the focused crawling approach, adapting it to address the Linked Data principles. Moreover, the dataset ranking is based on a quick glimpse into the content of the selected datasets. Additionally, the paper presents a case study in the Biomedical area to validate the implemented approach, and it shows promising results with respect to scalability and performance.

Authors: Yasmmin Cortes Martins, Fábio Faria da Mota, Maria Cláudia Cavalcanti

Date Published: 2016

Publication Type: Journal

Abstract (Expand)

The ongoing coronavirus 2019 (COVID-19) pandemic, triggered by the emerging SARS-CoV-2 virus, represents a global public health challenge. Therefore, the development of effective vaccines is an urgent need to prevent and control virus spread. One of the vaccine production strategies uses the in silico epitope prediction from the virus genome by immunoinformatic approaches, which assist in selecting candidate epitopes for in vitro and clinical trials research. This study introduces the EpiCurator workflow to predict and prioritize epitopes from SARS-CoV-2 genomes by combining a series of computational filtering tools. To validate the workflow effectiveness, SARS-CoV-2 genomes retrieved from the GISAID database were analyzed. We identified 11 epitopes in the receptor-binding domain (RBD) of Spike glycoprotein, an important antigenic determinant, not previously described in the literature or published on the Immune Epitope Database (IEDB). Interestingly, these epitopes have a combination of important properties: recognized in sequences of the current variants of concern, present high antigenicity, conservancy, and broad population coverage. The RBD epitopes were the source for a multi-epitope design to in silico validation of their immunogenic potential. The multi-epitope overall quality was computationally validated, endorsing its efficiency to trigger an effective immune response since it has stability, high antigenicity and strong interactions with Toll-Like Receptors (TLR). Taken together, the findings in the current study demonstrated the efficacy of the workflow for epitopes discovery, providing target candidates for immunogen development.

Authors: Cristina S. Ferreira, Yasmmin C. Martins, Rangel Celso Souza, Ana Tereza R. Vasconcelos

Date Published: 2021

Publication Type: Journal

Powered by
(v.1.16.0)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH