Open Research Data Projects

Projects funded in the framework of the ORD Program

The joint ORD program of ETH Zurich, EPFL and the four research institutes of the ETH Domain has financially supported more than 60 research projects in the period 2020–2023. Funding supports researchers engaging in, or developing, ORD practices with and for their community and assists these researchers in becoming Open Research Data leaders in their field.

This page provides an overview of these projects. It highlights how researchers in the ETH Domain are currently applying ORD in exemplary ways. Some of the projects have already been completed, others are still in progress. The projects have been divided into three categories.

“Establish” projects help link existing ORD practices to a research agenda to establish them on a broader basis. They contribute to a shared and comprehensive understanding of ORD practices that can then become de facto standards.

“Explore” projects are the most extensive ventures in the program and are designed to explore and test early-stage ORD practices. The goal is to map processes of what an ORD practice might look like and develop prototypes. Through these projects, new teams form across disciplines and institutions.

“Contribute” projects help scientists integrate their research data into existing, often international, infrastructures. By standardizing the processes and making them generally accessible, the data are validated, and their potential is considerably expanded.

Filter

Establish

Explore

Contribute

ETH Zurich

EPFL

Eawag

Empa

WSL

PSI

Prediction of biodegradation potential from (meta)genomes: advancing the enviPathPlus ORD platform

Category

Contribute

Institutions

Eawag

Data type

Biological Magnetic Resonance Bank (BMRB)

Field

Earth sciences

Researchers

Robinson, Serina

Abstract

Chemical pollution has exceeded planetary boundaries, requiring urgent solutions for chemical waste removal. Microbial biodegradation processes are crucial for breaking down chemical contaminants, yet the functions of microbial communities are often challenging to predict. To address this challenge, we aim to contribute a new pipeline, EDGEbp (Enabling Detection of metaGEnomic biodegradation potential), to advance the research capabilities of the ORD biodegradation prediction software, enviPathPlus. Specifically, in EDGEbp, we will build a Hidden Markov Model-based pipeline to identify biodegradation genes and pathways from total microbial community DNA (metagenomic) sequencing data. EDGEbp will output confidence scores to infer the ‘contaminant biodegradation potential’ of a given microbial community based on sequencing information. In other words, our aim is to build a tool to convert unintelligible DNA sequences into easily-understood biodegradation confidence scores. This will help us infer the capabilities of a specific microbiome to transform chemical contaminants. This project will therefore advance the sustainable development goals of improving water quality by reducing chemical pollution through microbial biodegradation. Overall, we anticipate that EDGEbp will expand the cutting-edge functionalities of the ORD tool enviPathPlus to support its long-term preservation and promote community engagement in line with ORD principles.

Fostering Research on Mobile Robotics with High-Quality Data and Open Tooling

Category

Contribute

Institutions

ETH Zurich

Data type

4D STEM data

Field

Materials Science

Researchers

Hutter, Marco

Abstract

Mobile ground robots have become increasingly popular in academia and various industrial applications. However, unlike other domains like aerial robotics, autonomous driving, and construction, there is currently no high-quality, large-scale dataset or reliable benchmark established in this field, nor the tooling available to do so. Creating such a dataset would be immensely valuable for researchers and developers in fostering research on robust and practical algorithms across diverse environments. Moreover, the development of a standardized benchmarking platform would promote fair comparisons between different approaches, fostering innovation and facilitating the rapid progress of mobile ground robot research. Motivated by this, we propose to collect and share a high-quality, versatile, large-scale robotic dataset, “GrandTour”, with scalable and automated tooling– focusing on legged robots in addition to a set of benchmarks and the necessary tooling.

Open Mechanized Foundations for JavaScript Regular Expressions

Category

Contribute

Institutions

EPFL

Data type

4D STEM data

Field

Materials Science

Researchers

Barrière, Aurèle

Abstract

One of the main forms of ORD in the programming languages research community is two-sided mechanized language specifications (the definition in a proof assistant of the semantics of a language). These mechanized specifications have many benefits: they can be extracted to an executable reference implementation, and used by both implementers (to verify compilers, interpreters and optimizations) and by users (to guarantee the correctness of programs in that programming language).

We propose to contribute the first open, two-sided, mechanized specification of JavaScript regular expressions (regexes). The lack of such mechanization is harming the research community: previous work has mechanized other parts of the JavaScript language but not regexes, and as a consequence researchers use paper-only semantics for JavaScript regexes. These paper semantics are neither executable nor reusable and often incorrect. We will translate to Coq the part of the open ECMAScript standard that describes JavaScript regexes; extract this mechanization to a reference implementation in OCaml; and validate our mechanization with Coq proofs.

Our project will provide a solid foundation for JavaScript regex research to build upon our Coq mechanization, including proving the correctness of regex optimizations, detecting regexes with security issues (ReDOS), or proving the correctness of entire regex engines. Our project will allow the open JavaScript community to test their regex engines.

Making Temporal Brain Recordings Accessible For Modeling via Brain-Score

Category

Contribute

Institutions

EPFL

Data type

Biological Magnetic Resonance Bank (BMRB)

Field

Life sciences

Researchers

Schrimpf, Martin

Abstract

Brain-Score is an established platform which curates a diverse set of neural and behavioral measurements from neuroscience experiments and facilitates its use in modeling the brain's visual system. By making experimental data accessible to the modeling community in the form of quantitative benchmarks, Brain-Score allows modelers to evaluate computational hypotheses on a broad range of biological data without having to know the details of each experiment. In this proposal, we aim to broaden the scope of Brain-Score model comparisons from the presentation of static images to video inputs. This will enable the modeling of a critical axis of brain processing in visual cortex that has not yet been explored.

Specifically, we will:

* Contribute new software to Brain-Score to enable the platform to work with temporal data. This involves defining a unified interface for how to provide models with video input, and adding candidate video models from the machine learning community.

* Curate published temporal datasets for Brain-Score. Without Brain-Score, even these public data are often difficult to use for model testing.

* Curate new primate recordings from experimental collaborators (MIT DiCarlo lab) for Brain-Score such that they are accessible for model evaluations. These are among the first electrode recordings in the visual ventral stream where the stimuli are short ecological video clips.

Speckle-OpenCascade Prototype for Enhanced AEC Interoperability through Geometry-Centered Approach

Category

Contribute

Institutions

EPFL

Data type

Workflow management systems (WFMSs)

Field

Materials Science

Researchers

Vouilloz, Raphaël

Abstract

A Speckle connector for Open Cascade Technology will enhance software and data interoperability within the architecture, engineering, and construction industry. The project aims to bolster the use of free software in the sector, by uniting the capabilities of two open-source ecosystems in a sector where proprietary tools are currently very dominant. On the one hand, Speckle open-source connectors enable seamless collaboration across diverse AEC software; it ensures accurate and efficient collaborative workflows between various actors and disciplines, thus contributing to the freedom of choice of digital tools, avoiding a captive market. The connector's open-source design encourages community contributions, fostering continual improvement. On the other hand, Open Cascade Technology is an open-source geometric kernel. It is used in free software alternatives such as Freecad or Salome, and in open-source libraries such as IfcOpenShell, which allows the development of applications based on IFC, the open standard for Building Information Modeling. Also, the open-source nature of Open Cascade Technology enables many researchers and professionals to develop their own highly specialized digital tools. Our prototype would connect this ecosystem to all AEC industry software, via Speckle.

Seizing the treasure: making long-term environmental data available for eLTER and beyond

Category

Contribute

Institutions

WSL

Data type

Data of a large-scale treeline afforestation

Field

Ecology

Researchers

Esther Frei

Abstract

Established in 1975, the European Long-term Ecosystem Research (eLTER) facility, Stillberg, in the alpine ecosystem near Davos, Switzerland, has amassed extensive environmental and ecological data over almost five decades. These encompass treeline afforestation experiments, meteorological records, plant responses to carbon dioxide enrichment, soil warming effects, plant-snow and plant-soil interactions, and factors influencing tree seedling recruitment. Our project's aim is to contribute these valuable ecological datasets from Stillberg as open research data (ORD). By meticulously curating these datasets and uploading them to national and international ORD platforms like EnviDat and DEIMS-SDR, we enhance their visibility, quality, and accessibility. Sharing this long-term environmental data fosters research syntheses, meta-analyses, and understanding of long-term ecosystem processes in mountain regions, supporting adaptation strategies.

Seismological Software Stack, “Portable self-contained software environments enabling reproducible seismological research”

Category

Contribute

Institutions

ETH Zurich

Data type

Software development

Field

Seismology

Researchers

Johannes Brackenhoff

Abstract

This project focuses on creating software containers crucial for the global seismological community. These tools, widely used but often complex to compile due to dependencies, will be encapsulated in virtual environments to ensure seamless interaction and ease of use. The portable containers will enhance software sharing and scientific reproducibility. The project involves a postdoctoral researcher and a doctoral student, and Continuous Integration and Continuous Development pipelines will maintain open access repositories. A dedicated workstation with GPU-accelerated hardware will support this. At the project's midpoint, a workshop with software developers will gather feedback and boost container awareness, which will guide final enhancements.

PhenoMast - Integration of standardized tree seed mast observations into existing phenology monitoring networks

Category

Contribute

Institutions

WSL

Data type

Seed mast data

Field

Ecology

Researchers

Daniel Scherrer

Abstract

Switzerland's dominant tree species exhibit masting behavior, characterized by irregular, sporadic seed production patterns. These patterns significantly impact tree regeneration and ecosystem dynamics. Environmental factors and climate change influence these patterns. Despite their ecological importance, many phenological networks overlook seed masting. Our workshop aims to unite managers from various phenological networks to establish an ORD protocol for collecting seed mast data. This standardized method can be integrated into existing phenology networks. Researchers, including modelers, physiologists, and ecologists, will benefit from this comprehensive data collection. The curated data will be publicly available on MastWeb's ORD platform, fostering collaborations with global phenology networks.

Open Workshops on Image Data Best Practices

Category

Contribute

Institutions

ETH Zurich

Data type

Image data

Field

Computational Biology

Researchers

Kevin Yamauchi

Abstract

Image data are essential in scientific research, from astronomy to microbiology. Advancements in technology have enabled the generation of vast and informative image datasets. To derive valuable insights from these datasets, open distribution is crucial. While centralized efforts exist for hosting open image data, many researchers struggle to prepare and share their data in a Findable, Accessible, Interoperable, and Reproducible (FAIR) manner due to data size, various formats, and complex analysis methods. This project's goal is to provide an open-source training resource that educates researchers on processing and preparing image data with best ORD practices. An online handbook will guide users on image ORD best practices and curate existing image ORD resources. Workshops based on the handbook will be conducted to train researchers and establish community consensus on image ORD best practices. These training materials aim to empower ETH domain researchers to effectively utilize ORD resources.

DISDRODB: A global database of raindrop size distribution observations

Category

Contribute

Institutions

EPFL

Data type

Raindrop concentration

Field

Environmental Remote Sensing

Researchers

Alexis Berne

Abstract

The raindrop size distribution (DSD) details raindrop concentration and size distributions in an air volume. It's vital for rainfall microstructural analysis, remote sensing interpretation, and accurate representation in atmospheric models. Disdrometers collect DSD observations globally, but data are dispersed, varied in format, and lack standardized tools for processing. As a result, large-scale DSD spatial and temporal variability exploration is challenging. DISDRODB addresses this by advocating common standards for data format, quality control, and processing. It establishes a database and processing framework to store shareable raw measurements, generate clean DSD data, and derive related products (e.g., rain rate, kinetic energy, mean size). This tool benefits scientific communities working with DSD for process understanding, remote sensing, and modeling.

Open Research Data Projects

Filter

Subscribe to our Newsletter

Filter