Research Data Management and Open Research Data Services and Infrastructure

Coming up soon: Central Info Point

Over the next few months, this page will be transformed into a Central Info Point for Research Data Management for researchers in the entire ETH Domain. In first instance the Central Info Point will guide the researchers to available ORD-related solutions available across the institutions of the ETH Domain.

Status of implementation

As preliminary work, the Expert Group Services and Infrastructure has conducted an inventory of services and infrastructures that support ORD practices of researchers in the ETH Domain. The inventory was made along the data life cycle (see below). This shows which services and infrastructures are required to answer specific questions that arise at different points in the research data management process.

Access the report here

The Central Info Point project is being carried out as part of Measure 2 “Improve the ecosystem of research data management (RDM) services and infrastructure that support ORD practices” and is coordinated by the EPFL Library. Within measure 2, a common storage access API is also being developed for the ETH Domain, as well as improved integration of research data repositories, electronic laboratory notebooks and data analysis platforms. 

Preliminary insights: ETH Domain ORD services and infrastructures

The research data life cycle provides a simplified model to describe the data-related phases of a typical research process; these are: data management planning, acquisition, storage and annotation, data processing and analysis, data publication, preservation and reuse. The list below shows shows which services and infrastructures are available to support task  that arise at different points in the research data management process.

Research Data Lifecycle Diagram

Data management planning

DMP templates for creating, curating, and distributing research data are available at all institutions in the ETH Domain. In addition, most institutions offer individual DMP consulting to support researchers. Some providers of RDM infrastructures and services in the ETH Domain also offer specific DMP templates to their users (e.g. EnviDat, Materials Cloud/AiiDA, openBIS). DMPonline is a web-based platform provided by the Digital Curation Centre (DCC) at the Universities of Edinburgh and Glasgow for research institutions. It is used by a wide range of research institutions worldwide to create, review and share data management plans (DMPs) that meet both institution and funder requirements. DMPOnline includes the latest funder templates and best practice guidance to help users produce high quality DMPs.

Data acquisition, storage and annotation

All institutions in the ETH Domain provide some sort of basic storage service to their researchers. Examples include various levels of NAS storage, long-term storage (on tapes), and high-performance storage systems (e.g., parallel file systems on HPC clusters). Apart from centrally provided storage solutions, individual institutes or research groups often still operate their own storage infrastructures.

EPFL: Depending on the department, different Electronic Lab Notebooks (ELN) are preferred. At EPFL, for example, these are Slims for life sciences and engineering, Eln.epfl.ch for spectroscopic data, and RSpace in chemistry. For secure data acquisition, RedCap is used at EPFL as well as at other institutions in the ETH Domain. The Open Sample platform, which allows scientists to search for antibodies, plasmids, cells or other biomedical research tools, provides information on whether this material has been used at EPFL.

ETH Zurich: At ETH Zurich, openBIS is used as a combination of data management solution, Electronic Lab Notebook (ELN) and Laboratory Information Management System (LIMS). It is currently used by about 70 research groups in 12 different departments.

Data processing and analysis

Computer Notebooks: Throughout the ETH Domain, a number of central notebook platforms have been established mainly based on JupyterHub and R Markdown. They support interactive scientific computing and are used in many research disciplines for exploratory scientific data analysis. Since they allow the exchange of a “computational history” (code, documentation, results, etc.), they also support OR processes.

Version control: Version control is a key technique for the proper management of software code and text-based data, which form the basis for most data processing and analysis in the ETH Domain. Git is currently the most popular version control system for professional software development. It is widely used in research areas where coding of simulation or analysis workflows is an integral part of the research process. Git-based platforms such as GitLab (open-source) or GitHub not only enable professional code management and reproducible research, but also support ORD by providing researchers with a platform for publishing code and data.

Platforms: Great efforts are being made in the ETH Domain to consolidate the various technologies for data processing and analysis into larger platforms. This is intended to lower the barriers to entry to current best practices. For example, at the Swiss Data Science Center (SDSC), the Renkulab platform was created combining many tools for collaborative data analysis. At EPFL, AiiDA was developed, an open-source Python infrastructure that supports researchers in high-throughput computations that can last from seconds to weeks.

Publication and reuse of data

Once created, research data must be properly stored and annotated (documented). Subsequently, the data are usually processed and analyzed, resulting in derived data sets. Some or all of these data sets are eventually selected for long-term storage and publication in data repositories. Only a small subset of the published datasets are ultimately reused, starting the data cycle again for a new research project. Currently, the ecosystem of research data repositories in Switzerland is markedly diverse.

Publications

All institutions of the ETH Domain provide their researchers with an institutional publication repository. EPFL uses Infoscience as its institutional publication repository, which is managed by the EPFL library. ETH Zurich uses the Research Collection as an institutional repository for publications and research data of ETH Zurich’s own scientists. At the end of 2022, about 241,000 publications were recorded. The four research institutes of the ETH Domain share the platform “Digital Object Repository at the Four Research Institutes” (DORA 4RI) as an institutional repository. It provides a directory of all publications (as of 2023: 75,600) by researchers at the institutes.

Repositories for research data

Research data of all kinds without a subject-specific focus can be stored in general data repositories, for example of the respective institution. There are also cross-institutional subject-specific data repositories. Researchers are free to choose the most appropriate repository as long as it can be considered compliant with the FAIR standard. At ETH Zurich, the Research Collection is operated by the ETH Library as an institutional general-purpose data repository. All research data in it are assigned a Digital Object Identifier (DOI). At EPFL, Zenodo is currently the most widely used general purpose research data repository. It was built and operated by CERN and OpenAire, with data stored at the CERN Data Center. The Eawag Research Data Institutional Collection (ERIC) is a repository specifically for archiving and disseminating research data produced by Eawag scientists. The WSL’s Environmental Data Portal (EnviDat) is a specialized repository using DOIs for environmental research data. It hosts and publishes environmental research data from WSL’s research units and collaborating partners from other institutions, including those in the ETH Domain (e.g. EPFL, ETHZ, Eawag, PSI – limited to environmental data). At PSI, research data is stored in the central Data Catalog. Data and metadata collected by internal and external users at the photon research facilities SLS and SwissFEL are automatically deposited in the catalog at a rate of 3-4 PB/year.

Repository name
Hosting institution
Other data-providing institutions
Repository type
Software
PID type
Selected statistics
Remarks
Digital Object Repository at the Four Research Institutes (DORA)
Lib4RI
Eawag, Empa, PSI, WSL
Publication
Islandora
DOI
≈75’600 publications
Further statistics can be found here
Infoscience
EPFL
None
Publication
Invenio
DOI
≈162’000
Next release to include datasets
Eawag Research Data Collection (ERIC)
Eawag
None
General data
CKAN
DOI
≈150 open datasets, ≈500 internal datasets
ETH Research Collection
ETHZ
None
Publication General data
DSpace
DOI
≈241’000 publications, ≈1700 datasets (42 TB total volume)
Data Catalog
PSI
Facility users, CSCS (see remarks)
General data
SciCat
DOI, PID
>400’000 datasets, 9 PB total volume, >1600 groups of users
Active proposal to provide broader access to ETH institutions
Zenodo
CERN
Open
General data
Invenio
DOI
Some institutional customization, e. g. via “EPFL groups”
EnviDat
WSL
Collaborations approved by WSL (see remarks)
Domain-specific (Environmental Sciences)
CKAN
DOI
≈540 datasets (20 TB total volume)
Currently hosting environmental datasets from WSL and other collaborating institutions
Materials Cloud
EPFL
PSI, Empa
Domain-specific data (Material Sciences)
Invenio (customized)
DOI
≈22M crystal structures, ≈7.5M simulations
Living Archives
EPFL
None
Domain-specific data (Architecture)
In-house
PID
≈11’000 items

Overview of commonly used repositories in the ETH Domain for publication of research data and outputs.

Data preservation and disposal

Long-term archiving ensures that valuable research data and thus scientific results remain interpretable and reusable for years to come. In general, a distinction can be made between curated and non-curated solutions. Curated solutions are often referred to as archives and are typically located in the library domain. They are important for ORD because they are part of professional data archives to ensure long-term accessibility and interpretability of digital objects. In uncurated long-term storage (e.g., tape libraries), the interpretability of stored data is generally not guaranteed. File formats may become unreadable, data may be poorly written or not written at all, and even knowledge of the existence of data may fade over time because accessible metadata is not available. Non-curated long-term storage solutions, sometimes referred to as “data graveyards,” are therefore of limited relevance to ORD.

The ETH Data Archive is the storage solution for research data at ETHZ, as it is also the storage layer behind the research collection for objects that should remain usable for longer than ten years. At EPFL, the Academic Output Archive (ACOUA) serves as the long-term repository for research data produced by EPFL researchers. ACOUA was launched in the first quarter of 2021 and is curated by the EPFL Library.

Scroll to Top