Research Data Management and Open Research Data Services and Infrastructure
Coming up soon: Central Info Point
Over the next few months, this page will be transformed into a Central Info Point for Research Data Management for researchers in the entire ETH Domain. In first instance the Central Info Point will guide the researchers to available ORD-related solutions available across the institutions of the ETH Domain.
Status of implementation
As preliminary work, the Expert Group Services and Infrastructure has conducted an inventory of services and infrastructures that support ORD practices of researchers in the ETH Domain. The inventory was made along the data life cycle (see below). This shows which services and infrastructures are required to answer specific questions that arise at different points in the research data management process.
Access the report here.
The Central Info Point project is being carried out as part of Measure 2 “Improve the ecosystem of research data management (RDM) services and infrastructure that support ORD practices” and is coordinated by the EPFL Library. Within measure 2, a common storage access API is also being developed for the ETH Domain, as well as improved integration of research data repositories, electronic laboratory notebooks and data analysis platforms.
Preliminary insights: ETH Domain ORD services and infrastructures
The research data life cycle provides a simplified model to describe the data-related phases of a typical research process; these are: data management planning, acquisition, storage and annotation, data processing and analysis, data publication, preservation and reuse. The list below shows shows which services and infrastructures are available to support task that arise at different points in the research data management process.
Data management planning
DMP templates for creating, curating, and distributing research data are available at all institutions in the ETH Domain. In addition, most institutions offer individual DMP consulting to support researchers. Some providers of RDM infrastructures and services in the ETH Domain also offer specific DMP templates to their users (e.g. EnviDat, Materials Cloud/AiiDA, openBIS). DMPonline is a web-based platform provided by the Digital Curation Centre (DCC) at the Universities of Edinburgh and Glasgow for research institutions. It is used by a wide range of research institutions worldwide to create, review and share data management plans (DMPs) that meet both institution and funder requirements. DMPOnline includes the latest funder templates and best practice guidance to help users produce high quality DMPs.
Data acquisition, storage and annotation
All institutions in the ETH Domain provide some sort of basic storage service to their researchers. Examples include various levels of NAS storage, long-term storage (on tapes), and high-performance storage systems (e.g., parallel file systems on HPC clusters). Apart from centrally provided storage solutions, individual institutes or research groups often still operate their own storage infrastructures.
EPFL: Depending on the department, different Electronic Lab Notebooks (ELN) are preferred. At EPFL, for example, these are Slims for life sciences and engineering, Eln.epfl.ch for spectroscopic data, and RSpace in chemistry. For secure data acquisition, RedCap is used at EPFL as well as at other institutions in the ETH Domain. The Open Sample platform, which allows scientists to search for antibodies, plasmids, cells or other biomedical research tools, provides information on whether this material has been used at EPFL.
ETH Zurich: At ETH Zurich, openBIS is used as a combination of data management solution, Electronic Lab Notebook (ELN) and Laboratory Information Management System (LIMS). It is currently used by about 70 research groups in 12 different departments.
Data processing and analysis
Computer Notebooks: Throughout the ETH Domain, a number of central notebook platforms have been established mainly based on JupyterHub and R Markdown. They support interactive scientific computing and are used in many research disciplines for exploratory scientific data analysis. Since they allow the exchange of a “computational history” (code, documentation, results, etc.), they also support OR processes.
Version control: Version control is a key technique for the proper management of software code and text-based data, which form the basis for most data processing and analysis in the ETH Domain. Git is currently the most popular version control system for professional software development. It is widely used in research areas where coding of simulation or analysis workflows is an integral part of the research process. Git-based platforms such as GitLab (open-source) or GitHub not only enable professional code management and reproducible research, but also support ORD by providing researchers with a platform for publishing code and data.
Platforms: Great efforts are being made in the ETH Domain to consolidate the various technologies for data processing and analysis into larger platforms. This is intended to lower the barriers to entry to current best practices. For example, at the Swiss Data Science Center (SDSC), the Renkulab platform was created combining many tools for collaborative data analysis. At EPFL, AiiDA was developed, an open-source Python infrastructure that supports researchers in high-throughput computations that can last from seconds to weeks.
Publication and reuse of data
Once created, research data must be properly stored and annotated (documented). Subsequently, the data are usually processed and analyzed, resulting in derived data sets. Some or all of these data sets are eventually selected for long-term storage and publication in data repositories. Only a small subset of the published datasets are ultimately reused, starting the data cycle again for a new research project. Currently, the ecosystem of research data repositories in Switzerland is markedly diverse.
Publications
All institutions of the ETH Domain provide their researchers with an institutional publication repository. EPFL uses Infoscience as its institutional publication repository, which is managed by the EPFL library. ETH Zurich uses the Research Collection as an institutional repository for publications and research data of ETH Zurich’s own scientists. At the end of 2022, about 241,000 publications were recorded. The four research institutes of the ETH Domain share the platform “Digital Object Repository at the Four Research Institutes” (DORA 4RI) as an institutional repository. It provides a directory of all publications (as of 2023: 75,600) by researchers at the institutes.
Repositories for research data
Research data of all kinds without a subject-specific focus can be stored in general data repositories, for example of the respective institution. There are also cross-institutional subject-specific data repositories. Researchers are free to choose the most appropriate repository as long as it can be considered compliant with the FAIR standard. At ETH Zurich, the Research Collection is operated by the ETH Library as an institutional general-purpose data repository. All research data in it are assigned a Digital Object Identifier (DOI). At EPFL, Zenodo is currently the most widely used general purpose research data repository. It was built and operated by CERN and OpenAire, with data stored at the CERN Data Center. The Eawag Research Data Institutional Collection (ERIC) is a repository specifically for archiving and disseminating research data produced by Eawag scientists. The WSL’s Environmental Data Portal (EnviDat) is a specialized repository using DOIs for environmental research data. It hosts and publishes environmental research data from WSL’s research units and collaborating partners from other institutions, including those in the ETH Domain (e.g. EPFL, ETHZ, Eawag, PSI – limited to environmental data). At PSI, research data is stored in the central Data Catalog. Data and metadata collected by internal and external users at the photon research facilities SLS and SwissFEL are automatically deposited in the catalog at a rate of 3-4 PB/year.
Repository name | Hosting institution | Other data-providing institutions | Repository type | Software | PID type | Selected statistics | Remarks |
---|---|---|---|---|---|---|---|
Digital Object Repository at the Four Research Institutes (DORA) | Lib4RI | Eawag, Empa, PSI, WSL | Publication | Islandora | DOI | ≈75’600 publications | Further statistics can be found here |
Infoscience | EPFL | None | Publication | Invenio | DOI | ≈162’000 | Next release to include datasets |
Eawag Research Data Collection (ERIC) | Eawag | None | General data | CKAN | DOI | ≈150 open datasets,
≈500 internal datasets | |
ETH Research Collection | ETHZ | None | Publication
General data | DSpace | DOI | ≈241’000 publications, ≈1700 datasets (42 TB total volume) | |
Data Catalog | PSI | Facility users, CSCS (see remarks) | General data | SciCat | DOI, PID | >400’000 datasets, 9 PB total volume, >1600 groups of users | Active proposal to provide broader access to ETH institutions |
Zenodo | CERN | Open | General data | Invenio | DOI | Some institutional customization, e. g. via “EPFL groups” | |
EnviDat | WSL | Collaborations approved by WSL (see remarks) | Domain-specific (Environmental Sciences) | CKAN | DOI | ≈540 datasets (20 TB total volume) | Currently hosting environmental datasets from WSL and other collaborating institutions |
Materials Cloud | EPFL | PSI, Empa | Domain-specific data (Material Sciences) | Invenio (customized) | DOI | ≈22M crystal structures, ≈7.5M simulations | |
Living Archives | EPFL | None | Domain-specific data (Architecture) | In-house | PID | ≈11’000 items |
Overview of commonly used repositories in the ETH Domain for publication of research data and outputs.
Data preservation and disposal
Long-term archiving ensures that valuable research data and thus scientific results remain interpretable and reusable for years to come. In general, a distinction can be made between curated and non-curated solutions. Curated solutions are often referred to as archives and are typically located in the library domain. They are important for ORD because they are part of professional data archives to ensure long-term accessibility and interpretability of digital objects. In uncurated long-term storage (e.g., tape libraries), the interpretability of stored data is generally not guaranteed. File formats may become unreadable, data may be poorly written or not written at all, and even knowledge of the existence of data may fade over time because accessible metadata is not available. Non-curated long-term storage solutions, sometimes referred to as “data graveyards,” are therefore of limited relevance to ORD.
The ETH Data Archive is the storage solution for research data at ETHZ, as it is also the storage layer behind the research collection for objects that should remain usable for longer than ten years. At EPFL, the Academic Output Archive (ACOUA) serves as the long-term repository for research data produced by EPFL researchers. ACOUA was launched in the first quarter of 2021 and is curated by the EPFL Library.