The Seedcase Project: Data engineering tools for building and managing open and FAIR research data

Authors

Kristiane Beicher

Signe K. Brødbæk

Luke W. Johnston

Joel Ostblom

Marton Vago

FAIR and open practices are beneficial for science as they encourage reproducible, transparent, and shareable data practices. While individual researchers are sometimes capable of implementing FAIR and open practices on their own, the availability of software and data engineering resources and tools greatly facilitates the adoption of these practices. Unfortunately, there is a lack of systemic structures to incentivise researchers, organisations, and funding agencies to appreciate the importance of software and data engineering in research contexts and to dedicate sufficient resources to implementing them in practice. A major consequence from this incentive misalignment is that researchers trying to adhere to FAIR and open practices face organisational and technical challenges with managing, sharing, building, and using research data, which in turn hampers scientific progress.

In the Seedcase Project (<seedcase-project.org>), we aim to resolve part of the aforementioned issues by developing a framework that facilitates the creation and management of FAIR research data. To reach this goal, we are developing a set of interoperable and modular tools that together constitute a modern, organised, and robust framework.

In addition to publishing the data engineering tools, we also publish our internal developer tools, documentation, and guides to effective collaborative practices. Through these publications, we contribute to our ultimate goal of helping researchers do “better science in less time”, not only in an academic settings in Denmark but also in industry and across the globe. Our progress towards this goal can be found in our roadmap at <seedcase-project.org/roadmap/> that lists all of our products and their development status.

The data engineering tools we’ve published so far include:

Sprout (<sprout.seedcase-project.org>): A Python package that helps create and manage a standardised and organised structure for storing and describing research data as a Data Package. Using modern data engineering and management practices, Sprout promotes research data that are well designed, discoverable, well documented, and ultimately (re)usable for later analyses. This package is built around the Data Package standard (<datapackage.org>).
check-datapackage (<check-datapackage.seedcase-project.org>): A Python package that checks a Data Package’s metadata against the Data Package standard to ensure that the Data Package’s metadata is compliant with the standard.
Template Data Package (<template-data-package.seedcase-project.org>): A template for creating a new Data Package that follows the Seedcase structure with the necessary files and configurations in place.

While the tools that are planned or in early development include:

Flower (<flower.seedcase-project.org>): A Python package and command-line tool to generate and display the datapackage.json file in a human-friendly way. In early development.
Propagate: A planned tool to make it easier to submit and process reproducible requests for a subset of data from a larger Data Package through a machine-readable set of instructions.
Garden: A planned tool to manage and track research projects that use a particular Data Package.

This project is funded by a Novo Nordisk Foundation grant (number NNF21OC0069462).