Expanding the Research Data Management Service Portfolio at Bielefeld University According to the Three-pillar Principle Towards Data FAIRness

Research Data Management at Bielefeld University is considered as a cross-cutting task among central facilities and research groups at the faculties. While initially started as project “ Bielefeld Data Informium” lasting over seven years (2010–2015), it is now being expanded by setting up a Competence Center for Research Data. The evolution of the institutional RDM is based on the three-pillar principle: 1. Policies, 2. Technical infrastructure and 3. Support structures. The problem of data quality and the issues with reproducibility of research data is addressed in the project Conquaire. It is creating an infrastructure for the processing and versioning of research data which will finally allow publishing of research data in the institutional repository. Conquaire extends the existing RDM infrastructure in three ways: with a Collaborative Platform, Data Quality Checking, and Reproducible Research.


Introduction
Bielefeld University, founded in 1969, is guided by a high demand for interdisciplinary research and quality of research-oriented teaching.This is reflected in the approach of "transcending boundaries", i.e. the problem-oriented transcending of subject boundaries, between scientific cultures and between science and society and the support and promotion of openness and transparency in research.Four strategic research priorities are pursued that are interwoven across three cross-cutting themes.In addition, an interdisciplinary approach to the topic of data science is pursued.Bielefeld University 1 hosts about 25,000 students, 1,500 academic staff and 280 professors, and covers a wide range of disciplines across 13 faculties, the Graduate School and the CITEC Cluster of Excellence.Not least a third party funding of more than 20% (60,4 Mio EUR out of a 294,4 Mio EUR total budget in 2017) underlines the need for a structured, institutional research data management approach.
Based on a strong and diverse multidisciplinary research landscape at the university, two evolutions have sharply increased awareness of structured research data management at the institution a decade ago; firstly, a radical shift in research methods from "analogue" experimental laboratories and printed materials to various digital tools and research artefacts that created a demand for new research infrastructures and services.Secondly, the requirements of research funding agencies at the international and national level, which called for digital infrastructure measures that would implement and sustain the preservation of research primary data (HRK, 2014; Steering Committee for the "Digital Information" Initiative of the Alliance of Science Organisations in Germany, 2017).An important impetus for the implementation of a consistent and systematic data management approach and the development of knowledge services for research was the University's application under the Excellence Initiative of the German Federal Ministry of Education

Policies, Technical Infrastructure and Support
Similar to the practice in other research libraries in Europe (Tenopir et al., 2017) the approach in "Informium" aimed to involve the library in pilots developing subject-specific research infrastructures in order to derive generic requirements for research data management and as a foundation to establish a basic service portfolio that can be offered across the campus and across disciplines.Consequently, the service portfolio is based on the three pillars of technical infrastructure, policies and support, as illustrated in Figure 1.
The University of Bielefeld is one of the first universities in Germany to adopt principles and guidelines on the handling of research data (2011, https://data.uni-bielefeld.de/en/policy),which were further developed into a resolution on research data management (2013, https://data.uni-bielefeld.de/en/resolution).
The guidelines and resolution emphasize the consistent consideration of research data management aspects throughout the research lifecycle and provide important guidance on how to proceed and provide basic technical and advisory services, such as: • the use of appropriate and relevant subject-specific standards from data acquisition to publication and documentation of research data • the provision of data management plans, in particular as part of new data-intensive research proposals • aiming that research data management in institutions and projects is making data widely available and retaining it for the long term in order to facilitate its reuse for research, practice, and the general public while simultaneously balancing the need to protect intellectual property rights, personal data, and obligations to third parties • to promote a sustained embedding and further development of high-quality research data management by addressing subject-specific methodological training and imparting principles of good academic practice (data literacy) For permanent support of the implementation of the research data management resolution, the coordination center for research data has been established at the library.It acts as a campus-wide point of contact for all questions on research data management and publishing as well as for concrete, temporary cooperation in pilot projects.One such example is the Data Service Center for Business and Organizational Data (DSZ-BO), developed jointly by researchers at the Faculty of Sociology and the Library, where the library has contributed its expertise in data modeling and microdata documentation in the Data Documentation Initiative (DDI) metadata framework.As an international standard DDI supports the documentation of the entire life cycle of data.A key element of this service is a data catalog for studies encoded in DDI and thus enables an interoperable exchange of studies between data centers (Edler et al., 2012).
The effect of cooperation between discipline-specific research projects and the coordination center for research data is to the mutual benefit of both sides.As part of the DSZ-BO requirements regarding the documentation of studies were implemented and special metadata requirements in organizational research in DDI-3.1 were fed back to the DDI community.But DSZ-BO was also an opportunity to roll-out and adapt a data publishing platform based on DKAN, a Drupal implementation of CKAN.It provides a flexible modular solution that combines bibliography, backup management and data visualization.Experiences gained from this pilot are implemented in an infrastructure service offering, which can now systematically be made available to other researchers and their research projects.
Furthermore the coordination center has been participating in Collaborative Research Centers (CRC) including the CRC-882 "From Heterogeneities to Inequalities",2 and the CRC-1288 INF "Practices of comparing.Ordering and changing the world". 3Since 2007 the German Research Foundation (DFG) supports "Information Infrastructure" (INF) projects to facilitate research data management within a CRC (DFG, 2017).INF-projects provide important impetus for continuous innovation regarding the technical infrastructure for research data.
Experiences from these projects result in a number of cross-disciplinary services which form the technical infrastructure for research data management today, including e.g.
• the Data Management Planning Tool which implements funder-specific templates such as the Open Data Pilot in the EC-Horizon2020 research and innovation framework and templates for research institutes at the university • the institutional, cross-disciplinary registration of research data and other research outputs with the Digital Object Identifier (DOI)4 • the participation in sciebo -the CampusCloud based on ownCloud and support of the follow-up project sciebo-rds aiming to extend sciebo as a low-entry-barrier RDM service5 (Vogl et al., 2016;Wilms et al., 2016) • GitLab,6 as a collaborative web-based manager for distributed revision control for software and data (Ram, 2013) Moreover the functionality of the institutional repository "Publications at Bielefeld University" (PUB) (https://pub.uni-bielefeld.de/)has been extended by the deposit of research data and the support of linking of research data with publications (see Figure 2) and their record in the Data Citation Index.As a quality assurance measure and to increase the acceptance of the repository by the researchers, the repository team applied successfully for the Data Seal of Approval (DSA) certification. 7With nearly 240 data publications the number is still low compared to research publications but with an increasing annual rate of growth of data deposits.
The technical infrastructure support is complemented by comprehensive advisory services on publication workflows and data management planning considering funders' recommendations, policies, and mandates (DFG, 2015;EC, 2016), legal aspects (such as data protection, licensing of software).It also includes support for the consideration of RDM aspects in the application for third-party funding in order to identify any issues with research data at an early stage of the research project.
Furthermore it organizes research data literacy and capacity building for students and researchers.These include biannual seminars in the staff development program for research and teaching staff.The topics are ranging from an introduction into research data management, legal aspects of research data (e.g. on conformance to the EU-General Data Protection Regulation) to managing research data and software under control with GitLab. 8The seminars are well attended with 12 to 15 researchers from various disciplines.Sessions of the seminars on "Research Data Management" presenting hands-on GitLab and other Open Science services of Bielefeld University, 9 are well attended by over 50 students. 10GitLab has found its way into teaching although we did not anticipate this.When lecturers approached us with this plan, the library asked the computing centre to whitelist all students for use of GitLab.
While the university library set up GitLab in 2013 to support its software development projects it became clear very soon that GitLab has proven crucial for collaborating with researchers from other universities, which is something that many other infrastructures do not cater for.As of today Gitlab at Bielefeld University has 412 active users some of which belong to one or more of 68 groups.Users have created 641 GitLab projects so far.It is used by the DFG collaborative project RATIO 11 in the field of computer-assisted decision making.But also a digital humanities project is among our most active users: The Luhmann co-operative effort (http://www.uni-bielefeld.de/soz/luhmann-archiv/)with the University of Cologne uses our GitLab instance to annotate digitized index cards of Niklas Luhmann based on the XML language TEI.

Data Quality and Analytical Reproducibility
The data disclosure policies of publishers and funders (Peng, 2009;Stodden et al., 2013) require scientific research data to be published alongside the research publication to enable scientific reproducibility and Open Access.However, the scientific and academic community does not have a single or precise definition for the term reproducibility.
We consider three factors important for scientific reproducibility (see Figure 3): • availability of data • analytical reproducibility • quality & compliance Furthermore, by adopting the principle of continuous integration of data, we ensure regular quality checks that produces high quality scientific data within a scientific project; thereby validating the data that can be reused (Pasquetto et al., 2017) to reproduce the published analytical result.
The design and development of RDM services must maximize reuse and visibility of research data for the benefit of data producers (researchers) and data infrastructure service providers.To facilitate this process, we structured our data quality management service around Gitlab, a repository management tool built around the Git Distributed Version Control System (DVCS) (Ram, 2013).It is a powerful collaborative research tool with a web and command-line functionality and an in-built continuous integration (CI) system.

Data Irreproducibility Analyzer (DIRA)
The Data Irreproducibility Analyzer (DIRA) is the first minimal proof-of-concept-implementation of our data quality framework for checking the analytical reproducibility status of CSV files.
As shown in the Conquaire system architecture (see Figure 4), when a researcher commits their CSV file (s) along with the corresponding format (schema) file into their GitLab repository, the GitLab CI runners are triggered via the YAML file stored in each repository.If the format (*.fmt) file is missing, researchers receive a notification via email asking them to declare and commit the format file, which is a human-readable schema file describing their research data.
Then, when the researcher recommits the data files, they receive a status report of the data quality checks, via an email communication with a link to a HTML page displaying the results.On the front-end page, we use a triple colour encoding system to highlight the data errors on a "red", "yellow" or "green" shade background for easy visibility.The important quality check features that we have implemented in DIRA are: • The textual definition of the column is as per the format (*.fmt) file.
• Each column in the data record has a name that appears in the format schema file, else, a warning notification is issued.• Check whether a value is out of range or Null /NAN values.
• The range of values is specified in the format file.
• Data is plain text using a character set like ASCII/ Unicode (UTF-8).
• The file contains one record per line with the same sequence of fields.
• A data type for each value is declared and all values match the datatype.
It is proposed that the quality warnings for PUB consider some important caveats regarding notifications, viz., the researcher must be able to ignore those warnings that dont significantly hamper either the data analysis, or the data quality.
The above described research workflow based on continuous integration principles has numerous advantages, namely: • Potential re-users of data sets have a strong starting point to assess the formal value of research data.• Data sets published via a quality verification process using GitLab would enable researchers to establish a systematic research workflow.• Reduced data errors via an automated workflow ensures value of data, as opposed to a data dump in the form of a *.zip file (where data contents are unknown).• Git and the inbuilt CI system reduce the ambiguity that accompanies the traditional model of sharing data at the end of the research lifecycle.
As part of our technical support services, we provided training and conducted 6 Git workshops for all the members of our nine partner groups that provided the participants hands-on learning experience in learning to use the Gitlab collaboration tool.This training served as a foundation for the Conquaire system testing workshop that was later conducted in March 2018.Since a few research groups were already using Git, the Conquaire infrastructure structured around the DVCS is considered a useful collaboration tool.Currently we are in the process of connecting PUB with GitLab which will allow the researcher to link their data for each publication with an assigned DOI.Thus, the data quality framework (Cimiano et al., 2015) Figure 4: Conquaire System Architecture and Worflow.defined in the Conquaire project supports the FAIR data principles (Wilkinson et al., 2016) by making data findable, archivable and reusable.

Reproducibility Experiments
In order to develop a generic framework it was important to understand the research workflow of each of our case study partner groups.With nine groups from varied disciplines collaborating on the Conquaire project, we were well aware that to create a technical infrastructure geared to solve the "reproducibility" problem, we must obtain a deeper understanding of their research workflow.
To facilitate this, we undertook an intensive Reproducibility Experiment that entailed analytically reproducing one result from a paper already published by our partner groups.Our goal was to understand: • their data analysis process • their data collection process and the storage process • the experiment tools used by the research groups • software analysis and tools used during their research process • the level of open access and open science adoption within each group Each reproducibility experiment was conducted independently and with minimal input from the original researcher(s) after the pre-processed data and research artefacts were handed over to us.For the duration of the experiment, we worked independently and spent considerable effort in understanding their data management process.Also, we were clear that we would not change the existing workflow of the groups drastically, but rather strive for finding a collaborative way to steer the group towards Open Science in an open-ended manner.
During the course of our reproducibility experiments, we discovered numerous positive and negative practices that were prevalent within the research groups.A few research groups were already aware of Open Science methodologies and had taken measures to use Free and Open Source Software (FOSS), had some documentation about their data and the scripts, and had also chosen a license for their data and software reuse.
However, they were exceptions to the norm, and we discovered some common issues and shortcomings that hinder Open Science, as listed below (in brief): Some examples are: Origin software, SPSS, etc. 3. File IO issues: Final results stored in Excel sheets where the column names were missing and the datatype changed in the middle of the columns or there were spaces and special characters in file header names.All these make it difficult to implement a generic CI system that can automatically validate the data quality.4. Non-homogeneous Data storage platform: The lack of a single university-wide cloud storage platform to manage research data storage has resulted in an ad hoc data management by each research group.Some groups use private servers or sciebo which does not provide a reliable backup service, hence the risk of data loss is quite high. 5. Documentation: Many research projects did not document their data, their analysis process nor the scripts used for the statistical analysis and visualization.With respect to reproducibility, if a non-domain researcher were to be given the entire data set, there was no information available regarding the reproducibility aspect.6. Licensing: Most research projects did not state clearly the licensing terms for their data, the software scripts and the documentation (if any).The lack of licences renders their data unusable by another researcher.7. Metadata & Ontologies: Many research groups lack a standardised ontology or metadata schema.The non-availability of machine-readable metadata that describes a digital resource for their research data is a hurdle for the library in creating a semantically linked resource for the research data.
Currently, we are in the process of documenting these research findings in the form of a book that will be publicly released towards the end of 2018.

Figure 1 :
Figure 1: Three pillars of Research Data Management.

Figure 2 :
Figure 2: PUB as a hybrid repository for publications and data.