Bielefeld University, founded in 1969, is guided by a high demand for interdisciplinary research and quality of research-oriented teaching. This is reflected in the approach of “transcending boundaries”, i.e. the problem-oriented transcending of subject boundaries, between scientific cultures and between science and society and the support and promotion of openness and transparency in research. Four strategic research priorities are pursued that are interwoven across three cross-cutting themes. In addition, an interdisciplinary approach to the topic of data science is pursued. Bielefeld University1 hosts about 25,000 students, 1,500 academic staff and 280 professors, and covers a wide range of disciplines across 13 faculties, the Graduate School and the CITEC Cluster of Excellence. Not least a third party funding of more than 20% (60,4 Mio EUR out of a 294,4 Mio EUR total budget in 2017) underlines the need for a structured, institutional research data management approach.
Based on a strong and diverse multidisciplinary research landscape at the university, two evolutions have sharply increased awareness of structured research data management at the institution a decade ago; firstly, a radical shift in research methods from “analogue” experimental laboratories and printed materials to various digital tools and research artefacts that created a demand for new research infrastructures and services. Secondly, the requirements of research funding agencies at the international and national level, which called for digital infrastructure measures that would implement and sustain the preservation of research primary data (HRK, 2014; Steering Committee for the “Digital Information” Initiative of the Alliance of Science Organisations in Germany, 2017). An important impetus for the implementation of a consistent and systematic data management approach and the development of knowledge services for research was the University’s application under the Excellence Initiative of the German Federal Ministry of Education and Research and the German Research Foundation (DFG) in 2010 (Kehm, 2013). It was recognized by the university that excellent research requires an excellent research data management infrastructure.
The current concept for research data management at Bielefeld University was developed in the context of the “Bielefeld Data Informium” project in the period from 2010 to 2015, https://data.uni-bielefeld.de/de/informium. Aiming for a distributed service concept for research and education, jointly developed by researchers and central facilities such as the library and the computing center, in house-funded pilots were realized. A role model has been created for institutional services that support research data management, which provides central support on the one hand, while taking into account subject-specific practices (Meier zu Verl and Horstmann, 2011) and building on networking with expert groups and initiatives within and outside the university.
Similar to the practice in other research libraries in Europe (Tenopir et al., 2017) the approach in “Informium” aimed to involve the library in pilots developing subject-specific research infrastructures in order to derive generic requirements for research data management and as a foundation to establish a basic service portfolio that can be offered across the campus and across disciplines. Consequently, the service portfolio is based on the three pillars of technical infrastructure, policies and support, as illustrated in Figure 1.
The University of Bielefeld is one of the first universities in Germany to adopt principles and guidelines on the handling of research data (2011, https://data.uni-bielefeld.de/en/policy), which were further developed into a resolution on research data management (2013, https://data.uni-bielefeld.de/en/resolution).
The guidelines and resolution emphasize the consistent consideration of research data management aspects throughout the research lifecycle and provide important guidance on how to proceed and provide basic technical and advisory services, such as:
For permanent support of the implementation of the research data management resolution, the coordination center for research data has been established at the library. It acts as a campus-wide point of contact for all questions on research data management and publishing as well as for concrete, temporary cooperation in pilot projects. One such example is the Data Service Center for Business and Organizational Data (DSZ-BO), developed jointly by researchers at the Faculty of Sociology and the Library, where the library has contributed its expertise in data modeling and microdata documentation in the Data Documentation Initiative (DDI) metadata framework. As an international standard DDI supports the documentation of the entire life cycle of data. A key element of this service is a data catalog for studies encoded in DDI and thus enables an interoperable exchange of studies between data centers (Edler et al., 2012).
The effect of cooperation between discipline-specific research projects and the coordination center for research data is to the mutual benefit of both sides. As part of the DSZ-BO requirements regarding the documentation of studies were implemented and special metadata requirements in organizational research in DDI-3.1 were fed back to the DDI community. But DSZ-BO was also an opportunity to roll-out and adapt a data publishing platform based on DKAN, a Drupal implementation of CKAN. It provides a flexible modular solution that combines bibliography, backup management and data visualization. Experiences gained from this pilot are implemented in an infrastructure service offering, which can now systematically be made available to other researchers and their research projects.
Furthermore the coordination center has been participating in Collaborative Research Centers (CRC) including the CRC-882 “From Heterogeneities to Inequalities”,2 and the CRC-1288 INF “Practices of comparing. Ordering and changing the world”.3 Since 2007 the German Research Foundation (DFG) supports “Information Infrastructure” (INF) projects to facilitate research data management within a CRC (DFG, 2017). INF-projects provide important impetus for continuous innovation regarding the technical infrastructure for research data.
Experiences from these projects result in a number of cross-disciplinary services which form the technical infrastructure for research data management today, including e.g.
Moreover the functionality of the institutional repository “Publications at Bielefeld University” (PUB) (https://pub.uni-bielefeld.de/) has been extended by the deposit of research data and the support of linking of research data with publications (see Figure 2) and their record in the Data Citation Index. As a quality assurance measure and to increase the acceptance of the repository by the researchers, the repository team applied successfully for the Data Seal of Approval (DSA) certification.7 With nearly 240 data publications the number is still low compared to research publications but with an increasing annual rate of growth of data deposits.
The technical infrastructure support is complemented by comprehensive advisory services on publication workflows and data management planning considering funders’ recommendations, policies, and mandates (DFG, 2015; EC, 2016), legal aspects (such as data protection, licensing of software). It also includes support for the consideration of RDM aspects in the application for third-party funding in order to identify any issues with research data at an early stage of the research project.
Furthermore it organizes research data literacy and capacity building for students and researchers. These include biannual seminars in the staff development program for research and teaching staff. The topics are ranging from an introduction into research data management, legal aspects of research data (e.g. on conformance to the EU-General Data Protection Regulation) to managing research data and software under control with GitLab.8 The seminars are well attended with 12 to 15 researchers from various disciplines. Sessions of the seminars on “Research Data Management” presenting hands-on GitLab and other Open Science services of Bielefeld University,9 are well attended by over 50 students.10 GitLab has found its way into teaching although we did not anticipate this. When lecturers approached us with this plan, the library asked the computing centre to whitelist all students for use of GitLab.
While the university library set up GitLab in 2013 to support its software development projects it became clear very soon that GitLab has proven crucial for collaborating with researchers from other universities, which is something that many other infrastructures do not cater for. As of today Gitlab at Bielefeld University has 412 active users some of which belong to one or more of 68 groups. Users have created 641 GitLab projects so far. It is used by the DFG collaborative project RATIO11 in the field of computer-assisted decision making. But also a digital humanities project is among our most active users: The Luhmann co-operative effort (http://www.uni-bielefeld.de/soz/luhmann-archiv/) with the University of Cologne uses our GitLab instance to annotate digitized index cards of Niklas Luhmann based on the XML language TEI.
The data disclosure policies of publishers and funders (Peng, 2009; Stodden et al., 2013) require scientific research data to be published alongside the research publication to enable scientific reproducibility and Open Access. However, the scientific and academic community does not have a single or precise definition for the term reproducibility.
The DFG-funded project Continuous quality control for research data (Conquaire, 2016–2019)12 (Cimiano et al., 2015; Wiljes et al., 2013) at Bielefeld University supports analytical reproducibility (Ayer et al., 2017a) via a generic infrastructure architechture (Ayer et al., 2017b) that supports collaboration and easy data sharing with access to different versions of research data over time.
We consider three factors important for scientific reproducibility (see Figure 3):
Furthermore, by adopting the principle of continuous integration of data, we ensure regular quality checks that produces high quality scientific data within a scientific project; thereby validating the data that can be reused (Pasquetto et al., 2017) to reproduce the published analytical result.
The design and development of RDM services must maximize reuse and visibility of research data for the benefit of data producers (researchers) and data infrastructure service providers. To facilitate this process, we structured our data quality management service around Gitlab, a repository management tool built around the Git Distributed Version Control System (DVCS) (Ram, 2013). It is a powerful collaborative research tool with a web and command-line functionality and an in-built continuous integration (CI) system.
The Data Irreproducibility Analyzer (DIRA) is the first minimal proof-of-concept-implementation of our data quality framework for checking the analytical reproducibility status of CSV files.
As shown in the Conquaire system architecture (see Figure 4), when a researcher commits their CSV file (s) along with the corresponding format (schema) file into their GitLab repository, the GitLab CI runners are triggered via the YAML file stored in each repository. If the format (*.fmt) file is missing, researchers receive a notification via email asking them to declare and commit the format file, which is a human-readable schema file describing their research data.
Then, when the researcher recommits the data files, they receive a status report of the data quality checks, via an email communication with a link to a HTML page displaying the results. On the front-end page, we use a triple colour encoding system to highlight the data errors on a “red”, “yellow” or “green” shade background for easy visibility.
The important quality check features that we have implemented in DIRA are:
It is proposed that the quality warnings for PUB consider some important caveats regarding notifications, viz., the researcher must be able to ignore those warnings that dont significantly hamper either the data analysis, or the data quality.
The above described research workflow based on continuous integration principles has numerous advantages, namely:
As part of our technical support services, we provided training and conducted 6 Git workshops for all the members of our nine partner groups that provided the participants hands-on learning experience in learning to use the Gitlab collaboration tool. This training served as a foundation for the Conquaire system testing workshop that was later conducted in March 2018. Since a few research groups were already using Git, the Conquaire infrastructure structured around the DVCS is considered a useful collaboration tool.
Currently we are in the process of connecting PUB with GitLab which will allow the researcher to link their data for each publication with an assigned DOI. Thus, the data quality framework (Cimiano et al., 2015) defined in the Conquaire project supports the FAIR data principles (Wilkinson et al., 2016) by making data findable, archivable and reusable.
In order to develop a generic framework it was important to understand the research workflow of each of our case study partner groups. With nine groups from varied disciplines collaborating on the Conquaire project, we were well aware that to create a technical infrastructure geared to solve the “reproducibility” problem, we must obtain a deeper understanding of their research workflow.
To facilitate this, we undertook an intensive Reproducibility Experiment that entailed analytically reproducing one result from a paper already published by our partner groups. Our goal was to understand:
Each reproducibility experiment was conducted independently and with minimal input from the original researcher(s) after the pre-processed data and research artefacts were handed over to us. For the duration of the experiment, we worked independently and spent considerable effort in understanding their data management process. Also, we were clear that we would not change the existing workflow of the groups drastically, but rather strive for finding a collaborative way to steer the group towards Open Science in an open-ended manner.
During the course of our reproducibility experiments, we discovered numerous positive and negative practices that were prevalent within the research groups. A few research groups were already aware of Open Science methodologies and had taken measures to use Free and Open Source Software (FOSS), had some documentation about their data and the scripts, and had also chosen a license for their data and software reuse.
However, they were exceptions to the norm, and we discovered some common issues and shortcomings that hinder Open Science, as listed below (in brief):
Currently, we are in the process of documenting these research findings in the form of a book that will be publicly released towards the end of 2018.
From the perspective of Bielefeld University Library, research data management has become one of several strategically important fields of activity in recent years. This manifests itself both in the strategy paper of the library (Knorn, 2017), which is coordinated with the university administration, and on the national level in the position paper of the German Library Association which shapes objectives up to 2025 (Deutscher Bibliotheksverband, 2018).
As a result, the previous achievements in research data management at the university are well recognized. There is a growing demand for research data management in terms of human resources, coordination, and technical and subject-specific infrastructure. This is in line with the requirements of the research funding agencies, which both require subject-specific solutions for research data management and research data curation, and demand a statement in funding applications on quality assurance, management and long-term availability of research data.
Therefore by turning new challenges into opportunities the library, the computing center “BITS” and the Bielefeld Center for Data Science (BiCDaS)13 jointly developed a concept to build a competence center for research data that extends the already existing coordination center for research data. It is understood that RDM is a cross-cutting task of central facilities and research groups in the faculties. An inventory of previously published research data and current research projects highlights the heterogeneity of file formats and types, the variability of storage infrastructure requirements, and the bandwidth between generic and discipline-specific infrastructures. Additionally, comparisons with surveys at other academic institutions revealed that there is a primary need for:
It is planned that by the end of the year 2018 the Competence Center for Research Data will start its work, although it will also be necessary to ensure its continuity in the long run. The main areas of work will be in the following areas, see Figure 5.
This research/work was supported by the German Research Foundation (DFG) and the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University, which is funded by the German Research Foundation (DFG).
The authors have no competing interests to declare.
Ayer, V, Pietsch, C, Vompras, J, Schirrwagen, J, Wiljes, C, et al. 2017a. Conquaire: Towards an architecture supporting continuous quality control to ensure reproducibility of research. D-Lib Magazine, CNRI (USA), 23(1/2). DOI: https://doi.org/10.1045/january2017-ayer
Ayer, V, Pietsch, C, Vompras, J, Schirrwagen, J, Wiljes, C, et al. 2017b. Enabling Git based research data quality control for institutional repositories. http://nbn-resolving.de/urn:nbn:de:0070-pub-29161886. Presented at Research Data Alliance (RDA), Repository Platforms for Research Data IG, 9th Plenary meeting, Barcelona.
Bibliotheksverband, D. 2018. Wissenschaftliche Bibliotheken 2025. https://www.bibliotheksverband.de/fileadmin/user_upload/Sektionen/sektion4/Publikationen/WB2025_Endfassung_endg.pdf.
Cimiano, P, McCrae, J, Jahn, N, Pietsch, C, Schirrwagen, J, et al. 2015. CONQUAIRE: Continuous quality control for research data to ensure reproducibility: An institutional approach. DOI: https://doi.org/10.5281/zenodo.31298
Deutsche Forschungsgemeinschaft. 2015. DFG Guidelines on the Handling of Research Data. http://www.dfg.de/download/pdf/foerderung/antragstellung/forschungsdaten/guidelines%20research%20data.pdf.
Deutsche Forschungsgemeinschaft. 2017. Guidelines Collaborative Research Centres. http://www.dfg.de/formulare/50_06/50_06_en.pdf.
Edler, S, Meyermann, A, Gebel, T, Liebig, S and Diewald, M. 2012. The German Data Service Center for Business and Organizational Data (DSC-BO). Schmollers Jahrbuch, 132(4): 619–634. DOI: https://doi.org/10.3790/schm.132.4.619
EUROPEAN COMMISSION Directorate-General for Research & Innovation. 2016. Guidelines on FAIR Data Management in Horizon 2020. http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf.
German Rector’s Conference (HRK). 2014. Management of research data a key strategic challenge for university management. Recommendation by the 16th General Meeting of the HRK on 13 May 2014 in Frankfurt/Main. https://www.hrk.de/fileadmin/vmigrated/content_uploads/HRK_Empfehlung_Forschungsdaten_13052014_EN.pdf.
Kehm, BM. 2013. To be or not to be? The impacts of the excellence initiative on the German system of higher education. In: Institutionalization of world-class university in global competition, 81–97. Springer DOI: https://doi.org/10.1007/978-94-007-4975-7_6
Knorn, B. 2017. Strategieprozess der Universitätsbibliothek Bielefeld und seine Folgen. Pro Libris, 22(3). https://www.bibliotheken-nrw.de/fileadmin/Dateien/Daten/Aktuelles/170921-Pro_Libris-web_Einzelseiten.pdf.
Meier zu Verl, C and Horstmann, W. (eds.) 2011. Studies on Subject-Specific Requirements for Open Access Infrastructure. Universittsbibliothek. DOI: https://doi.org/10.2390/PUB-2011-1
Pasquetto, I, Randles, B and Borgman, C. 2017. On the reuse of scientific data. Data Science Journal, 16: 8. DOI: https://doi.org/10.5334/dsj-2017-008
Peng, RD. 2009. Reproducible research and biostatistics. Biostatistics, 10(3): 405–408. DOI: https://doi.org/10.1093/biostatistics/kxp014
Ram, K. 2013. Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine, 8(1): 7. ISSN: 1751-0473. DOI: https://doi.org/10.1186/1751-0473-8-7
Steering Committee for the “Digital Information” Initiative of the Alliance of Science Organisations in Germany. 2017. Shaping digital transformation in science. “Digital Information” Initiative by the Alliance of Science Organizations in Germany. Mission statement 2018–2022. DOI: https://doi.org/10.2312/allianzoa.016
Stodden, V, Guo, P and Ma, Z. 2013. Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals. PloS one, 8(6): e67111. DOI: https://doi.org/10.1371/journal.pone.0067111
Tenopir, C, Talja, S, Horstmann, W, Late, E, Hughes, D, et al. 2017. Research data services in European academic research libraries. Liber Quarterly, 27(1): 23–44. ISSN: 2213-056X. DOI: https://doi.org/10.18352/lq.10180
Vogl, R, Rudolph, D, Thoring, A, Angenent, H, Stieglitz, S, et al. 2016. How to Build a Cloud Storage Service for Half a Million Users in Higher Education: Challenges Met and Solutions Found. In: 2016 49th Hawaii International Conference on System Sciences (HICSS), 5328–5337. ISSN: 1530-1605. DOI: https://doi.org/10.1109/HICSS.2016.658
Wiljes, C, Jahn, N, Lier, F, Paul-Stueve, T, Vompras, J, et al. 2013. Towards Linked Research Data: An Institutional Approach. In: Garca Castro, A, Lange, C, Lord, P and Stevens, R (eds.), 3rd Workshop on Semantic Publishing (SePublica), 994: 27–38.
Wilkinson, MD, Dumontier, M, Aalbersberg, IJ, Appleton, G, Axton, M, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3. DOI: https://doi.org/10.1038/sdata.2016.18
Wilms, K, Meske, C, Stieglitz, S, Rudolph, D and Vogl, R. 2016. How to Improve Research Data Management. In: Yamamoto, S (ed.), Human Interface and the Management of Information: Applications and Services, 434–442. Cham: Springer International Publishing. ISBN: 978-3-319-40397-7. DOI: https://doi.org/10.1007/978-3-319-40397-7_41