Academic libraries are experts at identifying, selecting, organizing, describing, preserving, and providing access to information materials, print and digital resources. They have been a safe harbour for research publications of all disciplines since centuries. As a cornerstone of the academic institution, libraries have a demonstrated and sustainable approach for providing services such as collection management, preservation, and access to a broad variety of information. However, we are experiencing a paradigm shift in the way knowledge is shared. Declining usage of traditional library services, the increased availability of eBooks and e-articles as well as the rise of open access to academic resources are just some examples. Moreover, the present age of digitalization is enabling fundamental changes in how we experience and use the products of research, innovation and personal ideas.
Therefore, it becomes evidently that the academic library’s role as a gateway to high quality research information is changing. The need for physical collections and traditional systems for the organization of information must be reconsidered. The challenge lies in how to invest and develop innovation while at the same time continue with selected traditional services. It also includes evaluating services, sharing the results and consequently transforming the workflows toward more digital stewardship. Digital curation services for research data (Keil 2014; Tenpoir et al. 2014; Swanson & Rinehart 2016) and other content such as audiovisual media, images, digital lab journals, multimedia, 3D objects, statistics and software code (Koltay 2016) are new key activities for academic libraries.
Institutions and researchers recognize that the longevity of digital resources may vary depending on the type of data or research. For example, DNA sequences may be outdated after a decade, whilst taxonomy data may be relevant forever. The possibly most important challenge, however, was stated in a white paper by the internationally operating foundation on research data issues, the Research Data Alliance (RDA) (RDA Europe 2014): “The right minds get the right data at the right time.” If you slightly change this phrase to ‘make sure that the right minds get the right (digital) resources at the right (any?) time’ the future role of research libraries becomes apparent. Data libraries which record the old and new findings of research which we are experiencing in these exciting age of digitization, particularly for research data and other digital research outputs which are not served by big data centres need to be established. But how to achieve this goal? How can we organize it that a digital network, a virtual library can assist to preserve research outputs such as research data, audiovisual media, databases, texts, images, spreadsheets, digital lab journals, images, multimedia, 3D objects, statistics and software code can be found and cited? And how can these resources be re-used to test new hypotheses, combine data and reproduce scientific findings?
When it comes to digital academic resources such as journal articles and the emerging research data, the use of unique and persistent identifiers (PIDs) has become a central aspect of proper data management and access. DOIs (Digital Object Identifiers = DOI® names) are originally contrived for publications of scientific findings as the core technology to refer to the electronic version of an article in a journal. A DOI consists of a unique character string that identifies an entity in a digital environment. Therefore, it identifies the object itself and not the place where it is located. If the object is moved and the location (URL) has changed, the only requirement is to update the URL in the underlying central database. This system ensures that the DOI persistently resolves to the location of the object (Paskin 2006). The Digital Object Identifier System is managed and administrated by the International DOI Foundation (International DOI Foundation 2012).
Several registration agencies provide DOI services and registration worldwide. One of them is DataCite, a registration agency particularly dedicated to services that support the enhanced search and discovery of research content, especially on research data and grey literature. The DOI community and the DataCite consortium are maintaining a network of services within the DOI system. The relationship of a digital resource to other attributes or resources can be captured in and exposed directly through the service itself as identifier metadata. Identifier metadata may be accessed without visiting the resource itself, thus reducing the load on repositories and catalogues. The usage of persistent identifiers like DOI also enable services such as event tracking of citations that otherwise would not have been realizable for dark archives.1 To achieve this, though, a PID service must be able to ensure that the types of information needed are available in the metadata. With PID services at the base, libraries are supporting researchers to deposit their research output into disciplinary research data repositories (e.g. ICPSR, GenBank and Pangaea). Additionally, they are establishing and managing institutional data repositories to support the long-tail of data outside of big data disciplines. Furthermore, libraries are helping to prepare data for sharing and re-use much earlier in the research life-cycle (e.g. in the development of a Data Management Plan, a data collection, or a long-time data storage).
While developing these services, research libraries need to be an active player in national, European or international initiatives (Pinfield et al. 2014). As such, the TIB e.g. is active in the Research Data Alliance (RDA), the Force 11 group and ICSTI. Within RDA, the focus is on the topics libraries for research data, long tail of research data, publishing, cost recovery for data centres, legal interoperability, metadata as well as PID services.
Services Established and Lessons Learned
For more than 50 years, the German National Library of Science and Technology (TIB) in Hannover has been providing scientific information from the disciplines of engineering as well as architecture, chemistry, information technology, mathematics and physics. The TIB has an outstanding stock of core and highly specialized technical and scientific literature and represents the largest specialist library in the world for its subject areas. In the course of transitioning from a traditional library to a modern technical information centre new services, i.e. the DOI service, are implemented and integrated in existing services. Meanwhile, a wide range of new digital contents emerge, using grey literature (i.e. reports (annual, research, technical, project), white papers, government documents, etc.) and as focal point and bridge to comprehensive technical information. Every year, TIB serves customers from around 65 countries. About 55% of these customers belong to the academic research community whereas 45% are commercial clients.
DOI Service: The facts
Research data management offers solutions for the proper storage and curation of datasets and other digital objects and their linking with publications throughout the scholarly research cycle. If authors submit the digital objects that support research papers to (certified) research infrastructures such as repositories, it will make future research studies and information retrieval much easier. In the light of this, TIB took a fundamental step with the innovative STD-DOI project in 2003 (Figure 1): In collaboration with scientific institutes, TIB developed an infrastructure model for the DOI registration, establishing a complete workflow for the referencing of research data. Five years later this work led to the foundation of DataCite in 2009. Nowadays, DataCite is a globally oriented non-profit organization operating from local institutions, with 47 members from more than 20 countries. Members include the British Library, the California Digital Library, the Library of ETH Zurich and the Australian National Data Service. DataCite offers an infrastructure that supports simple and effective methods of data citation, discovery and access. Trusted and certified data centres collaborating with the DataCite network and DOI agencies register and supply DOIs for deposited digital resources. Therefore, these objects can be linked to the corresponding publications in a persistent way. Depending on which properties are provided alongside a digital resource, the identifier metadata enables services that support discovery, access, verification of integrity and authenticity and a variety of other use cases. Furthermore, the respective data centres often also provide long-term preservation services for the digital objects. In summary, the use of DOIs enables the scientific community to move beyond journals and make more digital scientific content visible, available and searchable in a citable way. TIB in its role as the German National Library of Science and Technology developed dedicated services around the research data life cycle
- Over 120 academic institutions are provided with administrative, scientific and technical support from the TIB DOI service. Over 1.5 million DOIs for academic content have been registered since 2005.
- Out of the 1.5 million DOIs registered via TIB, 62% have been assigned to research data, 37% to grey literature and 1% to audiovisual media.
- As part of our national mandate, the TIB DOI service is free of charge in the EU for publicly funded institutions. Clients include major research centres such as PANGAEA, the World Data Center for Climate (WDCC) and the European Southern Observatory (ESO) as well as 51 universities and university libraries.
Metadata quality of digital resources is of utmost importance for their searchability and citeability, especially in the academic context. The research community assumes that the objects identified by a PID are part of the academic content that might typically be referenced in a journal article. Therefore, PID services such as the ones offered by DataCite have a metadata standard based around the typical attributes of academic records (including parameters such as title, creator, publisher and publication year). However, to ensure that the objects identified via PID service providers become a related part of academic content, metadata standards have to be constantly checked against current and new academic standards and evolve forward. One example is the new DataCite Metadata Schema Version 4.0., published by the DataCite Metadata Working Group (2016). The new schema includes more mandatory fields, including the description of the resource type and the possibility to add a funding reference as well as a funder identifier. The changes increase interoperability with other PID types such as the ORCID for the persistent identification of a researcher and enhance the discoverability of research objects registered via DOI. It also allows PID networks and service providers the implementation of new services such as DataCite Event Data (https://eventdata.datacite.org), a service which collects events around DataCite DOIs including references to related data, data citations in journal articles, and new versions of a work. The central goal of the TIB research data management service is to pass on such information in a clear and concise manner to our DOI service clients in Germany and Europe. As a research library, we also provide advice, support and in some cases technical services as part of our responsibility to accelerate the current transition to a digitized, data-saturated research system. We prepare for new innovations, such as the recently announced European Open Science Cloud as part of the Digital Single Market (Commission High Level Expert Group on the European Open Science Cloud 2016).
Example from the daily DOI business
In order to reach a functional and (in an ideal world) open network of research data and other digital resources, there are two basic types of challenges to overcome; one being technical (can we agree upon common, interoperable standards for data and the associated metadata?) and the other being social (can we agree upon a similar strategy for data management, e.g. like the Joint Declaration of Data Citation Principles by the Data Citation Synthesis Group (2014)?).
While the technical challenge facing the complexity of data is addressed by multiple working groups across the world (e.g. European Union 2010; Crosas 2011; Lecarpentier et al. 2013; Starr et al. 2015) and seems to make good progress, the social (human) side is still staggering and changing slowly (Data Citation Synthesis Group 2014; Roche et al. 2015). A successful examples is the the COPDESS Statement of Commitment (http://www.copdess.org/statement-of-commitment/) which includes a recommendation to archive data in public data repositories, and the acceptance of dataset DOIs in reference lists of Journal articles.
Beside the shortage of experts on the handling of digital resources, there are still major questions concerning the ‘art of resource management’ within and between the scientific communities: One of the historic but still omnipresent questions is, whether the scientific community expects the object being identified via a PID to contain exclusively academic content, and by which standards the content is evaluated as such. The following describes one example from the practice of TIB, being a DataCite member for seven years: In early 2014 we received an inquiry of DOIs for field stations, a topic which has been discussed for some time. The foundation of DataCite was a specific initiative to broaden the definition of academic output, focusing on the persistent identification of research data. Therefore, many questions and discussions have taken place regarding the scope of ‘academic content’. Questions asked were ‘Should field laboratories be identified persistently?’ and ‘Which identifier should be used?’ Sometimes it is hard to tell where the line of academic content should be drawn, and when to cross it. From a research library point of view, we strongly recommended to select an identifier based on a globally unique scheme for a field station, while considering that not every PID may fit every purpose. In order to identify a field station, a whole new metadata approach and much more insight in geographic location description and temporal resolution are needed. If field stations are located close to each other, detailed spatial and temporal resolution as well as information about the hard- and software of each field station may be of great interest to distinguish between them. The usage of varying spatial reference systems in the different disciplines is another factor that needs to be considered. All these aspects cannot be sufficiently represented by the DataCite metadata schema in its present form. In conclusion, it was recommended to the inquirer that other identifier initiatives such as the International Geo Sample Number (IGSN) or geocoding might be more suitable here, because it includes metadata designed to describe sampling sites (http://www.geosamples.org/help/vocabularies/#object) and can be used for the description of field stations. An example of a field station can be found here: https://app.geosamples.org/sample/igsn/JPL00ZM00.
Another example is the adopted use of DOIs for scientific audiovisual media such as recordings of conferences, lectures and experiments, reports and presentation of research work. These media are made available via the TIB AV-Portal, a cooperative project of both TIB and the Hasso-Plattner Institute. The portal was launched in April 2014 (av.tib.eu) and focuses on the automatic generation of metadata, a semantic search and cross-lingual retrieval (German and English). Content-based filter facets for search results enable the exploration of the increasing number of video assets. Search terms are not only searched for in the films metadata, such as author, title or abstract but also in the spoken texts, text overlays and image information. These technologies allow the users to search more efficiently by locating the relevant video segment even if the search term could not be found in the film metadata such as title or abstract. By using the open standard Media Fragment Identifier (MFID) in addition to the DOI, individual segments of a video can also be cited as easily as a chapter or a page in a book. In order to cite a video or a video segment the provided DOI link – which can be enhanced by the MFID – is simply copied and pasted into a document. By connecting video assets to dynamic digital content such as datasets, the scientific value can be increased and contribute to a better understanding of the content life-cycle, from acquisition and preservation to access.
Research and Development Department
The German National Library of Science and Technology underpins its PID and framework services with research on sustainable access to digital information, its visualization and by highlighting its connectivity and potential use for other research areas. Thus, this research is a derivate of the main mission and is focused on innovation while continuously improving the existing services. A major goal of TIB is to establish its own research capacity and support the trust in and (future) role of libraries as preservers of (digital) information and knowledge.
Present research and development projects at TIB focus on
- innovative, media-specific portals enabled by e.g. an automated video analysis with scene, speech, text and image recognition (an additional service upgrading the TIB AV-Portal).
- bibliometric and linked open data (LOD) services for research products such as author information, full texts and research data as part of the KomFor Consortium2 and the VIVO beta project (https://vivo.tib.eu/vivo/).
- the development of tools and the prototypical operation of a service for automated indexing, storage and presentation of non-textual types of documents, e.g. 3D models of CAD applications (PROBADO-3D project, http://www.probado.de/en_3d.html).
- the area of visual analytics by developing automatic processes that recognize patterns in the data and generate condensed graphic representations of high-dimensional data.
- the development of new ways of software citation and software management tools (e.g. using the Jupyter Notebook, a web application).
- establishing a generic Research Data Repository (RADAR), a collaborative project to preserve research results up to 15 years and assign well-graded access rights, or to publish data with a DOI assignment for an unlimited period of time (www.radar-service.eu/en).
Potential clients for the project results include libraries, research institutions, publishers and open platforms which require an adaptable digital infrastructure to archive, publish, analyze and explore digital resources while considering their institutional requirements and workflows. In the following, three exemplary outputs of TIB’s participation in PID services and supporting frameworks are described: the collaborative project for research data preservation and archiving, RADAR, the planned framework LEIBNIZ DATA and the recently established ORCID DE Consortium.
Globally resolvable, persistent digital identifiers have become an essential tool to enable unambiguous links between published research results and their underlying digital resources. One particular problem for the management of data originating from (collaborating) research infrastructures is their dynamic nature in terms of growth, access rights and quality. On a global scale, systems for access and preservation are in place for the big data domains (e.g. environmental sciences, space, and climate). However, the stewardship for disciplines without a tradition of data sharing, including the fields of the so-called long tail, remains uncertain.
RADAR – Research Data Repository – is an interdisciplinary end-point research data repository which provides both preservation and publication services. The project focuses on the so-called ‘long tail’ of research disciplines and will serve as an addition to established ‘big data’ and/or domain specific repositories, with a complementary function rather than being competitive to them. RADAR started in March 2017 and provides data services for customers without own data repository infrastructures or storage capacities. The repository was developed in the course of a project funded by the German Research Foundation from 2013 to 2016: http://www.radar-projekt.org. The project is placed within the program ‘Scientific Library Services and Information Systems (LIS)’ on restructuring the national information services in Germany. RADAR welcomes data from specialized research disciplines of all areas, i.e. natural, economic, social and cultural sciences. The heterogeneity of research data is a serious issue for many research data repositories, especially when they provide storage and publication services for a wide range of scientific disciplines. RADAR is facing this problem by focusing on real scientific workflows and elaborates a generic best practice approach that will be evaluated and tested with data provided by scientific partners from different research areas.
RADAR is developed as a cooperation project of five research institutes from the fields of natural and information sciences (Razum et al. 2014). The technical infrastructure is provided by the FIZ Karlsruhe – Leibniz Institute for Information Infrastructure and the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology (KIT). The Ludwig-Maximilians-Universität Munich (LMU), Faculty for Chemistry and Pharmacy, and the Leibniz Institute of Plant Biochemistry (IPB) provide the scientific knowledge and specifications and ensure that RADAR services can be implemented in the actual scientific workflows of academic institutions and universities. The sustainable management and publication of research data with DOI assignment is provided by the German National Library of Science and Technology. The partners aim to establish an interdisciplinary research data repository based on a stable business model. The data management processes and tools needed to achieve this goal include
- guidelines for researchers to introduce and facilitate research data management in general and to store/publish data in RADAR in particular,
- secure data preservation in compliance with required storage periods (including permanent storage) by the use of distributed data storage mechanisms,
- (optional) data publication with DOI-assignment to secure traceability, access and citeability, and
- technical support for institutions, including an optional provision of a review link that may be sent to reviewers/editors during the peer-review of a corresponding paper and frontend-branding.
Being the proverbial “transmission belt” between data producers and data consumers, RADAR specifically targets researchers, scientific institutions, libraries and publishers. In the data lifecycle, RADAR services are placed in the “Persistent Domain” of the conceptual data management model described in the “domains of responsibility” (Treloar et al. 2008). These domains of responsibility are used to demonstrate duties and responsibilities of the stakeholders involved in research data management. Simultaneously, the domains outline the contexts of shared knowledge about data and metadata information, with the goal of a broad reuse of preserved and published research data. Depositing research data in RADAR ensures that the requirements of funding agencies and of Good Scientific Practice are met. As a generic service RADAR accepts all types of digital data that are collected in the course of scientific research studies. A dataset deposited in RADAR may comprise raw data, primary data (intermediate working data), secondary data and files describing the data and documenting the research process. RADAR accepts both data underlying scientific articles and standalone data publications, e.g. ‘negative data’.3 Data may be submitted in any file format, however, format recommendations reflecting the requirements for long-term accessibility of digital content will be provided in the author guidelines (e.g. the use of PDF/A or XML based formats for text files). The online service can be used in a collaborative way. RADAR enables clients to upload, edit, structure and describe their (collaborative) data in an organizational workspace. In such a workspace, administrators and curators can manage access and editorial rights before the data enters the preservation and optional publication phase. RADAR applies different PID strategies for closed vs. open data. For closed datasets, RADAR uses handles as identifiers and offers format-independent data preservation between 5 and 15 years, which can also be prolonged. By default, preserved data are only available to the respective data curators, which may selectively grant other researches access to preserved data. For open datasets, RADAR provides a DOI to enable researchers to clearly reference and reuse data and to guarantee data accessibility. RADAR offers the publication service of research data together with format-independent data preservation for an unlimited time period. Each published dataset can be enriched with discipline-specific metadata and an optional embargo period can be specified. Workflows and detailed services of RADAR, including a pricing model, are available online (www.radar-service.eu/en).
A repository service to publish and preserve research outputs takes up a significant part of an institutional strategy and budget planning process. To aid institutions in such processes, various cost models have become available over the last years. Examples include the European 4C project (C Project Collaboration to Clarify the Costs of Curation 2016) and the APARSEN project, which maps and compares the various models (Kaur et al. 2014). For RADAR, a “cost by service” approach was selected. The calculation sheets are dependent on the three central phases (ingest, curation and access) of the repository service. The RADAR pricing model includes yearly payment plans based on institutional contracts depending on required storage volume and duration. Additionally, the pricing model of RADAR is (indirectly) subsidized by governmental funds (directed to the academic, non-profit institution operating the service, FIZ Karlsruhe), in addition to the service fee which is charged for its use. While this mixed business model of a subsidized service fees is still a novel structure in Germany, such models are is well known and used (i.e. in repositories such as Dryad, http://datadryad.org/). With RADAR, academic institutions pay the fee for services including long term preservation without any additional cost for the individual researcher being a member of the respective institution.
RADAR aims to meet demands from a broad range of specialized research disciplines: To provide a secure, citable data storage and citeability for researchers which need to retain restricted access to data on one hand, and an e-infrastructure which allows for research data to meet the FAIR principles (Wilkinson 2016), meaning for research data to be findable, accessible, interoperable and re-useable in a digital platform available 24/7, on the other.
Planned frameworks: LEIBNIZ DATA and the ORCID DE Consortium
Libraries, IT services and research offices at an institution are increasingly required to collaborate locally, nationally and globally in order to jointly build data libraries that support diverse data and research in the long-tail. A positive trend is the increased communication and formation of connections between research libraries and their engagement with research councils, funding councils and other key national and international stakeholders such as the Research Data Alliance (e.g. the Libraries for Research Data Interest Group) to gain and pass on best practices and guidelines of digital (data) curation, e.g. the project 23 Things: Libraries for Research Data (Witt & Libraries for Research Data Interest Group 2016). As such, they may promote the interoperability of disciplinary, national, and international infrastructures, provide a communication platform for the exchange of knowledge and work together to address central challenges. An exemplary project for such a communication and outreach framework concerning research data management is the Leibniz Network for Open Research Data (LEIBNIZ DATA). To better support researcher networking, TIB hosts the business office of the newly established ORCID DE Consortium.
LEIBNIZ DATA is designed as an infrastructure which offers a reliable and long-term service to those research infrastructures of the Leibniz Roadmap4 and beyond, which work with heterogeneous research data. It provides expertise on the cataloguing, archiving and subsequent use of these diverse and in some instances unique digital research data, and thus ensures that they remain available and usable as a central information source for scientific research and development. The data archives of the Leibniz Association’s specialized and established research data centres are networked with, and made visible within the association, using international metadata standards. To enhance the handling of research data in the institutions and among the researchers, common guidelines and standards on research data management will be established. As a network, LEIBNIZ DATA thus works towards the common and collaborative (further) development of sustainable solutions for the integration of heterogeneous data (https://www.leibniz-gemeinschaft.de/en/infrastructures/leibniz-roadmap-for-research-infrastructures/leibniz-data/).
ORCID DE Consortium
In October 2016, the German ORCID DE Consortium was launched, establishing a new alliance to make research more accessible through the adoption of ORCID researcher identifiers (orcid.org). The ORCID (Open Researcher and Contributor ID) is an open, non-profit, community-driven effort with the goal to create and maintain a registry of unique researcher identifiers. Researchers may obtain an ORCID to distinguish themselves from other researchers while at the same time managing their records of activities (e.g. publications of papers, dissertation, research data, and attended conferences) and search for others in the registry. Another core function of ORCID is the provision of APIs that support system-to-system communication and authentication. ORCID makes the code available under an open source license, and posts an annual public data file under a CC0 waiver for free download. Organisations may become members to link their records to ORCID identifiers, to update ORCID records and to register their employees and students for ORCID identifiers.
The project ORCID DE has been initiated by the German Initiative for Networking Information (DINI). It intends to promote the Open Researcher and Contributer ID for the persistent identification of researchers across academic institutions in Germany including research institutes and universities. Currently more than 40 German institutions expressed their interest in joining the consortium. Since ORCID is considered to be implemented in many universities and research institutions in Germany, the project ORCID DE aims to support the sustainable implementation of ORCID by an integrated approach. This is done by including international experience, important supra-regional infrastructures such as the Gemeinsame Normdatei (GND) and the Bielefeld Academic Search Engine (BASE), as well as publication formats beyond conventional text-based publications. The project promotes the ORCID standard by using related international groundwork by the Knowledge Exchange network or by the Confederation of Open Access Repositories (COAR) that will be edited and adapted during the project to the specific requirements of the German institutions.
TIB as DataCite founding member and service provider for research data, audio-visual media and ORCID in Germany has leaned that researchers and institutions working outside the ‘big data’ disciplines principally
- need assistance, on both a technical and a social level, concerning new PID standards, digital resource management and research data policies,
- need assurance and support to submit their data to data centres and publishers; presently they still keep valuable research resources in many different ways without the appropriate measures for preservation, access and re-use,
- locate research data and other digital resources in a patchy way: via colleagues, search portals and via scholarly literature. They would like to have this digital content (e.g. data and publications) linked, and
- treat digital research objects underlying scientific publications in terms of their validation, linking and accessibility in a wide variety of ways. As such, these processes often do not follow general standards or conventions and need individual consultancy of clients.
As a consequence of the COPDESS Statement of Commitment several new data policies have been published by large publishing houses including Springer-Nature and Copernicus. They explicitly recommend archiving data in public data repositories (and not “together with the paper”, i.e. as classical data supplement”). In addition, new publishing forms are introduced, including data journals, which publish peer-reviewed data descriptions. The data which are published via public data repositories (and not in the article) can be cited in such data papers.
To ensure a maximum transparency and re-usable scientific content, researchers first of all need to decide what kind of digital content is produced during a scientific project, and secondly which digital content (and data) should be shared with the respective community and the general public. Following these decisions, questions on storage, organization, metadata description and citeability need to be addressed in order to guarantee the re-use and citation of the shared resources. Research libraries such as the TIB (in cooperation with data centres and other infrastructures, e.g. DataCite) can provide support and act as multipliers for reproducible science by addressing the demands of the scientific communities on one hand and bring them together with the appropriate technical e-infrastructures and data centres on the other hand. To take further steps in the digital landscape, research libraries such as the TIB will additionally provide
- further development of skills, especially in applying metadata standards to digital objects,
- persistent identifier (PID) support around digital resources (DOI), information infrastructure and person identifiers (ORCID),
- advocating the importance of linked information for user-friendliness and ease of use of new and developing digital infrastructures,
- complimentary research data management, preservation and publication services and know-how,
- recommendations and infrastructures which securely store, curate and preserve research data, as well as,
- access to research data while providing the appropriate rights for its reuse.