Introduction

Background and Rationale

Research collections are an important tool for understanding the Earth, its systems, and human interaction (). These collections are very diverse and can include preserved natural history specimens, archeological artifacts, minerals, or historical documents, to name just a few. Maintaining and curating these collections requires a large investment of time and money by institutions and many individuals (). Knowledge is created from collections by many individuals over time, building on the work of others. For maximum efficiency, work needs to be shared broadly, recorded permanently, and tasks not repeated unnecessarily. Unfortunately, the current research cyberinfrastructure does not support this level of efficiency.

Despite the importance of collections, maintaining them and curating them to keep them up to date often remains challenging for many reasons (e.g., funding and staffing needs) ().

Despite the importance of collections, many are not maintained or curated as thoroughly as we would like (). We suggest a major contributing factor for this maintenance backlog is a lack of professional reward for curatorial actions. Most researchers who are qualified to curate a collection are consumed by activities that reap professional reward, such as writing publications and grants. Proper methods of attribution (at the individual and institutional level) are very important for incentivizing digitization, mobilization, and sharing of data deriving from collections (physical and digital). One strategy for elevating the academic value of curatorial actions is to create the necessary infrastructure that captures the breadth of activities undertaken by curatorial staff. Several programs exist for aggregating metrics for research products other than publications, such as ImpactStory (), OpenVIVO (), Bloodhound, and Altmetric. Thus, there is already infrastructure in place for aggregating these data, if the e-infrastructure for creation of these data is available. What is currently lacking is a standard to best express the actions taken by agents when curating physical and digital collections.

Significant investment has been made in creating the necessary components of the infrastructure that integrate data across a wide variety of disciplines. Many of these components are lists, repositories, or other structures that must be populated with data either by a person or algorithmically (; ; ; ). Even an automatically-created data set will require some degree of human curation to ensure quality. Often, very little can be completed without initial work by a person to create reference material. This human-component is a major bottleneck. Thus, existing infrastructure for collective resources is not being populated with data and thus is not maximally useful. One way to widen the bottleneck is to create professional incentives for researchers to contribute to maintaining and curating collections. If people could receive professional credit, ideally recognized by their administrators and funding bodies, they would prioritize these traditionally unrewarded tasks. Unfortunately, a unified mechanism to manage information about curatorial actions does not yet exist.

In order to address this problem, we worked with the Research Data Alliance (RDA) and the Biodiversity Information Standards (TDWG) group to engage a community of users and to develop the metadata standards that accurately describe attribution. The RDA is a group of over 7000 people from 137 countries who meet every six months to develop and adopt infrastructure that promotes data-sharing and data-driven research (). RDA provides a neutral space for working groups to develop recommendations on an 18 month timeframe. Recommendations are developed within the context of Working Groups (WG) and Interest Groups (IG) that form to address a specific problem. TDWG is a scientific and educational association that fosters collaboration among the creators, managers, and users of biodiversity information. TDWG supports interest groups in the development, ratification, and maintenance of standards for biodiversity information. Endorsement of these recommendations by both organizations was crucial for community adoption and long term support of the results. This paper will present the recommendations of the working group developed during four RDA biannual meetings and two TDWG annual meetings. They represent the completion of the 18 month development period within RDA and potentially the beginning of the standard ratification process within TDWG.

These recommendations were developed to record the attribution metadata associated with curation and maintenance of research collections, whether they be physical or digital objects. The schema was designed to be adopted as part of existing data models and workflows used by stewards of these collections, e.g., museums. It assumes the pre-existence of a collections management system that includes within it a means to track research objects and record curator identities through a system of unique identifiers. These recommendations are intended to fit within the context of existing, domain-specific vocabularies for recording various types of metadata.

Results

This Working Group recommends a very basic, three-axiom, schema based on PROV entities and properties shown in Figure 1 (and demonstrated in the PROV-O documentation) ().

Figure 1 

Recommended Schema. This working group recommends a basic schema linking Entities to Activities and Activities to Agents. Roles are optional.

The key elements of the model for attribution are:

  Entity wasGeneratedBy Activity
  Activity wasAssociatedWith Agent

with some additional attributes assigned to the Activity class:

  Activity has attribute DateTime
  Activity has attribute Reason (added as comment)

The Entity is the curated data object, whether it be a piece of metadata or a physical object. The Activity is the actual curation activity, such as making a correction or transformation. The Agent is the person performing the curation activity. Every Activity will have a DateTime stamp and a Reason it was performed (optional). The above axioms state that an Entity “wasGeneratedBy” an Activity. The Activity “wasAssociatedWith” an Agent, who performed the Activity. An Activity can be related to an Agent using one of two properties. The first is “wasAssociatedWith” and the second, “qualifiedAssociation”, allows for the assignment of a Role. Assigning a Role to the Agent is optional in this recommendation, but specific reifications of this recommendation, such as PROV, may require it. If no role is to be assigned, then “wasAssociatedWith” should be used. This ontology design pattern is very similar to work done by Cox and Car ().

Each specific Entity, Activity, and Agent should be represented by a unique, persistent identifier (). We recommend the use of IGSN for physical objects (), ORCID for people (), and DOI for digital objects () wherever possible. The adoption of IGSN for biological specimens is still being discussed and these recommendations will defer to the future community decision. As such, GUIDs or equivalent standards may be used in place of IGSN. Activities can be identified internal to the curation management system in place. All Activities, Entities, and Agents should be instances of a PROV Activity class, a PROV Entity class, and a PROV Agent class, respectively. If the appropriate class does not exist as a subclass, users should work with the VIVO () community to request the new VIVO subclass which can be mapped to PROV. Anyone can request that a new term be added to VIVO or raise any other issue via GitHub () and join the active ontology-improvement discussion group wiki (). DateTime should be represented as xsd (CCYY-MM-DDThh:mm:ss[Z|(+|–)hh:mm]).

Justification

The above recommendations are based on an existing ontology, PROV (), that is part of a broader world of interconnected ontologies and vocabularies that are in use and have active community support. The pattern is simple enough to be repurposed in multiple disciplines and on physical and digital objects, yet still conforms to existing semantic frameworks.

This schema supports the following queries identified as important by the use cases:

1. Show me all the Activities performed by Kenji on 16 Sept 2013.

SELECT ?agent ?activity ?startdate ?enddate
WHERE
    {
        ?agent foaf:givenName “Kenji” .
        ?activity prov:wasAssociatedWith ?agent .
        ?activity prov:startedAtTime ?startdate .
        ?activity prov:endedAtTime ?enddate .
        FILTER (?startdate >= “2013-09-16T00:00:00+05:30”^^xsd:dateTime) .
        FILTER (?enddate <= “2013-09-16T23:59:59+05:30”^^xsd:dateTime)
    }

2. What Activities have been performed on this metadata record? When?

SELECT ?activity ?startdate ?enddate
WHERE
    {
        ?activity prov:used | prov:generated :enhanced_image_metadata_record .
        ?activity prov:startedAtTime ?startdate .
        ?activity prov:endedAtTime ?enddate .
    }

3. Which Agents have worked with this digital image?

SELECT DISTINCT ?agent
WHERE
    {
        ?activity prov:wasAssociatedWith ?agent .
        ?activity prov:used | prov:generated :enhanced_digital_image .
    }

4. What Role did Kenji play in this image resubmission?

SELECT ?agent ?activity ?role
WHERE
    {
        ?agent foaf:givenName “Kenji” .
        ?activity prov:wasAssociatedWith ?agent .
        ?activity prov:qualifiedAssociation ?association .
        ?association prov:hadRole ?role .
        ?association prov:agent ?agent .
        FILTER (?activity = :image_resubmission) .
    }

Example from Use Cases

Example RDF Turtle representations and diagrams of three use cases are included. One is presented below and two additional examples are given in Appendix A. Examples of how these recommendations could fit in with the larger landscape of existing relevant ontologies and vocabularies are also represented. The RDF Turtle representation can also be found in the project GitHub repository ().

Use case: Digital record curation

Michael (a researcher) notices that a specimen has an incorrect digital lat/long record. Michael reports the error to Sarah (a data curator), who corrects the record in the database (Figure 2).

Figure 2 

Example of Digital Record Curation. Michael and Sarah are both associated with the correction of a digital metadata record as an error reporter and a record editor, respectively. The correction activity is also associated with the incorrect record, which was used to create the correct record.

Attribution:

  • Michael should receive attribution for reporting the error
  • Sarah should receive attribution for correcting the digital record

RDF/Turtle representation

@prefix :         <http://example.org/> .
@prefix dct:    <http://purl.org/dc/terms/> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix prov:   <http://www.w3.org/ns/prov#> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .
@prefix vivo:   <http://vivoweb.org/ontology/core#> .
 
:activityReason a rdfs:Class .
 
# Agents
:michael
    a prov:Person, prov:Agent ;
    foaf:givenName  “Michael” ;
    vivo:orcidId    “http://orcid.org/NNNN-NNNN-NNNN-NNNN” ;
.
:sarah
    a prov:Person, prov:Agent ;
    foaf:givenName  “Sarah” ;
    vivo:orcidId    “http://orcid.org/NNNN-NNNN-NNNN-NNNN” ;
.
 
# Contributor roles
:error_reporter a prov:Role .
:record_editor a prov:Role .
 
# Entities
:incorrect_lat_long_record
    a prov:Entity ;
    prov:Value      “-89.747988,43.138092” ;
    dct:references  <https://doi.org/XX.XXXX/XXXXXXX> ;
.
:correct_lat_long_record
    a prov:Entity ;
    dct:references          <https://doi.org/XX.XXXX/XXXXXXX> ;
    prov:Value              “43.138092,-89.747988” ;
    prov:wasRevisionOf      :incorrect_lat_long_record ;
.
 
# Activities
:correction
    a prov:Activity ;
    prov:startedAtTime    “2017-11-21T18:42:13-04:00”^^xsd:dateTime ;
    prov:endedAtTime    “2017-11-21T18:42:15-04:00”^^xsd:dateTime ;
    prov:used                   :incorrect_lat_long_record ;
    prov:generated              :correct_lat_long_record ;
    prov:wasAssociatedWith      :michael ;
    prov:wasAssociatedWith      :sarah ;
    :activityReason                  “Incorrect longitude and latitude on the digital record.” ;
 
# Role association
prov:qualifiedAssociation [
    a prov:Association ;
    prov:agent    :michael ;
    prov:hadRole  :error_reporter ;
];
 
prov:qualifiedAssociation [
    a prov:Association ;
    prov:agent    :sarah ;
    prov:hadRole  :record_editor ;
] ;
.

Discussion and Conclusion

Relationship to Other RDA Recommendations

A discussion of how these recommendations fit in the larger community of RDA Working Groups and Interest Groups can be found in Appendix B.

Relationship to Existing Standards

PROV-O: An ontology for describing provenance. These recommendations use design patterns from this ontology, ensuring compatibility (). The use of PROV means that these recommendations are compatible with VIVO () and BCO (). Users can use entities and properties in PROV, VIVO, and BCO to suit specific provenance needs that are out of scope for these recommendations. For example, linking a transformed image to its original image can be done using PROV derivedFrom.

SESAR/IGSN: A system of identifiers and metadata for physical samples. These recommendations include the use of IGSN as identifiers for physical objects where possible (). IGSN provides for recording the collector of a sample. The use of IGSN for biological specimens is still being discussed within the biodiversity community ().

TaDiRAH: A vocabulary focused on digital research in the Humanities. This vocabulary contains relevant terms such as “Annotating”, “Cleanup”, and “Editing” that could be used as an Activity, but is specific to the Humanities (; ). Users should draw terms from a relevant vocabulary or add the terms they need to an existing vocabulary.

CRediT: A vocabulary of contributor roles in research. CRediT is a high-level researcher role vocabulary supported by CASRAI (). If a Role is to be assigned to an Agent, it should come from a controlled vocabulary, such as CRediT; however CRediT is very high-level and may not have the needed terms for collection, curation, and maintenance. Users should draw terms from a relevant vocabulary or add the terms they need to an existing vocabulary.

OpenRIF/VIVO-ISF: An ontology for representing contributor roles, activities, and relationships in clinical research. VIVO () is compatible with PROV (). VIVO might be a good adopter if “Curation” is added as a subclass of “Process”. One important point to remember is that PROV is a W3C recommendation, while VIVO is an OBOFoundry ontology (). The critical difference between PROV and VIVO is in the Role class. In VIVO, the Role is unique to the Agent while in PROV, the Role is a separate class that can be assigned to multiple Agents. The consequences of choosing PROV or VIVO should be carefully weighed by each adopter, but will be less of an issue if Role is not used. A PROV Agent would be equivalent to a Person, Group, or Organization in VIVO (or rather, a FOAF Agent). A PROV Activity would be equivalent to an Event from the event ontology. A PROV Entity would be any OWL Thing. Figure 3 is an attribution model proposed, but not yet implemented, in VIVO. The person (or Agent) is the bearer of a Role which may have any of the CRediT types. The Role is realizedIn an occurrentPart (here called Contributorship and not directly represented in the recommendation) of a workProcess (or Activity) which has output Work (or Entity). The person “participates_in” (RO_0000056) the work process (not shown). A datetime can be added to the contributorship to constrain the time of a person’s contribution as shown above, but also to the work process to indicate start and end times for that process (not shown).

Figure 3 

Proposed VIVO Contribution Model. Unlike roles in the PROV model, roles the VIVO model inhere in the Agent/Person and are realized in the Activity/Work Process.

Darwin Core: A data standard for biodiversity. Darwin Core does not currently have an extension for describing curation of objects (). This recommendation will form the foundation of future work to develop this extension. There is already some demonstrated community support for this work (, ).

COPDESS: Data publication standards in Earth Science. COPDESS has committed to using IGSN and ORCID, as these recommendations suggest ().

Data Cite: Data publication standards. Data Cite only allows use of DOI (). These recommendations suggest using DOI for digital objects where possible.

Biological Collections Ontology (BCO): Ontology for describing the collection and treatment of biological samples. This ontology describes some activities that could be considered curatorial, such as the analysis and treatment of biological samples, but is less concerned about attributing those actions to an individual (). BCO and PROV are compatible. Many of the process classes in BCO could serve as Activities in PROV.

Future Work and Implementation

These recommendations represent the results of an 18 month collaboration between RDA and TDWG. After 18 months, RDA requires production of a set of recommendations, but continues support for working groups in maintenance mode as they work with organizations that wish to adopt their recommendations. While this manuscript represents the end of active development within an RDA working group, it also represents the beginning of refinement of these recommendations within a TDWG interest group. These recommendations will form the basis of a proposal to create an extension to the Darwin Core standard (; ). Refinement of these recommendations will continue as adopters come online and within the TDWG standards ratification environment.

Future work that will not necessarily take place within the context of this RDA/TDWG group, but will be important for adoption includes:

  • Formal adoption of a specific persistent identifier for specimens by the biological collections community (such as IGSN ()). The biodiversity informatics community has been struggling with the adoption of identifiers for specimens for many years with limited success (; , ). A new approach is in development for uniquely and persistently identifying specimens without a universal identifier system. This method uses the specimen metadata/identifier graph as an identifier (). We will explore this method for use in combining sources of attribution metadata without duplicating specimen records.
  • The expansion of existing vocabularies and ontologies to include needed terms. The addition of activity classes to VIVO is in early discussions and will likely include a collection and a curation class. These discussions will take place in the GitHub repository within the TDWG-sponsored Attribution Interest Group ().
  • The development of a pilot application for displaying curation activities on an ORCID profile. The pilot application is in early phases and will demonstrate the flow of attribution metadata from a collections data manager (Bloodhound) to a data aggregator (ORCID).
  • The quantification of the impact of this new standard on collections management. Studies have shown some impact of robust attribution on changing incentives and research practices (; ; ; ), but these data are hard to get without a dedicated study. Our future plans include applying for funding to explore the effect of improved curation and maintenance attribution on the practice of collections management.

Additional Files

The additional files for this article can be found as follows:

Appendix A

Attribution Metadata Standard and Use Case Examples. DOI: https://doi.org/10.5334/dsj-2019-054.s1

Appendix B

Relationship to Other RDA Recommendations, Working Groups, and Interest Groups. DOI: https://doi.org/10.5334/dsj-2019-054.s2