Introduction

The study of natural ecosystems requires multidisciplinary science teams to understand and model processes from molecular to global scales (). Many research activities involve diverse collections of samples and associated field or laboratory measurements (; ). For example, studies of organic matter cycling through plants and soil involves analysis of samples to represent soil biogeochemistry, microbial communities, plant structures, leaf gas exchange, and traits of the specific organisms involved (; ; ). Each scientific expert, project team, and discipline has a responsibility to ensure that others can interpret, integrate, and reuse their sample data to help solve emerging problems as our global environment continues to change ().

Collaboration across disciplines requires a more unified approach to report basic information about key data entities, such as samples. One challenge in promoting a unified way of reporting sample data is that some research communities have already developed community-specific conventions, including those for ‘omics samples (; ; ), biodiversity records (), and geoscience samples (; ). A larger challenge is that many researchers use no formal reporting conventions, or exclude information needed to interpret and reuse the data (). More coordination is needed across these communities to develop a multidisciplinary reporting format for physical samples that is widely adopted, or to ensure that standards are interoperable. Common reporting would support effective discovery, integration, and reuse of sample data that spans scientific domains.

Sample identifiers are also needed to associate and manage important information describing a sample (i.e. metadata), such as the location, date, environmental context, and purpose of sample collection. For multidisciplinary studies, the task of generating and managing unique sample identifiers and associated metadata can be complicated, particularly as important contextual information is added throughout the data lifecycle (). Samples are sent to different collaborators, laboratories and user facilities, and then combined into a variety of digital records and publications (Figure 1; ). As a result, scientists face challenges with (meta)data management, tracking, or the ability to integrate and reuse valuable sample data. Without attention, these inefficiencies result in (meta)data loss and inhibit the potential of scientific discovery.

Figure 1 

Tracking interdisciplinary samples throughout the cycle of field collection, transport to collaborators and other labs, various analyses, and digital records.

Our overall goal was to address sample identification and metadata needs of ecosystem scientists, and was driven by the user community of the US Department of Energy’s (DOE’s) data repository for Earth and environmental sciences—Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE; ). The DOE’s Environmental Systems Science (ESS) program relies on multidisciplinary, team-based science to study complex processes within terrestrial ecosystems, spanning from the bedrock through the rhizosphere and vegetation to the atmospheric surface layer (). This community is well-positioned to help address specific challenges in standardizing and integrating (meta)data about a variety of environmental samples (e.g. soil, water, plant, and associated biological material used for ‘omics analyses), which applies broadly to environmental research (; ; ; , ).

We focus on sample identifiers and metadata that support findability, accessibility, interoperability, and reusability (FAIR) from the multidisciplinary domain-science perspective (; ; ; ; ). We therefore use a community-focused approach to: a.) evaluate existing options for sample identifiers and metadata descriptions for ecosystem science samples; b.) pilot the process of standardizing sample information to evaluate practical issues from domain-science perspectives; and c.) outline practical recommendations for sample identifier allocation, tracking, and associated metadata.

Methods

Review of existing sample identifiers, metadata conventions and standards

ESS-DIVE’s work on sample identifiers and metadata began in response to a specific problem with tracking multidisciplinary samples as they are sent to different labs and user facilities, which DOE ESS scientists brought up during community meetings. As a community-focused data repository, our approach to this issue involved leading or participating in a variety of community discussions on sample identifiers and/or associated metadata. These included: presenting identifier options in an ESS community webinar and whitepaper, discussion with each pilot test participant, several meetings with US DOE user facilities and data systems representatives (Joint Genome Institute, National Microbiome Data Collaborative, Environmental Molecular Sciences Laboratory, and DOE Systems Biology Knowledgebase), broader community meetings on identifier and metadata practices for physical samples [Earth Science Information Partners (ESIP), and Research Data Alliance (RDA)], National Microbiome Data Collaborative (NMDC) Ontology workshop, USGS workshop on sample collection metadata for the National Digital Catalogue, and participation in the IGSN 2040 Steering Committee and business planning.

After reviewing the scope and use of available persistent identifier (PID) options (Table 1) and community discussions, we focused additional identifier comparison on International GeoSample Numbers (IGSNs) and Archival Resource Keys (ARKs), which are most commonly used for a variety of sample types (Supplemental Table 1). Considerations in the identifier assessment included: i.) association with a broader international community focused on sample identification and description, ii.) associated metadata to describe samples and their relationships, iii.) availability of user-friendly infrastructure to mint identifiers and validate metadata, iv.) general ease of use, and v.) other technical identifier characteristics listed in Supplemental Table 1.

Table 1

Examples of PIDs that have been used for samples, modified from Guralnick et al, ().


IDENTIFIER TYPEIDENTIFIER EXAMPLESCOPE

ARKark:/12148/btv1b8449691vFlexible

URNurn:catalog:UMMZ:Mammals:171041Flexible

HTTP URIhttp://data.rbge.org.uk/herb/E00115694Flexible

DOI10.7299/X7VQ32SJFlexible, mostly papers and datasets

UUIDEF0A4D3E-702F-4882-81B8- CA737AEB7B28Flexible

IGSNIGSN: IECUR0002Geoscience, working to become general physical sample identifier

CETAF URI, based on HTTP URIhttp://data.rbge.org.uk/herb/E00421503Species Occurrence, Specimens from CETAF institutions

RRIDRRID:MGI:5630441Biomedical Research Resources

BioSample accession numberSAMN03983893Biological source materials used in experimental assays

Acronyms: ARK = Archival Resource Keys, URN = Uniform Resource Name, URI = Uniform Resource Identifier, DOI = Digital Object Identifier, UUID = Universally Unique Identifier, IGSN = International GeoSample Number, CETAF = Consortium of the European Taxonomic Facilities, RRID = Research Resource Identifier.

We also reviewed existing metadata standards and templates that are relevant for samples collected by environmental scientists, including: general digital object standards (; ; ), biodiversity records (; ), ‘omics (e.g. genomics, metagenomics) material (; ; ), and geoscience samples (, ) (Supplemental Table 2). We created a translation table comparing 49 metadata elements (Supplemental Table 3) in human-readable format. The translation table depicts linkages where metadata elements were common across standards, and differences.

The core IGSN Descriptive Metadata Schema (https://github.com/IGSN/metadata) includes basic metadata associated with sample collection, which is generally relevant across sample types. This schema links metadata profiles that differ across six currently-functioning IGSN allocating agents. SESAR (the first allocating agent) has no access restrictions for obtaining IGSNs and provides user-friendly services for sample management (https://www.geosamples.org/). The SESAR metadata profile and controlled terms are currently focused on geoscience samples, but the IGSN organization seeks to accommodate multiple disciplines and has already expanded into plant and other biological samples for some IGSN allocating agents. Our translation table for sample metadata allowed us to identify metadata elements and terms that could be revised or extended within the SESAR profile for improved representation of other sample types (Supplemental Table 3).

Biology-related standards are well-established, commonly used in the community, and are particularly important for ecosystem science samples. Genomic and metagenomic analyses and data publication require use of standards developed by the Genetic Standards Consortium (GSC) (), namely Minimum Information about any Sequence (MIxS) and Minimum Information about any Metagenome (MIMS) (). DarwinCore is a metadata standard for biodiversity records that has been widely adopted across the biocollections community (). It is also required for submitting data to the Global Biodiversity Information Facility (GBIF, www.gbif.org), which allows global search and integration of biodiversity records (; ). GBIF provides a valuable service as a data aggregator, and thus has driven standards adoption, and enabled a wide range of data reuse applications in published biodiversity studies (; ), including over 5,000 known citations from studies using biodiversity records (www.gbif.org).

We researched ontologies that could be used to describe a broad set of environmental sample types, including the Biological Collections Ontology (BCO) (), Environment Ontology (ENVO) (), Population and Community Ontology (PCO; http://purl.obolibrary.org/obo/pco.owl), and Plant Ontology (PO) () to identify additional or alternate terms to generally describe other types of soil, sediment, water, gas, and biology-related samples ().

We also engaged with the broader, international community working on sample-related practices. This broader community is led by members of the IGSN organization, with participation across other national agencies (e.g. USGS, CSIRO, Australia Research Data Commons-ARDC) and data organizations (ESIP and RDA). This community participation was important in identifying best practices in identifier and metadata use, and contributing perspectives of ecosystem sciences in the broader community working on sample standardization. Continued participation in the broader informatics and domain science communities is important for improving interoperability and usability of sample-related standards.

Sample identifier and metadata testing in the field

In order to develop a sample metadata reporting format that was informed by our domain science community, we worked with scientists from eight different Environmental Systems Science projects to conduct a pilot test for using sample PIDs and metadata. In particular, we tested the practicality of the IGSN, which appeared to be the best choice amongst relevant PIDs for our purposes. These projects had varying scopes and sample types, and were all funded by DOE’s Office of Science Environmental Systems Science (ESS) program (Supplemental Table 4).

Prior to sample registration, we discussed the following with representatives from each project: 1) expected sample types involved, 2) how to assign IGSNs and link related samples, 3) essential metadata needed to understand specific sample types, and 4) past sample tracking workflows. Some projects had already collected samples and preferred to register for IGSNs after collection to be associated with digital files, while other projects pre-registered their samples before collection, or registered directly after collection. We used initial feedback and background research to identify several core descriptive sample metadata fields likely to be necessary for searches on ESS-DIVE to be most effective, including standardized information on the following (, and see Supplemental Table 3 for full translation table comparing metadata elements from existing standards and templates):

  • IGSN and Parent IGSN (where relevant)
  • Sample Name (project-specific sample name, must be unique)
  • Chief Scientist/Collector
  • Sample Type fields:
    • Object Type (e.g. Individual sample, core, site),
    • Material (e.g. Liquid-aqueous, Rock, Soil, Biology),
    • Sampled Feature (primary physiographic feature sample collected from)
  • Location Information (Latitude, Longitude in WGS84; Location description),
  • Date (ISO 8601; e.g. 1954-04-07),
  • Collection Method Description
  • Project

Note that this list represents the initial IGSN metadata fields that should be required, and were subsequently revised after our pilot test work. Many additional metadata fields are available and are recommended or optional depending on the sample type ().

The researchers involved in our testing used SESAR’s sample management portal (MySESAR, http://www.geosamples.org/mysesar) to register samples and update metadata. We recommended a specific workflow for participants to register their samples and update sample collection metadata, outlined in our github repository (https://github.com/ess-dive-community/essdive-sample-id-metadata) and associated dataset ().

We also worked with individuals to map sample history from collection of samples in the field through a variety of analyses, and publication (Figure 2). This exercise helped determine sample tracking needs, and develop recommendations for assigning PIDs and linking highly-related samples and subsamples.

Figure 2 

Sample journey map, using the sample PID and metadata to document sample history and link related samples in the WHONDRS project (; ).

PNNL = Pacific Northwest National Laboratory; EMSL = Environmental Molecular Sciences Laboratory; ORNL = Oak Ridge National Laboratory; GOLD = Genomes Online Database.

After sample collection and registration, we discussed the following: 1) What sample collection metadata is needed to understand resulting sample data?; 2) How much effort did it take to register samples and standardize metadata?; 3) What is needed to make sample PID registration and standardization easier?

Developing the final IGSN-ESS reporting guidelines

We used a combination of research on existing standards, and pilot test feedback to develop final recommendations for allocating identifiers and assigning standard metadata (). We took extensive notes during meetings with pilot test participants, and compiled specific feedback on improving guidance on allocating identifiers and relationships, metadata needed to understand relevant sample types, and improve efficiency of sample registration and standardization. Pilot test participants identified metadata elements that needed to be added, modified, or removed to improve relevance for multidisciplinary ecosystem science samples. We then used our translation table (Supplemental Table 3) comparing other existing standards to guide specific recommendations. For example, to address feedback regarding inefficiencies in providing all metadata at individual sample levels, we added the Darwin Core elements: Location ID, Collection ID, and Event ID. We then reviewed existing, commonly-used ontologies (ENVO, BCO, PO) to select important vocabulary terms to characterize sample type, material, and environmental context. We developed a list of relevant terms based on pilot test studies, and all participants helped decide on our final term lists for object type and material, specifically.

All feedback was addressed in our final recommendations, which we compiled into github, and more user-friendly gitbook documentation. This documentation includes: instructions on registering samples for IGSNs using our revised template, specific definitions/instructions/examples for each metadata element, lists of terms for elements where controlled vocabulary is needed, and instructions for how to contribute feedback using github, and cite the final format. To develop documentation, we used the ESS-DIVE community github for samples, inspired from user-friendly documentation for Darwin Core, which facilitates additional community feedback (through public github issues) and versioning. We presented our final recommendations and documentation in two additional community webinars, which are advertised to ESS-DIVE users and ESS scientists, and published on the ESS-DIVE website (https://ess-dive.lbl.gov/webinars/). The purpose of community webinars was to present our conclusions and collect any additional feedback.

As a community-oriented data repository, we will continue to gather feedback and develop additional tools to support users in submitting, searching for, integrating, and reusing high-quality sample data.

Results

Review of Existing Sample Identifier and Metadata Practices

In our review, we found that numerous studies have documented that persistent identifiers (PIDs) enable sample tracking across facilities and publications, and support reuse over time (; , ; ; , ; ; ; ). PIDs are globally unique, stored with descriptive metadata, and arguably essential for supporting data synthesis (; ). While there are several options for obtaining PIDs—Archival Resource Keys (ARKs), Digital Object Identifiers (DOI), Uniform Resource Identifier (URI) (, ; ; ; )—the International GeoSample Number (IGSN) is the primary PID for physical samples (; ; Table 1, Supplemental Table 1; ). IGSNs were originally designed for geoscience samples, but have been used for a variety of biological and environmental sample types. The IGSN organization is now expanding to better support multidisciplinary samples, and leading the Internet of Samples project ().

Through community discussions, we determined that the most important factors in selecting a PID were a.) an international community with expertise on sample documentation, b.) associated sample-specific metadata that will eventually enable global sample search and integration, and c.) user-friendly infrastructure to mint PIDs, validate metadata, and provide a sample-specific web landing page (Supplemental Table 1). IGSNs are the only identifier with these characteristics, as they are uniquely governed by an international community organization (IGSN e.V.) with a mission to mint and maintain persistent identifiers for physical samples. The System for Earth Sample Registration (SESAR) is the largest IGSN allocating agent, and enabled us to readily test the process of sample registration and standardizing metadata without first building new infrastructure to mint PIDs, print IGSN barcode labels, and submit and validate metadata. SESAR also provides a persistent sample landing page (e.g. IGSN:IEBWE000L) with metadata and links to related resources (; ; ; ; ).

Through our comparison of metadata elements in existing sample-related standards and templates (Supplemental Table 3), we concluded that IGSN metadata contains basic information needed, and was therefore sufficient to use in our pilot for standardizing sample metadata.

Sample Identifier and Metadata Testing in the Field

Our pilot test included eight DOE ESS-supported projects that collected field-based samples, including studies of biogeochemical responses to contamination, climate change, or other disturbances (Supplemental Table 4). Project sample types included soil cores, core sections, individual soil samples, sediment, gas, porewater, pond water, river water, leaves, and biofilms. Researchers registered their samples with IGSNs to determine practicalities of using the original SESAR IGSN template (i.e. excel spreadsheet with sample metadata elements for each column and unique sample names/IGSNs for each row) () in multidisciplinary scientific workflows.

A total of 4,485 IGSNs were registered as part of the pilot (Supplemental Table 4). A primary sample for participating projects was often split into multiple subsamples or replicates, and sent to different labs (2–9 labs/user facilities) for numerous analyses (2–23 analyses, Figure 2, Supplemental Table 4; ; ; ; ). There was universal agreement among researchers that top-level “parent” samples (e.g. soil core), and related “child” samples (e.g. subsections of a soil core) be assigned individual IGSNs. Note that a soil core is a physical parent sample, while in some cases researchers may need to link a set of related samples with no physical parent sample. One example from our test was a set of water samples collected at different depths at a specific point and time in a pond (Figure 3).

Figure 3 

Options for assigning IDs to sets or chains of highly related samples and subsamples. There is uncertainty among domain scientists about whether to assign new PIDs to subsamples. Based on our pilot test feedback, options 2 and 3 are most efficient for soil cores and water samples, respectively. Relationship metadata can be inferred from the type of ID (e.g. collection or site ID) and the order of Parent IGSNs, and assists machine reconstruction of the sampling hierarchy from original feature or sample through subsequent child samples.

Most participants were uncertain whether to assign new IGSNs to subsamples or replicates stored in different containers or split for analyses, particularly when they are essentially considered to be the same sample with the same metadata; many researchers preferred qualifiers/extensions from the same primary IGSN in such cases (Figure 3; ). IGSN extensions are currently allowed by request through SESAR IGSN, and are preferred by some users to avoid numerous rounds of IGSN registration and redundant metadata entry. The extensions can allow precise provenance tracking and incorporate additional analytical metadata when subsamples are sent out for a variety of analyses, without requesting new IGSNs. However, this requires users to 1) ensure that their extensions are unique, 2) are restricted to a limited number of additional characters, and 3) that they are batch registered through the IGSN allocating agent with associated metadata, including at least object/sample type, sample name, and the parentIGSN (and ideally all relevant metadata inherited from the parentIGSN). IGSN allocating agents could consider more efficient approaches for registering IGSN subsamples with the same metadata as parentIGSNs, such as adding a metadata field to list subsamples (IGSNs with user-specified extensions), or to have extended IGSNs automatically resolve to the primary IGSN landing page, as done by the ARK identifier system for containment qualifiers (https://wiki.lyrasis.org/display/ARKs/ARK+Identifiers+FAQ).

Researchers also had different opinions on whether related entities (e.g. location) should get an IGSN/PID. In most cases, project-specific, locally unique IDs were sufficient for collection and location IDs. Some researchers assigned IGSNs to wells that were re-sampled over time.

Use of IGSN metadata and template

Much of the IGSN Core Descriptive Metadata is relevant for samples across research domains, but there are key metadata fields and vocabulary terms that are missing or do not accurately describe some ecological samples. We added two essential metadata elements from Minimum Information about any Sequence (MIxS)(i.e. broad environmental context/biome, sample processing; ), and added or modified fields based on DarwinCore (i.e. Scientific Name, Depth, and Height fields) to more fully describe ecosystem samples (Figure 5). We concluded that the Environment Ontology (ENVO) includes more relevant terms to describe sample material and environmental context for ecosystem science samples. Because ENVO is used in the MIxS template, it also helps improve interoperability when relating geoscience analyses with ‘omics analyses for samples (Table 2), which is often important in ecosystem studies.

Table 2

Mapping of key fields to promote interoperability between geoscience (IGSN) and associated metagenomic samples (BioSample). Minimum Information about Any Sequence (MIxS)/Minimum Information about any Metagenomic Sequence (MIMS) templates require or encourage use of the Environment Ontology (ENVO) to describe environmental context and materials, and the GAZETTEER ontology (GAZ) for place names.


IGSN FIELDMIXS/MIMS FIELD

IGSNSource material ID (can include the full link to sample landing page)

MaterialEnvironmental medium* = ENVO

Related to Materialorganism (e.g. soil metagenome)

Physiographic featurelocal scale environmental context* = ENVO

N/Abroad scale environmental context* = ENVO

Countrygeographic location (country or region) = GAZ

N/Asample material processing

IGSN was designed to allow community-specific metadata profiles along with common high-level metadata to support broader interoperability. However, variations across the communities in high-level vocabularies, such as object/sample type and material terms, can inhibit interoperability if the vocabulary terms are not well defined, managed, and linked. We therefore mapped SESAR IGSN terms to ENVO terms for materials. Unlike IGSN vocabulary terms, ENVO terms have specified definitions, PIDs, and are linked to other related terms across many existing ontologies. We also believe that the broader IGSN community could contribute valuable input to the ENVO terms, and benefit from using this ontology or others as they move towards supporting a wider variety of disciplines. We found community agreement that the IGSN Object type terms also need to be revised, and high-level vocabularies will be addressed in the new ESIP Physical Samples Curation Cluster (https://wiki.esipfed.org/Physical_Sample_Curation).

Participants with extensive sampling campaigns found that the spreadsheet format requiring full documentation for each individual sample was impractical. To partially address this, we follow DarwinCore by adding the option of managing metadata using identifiers for higher-level entities (collectionID, locationID, eventID) to help avoid redundant metadata entry. Managing metadata for larger collections of samples by describing sample collections, locations, or events in separate files (see Figure 4) can allow programmatic transfer of relevant metadata to individual samples. However, with regards to applying IGSN metadata to locations we encountered several issues, described in Table 3, as metadata was not intended to fully document site information. We provide additional recommendations in Box 1 that may further improve the efficiency of standardizing sample metadata and/or address practical concerns of researchers.

Figure 4 

Example of using related identifiers to link related samples and information. Related identifiers are listed in blue. All metadata can be provided at the sample level or by providing separate files (depicted as boxes) for higher-level collections of samples, sampling events, methods, and/or locations. When providing separate spreadsheet files, each file (e.g. locations file) contains a row for each unique related identifier (e.g. location ID), with the associated metadata fields (e.g. location description) as columns. Unique identifiers for these related, higher-level entities then allow associating relevant metadata (e.g. latitude and longitude) with individual samples. This practice is flexible and optional, depending on data management needs and preferences.

Table 3

Summary of preliminary issues and solutions encountered in assigning SESAR IGSN metadata to sample locations. While the most basic location information is included (e.g. latitude, longitude, and location description), our community needs more work on interoperability with standards that more fully describe site locations, such as metadata standards developed by the Open Geospatial Consortium. Location descriptions in multidisciplinary ecosystem sciences include location descriptions for samples and other entities, such as sensor infrastructure in monitoring networks and remote sensing data.


Location IDIf there is a project-specific site/location name, you must currently provide this in the free-text location description field. We therefore added LocationID as a field, which can be associated with metadata and does not need to be globally unique. Sample metadata contains location fields, but is not intended to fully describe sites/location information.

Location HierarchiesWe do not address a standard way to represent complex location hierarchies (e.g. basins, watersheds, wells, depths within wells), which is needed but is out of scope for the current effort.

Plot NameMany projects are located in remote areas where GPS coordinates are not reliable and yet specific locations are necessary. Therefore, plots are formally defined and distance from specific points documented in the field using a relative reference system. Currently, users must describe this within the Location Description metadata field.

Uncertainty or precision of geographic coordinatesWe could add a metadata field to provide detail on the uncertainty in the geographic coordinates, as done in DarwinCore. However, we found that participants sometimes do not have this information. Certain instruments (i.e. smart phones) do not provide an easy way to specify uncertainty. It may therefore be more efficient to simply indicate the specific instrument used to provide information on the likely uncertainty or precision of the coordinates. Additional terms are needed to specify instrument used.

Sampling feature/well typeThere are no controlled vocabularies within the current IGSN template to characterize the type of well. We currently recommend providing this information in the free-text location description.

Box 1 Efficiency recommendations for large sampling campaigns

  1. Field Collection Apps: On or off-line field collection apps can be programmed with standard metadata, and enable users to collect information directly in the field, such as automated generation of date/time, and location (; ). Apps could also generate and record PIDs in the field, and be paired with portable label printers.
  2. Label Material: For IGSNs to be associated with the physical sample, recommended label material and adhesive to withstand extreme conditions (e.g. –80 freezer, water submersion) is useful. Some specific recommendations include: waterproof or cryogenic labels (e.g. https://www.labtag.com/cryogenic-labels/), vinyl or polyester labels (e.g. https://www.dymo.com/en-US/ind-permanent-polyester-labels-3-4-in), and Microcentrifuge Tube Tough-Tags®.
  3. Barcodes and APIs: Sample label barcodes and barcode readers could utilize an API for effectively pulling specific metadata from the IGSN record (e.g. sample type, location) to assist with downstream data analysis or processing, and/or automatically adding links to additional metadata or data as it is produced later in the life-cycle of a sample ().

Sample identifiers for tracking and linking

Researchers generally use their own meaningful sample name for internal sample tracking and individual data analysis workflows (Figure 2); so, both the project-specific sample name and the IGSN should be associated with digital records of the sample. The IGSN, as a globally unique PID, is better suited for automated sample tracking and linking related information over the data life cycle, from field-collection to open-access publication (; ). With IGSNs, related samples can be more clearly linked on the sample landing page (e.g. IGSN:IEWDR000X). Further, specific location or event IDs clarify common relationships for samples and derivatives in a project studying ecological processes at a given location—for example, involving plant litter, leaf, root, soil, and associated ‘omics samples.

To most effectively link samples, we recommend that all labs and data systems that generate or store sample data utilize the IGSN or other PID, adding it to metadata templates where relevant. Use of the SESAR API to obtain relevant information about samples can facilitate reuse of metadata across multiple labs or facilities. In theory, the IGSN could be used to automatically add links on the sample landing page to data generated at different facilities; however, no tools are currently available to enable automated linkages.

Improvements are needed to link environmental and associated biological samples. Genomic samples, for example, should be assigned a BioSample number when submitted for sequencing, and linked to the original field-collected sample where relevant (Table 2). There is currently no automated way to link such identifiers, so we recommend providing a full link of the IGSN landing page in the source material ID field in the MIxS template (Table 2).

Discussion

Sample PIDs and metadata in Multidisciplinary Environmental Sciences

We advocate use of IGSNs for ecosystem science samples for a number of reasons. IGSNs are the only PID specifically designed for samples with associated metadata (). IGSN is the only PID backed by an international community of experts, dedicated to identifying, describing, and linking sample data (). Participation in the IGSN community will help improve the usefulness of sample PIDs and relevance of associated metadata for multidisciplinary ecosystems science. Additionally, other large national agencies have or plan to adopt IGSNs [e.g. United States Geological Survey (USGS), National Oceanic and Atmospheric Administration (NOAA), Commonwealth Scientific and Industrial Research Organisation (CSIRO)]. A recently funded effort, iSamples, will improve infrastructure for samples that utilize IGSN and other sample PIDs, and eventually support global search for an even wider variety of sample types ().

Benefits to Data Contributors and Users

Funders of scientific research, such as the US DOE and the National Science Foundation (NSF), require robust data management and publication plans, which should include details for managing and tracking valuable sample data. These data are often not well-described and are missing key information needed to interpret and reuse it, leading to data loss (; ; ). The IGSN-ESS reporting format can assist ecosystem researchers in creating effective sample management plans and preserving their data.

More widespread use of sample PIDs and related metadata will help make sample data more FAIR (; ). Standard information to characterize the sample type, location, and date are particularly useful for finding relevant data (; ). Persistent landing pages for samples allow long-term access to sample (meta)data. Use of a controlled vocabulary for key metadata (e.g. sample material and environmental context) helps make data interoperable and more easily integrated across datasets. In addition, reuse often requires information on collection and processing methods (). Samples with standard metadata can be more easily shared (i.e. understood and reused) with collaborators, which helps avoid situations where information is lost when people change institutions or retire (). High-quality published data increasingly helps scientists achieve greater academic recognition, higher citation rates, and can lead to new opportunities for co-authorship and collaboration (; ).

Multidisciplinary ecosystem science often involves complex workflows, and sample PIDs and common metadata provide essential information to help users automatically track samples and add relevant data throughout the sample life cycle. PIDs (such as IGSNs and DOIs) are essential for tracking use of samples and related data over time (; ; ). This provides the foundation to build tools that automatically link and exchange this information across data systems, with no further input from the user after the initial metadata is provided.

Ecosystems research often relies on sample data combined with other data types, such as remote sensing and environmental sensor data, to answer questions about ecosystem response to increasingly rapid global changes (; ; ; ). One limitation is that our standards comparison was focused on sample-related metadata; we need more work towards incorporating standards suitable for other related entities, such as locations and sensors (; ; ).

More widespread standardization will help reduce the estimated 80% effort currently spent on data wrangling for synthesis work, and enable more efficient data integration and analysis (). Improved sample data management and reuse will increase the pace of scientific discovery and accelerate new fields of enquiry (; ). Already, publicly available nucleic acid sequences have enabled scientists to build phylogenies and perform comparative genomics studies, and are now essential in community ecology (). Biodiversity records are regularly combined with climate and land use data to predict species distributions, biodiversity, and explore multi-scale ecological patterns (; ; ; ). With our multidisciplinary reporting format, we can move beyond infrastructure supporting individual data types, towards efficiently integrating multidisciplinary data to understand ecosystem processes from molecular to global scales.

Conclusions

Summary of IGSN-ESS identifier and metadata recommendations

Many multidisciplinary projects have complicated workflows and need an efficient system for tracking samples as they are sent to different collaborators, labs, user facilities, and published online (Figure 1). Despite growing need and interest, there was previously no straightforward guidance on how to describe sample collections or multidisciplinary samples. We therefore recommend registering samples with IGSNs, using our modified metadata template for ecosystem sciences (IGSN-ESS; Figure 5). The downloadable template, along with complete definitions of all terms, instructions for IGSN registration using IGSN-ESS and providing feedback are detailed in the ESS-DIVE community github repository, and associated data publication ().

Figure 5 

Sample metadata for Environmental Systems Sciences (IGSN-ESS). Each sample metadata element is listed under a general category of information. Required fields are marked with an asterisk*. Fields added to IGSN metadata or revised from Darwin Core (DwC), MIxS, Environment Ontology (ENVO), Biological Collections Ontology (BCO), Plant Ontology (PO) are indicated in parentheses.

To avoid redundancy in describing samples with the same metadata, we add the optional practice of assigning common sample metadata to a collection, location, or event (; ; ; ). A collection ID provides a flexible way for projects to define common metadata for any set of related samples, while location ID can be used to describe project locations/sites, and event ID can describe metadata for a given sampling event (see Figure 4). These related IDs also provide an unambiguous way to automatically link commonly-related samples. This is particularly important for ecosystem science research, as diverse sample types often need to be clearly linked by specific related identifiers (e.g. location).

For highly-related subsamples with the same metadata, we recommend the option of ID extensions, which could be opaque or meaningful as long as they are unique (Figure 3). It would further improve efficiency of subsample IGSN registration to update the primary IGSN metadata by listing the subsamples or replicates under a “subsample” field, instead of registering them separately. Or the IGSN resolution service could follow the practice of ARKs, where IGSNs with extensions (i.e. containment qualifiers, Supplemental Table 1) automatically resolve to the primary IGSN landing page.

We added or revised fields and vocabulary terms to more accurately describe multidisciplinary samples, and support data linking and reusability (Figure 5). We include controlled vocabularies for relevant subsets of terms from ENVO, (; ), which improves description, search, and integration of a variety of multidisciplinary sample types using key fields (e.g. sample type, sample material, and environmental context; ). We selected terms based on an evaluation of their relevance and likelihood of being used in multiple contexts. We also found that use of ENVO for both local (physiographic feature) and broad (biome) environmental context (e.g. stream ENVO_00000023) is important to fully characterize soil, sediment, and water samples.

Promoting Adoption and other Next Steps

Most ecologists and environmental scientists now understand the importance of data archiving, but struggle to manage data effectively (; ). Removing even trivial barriers can increase the likelihood that researchers will adopt beneficial practices that take effort to achieve (). User-friendly guidance and sample metadata templates are an essential step in promoting standard practices that make data publishing, integration, and reuse easier. However, investments are also needed in training programs (), tools to assist with legacy data and analytical instrument systems, and improved data quality management systems that encourage good management practices throughout the research process (; ). We need tools that translate across existing metadata conventions and use sample and relationship metadata to automatically generate digital resource maps; this could promote adoption by helping users precisely document sample history and linkages to other PIDs and documents (; ). Global sample search (e.g. iSamples Central; ), with integrated results, based on key fields (e.g. sample material, location, environmental context, methods, and associated data variables/analyses) would greatly enhance sample data discovery and reuse, and is likely the most effective tool to promote widespread adoption of sample standards (e.g. GBIF; ).

Overcoming complex challenges that require communities to change behavior and provide standardized data will require a coordinated effort, which is best addressed by collaborations of key stakeholders who establish community consensus, enforce guidelines, and help solve problems (; ). These stakeholders include a variety of data contributors and users from different scientific domains, as well as laboratory facilities, repositories, funders, and publishers that take part in institutionalizing and rewarding good data management practices (; ; ; ). Community coordination on sample reporting conventions and linked cyberinfrastructure will help solve data management problems, expand access pathways, and make our sample data more useful over time.

Data Accessibility Statement

Data and recommended metadata guidelines generated as part of this work are published in the ESS-DIVE repository, Damerow et al, (), and future updates will be managed and available through our community github repository (https://github.com/ess-dive-community/essdive-sample-id-metadata).

Additional Files

The additional files for this article can be found as follows:

Supplemental Table 1

Comparison of ARK and IGSN sample identifiers characteristics. DOI: https://doi.org/10.5334/dsj-2021-011.s1

Supplemental Table 2

Overview of existing sample-related standards and templates. DOI: https://doi.org/10.5334/dsj-2021-011.s2

Supplemental Table 3

Translation table comparing standards and templates related to sample metadata. DOI: https://doi.org/10.5334/dsj-2021-011.s3

Supplemental Table 4

Summary of projects involved in IGSN and standard metadata pilot test. DOI: https://doi.org/10.5334/dsj-2021-011.s4