1. Introduction

Metrics are measures that are able to produce quantifiable information. There are many applications of metrics, ranging from monitoring data system performance to benchmarking a project or mission success. Metrics (e.g., Table 1) are routinely collected in data repositories and provided to data providers, mission or project management, and scientists and software engineers, to analyze (e.g., compare, benchmark) and track a range of performance-related activities, such as system performance, data access and usage in research, applications, allocation of information technology resources, and benchmarking a mission or project success (; Par et al. 2019; ; ; ; ; ; ). In short, developing, collecting, and analyzing metrics is essential to better support Earth science research, applications, and education (e.g., , , , , ).

Table 1

Four major types of metrics collected at GES DISC.


TYPE OF METRICSMETRICS

Key MetricsThe operational distribution metrics recording overall user data/service access and download activities for the following three major groupings:
  1. Number of Distinct or Registered Users
  2. Number of Distributed Data Files
  3. Size of Distributed Data Volume
in four categories: Country (e.g., United States, Canada), Protocol (e.g., HTTPS, OPeNDAP), Project (e.g., TRMM/GPM, MERRA-2) and Domain (e.g., ‘.edu’, ‘.gov’).

Bugzilla Ticket MetricsCollecting and retrieving significant anduseful info from user questions or feedback mentioned in UserAssistance Tickets:
  1. User Background:
    1. What they are: Researcher; Professor/GraduateStudents; Industry; etc.
    2. Where they come from: USA; Africa; Asia; Australia; Europe; and the Middle East; etc.
  2. Number of UserAssistance Tickets: Monthly; Seasonal; Yearly distributions (per routine Daily collections)
  3. Application/Study: Hydrology; AtmosphericChemistry; Oceanography; etc.
  4. Portal: Giovanni; MERRA-2; TRMM/GPM; etc.
  5. Data Variable: Air Temperature; Wind Fields; Precipitation; Aerosol; etc.

Giovanni Publication MetricsCollecting/gleaning significant and useful info from our Giovanni user journal publications in regard to:
  1. Applied Variable: Atmos. Aerosol; Precipitation; Air Temperature; etc.
  2. Product Source: TRMM/GPM; MODIS; MERRA-2; etc.
  3. Studied Subject: Hurricane; Aerosol/Dust; Rain/Water Vapor; etc.
  4. Studied Temporal Period: Long-term; Mid-term; Short-term
  5. Studied Spatial Domain: Global; Regional; Local
  6. Studied Region: Continents; Oceans; Countries; Lakes; etc.
  7. Journal Origins: America; Europe; Asia; Middle East; International/Open Access; etc.

Website Metrics (via Google Analytics)Collecting useful info on User Website Access via utilization of the Google Analytics Tool.
  1. Dataset Keyword search: “rainfall”; “TRMM”; “merra-2”; “trmm”; “GPM”; etc.
  2. Information Keyword search: “precipitation”; “trmm”; “rainfall”; “merra-2”; “Giovanni measurements”; etc.
  3. Content Type search: “Data Collection”; etc.
  4. Traffic and/or Referral Sources: Direct Access; Google; etc.
  5. Datasets subsetted/downloaded directly from search results page: “trmm_3b42_v7”; “m2t1nxslv_v5.12.4”; “m2i3npasm_v5.12.4”; etc.
  6. Most Sorted Columns: “begin date”; “time res.”; “end data”
  7. Most Browsed Categories: “subject”; “measurement”; “source”; “project”; “spatial resolution”; “temporal resolution”
  8. Most searched Content Type: “data collections”; “data documentation”; “image gallery”; “how-to’s”; “tools”; “faqs”

As one of the largest repositories of Earth science data in the world, NASA’s Earth Science Data and Information System (ESDIS) Project () supports twelve Distributed Active Archive Centers (DAACs) (). Standard metrics have been developed and collected by the ESDIS Metrics System (EMS) () for routine analysis at each DAAC. Other metrics for both ESDIS and each DAAC are also collected, such as user satisfaction.

As the total data volume is expected to grow rapidly and technologies (e.g., cloud computing, AI/ML) continue to improve data discovery and accessibility, opportunities for developing new data services for the Earth science community will also arise. However, developing such metrics has become a challenge, because in this era multiple datasets are often needed. Current metrics are designed for a single predefined dataset or service, a disadvantage for collecting metrics for interdisciplinary data services.

In this paper, we use one of the DAACs, the NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) (), as an example to assess the current status of metrics. We will discuss challenges and opportunities, with recommendations in developing metrics for interdisciplinary data and services in conjunction with the FAIR guiding principles () and community recommendations.

The structure of the paper is as follows: Section 2 overviews existing datasets and services for collecting metrics; Section 3 lists current metrics, collection methods and operations, along with examples; Section 4 discusses current issues and future needs for new metrics; and Section 5 provides the summary and recommendations.

2. Datasets and Services at GES DISC

The GES DISC provides a large number of NASA Earth science multidisciplinary datasets (i.e., atmospheric composition; water and energy cycles; climate variability; carbon cycle and ecosystems) to research, application, and education communities across the globe. Datasets from several well-known NASA satellite missions and projects are included, such as the NASA-JAXA Tropical Rainfall Measuring Mission (TRMM), and the Modern-Era Retrospective-analysis for Research and Applications (MERRA). As of this writing, more than 1200 datasets are archived at GES DISC.

The GES DISC homepage (Figure 1) is the primary gateway for accessing datasets and relevant information (e.g., technical documents). The homepage (Figure 1) contains Web components, which include: 1) dataset search from data collections; 2) links to dataset-related publications; 3) access to data tools; 4) archives of dataset-related information (news, alerts, service releases, glossary); and 5) libraries of supporting information (FAQs, Data in Action articles, How-To’s). There are several search functions developed to help refine and sort search results (e.g., refining by subject, measurement, source, data processing level, project, and temporal and spatial resolutions).

Figure 1 

The homepage of GES DISC with search capabilities for datasets, tools, documentation, alerts, data releases, news, FAQs, publications and more.

Each dataset at GES DISC has its own landing page (e.g., Figure 2) which serves a one-stop ‘shop’ for data services and information. The dataset landing page is still evolving; at present, it provides: 1) data access methods; 2) a brief dataset summary; 3) dataset citation (e.g., Digital Object Identifier, DOI); and 4) key dataset documentation (e.g., the Algorithm Theoretical Basis Document, also known as ATBD).

Figure 2 

An example of the dataset landing page for the popular NASA Integrated Multi-satellitE Retrievals for GPM (IMERG) monthly dataset. A one-stop ‘shop’ design allowing easy access to data and dataset related information.

3. Existing Systems and Software for Collecting Metrics

3.1. Framework of Collecting Metrics at GES DISC

A framework of collecting metrics has currently been developed at GES DISC (Figure 3). Four major/different types of metrics (Table 1) are collected and analyzed at GES DISC () including: 1) key metrics (Figure 4) extracted from the operational distribution metrics, recording overall user data/service access and download activities and submitted to EMS; 2) Bugzilla () user ticket metrics (Figure 5) from User Assistance tickets; 3) Giovanni (Liu and Acker 2017; ) publication metrics (Figure 6) from journal publications by users acknowledging their Giovanni usage; and 4) website metrics (via Google Analytics) (Figure 7) for information on user website access.

Figure 3 

A schematic of four “correlated” metrics at GES DISC. More details with examples are shown in Figures 4, 5, 6 and 7.

Figure 4 

The schematic of the collection workflow (top), and the yearly, i.e., FY2010 – FY2019, distributions of distinct user/IP (middle), data file (bottom left), and data volume (bottom right).

Figure 5 

Bugzilla metrics – user assistance tickets. Top: a schematic of the collection workflow. Bottom: Monthly (2013-2018, left) and yearly (201301-201909, right) ticket distributions presented in two different perspectives.

Figure 6 

Giovanni publication metrics. Top: a schematic of the collection workflow. Middle: Monthly publication distributions of diverse disciplines (left) and the respective distributions of individual disciplines (right) for FY2019. Bottom: Yearly publication distributions for Y2004-Y2019* [*projected to Dec 2019].

Figure 7 

Standard “out-of-the-box” metrics. Top: a schematic of the collection workflow. Bottom: a workflow to generate GES DISC website custom metrics reports.

3.2. NASA ESDIS Metrics System (EMS)

ESDIS EMS () establishes requirements and methods for each DAAC to collect data activity and usage metrics. The metrics, analysis, and reports () are generated and provided to NASA management to inform the best allocation of resources for the scientific user community (). Metrics are also analyzed at DAACs for better understanding the usage of their data products by end users, which can potentially or genuinely guide data centers to improve data services.

Digital analytics products (e.g., Google Analytics 360) not only include the traditional website analytics (e.g., the IBM NetInsight), but are also capable of acquiring and analyzing metrics from all possible sources, including again, traditional website analytics, and social media, mobile devices, etc. (). At present, Google Analytics 360 is used at EMS for digital analytics (). In short, digital analytics products provide additional information for data service operations and decision makers.

3.3. Collecting Metrics for EMS at GES DISC

Data product collection metadata serve as a key element in the GES DISC reporting capabilities. Product attributes consist of metadata such as instrument, mission, product level, and discipline, all describing the characteristics of a data product. This metadata information, together with the product search term information, will be linked or “mapped” to each record within a distribution file because of unique patterns contained within a record. Accuracy of EMS reports is highly dependent on comprehensive, consistent, and timely updates to the product attributes. The required fields of the collection metadata and its search terms are listed in Table 2. They are used to collect metrics information from all data ingest, archive, and distribution interfaces throughout EOSDIS for analysis and reporting.

Table 2

Required fields of the EMS collection metadata and its search terms.


FIELD NAMEDESCRIPTIONMAX LENGTHEXAMPLE

productThis is a product identifier or the short name of the dataset…80AIRIBRAD

metaDataLongNameIdentification of the long name associated with the collection or granule.1024AIRS/Aqua infrared geolocated radiances

productLevelNASA data processing levels (i.e., 0, 1, 1A, 1B, 2, 3, 4).101B

disciplineDesignates the scientific area of application (i.e., Ocean, Atmosphere, Land, Cryosphere, Volcanic, Solar, Raw data, Radiance).500Atmosphere

missionAn operation to provide scientific measurements with space-based and/or ground-based measurement systems (i.e., platforms, satellites, field experiments, and aerial measurements, etc.). For a multi-mission product, list all missions separated by a semi-colon (;). The primary mission should be listed first. Each mission should have 1 or more instruments associated with it. If there are multiple missions and multiple instruments, then the relationships between the missions and instruments should be defined.80Aqua

instrumentConsisting of a collection of one or multiple sensor instruments to provide scientific measurements. For a multi-instrument product from one mission, all instruments are listed and separated by a comma (,). If the product (e.g., a combined product) involves multiple missions and multiple instruments, the instruments from each mission are separated by a semi-colon (;).The order of instruments should be in the same sequence as the mission field. If not applicable, enter: “N/A”. (NOTE: the number of missions entered must pair evenly to number of instruments delimited by “;” i.e., if two missions entered: “mission1;mission2” then at least two instruments: “instrument1;instrument2” or “N/A;N/A” or “instrument1a,instrument1b”; “instrument2a,instrument2b” etc.)80AIRS

processingCenterData center where this product was generated.80GESDISC

archiveCenterData center where the data product is archived. This value is usually ‘GESDISC’.50GESDISC

eosFlagFlag to indicate whether the data product is an EOS () or Non-EOS product. Values: E for EOS and N for Non-EOS.1E

productFlagFlag denotes the type of product. Values: 1 = Data Product, 2 = Instrument Ancillary, 3 = System/Spacecraft and 4 = External. For a non-ECS product, use the value 1.11

publishFlagFlag to indicate whether the product and its associated granules be published to EMS or not. This value is usually ‘Y’.1Y

searchTermFile name, directory, path, ESDT, Data Provider internal product IDs or other information that uniquely identifies a data product as it appears in an EMS Data file. The searchTerm should not include URL query strings and associated name value pairs. searchTerms can include full strings or substrings. Values within this field are always treated as regular expressions (e.g., ‘.+MOD1[1-9].+’). Therefore, reserved grep/egrep characters should only be used when they are needed. By default, we will use the product Shortname. Let OPS staff know if any specific pattern needs to be added to a product.200AIRIBRAD

dataSourceAssigns the data source (e.g., the system, subsystem, file, table or other identifying information) where the logs/flat files/metadata are generated (e.g., airs, aura, disc, reason, urs). Currently, GES-DISC has identified five data source groups. Each group and its associated hosts are listed as follows:
  • ‘airs’ – airscal1u, airscal2u, airspar1u, airspro2u, airspro3u, airspro5u, airsraw1u, airsraw2u, airsraw3u, rep2u, rep1
  • ‘aura’ – acdisc, aurapar1u, aurapar2u, auraraw1u, goldsfs1u, goldsmr1, goldsmr2, goldsmr3, rep5u
  • ‘reason’ – reason, neespi, atrain, agdisc, hydro1
  • ‘disc’ – disc1, disc2, disc3, tads1u, gdata1, gdata2, rep3, rep4
  • ‘urs’ – discnrt1, discnrt2
50airs

3.4. American Customer Satisfaction Index (ACSI) Reports

The metrics for user activities described so far are passively collected from the servers at GES DISC. By contrast, initiated by ESDIS () and tasked to the Claes Fornell International (CFI) Group (), the ACSI survey () has been conducted annually in the user community of NASA DAACs in a proactive manner since 2004. The ACSI survey measures user satisfaction with NASA EOSDIS data services to identify key areas for continuously improving data services to users, and to track trends in user satisfaction with each DAAC.

4. Challenges and Opportunities

From Table 1, it is seen that key and website metrics collected at GES DISC are likely to be found in other data repositories as well. The other two metrics, namely Bugzilla ticket and Giovanni publication metrics, may depend on whether the similar services are available. Nonetheless, current key metrics at GES DISC, in general, are designed for a single dataset or service. Interdisciplinary research and applications, necessitating the use of multiple datasets, have increasingly relied on data from multiple sources (e.g., satellites, models, in situ observations). For example, multiple datasets from different disciplines (e.g., meteorology, hydrology, oceanography) are often needed or demanded in tropical weather and climate research and applications (). For interdisciplinary research, datasets (involving multiple satellite missions or projects) are often acquired from multiple DAACs or even multiple domestic and/or international organizations (e.g., NOAA, NSF, ESA). Accordingly, adequate and meaningful correlated metrics associated with multiple datasets and collecting methods should also be further considered and developed.

4.1. Interdisciplinary and Integration Challenges

As technologies evolve and scientific requirements change with time, new Earth science data services have correspondingly been developed and improved. For example, the cloud computing environment () has aimed to improve capability and potential to develop and provide numerous new data services, which may be otherwise difficult to implement ‘on premises’, for handling large volume remote sensing or model datasets. Eventually all datasets at the twelve DAACs will be made available in the cloud environment (). Such a cloud environment offers new opportunities to develop customized dataset services capable of generating on-the-fly datasets from one dataset and/or merging several datasets from one DAAC or even multiple DAACs. A new design for such customized datasets is possible by organizing multiple datasets into different groups (e.g., netCDF-4) to form a new dataset. However, existing single-dataset-based data product development guides (e.g., ) need to be expanded for this new dataset design.

In current EMS metrics, metadata are required to be defined in advance in order to be collected for metrics from server logs. With more on-the-fly datasets expected to be available in future data services, it would likely become quite a challenge to develop this kind of metrics, as those datasets lack full definitions prior to usage. Rauber et al. () proposed a scalable dynamic data citation methodology with three core recommendations for data (versioning, timestamping, and identification), which can be considered for use with on-the-fly datasets.

The four major metrics groups at GES DISC have been found to be correlated to each other to a certain degree (). It is necessary to integrate these metrics groups for a holistic analysis. Additional dataset-related metrics are needed, in particular science-related metrics (). For example, dataset citations are a key measurement for impact in the scientific community (e.g., ). Initial efforts have been carried out at GES DISC to add dataset-related publications () to the individual dataset landing page; however, it remains a challenge to include and sort all related refereed and non-refereed publications, as well as publications in which multiple datasets (not only including the mentioned individual dataset) are involved. For very popular datasets, such as Global Precipitation Climatology Project (GPCP), which can have well over 1000 citations, certain proper management capabilities (e.g., for sorting and filtering) may also need to be developed.

4.2. Data Quality Issues

Attention to and studies on data and information quality have significantly grown, especially during the past two decades (e.g., ; ; ). Without data quality being genuinely assessed and/or its information being timely facilitated to the user community, it would be difficult for users to select and use the right dataset in research and applications (). This should be another important area for metrics to address or tackle. Data quality consists of two main components: 1) quality information itself generated by either dataset or service providers, and 2) dissemination of the respective quality information to the users. Ramapriyan et al. () introduced and defined four components of information quality (scientific, product, stewardship, and service) aiming to promote consistent quality assessment and quality information on data products and services for the Earth science community. However, concerning how to timely provide and effectively facilitate such comprehensive information to the user community, there still remain many obstacles to overcome.

For example, to obtain reliable scientific data quality information for satellite-based global datasets could be difficult. Researchers or application workers (e.g., flood forecast, crop monitoring) around the world often seek and acquire existing data quality information relying on available (limited) ground validation results from previous data quality assessment studies, such as reports or publications. Conducting ground validation activities on a global scale is a challenge because in situ observations are very limited, especially in remote regions and over vast oceans. For datasets relying on multiple satellite observations, to gain and provide data quality information is even more challenging. Data quality for derived datasets, e.g., from 3-hourly to daily, is not easy to define either.

There are also challenges in developing, implementing, and presenting data quality information in (or along with) datasets. At present, there is no widely implemented community standard for data quality parameters or variables, along with metadata in a dataset. Over the years, the Data Quality Working Group (DQWG), one of NASA’s Earth Science Data System Working Groups (ESDSWG), has developed several documents for data quality recommendations addressing different need areas and offering useful information or suggested guidance (; ; ; ). More efforts (e.g., via education, training, project management, etc.) are needed to implement and continuously evaluate and optimize, as well as eventually disseminate, these recommendations.

4.3. Application and Research Metrics

Data (and/or service) application metrics collected from users’ data and/or service usage feedback are useful for data providers and project management. For new or novice users, existing applications can act as examples of using the datasets. Application metrics might be difficult to obtain, however, because they would require users to actively report their activities with some timely detail. Most of the NASA Earth observation datasets are global and many applications (e.g., monitoring crop conditions, landslide prediction, environmental hazard, and disaster management) around the world rely on NASA satellite data for development and operations. However, to collect such application metrics has been a challenge, especially from private industries where such information can be proprietary.

Nonetheless, a new design for comprehensive metrics, especially those being scientific-activity-related (e.g., citation, data quality, application), is needed and it requires all involved parties (e.g., DAACs, stakeholders) to participate and set up the requirements (; ; ; ; ; ). The National Research Council () has developed principles for developing metrics in the climate change science program. One simple and crucial principle (, ) is that data metrics need to be uncomplicated and easy to understand, along with a few other principles involving funding, leadership, planning, or implementation, respectively. Providing straightforward data metrics is a challenge, which requires many reiterations and collaborations among software developers and stakeholders (e.g., those using the metrics for various purposes) to ensure that data metrics are easy to use and benefit stakeholders’ activities.

4.4. Metrics interoperability

Metrics interoperability is an important area, especially when datasets come from various sources or repositories. Without interoperability and the thus-to-be-generated (new) metrics, more time will be spent on data processing (e.g., format conversion), which is less efficient, and it will remain difficult to evaluate the experience and success of multidisciplinary researchers and/or their research collaboration. The EMS standards have been a good example to ensure the interoperability in metrics among DAACs, which is a model for integrating, enhancing, and developing new metrics. On a larger scale (e.g., interagency and international), standardization of metrics is necessary to ensure interoperability. For example, the project, COUNTER (), enables publishers and vendors to report standardized and consistent usage of their electronic resources (e.g., data usage).

4.5. User survey

At present, it is a common practice to passively collect metrics by recording user activities (e.g., website visits, data download, and user inquiry), which may not sufficiently or often reflect user’s opinions or provided feedback, e.g., when a user’s choices are limited to an individual DAAC that may not provide a vitally needed dataset. GES DISC occasionally receives user feedback or suggestions through the user support service or Bugzilla tickets, but they may contain bias due to limited sample size, essentially the viewpoint of individuals. On the other hand, conducting active user surveys often receives low response rates, which is a major challenge. Short and focused surveys in different forms may be more effective and receiving better response rates and thus should be considered as alternatives.

4.6. Dissemination of Metrics

Nowadays, it is imperative to develop and provide an all-in-one dataset landing page (e.g., Figure 2) to users who look for related services and information in one place. Over the years, especially in recent years, software engineers and scientists at GES DISC have been working to integrate dataset information (e.g., DOI, documents) and services in one place for easy discovery and access, compared to several years ago when information and services were more scattered in different places and difficult to find or remember. More dataset-related items (e.g., FAQs, How-To’s, Data-in-Action articles) have been added to the dataset landing page. Likewise, metrics need to be integrated together from different sources and made available in the dataset landing page. Currently, some of these metrics are only available internally and submitted to EMS where metrics from other DAACs are integrated. NASA EMS provides an annual report for the entire data system and each individual DAAC (). However, detailed information about an individual dataset is not available in this annual report. A data provider or a project principal investigator (PI) who develops the dataset algorithm will have to make a special request for the associated dataset metrics to the DAAC where the dataset is archived. Since mid-2016, user registration has been routinely required to download NASA Earth science data, so it should now be more feasible to produce more accurate user usage metrics and make them available in the dataset landing page (e.g., Figure 2) for the respective data providers or PIs, or even future users, to fetch.

As metrics increase in volume and variety, a Big Data approach that capably integrates different types of metrics from different sources for ensemble analysis is needed to help better understand these metrics and disclose their possible interwoven correlations. Han et al. () conducted a case study to investigate metrics collected from the CEOS (Committee on Earth Observation Satellites) federated catalog service with emphasis on catalog service integration. Their integrated and deployed metrics reporting system provides insightful information for different users such as stakeholders and developers. Google Analytics is another example of providing on-the-fly data analysis and visualization for website metrics, as mentioned earlier. Similarly, newly improved or developed tools should and will be able to integrate metrics from different sources (e.g., servers, data quality, and citation).

5. Summary and Key Recommendations

Using NASA GES DISC, a multidisciplinary data center, as an example, we have described current metrics for satellite data products and services. The EMS standards are used at each DAAC to collect dataset metrics, requiring registration of dataset attributes in advance. Each DAAC also participates in an annual ACSI survey of its user community.

The pace of scientific progress and objectives is dynamic and rapid. Meanwhile, technologies continue to evolve. Metrics activities, including definitions, collection, analysis, and visualization, need to keep up with the changes accordingly.

Key recommendations from the discussion are:

  1. Current predefined metrics are collected mainly for a single dataset or service. Expansion of metrics is needed to support interdisciplinary activity and on-the-fly data services. Recommendations from community efforts can be leveraged.
  2. Data quality metrics are important for research and applications. However, very few metrics or limited quality information is currently available, especially for satellite products with global coverage. More efforts are needed from data product developers and providers.
  3. Application and research metrics are an important part of the information for users, data product developers and providers, management, etc. Collecting such information is a challenge, requiring broad participation from product users and publishers as well as a system to report and collect such information.
  4. Interoperability for metrics is a challenge. The NASA EMS standards may not be interoperable with other data repositories. When developing metrics, it is important to have the FAIR guiding principles as a guideline to optimize or maximize the use of metrics. Existing efforts (e.g., COUNTER) can also be leveraged.
  5. Metrics need to be integrated for forming a holistic view. User-friendly tools for analysis and visualization are also needed.
  6. For the Earth science data community as a whole, the challenges are even bigger, requiring all associated stakeholders to collaboratively identify the problems inherent to this scientific endeavor and to work together on common solutions and standards.

Computer Code Availability

No code or software has been developed for this research. Google Analytics 360 was used in collecting and generating GES DISC website custom metrics in Figure 7.