1. Introduction

As datasets enter the scientific record, citations connect published data to a larger research network (). Data citations establish precedence for results, provide evidence signaling the quality and significance of research findings, and make it possible to study how researchers use existing data. Citation analysis relies upon the standardization of citations to assess scholarly communication patterns, such as the reach or visibility of ideas across scientific disciplines (e.g., through paper citation networks). Citation indexes of academic papers, like the Science Citation Index (), allow researchers to understand who is highly cited, which published work is highly cited, and which publication outlets are prominent. Recent audits of bibliometric networks have also revealed inequalities, suggesting that citations are not objective (). For instance, citation rates vary by research topic and author race and gender, suggesting that social factors play an important role in researchers’ awareness of published work and decisions to cite it ().

While the infrastructure for studying citation trends for research publications is robust, three main challenges limit the large-scale analysis of data citations. The first challenge is the unambiguous identification of data references. While there are well-established systems for referencing the work of others (), many authors still fail to cite data. Many authors refer to data informally in their writing, despite data repositories’ guidance on best practices for data citation () and pressure from funders and publishers to ‘make data count’ (). When the burden of linking data to publications falls largely on the author, this often results in partial or inconsistent references to datasets in research articles (). Informal data citation practices make it challenging for readers to understand which data the authors accessed and whether they analyzed data or simply described them ().

A second challenge involves understanding the intent of a data citation. Bibliometric analysis often treats citations as something that can be standardized and universally interpreted as conferring legitimacy to published work (). However, like citations of academic literature, researchers cite data for different purposes. Existing citation typologies account for the variety of reasons that researchers cite materials (e.g., to persuade, to critique, to contrast). Many researchers communicate their findings through empirical studies in which they make claims tied to other scholarly products, including published data. Authors’ claims range in specificity from explicit to implicit and are often supported by data ().

A third related challenge involves inferring the quality of citations. Bibliometric measures for quantitative impact assessment, such as the h-index, indicate the popularity or visibility of a source () but say little about the nature of engagement surrounding it. Prior studies of citations to academic literature distinguish surface citations from those that engage deeply with the source material (; ; ; ) and help determine the purpose or polarity of citations (; ; ; ). For example, citations that pay homage (e.g., to one’s mentors or other influential researchers) also create cumulative advantages, where the best-known researchers receive far more credit for their work (). Thus, the number of citations a source has received does not indicate the purpose of the citations and may not be a reliable proxy for research quality ().

Given the challenges associated with analyzing data references, this study takes a qualitative approach to identify the types of, reasons for, and interactions involving social science data reuse in scientific research. Prior studies of data reference have focused on formal bibliographic citation (; ; ; ). By contrast, we closely analyze even oblique data mentions in papers—sentences in which a dataset, or part of a dataset, is named but not formally cited. We find that data references perform a limited set of functions, which we define in a typology that captures and describes the variety of ways researchers refer to data. We then apply the typology to analyze the use of research data in social science publications.

2. Background

2.1 Defining and citing data

The terms ‘data’ and ‘datasets’ have various meanings depending on their context. Often, what becomes ‘data’ is determined by scientists’ choices as they interact with and record observations. ‘Data,’ then, are a byproduct of interpretation and can be practically understood as ‘referring only to that which is analyzed’ (). Part of this challenge in defining ‘data’ relates to their ‘unruly’ and ‘poorly bounded’ identities, which makes it difficult for them to function as digital objects that can be readily referenced and retrieved (). There is also disagreement on the use of the term ‘dataset’ in technical and scientific literature, which presents challenges for data sharing and preservation (). ‘Data’ are abstract arrangements of symbols that express content; ‘datasets’ are made up of multiple data-bearing entities and may contain additional contextual information about data, such as collection methods (; ). In our analysis, we focus on archived social science datasets that include contextual metadata and documentation, which have been produced and shared by others for research purposes.

Capturing the relationships between datasets and other scholarly works is critical for giving credit to datasets. Data ‘references,’ mentions,’ and ‘citations’ signal importance and enable credit through attribution (). While the terms ‘references’ and ‘citations’ are often used interchangeably, ‘references’ generally indicate that the work is listed in the reference section of a publication (). The term ‘citation’ implies the use of a persistent identifier (PID), which carries a more formal connotation in bibliometrics than ‘references’ or ‘mentions’ (). Citations with PIDs link published works to their usage contexts, enabling the verification and reuse of existing scientific analyses (; ). While authors are beginning to use digital object identifiers (DOIs) to reference data, best practices for when and why they should do so are not widely followed (). For example, a recent review of literature citing datasets using their DOIs found that many authors cited data that they had mentioned or described (e.g., data collection methods) but which had not been re-analyzed or used in other ways (). In our work, we use the general term ‘data reference’ to cover informal data mentions (i.e., use of a dataset name only), formal citations (i.e., the inclusion of an APA-style citation for a dataset DOI), and descriptions of data use.

2.2 Citations in scholarly communication

Citation analysis within scientific disciplines reveals information flows and brings together separate strands of information to construct ‘consensus models’ of subjects within science (). Citations reflect influences on authors, and citation patterns trace communication across active research networks (). Citations function differently at the micro and macro levels. At the micro level, citations indicate professional relations and function as rewards, while at the macro level, groups of citations function as concept symbols that codify knowledge in hierarchical social networks (). When cited, papers can be invoked as symbolic of the ideas expressed in their text (). Citations are powerful in that they are persistent and take on a separate identity from the people involved in their creation. They are ‘speech acts,’ which are brief statements that endure in documents and can be inspected over time (). Cited sources often substantiate statements or assumptions, point to further information, acknowledge previous research in the same area, and draw critical comparisons indicating the quality of the research ().

Studies of research infrastructure rely on citations as metrics for tracing attribution and indicating the impact of scholarly works like datasets or software (). Institutions, like journals and data publishers, enforce disciplinary and cultural norms for writing style and citation through publishing guidelines and style manuals. Data citations that use specific identifiers allow readers to identify, retrieve, and give credit to research data. Despite recommendations and best practices, formal data citation is still not commonplace in scholarly writing (). Incomplete, informal, or improperly formatted citations present obstacles to tracking data use (). Vague or implicit references, for example, make it difficult for readers without an intimate knowledge of variables or other data features to understand which data the authors used and how they used them (). Thus, focusing exclusively on formal citation practices (e.g., using DOIs) means overlooking many potential data references. We seek a more comprehensive understanding of what authors do rhetorically when they formally and informally refer to data in their papers.

2.3 Meaning and motivations for citation

Citations bestow credit and recognition in science (). However, there may be a disconnect between authors’ citation practices and the use of citations to evaluate performance and measure research impact. In other words, citations indicate what is cited and how often but do not explain ‘why’ works are cited. Authors’ motivations for citing publications can be classified as scientific or tactical. Scientific citations provide background, identify gaps, and establish bases for comparison. Tactical citations acknowledge subjective norms and advertise published work (). While it is often assumed that citations indicate high-quality work that has influenced authors’ research, a survey found that authors’ citation decisions were more often motivated by strategic factors rather than their familiarity with the research or perceptions of quality ().

Behavioral surveys and interviews with authors reveal their judgments about what they are citing and why, which are not reflected in the scholarly record (). In one such study, authors considered the recency of publications, their topical specificity, and ease of use when deciding whether to cite them (). Authors also believe that citations reflect the prominence or novelty of a document as a ‘concept marker’ and that citing the concept marker will bolster the authority of one’s work, either through alignment or by critiquing existing work (). Silvello identified six main motivations for data citations that are shared across scientific fields: data attribution (accountability and merit), data connection (to claims in publications), data discovery (identification and retrieval), data sharing (reputational), data impact (assessing exposure), and reproducibility (validation and procedures) (). These motivations alone, however, do not explain authors’ data-citing behaviors and why they vary across venues and contexts.

2.4 Analyzing citation content and context

Many computational approaches for citation analysis have been proposed, building on prior insights about authors’ motivations to cite. Common citation categories identified across multiple content analysis studies included background information, theoretical framework, prior empirical or experimental evidence, negative distinction, and explanation of methodology (). Features, such as the sections of publications in which citations appear, can be used along with the semantic content of citations to predict citation intent (). Various classification schemes have been proposed for labeling authors’ intents in citing published research (). One such scheme accounts for citation purposes (i.e., author intent) and polarity (i.e., author sentiment) by distinguishing and weighting negative, neutral, and positive citations (). More granular, rule-based coding schemes differentiate statements of weakness, contrasts, or comparisons with other work, agreement, compatibility with other work, and neutral citations (). Conversely, less granular schemes support general citation intent classification by distinguishing background information, method, and comparison citations (). Such schemes can also help distinguish citation framing (e.g., uses, motivation, future, extends, compare or contrast, background) ().

Qualitative approaches respond to the difficulty of predicting authors’ intents by focusing on the context and features of references. A review of data citations in academic literature found that they varied along two major dimensions: cited entities and styles (). Data producers (the researchers who created the data) and data providers (the people or the institution from which the data were obtained) were often named in data citations. Another recent study found that researchers tended to use data created by others for comparison (e.g., ground-truthing, calibration, and identifying baseline measures) and integration (e.g., to ask new questions and conduct new studies) (). A large survey found that existing data are often used as the basis for a new study, to prepare for a new project, to generate new ideas, to develop new methods, to verify other data through analysis and sensemaking, and for teaching (). Given that data reuse fulfills diverse needs, researchers’ purposes for citing data also vary. We build upon established insights into data reuse practices to show how researchers give credit to data.

3. Materials and Methods

3.1 Sampling frame

We analyzed a sample of publications referencing one or more datasets available through the Inter-university Consortium for Political and Social Research (ICPSR), a large social science data archive at the University of Michigan. We based our typology on existing citation schemes for academic publications, which we extended and refined through iterative coding. The typology captures structural features and rhetorical functions that authors employ when referencing research data. Prior studies have focused on particular publication styles, such as data papers (; ), or publication outlets, such as PLoS One (). Instead, we drew from multi-disciplinary publications that referenced social science data archived at ICPSR. This approach allowed us to capture a wider variety of data reference contexts. We considered data references as they occurred in the full-length context of research publications. Further, our only selection requirement was that each publication mentioned one or more archived social science datasets.

We analyzed papers retrieved as part of ICPSR’s collection efforts to expand the ICPSR Bibliography of Data-related Literature. The Bibliography includes more than 100,000 publications that use existing social science data available through ICPSR. The review process for the Bibliography involves searching bibliographic databases for references to data available through published ICPSR studies. Staff manually review the metadata and full text of publication search results for evidence of data use. ICPSR maintains strict collection criteria to ensure that publications in the Bibliography reflect data use. Publications are collected if they unambiguously refer to one or more studies available through ICPSR and if it is clear that the authors have accessed and analyzed the data. Publications are rejected from the Bibliography if they fail to demonstrate substantial use of ICPSR data or if the specific studies or series used in the authors’ analyses cannot be determined.

To develop a sampling frame for testing our typology, we first identified five publications from the current ICPSR Bibliography representing the multidisciplinary use of ICPSR data. We closely read these publications to identify data references and develop a provisional typology. We then searched an external index of publication full text provided by the Dimensions bibliometric database () for additional references to any of ICPSR’s 11,639 study DOIs available as of February 2022. With the support of ICPSR Bibliography staff, we evaluated and classified the 2,546 search results into six categories indicating whether the publications met the collection criteria for the ICPSR Bibliography. These categories were proposed by ICPSR staff () and are defined in Appendix A (Supplementary File 1). We then randomly selected publications across each category to include in our analysis, resulting in a total of thirty publications. We gathered additional metadata for each publication, such as the field of research categories from Dimensions, to determine the disciplinary coverage of our sample. We report the publication sampling frame and selection criteria in Appendix A (Supplementary File 1).

3.2 Qualitative coding

Our team conducted two phases of coding. The purpose of the first phase was to develop a codebook to describe the diversity of data references. To develop and refine our codes, three annotators from our team read the full-text multiple times for each publication labeled ‘phase I’ listed in Appendix A (Supplementary File 1). The annotators independently proposed refinements to a shared version of the codebook. To achieve qualitative reliability, the annotators drew from many examples and discussed them in weekly meetings (). The annotators met weekly to review and incorporate proposed changes in interactive coding sessions. Emerging ideas and conflicting opinions created a dialog from which the codebook was created. In addition to codes, annotators also discussed the scope of the data references and the codebook’s focus, purpose, and definitions. Each team member independently applied the updated codes for every iteration of the codebook to identify and annotate all data references in the full text. The team repeated this process until saturation was reached and no new codes were proposed (). The first coding phase resulted in a stable codebook, which we report in Appendix B (Supplementary File 2).

We also described the extent to which annotators’ coding aligned. Given that the annotators selected and coded segments from the unstructured, full text of publications, we used Holsti’s Index () as an agreement measure. The final Holsti Index was 67.9%, indicating a relatively high level of agreement, given that annotators selected different text segments, which they coded with multiple codes. A single team member independently applied the typology to the held out set of twenty-five publications labeled ‘phase II’ in the sampling frame reported in Appendix A (Supplementary File 1). This second phase demonstrated the typology in action and captured findings shared in Section 4.

4. Results

The data reference typology consists of four parent codes (Data Entity, Data Reference, Feature, and Function), which are summarized and defined in Table 1. A Data Entity anchors a Data Reference and is based on the pragmatic distinctions raised in work by Renear et al. () to define components of datasets in scientific literature. Renear et al. distinguish between data content (including files and observations), groupings (the set or study to which they belong), and purposes (metadata used to interpret the data). Similarly, we define Data Entities as ‘one or more words indicating recorded observations’ and record them as ‘Files,’ ‘Metadata,’ ‘Studies,’ or ‘Variables.’ We excluded Data Entities that were not specific or were not discussed in the body of the paper. For example, if the name of a dataset appeared in the title of a publication but data were not described in the main text, we did not consider this a Data Entity. Given that some data analysis discussions were broad (e.g., results of statistical tests), we only considered statements that referenced a specific Data Entity.

Table 1

Overview of parent codes, subcodes, and definitions.


PARENT CODESUBCODESDEFINITION

1ST LEVEL2ND LEVEL

Data EntityFileOne or more words indicating recorded observations

Metadata

Study

Variable

Data ReferenceContext window in which one or more Data Entities are mentioned

FeatureAccessProvision, ReceptionStructure, form, and appearance of the data reference

ActionCites, Mentions, Uses

LocationAbstract, Acknowledgements, Appendix, Caption…

StyleAcronym, Generic, Name, Parenthetical

TypeDerived, Primary, Secondary

FunctionCritiqueComparison, LimitationsThe purposes of the data reference

DescribeComposition, Source

IllustrateContext, Outlook

InteractInterpretation, Manipulation

LegitimizeJustification, Transparency

A Data Reference is the context window in which one or more Data Entities appear. We experimented with various context windows and determined that paragraphs captured sufficient detail leading up to and following a Data Entity. We focused on three types of data references, which are introduced in Section 2.1: citations, mentions, and uses. Data citations include a clear pointer to a published data source but do not name a dataset or indicate that authors used the data. Data mentions name a dataset in order to describe it but do not indicate data use. Data uses name a dataset and describe interactions between the author and the data. We applied feature and function codes to each Data Reference. The full codebook, along with definitions, rules, and examples are provided in Appendix B (Supplementary File 2).

4.1 Features of data references

Features describe the structure, form, and appearance of the Data Reference. The twenty-four features of data references that we identified are organized under five subcodes (Access, Action, Location, Style, and Type). Access codes indicate whether the author describes data sharing or retrieval in the reference. Action codes capture the distance between the author and the data along a continuum covering ‘citing’ (i.e., parenthetically referencing data without further context), ‘mentioning’ (i.e., describing or alluding to data), and ‘using’ (i.e., describing active, hands-on work with data). The Action codes build on the distinction proposed by Pasquetto et al. () between comparative and integrative data reuse. The Location code notes the section of the publication in which the Data Reference occurs, such as the abstract, acknowledgments, captions, figures, tables, footnotes, or methods sections stated in the expanded IMRAD structure (). The Style code captures how the author specifies data entities through the use of an acronym such as ‘ANES,’ a generic noun such as ‘data’ or ‘study,’ a formal name such as the ‘American National Election Study, 2016,’ or a parenthetical citation using author and year of publication. The Type code captures whether the data entity is derived from existing data, represents a primary source created by the authors, or is a secondary source published for other researchers to use.

4.2 Functions of data references

Functions reflect the purposes of each Data Reference. The ten rhetorical functions of data references we identified are organized into five subcodes (Critique, Describe, Illustrate, Interact, and Legitimize). Definitions and examples for each code and subcode are provided in Appendix B (Supplementary File 2).

The Critique code includes (1) Comparison, which contrasts the author’s work with other work that uses the data or findings from other sources. Authors issue Comparisons to draw a contrast between their work and prior findings or to summarize conclusions drawn from the prior use of data. The Critique code also includes (2) Limitations, which signal authors’ awareness and caution when working with data. This code includes acknowledging quality issues, such as potential errors or sampling biases that should limit how data are used.

The Describe code includes (3) Composition, which explains or discusses knowledge about the data or metadata. Composition data references describe what is in the data (e.g., the sampled population) or the study’s context (e.g., the data collection method). The Describe code also includes (4) Source, in which authors describe the provenance of the data. Source references acknowledge the origin of the data associated with the data producer or provider.

References labeled with the Illustrate code are persuasive. This code includes (5) Context, in which the author provides background, findings, or statistics derived from referenced data. In a Context reference, the data are metonymic, standing in for the point or claim that the authors are making. The Illustrate code also includes (6) Outlook, in which authors speculate on potential applications of data that they did not conduct or review in their work. Outlook references claim the potential utility of data based on their properties.

The Interact code describes hands-on work with data. Interact includes a subcode for (7) Interpretation, where authors make an empirical claim derived from the analysis of referenced data. Interpretation references often follow the description of the authors’ analysis. The Interact code also includes a subcode for (8) Manipulation, where authors describe steps performed while working with data. Manipulation involves selecting variables, preparing or transforming data for analysis, and specific data preparation techniques like sampling, correlating, integrating, and validating analyses.

Finally, the Legitimize code is used for references intended to persuade the reader through value statements made about data. The Legitimize code includes (9) Justification, which draws attention to a feature of data that lends credibility or authority to the authors’ choices. Examples of Justification references include authors’ reasons for why data were selected, discussions about the credibility or representativeness of data, and descriptions of previous data uses that qualify their selection. The Legitimize code also includes a subcode for (10) Transparency, in which authors explain why or how an analysis procedure was applied and signal quality or considerations taken in the analysis. Examples of Transparency references include making methods open or reproducible by including analysis in supplementary materials.

4.2.1 Data references provide readers with access to data

Data references provide multiple ways of accessing data. Though some journals require that authors include data provision statements, where authors make the data used in their analyses available to readers, they were not common in the publications we reviewed. Examples that we encountered included cases where authors provided access to data derived from their analysis for the stated purpose of replication. Alternatively, authors may also provide access to data as a means of recruiting future collaborations. One such description of data provision read:

The authors have made available the data that underlie the analyses presented in this article (see ), thus allowing replication and potential extensions of this work by qualified researchers. Next users are obligated to involve the data originators in their publication plans, if the originators so desire ().

Statements about authors’ access, or reception, of data from providers were often accompanied by formal, parenthetical data references. We defined data reception as a reference to an existing data entity and specifications for how that data could be accessed. For example, if a reference is parenthetical, the instance in the reference list must provide an access mechanism, such as a URL, by which others may access the source. In the following example, the author formally attributes the data creator through a parenthetical citation, which includes details about the analysis performed and the historical context motivating selection of the dataset:

In order to test whether or not fallout from nuclear testing had persistent effects on the agricultural sector, I create a panel of comparable variables from Historical U.S. Agricultural Censuses for the years 1940 to 1997 Haines et al. (). This Census data comes from the most comprehensive surveys of agriculture in the United States that ranges back to 1840. Starting in 1920, the Agricultural Census started conducting bidecennial surveys. I use this data to explore the effects on radioactive fallout deposition on long run outcomes and agricultural development at a national level ().

4.2.2 Data references indicate authors’ interactions with data

Data references spanned three levels of interactions between authors and data. First, we identified examples of superficial data citations, where authors’ cited published datasets in the same way as academic articles. In these cases, authors did not name a specific dataset in their writing; instead, they used footnotes or parenthetical citations to formally acknowledge the dataset in their reference list. Most data citations were found in introductory sections and were contextual, meant to provide background, findings, or statistics, which authors used to substantiate a point. It was often unclear, however, how statistics or figures that the authors cited were connected to or derived from the source data. In the following example, the data citation provides findings without direct analysis. No verbs have been used to describe actions performed with or to data; instead, the reader may assume that the authors have some previous experience analyzing the data or that the cited figure is tied to the dataset’s published summary statistics. In the following example, the author provides statistics with a corresponding footnote, which leads to a formal citation for data from the India Human Development Survey in the article’s reference list:

Slums are associated with poor quality housing, water, sanitation, and other services, leading to, among other outcomes, higher rates of disease and death. Rich households, on the other hand, are often located in areas with piped water and during water shortages can build storage facilities, tap into underground wells, and pay for delivered water. Only 38% of households among the poorest fifth of India’s urban population have access to indoor piped water compared with 62% of the richest fifth ().

When authors mentioned data, instead of citing them, they described the composition or source of a dataset. Unlike citations, mentions name the data in-line. We identified mentions of data primarily in the articles’ Methods, Introduction, Discussion, and footnotes. Many data mentions provided details about the composition of the data product and relayed knowledge about the basis for the study, collection method, or population. Mentioning data provided background information about data that the authors used later in their analysis or acknowledged the authors’ awareness of data that they evaluated but decided not to use. In the following example, the authors describe changes made to the sampled population between waves of a survey in order to qualify their selection method, signal awareness of data quality, and justify their approach:

As regards education, health, relationship status, and employment status, Wave 1 respondents who did not remain within the analytical sample show disadvantages compared with those who did. Accordingly, if those more susceptible to depressive symptoms had lower likelihoods of remaining within the analytical sample, attrition between Waves 1 and 2 might lead to conservative assessments of how contexts undergoing economic declines affect their residents’ depressive symptoms ().

4.2.3 Data references are building blocks for empirical arguments

In examples where data were critiqued, authors described others’ prior efforts or findings to contrast with their approaches. In some cases, authors described how they used the same data differently or decided against using the data based on the reasons that they provided. An example of a data comparison is provided below. The authors present several longitudinal studies covering a similar population and explain potential differences in findings based on differences in their compositions. In this way, the authors signal that they have performed due diligence; they are aware of related studies and can describe their limitations:

Studying an earlier cohort than After the JD, the National Longitudinal Bar Passage Study found that long-term bar passage rates were substantially lower for minorities than for whites. Thus a study of all law degree holders including those who did not pass a bar examination may find larger racial gaps in earnings. Census surveys such as those used in this paper lack bar passage status, and therefore likely include a larger proportion of lower earning individuals compared to After the JD ().

More references indicating data use were found in Methods and Discussion sections of articles as well as in captions, figures, and tables. Mentions and data use statements were distinguished based on the authors’ use of verbs and personal pronouns. Most of the use statements described actions, specifically data manipulation (e.g., steps performed while working with data) and interpretation (e.g., making an empirical claim derived from data analysis). Data references describing use also occurred in appendixes and supplementary materials rather than in designated areas of articles, like acknowledgments or data availability statements. Examples of data manipulation included selecting variables from referenced data and preparing, transforming, modifying, sampling, subsetting, comparing, or correlating referenced data. Data interpretation included building theories, comparing, and interpreting empirical evidence in figures. The following example illustrates how an author refers to two waves of a study, and related variables, in detailing their analytical approach:

To assess the degree to which genetic and environmental factors are stable over time requires an extension of the classical twin design to encompass repeated measurements. Here, we used the bivariate Cholesky decomposition approach: for each of n measured variables, the Cholesky decomposition specifies n latent A, C, and E factors. Viewed as a diagram, with the latent factors arranged above the measured variables, each of these factors is connected to the measured (manifest) variable beneath it, and to all variables to the right. In this way, each latent factor is connected to one fewer variables than the preceding factor. This design is of value for answering the current question as it allows estimation both of A, C, and E effects at Wave 1, and the extent to which these can account for Wave 2 variance, as well the new variance that emerges at Wave 2 ().

5. Discussion

Our typology expands the notion of data use beyond re-analysis. For example, while some researchers may access and re-analyze published survey data, many more may reuse that survey’s questionnaire or sampling design as a gold standard. Users may also critique the survey data by pointing to its limitations in addressing a particular topic. Our approach casts a wide net to capture these kinds of data references, providing insights into how social science data support research. Our typology is useful for informing recommendation scenarios for researchers about when, why, and how they should reference published data. It also provides a basis for novel data reuse metrics that reflect many forms of engagement with data, from the reuse of survey designs to the re-analysis of survey data.

Our typology also reveals some ways in which data references differ from and align with traditional bibliographic citations. First, the referenced entity can vary in scale; we found references to individual files, metadata records, studies overall, and individual variables. While bibliographic citations may similarly range in scale (e.g., a citation of a specific phrase or section of a paper vs. the paper overall), data entities have a different and possibly broader range of constituent parts. Further expansion and refinement of the typology through review of papers in other domains may reveal additional sub-entities (for instance, research in archaeology or paleontology likely refer to specific artifacts, as well as data derived from those artifacts). Further work is needed to understand the implications of these differing citation scales; are different scales (e.g., variable-level versus full dataset-level) references associated with different types of use and argumentation? Are different scales of data more or less likely to result in a formal citation of the dataset? Data entities may additionally have multiple versions that could be referenced (though we did not see this in our sample); how does this complicate our ability to trace the flow of scholarly influence?

Second, we find that data references can act as ‘concept symbols’ (), similarly to bibliographic references. Informal reference to datasets by acronym or name (and without a formal citation) indicates a familiarity with datasets as one sees with canonical works of scholarship. In other words, datasets can be referenced with the same familiarity as a biologist references Darwin or an economist references Locke. Future work to identify these foundational or canonical datasets may help reveal how datasets-as-concept symbols differ from bibliographic references. Datasets may be unique in that they also can have a distinct metonymic function, where a reference to a dataset as a whole can stand in for a reference to a specific part or feature of a dataset (as revealed by our ‘context’ code).

Third, data references show interactions with data entities that aren’t typically found with bibliographic entities—namely, the provision and archiving of data. Datasets function as both a resource to be used, and a scholarly product to be cited or made available to others. In the publications we annotated, we found that it was uncommon for authors to provide direct access to the findings that they derived from existing data; more often, authors established credibility and trust by simply describing the data source or data provider that they had accessed. Prior studies of researchers’ attitudes toward data sharing and reuse show that researchers are reluctant to provide access to their data because they do not believe that the data would be valuable to others, or because hoarding data provides a way to attract future collaborations (; ). Though our sample was not representative, we found early indications that align with this prior work.

Finally, we also found alignment with prior schemes describing authors’ motivations for citing literature. The rhetorical functions we identified signal the quality, verifiability, or reproducibility of authors’ research findings by allowing readers to discover the data the authors have analyzed (). For the most part, the data references we reviewed either provided details about dataset composition or descriptions of data manipulation. In the examples we identified, authors affixed additional context about the analysis they performed to connect a data source to its use. Further, when authors included specific access information for data, this enabled readers to retrieve the same dataset.

5.1 Limitations and future work

This study proposes a typology that models how authors reference research data. We developed the typology by closely reading papers from the ICPSR bibliography and adding new categories until we reached saturation. The present analysis is not intended to provide quantitative evidence for specific citation trends. We would need to conduct annotation at a larger scale with additional measures in place to verify the agreement of annotators. In addition, we constructed our sampling frame by selecting papers that were first reviewed and classified by experts (i.e., ICPSR Bibliography staff); the sample is balanced across the categories provided in Appendix A (Supplementary File 1). While this sampling strategy is useful for developing and analyzing the ICPSR Bibliography, future uses of the typology for other purposes may require different selection criteria.

We envision applying our typology to study differences in data references across social science disciplines (e.g., sequences or co-occurrences of data reference strategies as markers of scientific disciplines or analytical methods). A recent study of data citation practices at ICPSR observed unexpected uses of dataset DOIs in published literature, which did not indicate data use (). Our typology can be used to study when and why researchers use dataset DOIs and distinguish references that describe data from those that imply data analysis.

6. Conclusion

Although research data are increasingly important in modern scientific analyses, they have not been regarded historically as primary research products. The publication, long-term preservation, and dissemination of research data, along with descriptive metadata, make it possible for others to discover, use, and cite observations collected by other researchers for other purposes. We introduced a typology of data references that characterizes the functions data serve in scientific publications: critical, descriptive, illustrative, interactive, and legitimizing. The typology captures researchers’ interactions with (e.g., work or analyses done with data) and judgments about data (e.g., claims about its fitness for use based on what is known about data). Understanding why authors reference research data is essential for giving data producers and providers the scholarly research credit they deserve for facilitating scientific work.

Date accessibility statement

The ICPSR Bibliography of Data-related Literature is available at https://www.icpsr.umich.edu/web/pages/ICPSR/citations/.

Additional Files

The additional files for this article can be found as follows:

Appendix A (Supplementary File 1)

Sampling Frame. DOI: https://doi.org/10.5334/dsj-2023-010.s1

Appendix B (Supplementary File 2)

Codebook. DOI: https://doi.org/10.5334/dsj-2023-010.s2