Introduction

Data science has been called the “sexiest job of the 21st century” (). Data science has also been extensively critiqued by scholars across numerous fields. One particularly vivid critique labels data science as “machinic Neoplatonism,” stating that data science techniques encourage and enable thoughtlessness in the context of decision-making and societal analysis (). Other commentary on the nature of data science is similarly divergent. Data science has been characterized as being little more than statistics relabeled () while also being characterized as encompassing almost every kind of science ().

Within this bubbling commentary, considerable debate can be found on almost every facet of data science. The one thing that most commentators agree on, however, is that data science must be characterized as having an interdisciplinary and/or metadisciplinary nature. To be doing data science, according to almost every description, one must be pulling tools, skills, algorithms, concepts, or data from multiple disciplinary or methodological frameworks. As one of the above quoted scholars phrased it, “The imaginary ideal data scientist is a Renaissance figure with a mastery of all these arts,” referring to programming, statistics, mathematics, and data visualization, among other skills (). Looking across various academic and popular press descriptions of data science, significant differences can be found in the characterizations of the appropriate mélange of skills and tools that constitute a “data scientist.” But the fundamental interdisciplinary nature of data science—the fact that people who do data science cross or transcend traditional disciplinary boundaries, tools, and methods—seems to be a consensus view.

This paper aims to build a lens for understanding the diversity, complexity, and interdisciplinarity or metadisciplinarity of data science by drawing lessons from the history of information science and its precursors. This analysis highlights the historical parallels between the emergence of data science in the 21st century and the emergence and evolution of information science over the past 100 years to provide insight into interdisciplinary challenges facing data science as a professional and academic endeavor.

This comparison is particularly timely. Debates about the disciplinary status of data science are growing within government, corporate, and higher education institutions. A number of recent consensus reports have been written to help shape the present and future of data science (; ; ).

Disciplines are social, organizational, and institutional constructs that often emerge around nascent problems or topics where resources like funding and students are in a growth period (; ). Such is certainly the case for data science.

The term interdisciplinarity can be a linguistic stand-in for modern, creative, and/or progressive ways of working (). As such, interdisciplinarity can be a way of working and/or a way of talking. This tension is one manifestation of how interdisciplinarity can encompass many different things (). Because of this variation in meanings, “good interdisciplinary work requires a strong degree of epistemological reflexivity” (). Epistemological reflexivity may have value in moving data science toward becoming a “critical technical practice” (), that is, an area of work that actively examines and engages with its own limitations and inherent challenges. People working within and around information science have repeatedly debated the relative merits and drawbacks of interdisciplinarity and metadisciplinarity throughout the past century and up to the present (; ; ; ).

This paper begins by characterizing data science as an inter- and metadiscipline by highlighting a number of key features of recent research and professional work in the area. I then depict similar characteristics of information science and finish with a discussion of the following set of questions related to the interdisciplinary pros and cons of current data science:

  1. What will be the focal points around which “data science” and its stakeholders coalesce?
  2. Can data science stakeholders use the lack of disciplinary clarity as a strength?
  3. Can data science feed into an “empowering profession,” namely, a profession that promotes the growth, competence, and autonomy of the people that it serves?

Methodological Approach

This paper is based on a review of the literature in the information and data sciences related to interdisciplinarity. Many of the sources used in the characterizations below are personal narratives that present the perspective of a single individual. In the case of data science, some are white papers, blog posts, or opinion papers. In the case of information science, many relevant sources are papers published in peer-reviewed journals by prominent people in the field, including scholars, educators, and administrators. Any single perspective among these voices may have particular limitations or biases. Taken together, however, such sources prove extremely valuable for tracking the evolution of interdisciplinary research areas (). Personal narratives serve as indicators of the ways that particular issues were discussed at different points in time. These personal narratives have been compared and contrasted with relevant research papers appearing in peer-reviewed journals discussing the nature of information science and data science as disciplines and professions.

The method for gathering relevant peer-reviewed materials for this paper included systematic queries of article databases such as the Web of Science and Google Scholar for articles related to “information science,” “data science,” “interdisciplin*,” and “metadisciplin*.” These sources are useful to find relevant materials related to information science given the long history of the topic area, but they are less useful for tracing longer-term developments relevant to data science given its nascent development as a named entity. For example, as of February 10, 2023, the Web of Science Core Collection returns 14,448 total results when searching for the phrase “data science,” of which only 26 date from 2009 or earlier. Of these, 19 are spurious hits, and 5 are book reviews or news items. Only 2 peer-reviewed articles discuss data science in a way that is close to current understandings: Cleveland (), discussed below, is a foundational article for the statistical aspects of data science. Mezey et al. () discuss a number of aspects of database analysis that “provide challenging tasks and opportunities for data science” but otherwise do not directly discuss data science itself. As a point of comparison, the Google NGram viewer, which quantifies usage of particular words or phrases across the Google Books corpus, shows almost zero use of the term “data science” through 2008, but there is significant year-over-year growth in use of the phrase since 2009 (https://books.google.com/ngrams/).

Notably, as of February 2023, the Web of Science does not index any journals that include the term “data science” in the title. Thus, an additional method for finding relevant articles was to directly investigate journals that focus on data science but are not indexed by the Web of Science, either by using Google Scholar or by visiting the journals’ websites and examining issues. Some specific journals that were investigated in this fashion are described further in the section that follows. Another method for finding materials relevant to this article’s discussion, perhaps the most valuable, was citation chaining. Once a relevant article was found, following citation networks both forward and backward in time frequently resulted in the discovery of more relevant articles.

Data Science

At the time of this writing, historical literature related to data science is scant. The best chronological depictions of the development of data science are found in Cao () and Press (). Both articles illustrate how the trajectory of what we now call data science can be dated back at least 50 years, encompassing developments in data analytics and visualization, statistics, database design, and other topics. Phillips () points to other trends related to data gathering and analytics that extend back a century or more. The phrase data science, however, has only been in use for about 20 years. In this section, I highlight particular developments in the past two decades related to the emergence of data science, focusing on characterizations of data science’s boundaries and participants.

Conceptions of data science

The recent growth of data science has been stimulated by the large volumes and varieties of data being made public on the internet via the explosive growth in digital technologies, such as personal computers, cell phones, social media, smart devices, and sensor networks. As the generation of data by these technologies has increased, the need for methods of storing, accessing, analyzing, and presenting data has also increased. Data science has emerged as a panoply of techniques, tools, and skills that can be applied to derive value (economical or intellectual) out of the growing piles of data. The concomitant need for people with skills to work in these areas within the commercial and public sectors has also been a significant driver of the growth of data science ().

From different points of view, data science can be viewed as (1) a proto-discipline, (2) a toolkit of analytical pipelines and platforms, (3) a bundle of transformative forces at work inside and outside the academy (), or even (4) “a community of practice of data-driven scientists of whatever scientific discipline they ask questions about” (). Statistician David Donoho’s () recent paper, “50 Years of Data Science,” provides a useful starting point for this discussion about the scope of data science. This paper has been highly cited since its initial publication, and Donoho is a prominent figure in many discussions of data science. Donoho describes his view of six divisions of “data science” activity:

  1. Data gathering, preparation, and exploration
  2. Data representation and transformation
  3. Computing with data
  4. Data modeling
  5. Data visualization and presentation
  6. Science about data science

Donoho explicitly excludes this engineering component from his typology, namely, the activities involved in building systems to effectively deal with data, move data, and distribute data at different scales. Other commentators, however, call out infrastructure development as a core component of data science and computational work in general (; ; ).

As Donoho acknowledges, his typology builds on earlier works by the prominent statisticians Chambers () and Cleveland (), who stimulated the idea of a “data science” by urging the field of statistics to broaden its focus beyond its traditional emphasis on theoretical analyses. Cleveland’s paper, for example, lays out a precursor to Donoho’s data science typology and depicts how statistics curricula could be expanded to better train people as “data scientists.” The Journal of Data Science, launched in 2003 by two statisticians, provides a publication venue for statistically focused data science research very much in line with Cleveland’s call, focusing on the applications of statistical methods in a variety of contexts. Many prominent articles about data science, however, including Donoho’s, have been published in core statistics journals.

At around the same time as Cleveland’s call for data science within the field of statistics, data science was also becoming a named entity in other sectors. CODATA, the Committee on Data of the International Council for Science (ICSU), published the first issue of its Data Science Journal in 2002. CODATA was established in 1966 “to promote throughout the world the evaluation, compilation and dissemination of data for science and technology and to foster international collaboration in this field” (). The six founders of CODATA included chemists, physicists, and an engineer. The Data Science Journal was formed to facilitate the dissemination of scholarly work on topics related to the committee. The launch of the journal was also specifically motivated by disciplinary aspirations. As stated in a retrospective on the 45th anniversary of CODATA, “A journal gives identity to a discipline” (, italics in original). The first editor of the journal, F. Jack Smith, outlined his view of the key topics of interest within the new discipline and journal:

… the study of the capture of data, their analysis, metadata, fast retrieval, archiving, exchange, mining to find unexpected knowledge and data relationships, visualization in two and three dimensions including movement, and management. Also included are intellectual property rights and other legal issues. ()

As of 2023, the scope of the Data Science Journal had not varied significantly from Smith’s initial focus (; ).

The Data Science Journal’s emphases were largely disjointed with Donoho’s typology of data science and the goals of the aforementioned Journal of Data Science. Some of the Data Science Journal’s areas of emphasis fall into the engineering category that Donoho acknowledges but does not include in his typology, but some others are much further afield, such as the Data Science Journal’s mention of legal issues related to data.

Looking at the Journal of Data Science and the Data Science Journal in parallel, we clearly see two distinct notions of what data science encompasses, both generally exclusive of the other. Numerous other journals and conferences related to data science expand the boundaries of the topic even further, including titles launched since 2019, such as the Harvard Data Science Review, Data Intelligence, and Patterns (; ; ).

Data science is often depicted as a nexus of certain kinds of skills. Drew Conway’s () data science Venn diagram is a commonly referenced visualization for this view, in which data science is depicted as the amalgamation between (1) math and statistics knowledge, (2) “hacking skills,” and (3) “substantive knowledge,” referring to knowledge within a particular disciplinary specialization. Conway is careful to note that this Venn diagram is intended to apply to data science broadly, not necessarily any specific data scientist (). But others, such as Davenport and Patil (), take this view further by stating that having computer science and statistical expertise are the defining features in distinguishing a data scientist from a traditional scientist. Blei and Smyth () also note this distinction between data scientists and “domain scientists,” but they emphasize that the two groups should be partners (or integrated) whenever possible:

Crucially, the data scientist solves the problem iteratively and collaboratively with the domain expert. (We note they do not need to be two different people; the data scientist and domain expert could simply be two “hats” for the same person). ()

This conceptual separation between regular (or domain) science from data science is in fact necessary for the data scientist to exist as a distinct type of person (). There would be no need to create a new label like data scientist if there was no conceptual or practical distinction between what a data scientist does and what a typical researcher would be doing within chemistry, astronomy, or meteorology. While the tools and methods used are one notable distinction, another could be that data scientists are expected to be able to apply their skills to data regardless of the disciplinary focus of those data. In other words, in the characterization of the above authors, data scientists are expected to be able to work with data for which they have no specific disciplinary training (), while domain scientists are only expected to be able to work with data from within their own discipline, such as chemistry, astronomy, or meteorology.

Key characteristics of data science as an inter- and metadiscipline

In looking at recent discussions of the trajectory of data science, three key issues related to interdisciplinarity repeatedly manifest: (1) the diversity in participants and communities, (2) the diffuse and contested boundaries of data science, and (3) the debated disciplinary status of data science. This section expands on these points.

The diversity in participants and communities

It is clear that data science, however bounded, is a topic area that encompasses many participants and communities and involves people with a multiplicity of skills and backgrounds. The statistics-centric view emphasizes the need for data scientists to be knowledgeable about data representation, transformation, modeling, and visualization. The data management and engineering conception of data science spans computational infrastructure building, metadata development, data retrieval and archiving, and intellectual property regimes for software and data products. Some discussions of data science include components of both views (; ; ), but this is less common. It is also clear that there is an evolving spectrum of skills and expertise that data scientists hold in practice (). People with data science job titles or responsibilities work in nearly every societal sector, including government, industry, nonprofit organizations, and higher education (; ). Many people who could be characterized as data scientists, however, do not fully identify as such, as noted by a recent survey of data scientists in academia, in which many respondents “somewhat” identified as a data scientist ().

The diffuse and contested boundaries of data science

With this diversity of people involved, few individuals follow the same path into the field. A former editor for the Data Science Journal hoped that the journal could serve as “a saloon for data scientists and experts in other fields” (). As such, the boundaries between data science and other fields are porous. Commentators have drawn parallels between data science and numerous other disciplines, ranging from statistics and information science to computer science () and journalism ().

The diffuse and contested boundaries of data science manifest clearly as departments and schools jockey for position to own data science within academic institutions. Educational programs for data science are blooming, albeit in highly heterogeneous ways, which makes identifying any broad trends in curriculum development problematic (). The US National Academies of Science report on data science undergraduate curricula provides little closure around what should or should not be part of data science education (). The summary lists nine central conceptual areas within the scope of data science and asks, “Which key components should be included in data science curriculum, both now and in the future? How could these components be prioritized or best conveyed for differing types of data science programs?” The report does not attempt to answer these questions directly.

De Veaux et al. (), on the other hand, define an undergraduate curriculum for data science in great detail, encompassing mathematical and statistical components, as well as data modeling, description, and curation. Their proposed curricula also includes a significant emphasis on communication, reproducibility, and data ethics. The EDISON Framework likewise breaks data science curricula into a number of competency areas, specifically (1) data analytics, (2) data engineering, (3) data management, (4) research methods and project management, and (5) domain-related competencies (; ). Numerous other curricula can be found, both undergraduate and graduate, each covering various topic areas (; ; ; ).

An ongoing question for many universities is where to situate data science programs within the ecosystem of existing schools and departments (; ). Many of the curriculum topics listed in the previous paragraph are already being taught within statistics, engineering, computer science, and information science programs. Data science students and instructors alike have diverse backgrounds, and it is common for instructors to be active practitioners, not tenured faculty (). One model is to create data science institutes as distinct entities while drawing faculty from multiple existing campus departments (). These institutes provide forums for building coalitions of faculty, student interest, and financial investments and provide testing grounds for broader data science undertakings across a campus ().

Significant diversity exists, however, in how data science has been instituted within university structures. An intensive study conducted by the University of California, Berkeley, assessed 16 different options for providing organizational support for data science, including forming new schools or colleges, creating new divisions within existing schools, creating programs that are spread across multiple schools, and creating new research units or centers (). Many universities, for-profit companies, and nonprofit organizations have also started online data science courses and certification programs (; ; ; ). These online programs have been able to reach much larger numbers of students, including populations beyond the typical undergraduate student (). These programs are responding to the need to scale up the number of graduates to meet employment demands in the private and public sectors ().

The debated disciplinary status of data science

All these factors contribute to the contested disciplinary status of data science (). This debate is rooted in the variable understandings of what the central concept, data, actually means. Defining data is itself an area of active scholarly research, though mostly by philosophers and information scientists (; ; ; ; ). Many discussions of data science that are otherwise very comprehensive, such as Donoho (), Cao (), and the EDISON Project (), do not engage in the fundamental question of defining the core concept of the emerging field. Nonetheless, numerous definitions of data can be found, ranging from disciplinary or technology-centric perspectives to abstract conceptualizations (). The ubiquity of the concept of data, combined with its elusiveness, frame the ongoing debates about the formalization of data science as a discipline.

Here it is important to note the distinction between (a) a formally defined discipline and (b) sets of people who are interested in, working on, or conducting research related to a particular topic or phenomena. The latter kinds of groups, which might be characterized from different points of view as “invisible colleges,” “epistemic communities,” or “communities of practice” (; ; ), encompass groups of people who are connected via social and/or intellectual networks but may have different formal disciplinary affiliations. This distinction is important in relation to the question about whether the goals of data science should be to develop a “science with data” or a “science of data” (). In the next section, I return to the question of the degree to which data science as a discipline will encompass broader areas of research that focus on data as a phenomenon of interest.

Discussion

This section presents a discussion of the literature review and works through the three central questions of the paper in detail, focusing on the comparison between data science and information science. Table 1 presents high-level parallels between data science today and current and past information science. As shown in the table, the notion that there is an explosion of information and data that is outpacing our ability to manage, use, and understand them is not new to the “big data” or “data science” era. Rhetoric of “information overload” has been used to motivate new developments in information and data management techniques at least as far back as the early 20th century ().

Table 1

Comparison between data science and information science.


POINTS OF COMPARISONINFORMATION SCIENCE PRECURSORS, 1920S THROUGH 1950SINFORMATION SCIENCE, 1960S THROUGH 1990SDATA SCIENCE AND INFORMATION SCIENCE, 1990S THROUGH 2010S

Explosion of information/data resources
  • Growth of US government in 1930s
  • Technical reports classified during World War II made public afterward
  • Seized documents from Axis power countries
  • Cold War–driven expansion of research and research outputs
  • Digital resources distributed through electronic media (magnetic and optical disc formats)
  • Emergence of the web
  • Digital technologies, such as personal computers, cell phones, social media, and sensor networks, that enable faster generation of information and data
  • Large volumes and varieties of data being made public on the internet

New technologies that promise to improve capacity
  • Microfilm and microfiche
  • Punch card–based document sorting and selection tools
  • Early computing technologies
  • Digital computing technologies
  • Personal computers
  • Internet and web technologies
  • High-bandwidth cellular networks
  • Artificial intelligence and machine learning tools
  • Advanced data mining
  • Cloud computing
  • App development on social media platforms

Diversity of participants
  • Engineers
  • Librarians
  • Mathematicians
  • Scientists
  • Computer scientists
  • Economists
  • Engineers
  • Information scientists
  • Librarians and archivists
  • Psychologists
  • Scientists
  • Computer scientists and engineers
  • Information scientists
  • Librarians and archivists
  • Philosophers
  • Scientists
  • Social scientists
  • Statisticians

Information science became a distinct disciplinary and professional label in the 1960s. The prehistory of information science, however, centers on international efforts in the first half of the 20th century that focused on “documentation,” the initial predominant name for the topic (). After World War II ended, the interest and activity related to information and documentation increased dramatically. The governments of many countries, particularly the United States, wanted to leverage the research conducted during the war to facilitate growth of public knowledge (). This resulted in an explosion of technical reports into the public domain after the war. In addition, the victorious Allied forces seized a huge number of government documents from Nazi Germany and other Axis countries (). The challenge of organizing these documents stimulated interest and activity in documentation, information organization, and information retrieval. Information and intelligence work related to the growing Cold War with the Soviet Union likewise stimulated growth in information research and professionalization (; ). Many organizations undertook information-related work during this time, and the number of information workers grew rapidly, including many scientists who encountered information work during the war ().

The information science educational ecosystem expanded through the 1970s, often (though not exclusively) through programs based in library schools. The library and information sciences coalesced enough during this period for a number of specializations to become prominent, if somewhat disconnected (). The following decades saw a serious retrenchment of the educational landscape, as over 20 library and information science schools closed or went through administrative realignments in the 1980s and 1990s (; ). This retrenchment slowed in the 2000s as the internet emerged as a social and technological phenomenon, causing renewed interest in information within governments, universities, and the private sector. In 2005, in the midst of the dot-com period, the “iSchool” caucus was formed by a group of nine library and information science schools (). As of mid-2020, it contained 114 members across six tiers of membership. The iSchool membership is diverse, intellectually and programmatically. Some schools retained strong connections to the earlier information science focus areas, but many differ substantially from what an information science school looked like in prior decades (; ).

Throughout the past century, information science has demonstrated the same characteristics analyzed above for data science: (1) diversity in participants and communities, (2) diffuse and contested boundaries, and (3) debated disciplinary status. As shown in Table 1, as long as information science and its precursors have existed, there has been a diverse and clumpish mix of participants. This diversity of participants and intellectual approaches has provided a constant source of new ideas and contributors within information science, but it has inevitably engendered boundary arguments about what the discipline should (or should not) include. During the 1960s, information science emerged as a contested space, and it has continued to face boundary disputes to this day (; ).

Articulating and negotiating the unique value and niche of information science within the ecosystem of constitutive and related disciplines has been an ongoing challenge (; ; ). The diverse and evolving sets of participants and ongoing boundary challenges have repeatedly engendered debates about the disciplinary status of information science. Many commentators within these debates have noted that interdisciplines face continuous struggles to achieve power and legitimacy inside academic and government institutions that favor (implicitly or explicitly) traditional disciplines. Some question the wisdom of arguing for the field explicitly by championing its interdisciplinary nature (; ). In part, ongoing challenges in articulating the common thread(s) within information science stem from the elusiveness of information as a topic. Definitions and characterizations of the concept of information abound by individuals inside and outside of information science (; ).

What can be drawn from the parallels between the ongoing evolution of information science and the emergence of data science? The disciplinary ecosystems between the two fields are not identical. Marchionini () argues that information science can be considered an academic discipline because it has developed distinct “principles, key research questions, and communities of practice that have given rise to subspecialties, professional standards, curricula and degrees; whereas data science at present consists of a set of techniques that have arisen out of allied fields such as statistics, computer science, and information science and is driven by applications and problems from a variety of endeavors of modern life.”

To build insight from this comparison, the following sections discuss three key questions regarding the future of data science. The intention behind these questions is to identify issues that are either already important in the data science landscape or will be likely to be important in the near future. For the stakeholders involved in data science, there is benefit in discussing how these debates can be turned into productive discussions rather than having them manifest as impediments going forward.

1. What will be the focal points around which “data science” and its stakeholders coalesce?

Given the general vagueness described above around the conceptualization of data within data science, why has the term data emerged as the focal point for this conglomeration of activity? Here, the historical comparison to information science may shed light. Information superseded documentation as the central concept of the field in the 1950s, but a considerable body of work since that time has argued that other concepts provide more theoretically robust entities, including “documents” (), “literatures” (), “relevance” (), or, more recently, people and their use of networked computers ().

What then holds information as the central concept of the field? Certainly, the formalization of information science in the 1950s and 1960s was related in large part due to the success of Shannon’s “information theory” within the fields of mathematics and electrical signal processing (). Information theory, as developed by Shannon and many others (), provided conceptual metaphors of information “senders,” “channels,” and “receivers” that persist within information science research and education to this day (; ). Perhaps equally important, the success of information theory brought attention and resources to the study of information. Governments, private foundations, and for-profit companies invested in a wide range of information-focused research during the postwar period (). As such, it is tempting to attribute the movement from documentation to information science as being one of status seeking, that is, adopting the term information to align preexisting bodies of work under the documentation label with emergent and highly prestigious research focused on information ().

Such alignments are inevitable and are certainly happening today in the movement to data science. But the information concept provides more than just status. The vagueness of the term provides affordances in how it can be used and understood. As Agre () illustrated, “information” provides a neutral term that enables research and professional communities to make broad intellectual territorial claims without overaligning to any particular technology, institution, or knowledge area.

Many of these same characteristics can be seen in the centering of data within data science, namely, foundational metaphors, pragmatic alignment with trending topics, and a vagueness that enables broad territorial claims. Data certainly comes with a foundational metaphor, detailed by Rosenberg () and Frické (), that is at the root of work in many fields: data being that which underlies facts, evidence, truth, and information. Roseberg, Frické, and others (c.f. ) point out conceptual problems with this metaphor, but it undoubtedly remains strong in most sectors of academia and society. On a practical level, the label data also serves as a pragmatic sign of alignment with emergent and resource-heavy research areas, such as big data, the Internet of Things, and social media analytics. Finally, like information, the term data lacks conceptual baggage that would tie it to any specific technology, institution, or knowledge area. This characteristic is at the root of the “data science as metadiscipline” commentary, namely, that its techniques (whether data organization, processing, analytics, or visualization) can have application in almost any setting.

As such, information and data serve a number of functions, even if they lack unifying conceptual clarity. Their conceptual vagueness presents both benefits and drawbacks. As Hjørland () describes, both centripetal and centrifugal forces exist with regard to the formation of a coherent discipline based on such a diffuse topic. Even as this tension has caused ongoing practical and institutional challenges for people involved in the study of information, it has stimulated considerable intellectual advancement related to the understanding of information as a conceptual and theoretical entity. Whether the development of data science stimulates similar advances in the understanding of the concept of data remains a question for future research.

In its current formative period, data science is perhaps most coherent as a platter of methods and tools, not as a grouping of research or professional areas. Data science projects tend to be distinguished by the kinds of tools and methods used, not the disciplinary topic on which they are working (). As vividly shown in the “Periodic Table of Data Science” (), data scientists may engage with a variety of tools for data collection, cleaning, processing, analysis, archiving, and distribution, including programming languages like Python and R, frameworks for analyzing large data sets like Apache Hadoop and Pangeo, and machine learning and artificial intelligence approaches like decision trees and neural networks (; ). As with any discipline, some specializations within data science will have little crossover with each other. It remains to be seen, however, whether data science specializations will continue to be structured around particular tool sets and methods, or whether particular theoretical developments, topical interests, or social problems will become more prominent focal points.

2. Can data science stakeholders use the lack of disciplinary clarity as a strength?

This leads to the next point of discussion. Conceptual advances in the understanding of data as a foundational concept are taking place, but as noted above, this work is largely being conducted by non–data scientists. This is one demonstration of the blurriness of the boundaries around data science. Boundaries between disciplines are always blurry. Scholars tend to interact most closely with people who work on similar topics and/or with similar methods, regardless of their disciplinary affiliation. Porous boundaries mean that new participants with diverse backgrounds will move into or across the field. This will inevitably lead to rediscovery or reinvention of particular ideas or approaches and periodic circularity in the topics of current interest. This trend has been noted by many prominent information scientists (; ; ).

Such reinvention and circularity can stem from a lack of knowledge of historical predecessors, but it is also reflective of the ongoing nature of many information and data challenges. Some problems reemerge repeatedly, despite the best efforts of many experts. Within information science, it has long been known that information organization and retrieval methods that once worked well will break down if not regularly revisited due to changes in how language is used across space and time (). Data scientists encounter such circularity as they attempt to standardize data within and across organizations, leading to the well-documented fact that a significant portion of recurring data science work involves data wrangling and cleaning (; ; ).

For information science, these characteristics have been viewed as problems that limit the field from gaining status within the broader ecosystem of academic disciplines (). In contrast, sociologist Jerry Jacobs () has argued that innovation in the face of diffuse boundaries is what ensures the vitality of disciplines over time. Attempting to demark disciplinary boundaries is counterproductive when the grounds to claim such boundaries are uneven, as is the case with information and data science (). Disciplinary boundaries are important to demark educational, professional, and funding institutions, but focusing too much on the need to form and define disciplinary boundaries implies a discourse of “weakness,” which can cause unnecessary and repetitious debates about how to make the discipline stronger ().

For data science, embracing porous boundaries by evincing openness to new ideas and people could be a means for continually refreshing the field and for broadening the diversity of data science participants generally. There are positive movements in this direction already. The Academic Data Science Alliance, for example, was created in part to “advocates for justice, diversity, equity, and inclusion of all backgrounds and lived experiences in data science and more broadly in academia” (). In another example, the development of the CARE principles has been an important motivator and signpost in bringing Indigenous voices into data-focused discussions (). This set of principles outlines key approaches to working with any data related to Indigenous peoples or communities, namely, that there should be collective benefit for the relevant Indigenous communities, the Indigenous communities should have authority to control their data, there is a responsibility to engage respectfully with Indigenous communities regarding any data collection or use, and that ethics (of the researchers and Indigenous people) should inform data use (). This has been an important addition to the discourse around data science over the past decade.

The extensive debates on these boundary issues have not “solved” the problems of interdisciplinarity within information science over the past 50-plus years. Views on extant or desired boundaries are inevitably dependent on one’s viewpoint and will thus evolve in concert with the participants involved. But as noted in the discussion of question 1 above, these debates have been highly generative intellectually within information science. Studies focused on data and data science may be able to take a lesson from this duality, namely, that though the recurrence of such debates will cause frustration and occasional points of circular argumentation, new voices adding to these discussions have the potential to significantly advance understanding of the nature of data as a foundational concept.

Finally, understanding data in all of its facets requires coupling (or at least embracing) multiple kinds of research, development, and analytical methods. Buckland () has argued that the ability to be “methodologically versatile” must be a calling card for studies of data and information. Some data phenomena can only be studied via statistical methods, while other phenomena can best be studied via engineering, bibliometric, survey, or ethnographic research methods. Qualitative and quantitative methods used in complementary ways may be more effective in solving data-related problems than either type of method individually (). Versatility to shift between or combine these methods allows those who work in interdisciplinary areas to be flexible in the face of new societal and technical developments (). Because of this need for methodological flexibility and versatility, information or data science programs that emphasize only a single methodological approach may be less resilient over time.

The openness to the new voices and methodological versatility described in this section is particularly critical as society becomes ever more data driven. As of this writing, in mid-2020, questions about data are at the center of national and international politics (use of social media data for targeted election advertising), public health (COVID-19 disease data gathering, sharing, and analysis), and global environmental change (measurements and projections of climate change). Twenty years ago, Saracevic () noted that “contemporary information problems are too important to be left to any one discipline.” To paraphrase this for today, contemporary data problems may be too important to be able to be gathered under any one discipline or professional group.

3. Can data science feed into an “empowering profession”?

Given the broad importance of data within a range of societal sectors, there are many calls for data science stakeholders to embrace ethics as a core competency (; ; ). Machine learning, in particular, is under considerable scrutiny as a tool that can be used for ethically questionable purposes (). These techniques may also produce unethical results, even with good intentions (). If, as noted above, data scientists often work with “domain experts” or stakeholders in a client-like relationship, this connection with ethics will manifest on a day-to-day basis through questions about data bias, reliability, integrity, and quality. Instead of trying to deal with this issue indirectly by building better data products (e.g., visualizations or representations), data scientists have an opportunity to embrace the idea of becoming an “empowering profession” (). “Empowering professions” promote their client’s growth and competence. They do not withhold information or stand behind a bulwark of “expertise” in limiting what is shared with a client.

This does not mean that data scientists should be trying to train everybody else to be data scientists. Instead, it suggests that data scientists could promote data literacy as a means toward the personal empowerment of the people that they work with (). “Data literacy” in this context refers to enabling clients to understand that the products of a data science project (whether machine learning outputs or data infrastructure developments) come with certain embedded assumptions, limitations, and ethical concerns. It also refers to helping clients understand that data are embedded within particular situational and relational contexts, both when collected and when analyzed (). This also would involve finding ways to ensure transparency and interpretability of the outcomes of data science workflows, particularly when they are used for decision-making ().

A move toward empowerment would seek to understand information and data in relation to concepts like vulnerability, trust, autonomy, and agency and would work to support people in approaching the use of technology, documents, information, and data from their own cultural viewpoints, personal interests, and social settings (). As an example, Pierre () studied social media use by children and showed how digital technologies serve as sources of social support, self-expression, and self-assurance, as much as (or more so than) tools for information, data, or knowledge creation.

Data scientists are inevitably political and ethical actors, even if they do not intend or desire to be (). Open research questions remain about the extent to which the data science profession embraces political, ethical, and empowerment-focused research agendas and professional norms. This may be critical to the future development of the field, as empowering clients—that is, enabling them to better understand their own data, and what can (or cannot) be done with them, without needing the data scientist to shepherd every step—will help data science to build a reputation for trustworthiness and social responsibility.

Conclusion

Information and data work must draw from multiple conceptual and practical domains. Understanding and using information and data involves articulations between people, their societies and institutions, and the technologies they create and use (). Because of the vague and contested nature of data as a central organizing concept, new people, institutions, and technologies will continually enter the ecosystem of data science, engendering continual discussion of its disciplinary status. One response to this dynamic is to push for formalization of a discipline and profession with agreed-upon curricula, skills, and professional responsibilities. This may result in periodic stabilizations of disciplinary characteristics, but the discussion within this paper suggests that the definition of data science and its boundaries will be a source of contestation and debate for the foreseeable future. Similar issues have manifested in information science for a century or more and continue to be points of debate today.

Understanding the potential pitfalls in focusing too much effort on disciplinary formalization is critical for data science moving forward. Creating a discipline involves significant institutional work of establishing social and organizational support structures (). The new technologies and analytical methods depicted in Table 1, such as microfilm photography, early digital computers, the internet, and social media, emerged out of, and into, an interconnected web of social institutions (). If data science is to continue to grow as a distinct entity, attendant institutions must likewise be developed. These may include formally organized entities, such as professional associations or caucuses of academic programs. But institutional development also encompasses the emergence of professions and professional norms of conduct, processes for governing standards and tools, and the development of consortia that mediate institutional interactions (; ).

Many current initiatives assume that solidifying the boundaries around data science is possible or desirable. Examining these kinds of assumptions is central to building data science to be a “critical technical practice” (). The stakeholders who are developing the present and future of data science will need to examine the relative merits of embracing porous boundaries and methodological versatility, and they will have to deal with reinvention and circularity of central topics. Leaders in the many data science communities will also have to address whether ethics and empowerment are central to strengthening the foundation of the emerging field and associated set of professional roles. Finally, over time, funders, universities, and professional leaders will need to identify the kinds of institutional developments that will make data science more robust when it encounters the inevitable societal and technological changes of the next few decades.