Studying an artificial intelligence lab focused on developing tools for knowledge representation in the early 1990s, anthropologist Diana Forsythe () argued that computer scientists tend to ‘delete the cultural’ when rendering knowledge machine readable. In their work to translate knowledge into computer code, the scientists would seek to render visible the complex bodies of information that orient diverse communities, while expunging their own roles as knowledge curators and interpreters from consideration. The argument was an extension of a similarly framed argument from Susan Leigh Star () that computer scientists tend to ‘delete the social,’ privileging technological concerns over social ones in their design practice. This essay will reflect on how discourse on the ‘social’ and the ‘cultural’ has evolved in the data science community, arguing that despite increased attention to sociocultural issues, data scientists tend to overlook the cultures orienting their own work. I then contrast these deletions with the methods that we engage in a course I teach at Smith College called Data Ethnography—a course that aims to foster underprioritized reflexive sensibilities in data science work. I conclude with recommendations for instituting protocols that would prompt researchers and practitioners to document the cultural provenance of their data science policies and infrastructure.

Where Are the ‘Social’ and the ‘Cultural’ in Data Science Discourse?

Fast forward to the early 2020s and ‘the social’ has gained airtime in data science discourse. Perhaps the most common refrain from technologists that I have heard at data science conferences and meetings has been that the key challenges data science communities face ‘are not technical but social’ and that, as a community, we need to be focused on building social infrastructures in addition to technical ones. Admirably, over the past 20 years, data science organizations and journals have drawn attention to social barriers to data sharing, such as differing incentive structures and levels of training within and across diverse disciplinary communities, scholarly generations, and geographic boundaries (; ; ; ). Research examining how to support the uptake of best practices, such as FAIR (findable, accessible, interoperable, and reusable) guidelines and data management planning, has shown that adoption not only depends on clear guidance and well-designed infrastructure but also social advocacy, relationship building, and an amenable financial, legal, and policy landscape (). Movements to prioritize the rights and interests of Indigenous peoples in the knowledge economy have highlighted the need for alternative data governance models that privilege self-determination and collective benefits ().

Similarly, ‘the cultural’ has earned a place in data science discourse. I have been in countless data science meetings and conferences where I have heard communities reference the ‘culture problem’ data science faces. How do we bring about the ‘culture change’ necessary to facilitate data sharing (; )? How do we get everyone to adopt the same standards or speak the same language? Or, alternatively, how do we develop the translational tools to map common meanings across different languages? It is notable that these concerns are often raised in the name of fairness, equity, and inclusion and have resulted in commendable educational efforts (; ; ).

While the ‘social’ and the ‘cultural’ have been prioritized in data science discourse, social and cultural concerns that get raised in data science are almost always outwardly focused—applying more to the communities that data scientists seek to support than computationally focused data science communities. This is supported by central organizing principles within data science projects, institutions, and policies. As David Ribes et al. () argue, data science and other computational research communities are often seen as sitting independently from ‘domains,’ framing the communities as domain agnostic.

We see this guiding principle in countless settings. For instance, Ribes et al. () detail how this logic has guided the funding policies at the US National Science Foundation and National Academy of Science. As data science curriculum builds out at numerous institutions, there is a tendency toward separating computational/analytical data science from more applied data science domains. Work in organizations like the Research Data Alliance tends to organically separate into domain tracks (focused on addressing more disciplinarily specific data infrastructure concerns) and nondomain tracks (focused on developing more domain-agnostic infrastructure and policies in support of domains).

Norms of interaction across ‘domain’ and ‘nondomain’ communities further buttress these roles. For instance, to develop data frameworks and infrastructures that can bridge diverse disciplinary cultures, folks in nondomain tracks will draw up use case templates and distribute them to multiple domain communities in an effort to learn more about what makes each community unique—that is, what their assumptions are, what their commitments are, and what unique disciplinary challenges they face. While prioritizing the acquisition of knowledge about other data cultures, we tend not to think about the cultures of the communities authoring the use case templates, interpreting the information collected from domain groups, and translating that information into infrastructure. This domain-agnostic positioning frames data scientists as neutral translators—as responsible for designing the tools to bridge across disparate social systems and cultures, mitigating ‘data friction’ (). What would it look like to turn an ethnographic focus back on the data science communities responsible for policy and infrastructure design? What are the stakes?

Thick Description for Data Science

In his seminal work The Interpretation of Cultures, anthropologist Clifford Geertz () presents a case for framing anthropology as a science in search of meaning. As an illustrative example, he asks us to consider how we discern the differences in meaning between a twitch of the eye and a wink. On a surface level, we see a wink as one eyelid closing and reopening; however, on an interpretive level, we categorize the movement as a mode of communication—indicating a joke, affection, or greeting. To ultimately discern the meaning of a wink, we have to take into consideration a number of contextual factors beyond the movement of the eyelid; it requires us to detect the symbolism in the action—to draw out its semiotics. The vehicle through which anthropologists move beyond surface-level observations of behaviors like a wink (and toward their contextualized interpretations) is thick description. Engaging thick description involves documenting detailed descriptions of behaviors or events and enriching those descriptions with interpretations of their symbolic cultural meaning.

Let me provide an example from a recent research project—a cultural analysis of semantic web infrastructure—to demonstrate how this applies to the data science community. Since the start of that project, I had been fascinated by debates around the meaning and flexibility of a property in many ontology languages that serves to indicate that one data point is the ‘same as’ another data point. I had been following discussions lamenting the misuse of the ‘same as’ property ‘in the wild’—when everyday web users were leveraging the property to mark equivalence between two things that were not ‘strictly’ identical (). This issue, one conference paper argued, was leading to a logical ‘crisis’ of identity that was turning the interconnected web of data into ‘the semantic equivalent of mushy peas’ (). Studying these concerns as a data ethnographer provided insight into the language ideologies that guided the design of semantic web infrastructure.

In 2016, while sitting at my desk in Troy, New York, and attempting not to allow email to pull me away from writing, an email with the subject line ‘Deprecating owl:sameAs’ pinged in my inbox. It was directed to the semantic web World Wide Web Consortium email list—where a great deal of planning for semantic web infrastructure had been occurring. The email read as follows:

The research that I’ve done makes me conclude that we need to do a massive sweep of the LOD cloud and adopt owl:sameSameButDifferent.

Such an owl:sameSameButDifferent statement indicates that two URI references actually refer to the same thing but may be different under some circumstances. ()

I remember being taken aback initially. Over the next few hours, more responses came flooding in, suggesting additional properties such as ‘owl:differentDifferentButSame,’ ‘owl:isKindaLike,’ ‘owl:sometimesSameAs,’ ‘sameAsItEverWas,’ and ‘owl:actuallySameAsReally.’

It took me reading a few responses to recognize that the date was April 1 and that many (though notably not all) respondents were jumping on a satirical bandwagon to participate in an April Fool’s joke.

Thick description enables an ethnographer to move from seeing this simply as an exchange of suggestions over email to discerning the significance of the humor. It gets us asking the question, Why is this funny, and what prods folks to join in on the joke? Writing on the significance of discerning irony when ethnographically studying computing communities, anthropologist Nick Seaver () notes,

Only through deep engagement and richly contextual description could the ethnographer distinguish such variety—or, in other words, be in on the joke. Superficial accounts risk taking ironic statements literally or missing the conflicted experience of programmers negotiating between different sets of values.

Drawing on the context of conversations I had heard up to that point, I came to see this satirical exchange as a marker of an emerging collective cynicism, at least among some in this community, around both the proliferation of data standards and the precision of data standards. It marked a recognition of the complexity of ‘sameness’ and ‘identity,’ along with a concern over the futility of attempting to nail the concepts down, particularly in a space like the World Wide Web.

Documenting this interaction via thick description helps us see that these shifting beliefs and values inform data infrastructure design work; they become interlaced in the infrastructures we engage as data scientists—and they matter. When digital systems rely on these codified semantics to determine how to generate search results, recommend related content, or make automated decisions, designers’ negotiated beliefs, convictions, and hesitations shape how our knowledge systems portray the world to us.

Teaching Thick Description

Thick description takes center stage in a course I teach at Smith College called Data Ethnography. The aim of the course is to help data science students develop an awareness of and ability to evaluate the cultural logics that orient data science work so that they can recognize and intervene when those logics are out of sync with their own ethics. To develop this awareness, we use thick description to excavate the cultural values and meanings that are often rendered invisible in the data science discipline.

While many principal methods of ethnography could guide this course, we focus on thick description for a few reasons. First, most students entering the course have not had an opportunity to reflect on the symbolic cultural provenance of the data resources, tools, and infrastructures they work with. They are often surprised to learn that a dataset documenting the measurements of different iris species—a dataset leveraged very often in data science courses for its usefulness in introducing machine learning concepts—has ties to the eugenics movement (). Reading seminal thick descriptions of data science infrastructures—for example, of classifications like the International Classification of Diseases (), database models like NoSQL (), statistical frameworks like homophily (), and datasets like ImageNet ()—they develop an appreciation for how the designs of data infrastructures are guided by certain belief systems, political commitments, dominant discourses, community rituals, and organizational incentive structures. Turning an ethnographic eye to the communities producing these data infrastructures, we are reminded that their configurations are not given but emerge as data scientists identify priorities and negotiate trade-offs.

Consider the debates between the ‘structuralists’ and the ‘minimalists’ that oriented the design of Dublin Core; the debates between the ‘neats’ and the ‘scruffies’ that guided the design of the Web Ontology Language (OWL) (); or how, in its efforts to embody a ‘middle’ ontology (i.e., to avoid becoming an ‘ontology of everything’ while still remaining useful) (), the collaborators working on schema.org had to make some critical judgments regarding what terms should be enumerated within the core schema and how. For example, does a ‘public toilet’ deserve a place as a civic infrastructure in schema.org, and if so, should the schema offer subtypes for different genders ()? Scholarship in critical data studies and information studies demonstrates how these design debates and negotiations are not arbitrary; they impact how knowledge forms and disseminates, at times resulting in unjust and discriminatory representations of communities (; ; ; ; ). While critical components of the provenance of these infrastructures, references to these debates rarely appear in their documentation.

This erasure of cultural history leads us to the second reason thick description is prioritized in Data Ethnography engaging thick description marks a methodological corollary to the norms and commitments that students have grown accustomed to in data science. The first time I assign students to thickly describe a data environment (such as a classroom of biology students collecting field data or a data science hackathon), I will hear questions like the following: How do I make sure that my own biases don’t influence the way I interpret what I’m observing? How do I make sure my presence doesn’t influence the way that people behave (and thus the ethnographic data that I collect)? How do I make sure I get a representative sample of observations?

I remind students that ethnographers tend to approach these questions differently than data scientists, assuming that biases will always influence the way we interpret ethnographic data; our presence will always influence the data we collect, and there are no thresholds at which culture becomes ‘representative.’ In the ethnographic communities that I work within, these personal biases and influences are part of the cultural phenomena that we aim to analyze and document. In providing a methodological corollary, engaging data ethnography encourages students to recognize and interrogate the norms of data science communities while also fostering the reflexive sensibilities that have historically been underrepresented in traditional STEM disciplines. The course asks them to discern their own cultural positioning while analyzing that of others. It is an effort to subvert cultural deletions.

Conclusion

As designers of data policy and infrastructure, data scientists play an integral role in shaping what forms of knowledge production are made possible, who can participate in knowledge production, and how cultural meaning is made from collected data. With this in mind, data science communities have a responsibility to attend not only to the cultures that orient the work of domain communities but also to the cultures that orient their own work.

Recent scholarship has pointed to pathways forward: to encourage reflection on the assumptions and motivations that underlie the creation, distribution, or maintenance of datasets, Gebru et al. () recommend that all dataset producers document their practices in ‘datasheets.’ Further research has shown that the practice of producing these documents has prompted data scientists to recognize and deepen their understanding of ethical issues that emerge in relation to machine learning models (). There are opportunities to extend and implement these reflexive protocols beyond dataset creation. Organizations responsible for the design and dissemination of data science infrastructure and standards (such as the Research Data Alliance and Committee on Data of the International Science Council) can encourage similar documentative practices whenever a new recommendation or deliverable gets published, prompting designers to not only report on the scope, impact, and use cases for outputs but also on the motivations, assumptions, and debates that guided their design. These organizations can also help network more computationally focused groups to anthropologists, sociologists, and science and technology studies scholars with expertise in detailing this type of cultural provenance. Funders could mandate datasheets and other forms of reflexive documentation in annual project reporting, and the Data Science Journal can encourage submissions that position new data science policies, applications, and infrastructures in their historical and cultural context.

Finally, to ensure that the next generation of data scientists is ready to engage in this form of reflection and documentation, it will be important for data science college and university programs to foster skills in critical pedagogical traditions that typically get excluded from STEM (such as ethnography, hermeneutics, and critical analysis). More generally, addressing data science’s ‘culture problem’ will demand widespread recognition that we can never design data infrastructure from a cultureless place.