Looking Back to the Future: A Glimpse at Twenty Years of Data Science

This paper carries out a lightweight review to explore the potentials of data science in the last two decades and especially focuses on the four essential components: data resources, technologies, data infrastructures, and data education. Considering the barriers of data science, the analysis has been mapped into four essential components, highlighting priorities and challenges in social and cultural, epistemological, scientific and technical, economic, legal, and ethical aspects. As a result, the future development of data science tends to shift toward datafication, data technicity, infrastructuralism, and data literacy empowerment. The data ecosystem, at the macro level, has also been analyzed under the open science umbrella, providing a snapshot for the future development of data science.


INTRODUCTION
Data science, namely the science of data and science facilitated by data, has significantly changed how we work, study, and live.Over the recent two decades, statistics show a growing interest in data science (see Figure 1) that remained relatively steady early in the new millennium before a constant increase in the most recent decade.Moreover, the incredible power embedded in data science ignites the worldwide digital revolution in which everyone enters the unprecedented big data era (Science International 2015).
This paper carries out a lightweight literature review to unveil data science in the last 20 years.It employs scientometric methods to identify data science essentials and introduces examples illustrating multidimensional challenges.It also includes an analysis of future trends and proactive actions.Hopefully, this paper will help reflect on data science work and support its future development under open science guidance (UNESCO 2021).

EXPLORING THE FULL POTENTIAL OF THE BIG DATA ERA
From no data and little data to big data (Borgman 2015), the new era brings the most significant advances to data science.According to Google Trends, the most relevant data science topics searched fall into different categories, such as data, technology, infrastructure, and education (see Table 1).Another word frequency analysis of the data science publications on the Web of Science platform shows similar results (see Figure 2).Therefore, we employ the four essential components (see Figure 3), namely data resources, technologies, data infrastructures, and data education, to depict data science trends.Nevertheless, data are not alone.The supporting information and communication technologies (ICT) provide opportunities to facilitate the full exploitation of data value.From analysis to analytics (Cao 2017), technologies transfer data from facts to insights.From data mining to artificial intelligence, technical development drives data to understand the past and envisage the future (Provost & Fawcett 2013).From centralized data control to cloud solutions, technologies support better data management for scalable research.Thus, as catalysts and boosters, technologies improve data performance by adding value to data within the whole life cycle and every workflow.Second, as the essential game-changers for science, promising and robust technologies support cutting-edge data exploration by combining certain data into tailored scenarios.For example, long-tailed value exploration for massive data, lack of data, and sensitive data coexist in scenarios.Thus, any data curation should be set in the lifelong data cycle to save chaos, such as real-time massive data integration in Earth Sciences for Sustainable Development Goals (SDGs) research (Guo et al. 2021)

CONCLUSIONS
The recent two decades of data science have been long and exciting, full of difficulties and boundless potential.Looking back into the human history of thousands of years, two decades of data science is extremely short.However, it is of great significance.The transit to the fourth paradigm of scientific research, the global wave of the digital economy accompanied by the rapid rise of many developing economies, the dramatic development of the global village, and the polarization of digital integration and the digital divide are just a few examples.Nevertheless, the great charm of data science lies in the science and permeates the social lives of everyone every day.The power to master the double-edged data sword will advance data science explosively.
As this paper elaborates, data, technology, infrastructure, and education contribute jointly to the four-wheeled wagon of data science.The four essentials will together effectively consolidate and enhance the construction and development of future data science.Meanwhile, the call for open science provides rich soil for the healthy development of the whole data ecosystem.Therefore, looking into the future, we will embrace a better world driven by open data resources, responsible technologies, open infrastructures, and inclusive data education.

DATA ACCESSIBILITY STATEMENT
Data are captured from the Web of Science with selected articles entitled 'data science.'There are no strict constraints on publication times to trace the theme evenly.As a result, 4,131 pieces of records are returned, including 88 records earlier than 2000.Among them, 3,490 non-null-value publication abstracts are taken as valid textual results for word frequency

Figure 2
Figure 2 Relevant data science topics in Web of Science publications.

Further
, data infrastructures facilitate the deployment of technologies, thus supporting data use and reuse.Technical infrastructures provide connectivity for data exchange, capacities for data storage, computing for data processing, algorithms and software for data analysis, portals and technical support for the accessibility and reusability of data services.Thus, data and technologies evolve together with the supporting data infrastructures.Take the Scientific Data Program of the Chinese Academy of Sciences (CAS SDP), for example (Figure4)(Zhang et al. 2021): initiated in the 1980s and sustained for over 30 years, the CAS SDP has established data infrastructures to support disciplinary research.The data infrastructures have been growing from early databases to the data grid, data engineering, data cloud, and an open data system of systems, as well as leading the exploration of co-building a Global Open Science Cloud (CODATA 2021; CSTCloud 2021).The evolving construction of the supporting e-infrastructures reflects the iteration of data and technologies that has adapted to the dynamic research scenarios of different research periods.

Figure 3
Figure 3 Four essential components of data science.

Figure 4
Figure 4 Typical data science policies and practices from the CAS SDP perspective.

25 MOST POPULAR TOPICS RELATED TO DATA SCIENCE (GOOGLE TRENDS, 23 FEB. 2023)
CATEGORIES TOPTechnology Python; analytics; learning; data analysis; machine learning; analysis; artificial intelligence Infrastructure computer; machine; project; engineering Education data science; course; university; job; master's degree; master of science; bachelor's degree; salary Others science; computer science; statistics; business

Table 1
Relevant data science topics searched on Google.
Note: Data are captured from Google Trends and manually cleaned.
(Luan et al. 2020)eWise 2020 DOI: 10.5334/dsj-2023-007Fourth, data education(NASEM et al. 2018;Wise 2020) should build capacities for specialists and citizens.Data science overlaps with computer science and statistics and focuses on realworld problem solving.Thus, data professionals' challenges may include establishing proper social identity, connecting with other relevant societal roles, and mapping real problems into tailored curricula, such as 'precision education and individualized learning'(Luan et al. 2020).Cultivating citizens may also be popular, such as coding for data cleaning, analysis, and visualization, but that is not enough.Data science education, both professional and amateur training, should aim bigger, covering the whole life cycle of data in an open-science manner to embrace inclusive and responsible data science in the future.Specifically, no data challenges come alone, and no data science essentials can break the silos independently.Instead, data challenges require collaborative support from technologies.Moreover, infrastructures will facilitate data sharing and technology implementation.Educating data professionals and the social community will empower data fully by "leaving no one behind" (UN Sustainable Development Group 2022).Therefore, solutions should consider the four data science essentials together, not segregated ones.
(George & Walsh 2022)ogical thinking(Kempeneer 2021) should be involved in the technical design, pulling data results out of the black box of technologies.Other important concerns also include technology neutrality, algorithm ethics, legislation redesigns, and global alignments(George & Walsh 2022)to better serve the public interests, like in the case ofChatGPT (van Dis et al. 2023).be amplified by accumulative advantages, such as current construction scale, prestige, and popularity.Thus, vulnerabilities may lie in the future opportunities to raise funding as newly established infrastructures, to share data as sovereign resources suppliers, and to access data services as niche-demand users.Technical concerns may include exploring systematical design for flexible and extensible services, effective functionality deployment for efficient data curation, and environmentally friendly development.
(Silver et al. 2017)12;Ducassé & Lee 2014;Wiktionary 2022)e global public good(CODATA et al.  2019), data science may assume many future responsibilities, with inevitable trends toward datafication, data technicity, infrastructuralism, literacy empowerment, and others.The following subsections elaborate on each of these issues.Furthermore, constitutive technicity(Gallope 2011;Ash 2012;Ducassé & Lee 2014;Wiktionary 2022)dramatically tightens data and technologies.'Technicity'depictsthe prevalence of technology deployments in data management.'Constitutive'emphasizesthat these technologies transit from outsiders to those closely engaged with lifelong data management intrinsically.Data technicity fastens the pace of value extraction from data and even exceeds human visions in many cases(Silver et al. 2017).The evolving technical design should be interoperable across machines and inclusive to humans.Enhanced collaborative research models and international alignment are to follow, affirming the transparency, flexibility, robustness, and intelligibility in technical development to face the ever-growing data deluge locally and globally.And the FAIR use of technologies should follow the open science paradigm, such as the cases in tackling natural hazards, health crises, and climate change and achieving the UN SDGs.Zhang Data Science Journal DOI: 10.5334/dsj-2023-007 infrastructuralism will overturn the traditional business models, and provide a reciprocal environment for research and innovation with everyone involved.To better serve as the engines of global research, future facilities should fit into the growing need for enhanced and open data infrastructures, assembling cutting-edge technologies and optimized data resources.Furthermore, these infrastructures should follow interest-balanced and cost-effective models to leverage the responsibilities and rights of all potential stakeholders.Future data infrastructures should also pinpoint collaboration, openness, interconnection, and inclusiveness to reshape trustworthy and reliable worldwide science and technology.In addition, data education extends literacy empowerment to proliferate data science and cultivate the whole society.Skills training (such as knowledge of programming, algorithms, and systems) is useful to capture the exponentially increasing value of data.At the same time, data literacy (Wolff et al. 2016; Gummer & Mandinach 2015) may also consider epidemiology, policies, participants, impacts, and sustainability to serve particular training objectives.Possible courses should be diversified, covering data culture, rights, and ethics for decent data reuse; data epidemiology, critical thinking, and data skills for efficient data reuse; and data accountability, metrics, and data audits for reliable data reuse.Different training pieces should educate data professionals and citizens, thus establishing shared values on data science.Data education will also endure interactive processes to help everyone get ready for the changing world.Meanwhile, besides the four essentials, future data science also calls for a full data picture at the macro level.Accordingly, the data ecosystem should be established based on mutual trust.Under the umbrella of open science, future data science will work efficiently and systematically as an ecosystem, with data, technology, education, and others integrated through infrastructures.Thus, key actions may include establishing and maintaining an open science environment for the data ecosystem, involving potential stakeholders under sound management strategies (i.e., interest-balanced models) and sustained models (i.e., fair and efficient reward systems), opening dialogues between communities, and collaborating on open science and data initiatives.In addition, future data ecosystems should encourage the datasharing culture and enhance the global alignment between physical and virtual facilities to support the data flow of enormous research scopes across domains and regions.Surely, based on a harmonized data ecosystem, data science and the essentials will help us prepare deeply and widely for the adventurous data journey forward.
(Breu & Leo 2022;Brehm 2022hur 2016)nez 2019)äfer 2017;Mejias & Couldry 2019)17;Mejias & Couldry 2019), quantitated data activities prevail in the big data world.Datafication pulls data from traditional statistics to big data analytics with facts, knowledge, and wisdom.Deeply rooted in the digital revolution, datafication contributes to the booming digital economy while encountering societal and cultural conflicts.To better harness the power of datafication, we should embrace open science (UNESCO 2021) more than ever.A series of guidelines should be followed, such as the FAIR principles(Wilkinson et al. 2016), the CARE principles(Carroll et al. 2020), and the TRUST principles(Lin et al. 2020), as well as others.Open data involves sharing for reuse and closing for protection, commercial and noncommercial models, scientific and pragmatic explorations, and close connections among the research community, social enterprises, and citizens.To open or to close may not be contradictory, but there are inevitably frictions and gaps.For the sake of good science, rules should clarify data boundaries, especially highlighting ways for grey data reuse(Borgman 2018).Research integrity and data ethics are also necessary for responsible open science.For example, Indigenous data sovereignty(Carroll, Rodriguez-Lonebear & Martinez 2019), democratic accountability(Gurumurthy, Chami, & Bharthur 2016), and other moral aspects of data are to balance the interests of potential stakeholders for sustained lifelong data curation.Data infrastructuralism tends to merge data and technology into streaming and scalable services.'Infrastructuralism'(Breu&Leo 2022;Brehm 2022), adopted here as a neutral concept, refers to the centrality and materiality of data infrastructures.Future data infrastructures will be incredibly important in integrating multiple-sourced data and complex technologies for user-friendly services.Furthermore, infrastructuralism highlights the predominant roles of data infrastructures in coordinating data science essentials.Guided by open science, data Zhang Data Science Journal DOI: 10.5334/dsj-2023-007 analysis and word cloud visualization by Python.Data are available at www.webofscience.com[Last accessed 24 February 2023].Refined data and python code are available at: https://doi.org/10.57760/sciencedb.07847.