The continuous evolution of data management systems affords great opportunities for the enhancement of knowledge and advancement of science research. To capitalize on these opportunities, it is essential to understand and develop methods that enable data relationships to be examined and information to be manipulated. Earth Science Data Analytics (ESDA) comprises the techniques and skills needed to holistically extract information and knowledge from all sources of available, often heterogeneous, data sets. This paper reports on the ground breaking efforts of the Earth Science Information Partners (
The continuous evolution of data management systems affords great opportunities to the enhancement of knowledge and advancement of Earth science research. With the growing need and desire to leverage information from various sources to better understand our environment, it becomes evident – through community experience and foresight – that this can be maximized by accepting new ways to crossexamine this information. As excerpts in the ‘4^{th} Paradigm’ (
‘We have to do better at producing tools to support the whole research cycle—from data capture and data curation to data analysis and data visualization’. (xvii)
‘Clearly, dataintensive science… must move beyond data warehouses and closed systems, striving instead to allow access to data to those outside the main project teams, allow for greater integration of sources, and provide interfaces to those who are expert scientists but not experts in data administration and computation’. (147)
‘We are already seeing some attempts to infer knowledge based on the world’s information’. (167)
To capitalize on these opportunities, it is essential to develop a data analytics framework in which we scope the scientific, technical, and methodological components that contribute to advancing science research. Through this framework, we can categorize discussions amongst individuals of like component interests, instead of attempting to draw specific direction from a set of starting points that may greatly vary.
Data Analytics is the process of examining large amounts of data of a variety of types to reveal hidden patterns, unknown correlations, and other useful information, key to facilitating Earth science research opportunities. Thus, the research presented here is motivated by the need to determine and categorize available data analytics techniques and skills and to identify the gaps for where they are still needed.
Today, and well into the foreseeable future, there is a rapid growth in the amount of Earth science data and valueadded heterogeneous information that many Earth science researchers have not yet holistically leveraged. It is important to realize that this rate of data growth is new and challenging. It is new in that information technology is just beginning to provide the tools for advancing the analysis of heterogeneous datasets in a ‘big’ way to provide opportunities to discover unobvious scientific relationships, previously invisible to the science eye. The challenge is it takes individuals, or teams of individuals, with just the right combination of skills to understand the data and develop the methods to glean knowledge from data and information.
The ability to apply information technology, tools, and services necessary to facilitate the advancement of Earth science research is becoming more obvious and necessary at a rate that is accelerating. That is, if data manipulation (subsetting, data transformation, format conversion, etc.) extract information from data, then data analytics techniques and skills glean knowledge from information (
The objectives of the Earth Sciences Information Partners (
In this paper, we present the ESIP derived definition of ESDA and differentiate it from other publicized definitions of data analytics. We then describe different types of ESDA and their driving goals. This is followed by an exhaustive survey of current techniques and skills for performing ESDA, made available for the benefit of Data Scientists exploring new Earth science data analytics methodologies, and their potential use.
The advancement of information use resulting from evolving technologies, newly developed techniques, and refined skills has become the purview of the Data Scientist performing data analytics. Data analytics got its start in the business world which is why most literature and developed tools reflect back on business as the primary application. In the literature, we find that data analytics is comprised of 5 types: Descriptive, Diagnostic, Discoverative, Predictive, Prescriptive. When the ESIP ESDA Cluster attempted to categorize Earth science research use cases into these data analytics types, the use cases did not fit: categorizing was ambiguous and/or they fit in more than one type category. Where business data analytics types reflect looking for patterns, and predicting (and prescribing) actions, Earth science data analytics also include assessing, validating, calibrating, and applying techniques required to prepare raw datasets for couse. In addition, characteristics of Earth science data introduces data analytics challenges such as dealing with differing formats, differing spatial and temporal data resolutions, inconsistent data acquisition techniques and units for the same measurement, noise, biasing, to mention a few. This led to the need for a data analytics definition directed specifically at Earth science research goals.
In addition, insights like: ‘Researchers in science must work with colleagues in computer science and informatics to develop fieldspecific requirement’ (
A significant aspect of ESDA is information literacy, the ability to “recognize when information is needed and have the ability to locate, evaluate, and use information effectively” (
As seen in the literature, there is no shortage of data analytics definitions, and descriptions of individuals who performs data analytics, the Data Scientist. The Booz/Allen/Hamilton (B/A/H) Report, ‘The Field Guide to DATA SCIENCE’ (
The National Institute of Standards and Technology provides the following definitions (
Data science is the extraction of actionable knowledge directly from data through a process of discovery, or hypothesis formulation and hypothesis testing.
A data scientist is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the endtoend data processes in the data life cycle.
The analytics process is the synthesis of knowledge from information.
The article, ‘8 skills you need to be a Data Scientist’ (
Skills of a Data Scientist (
Skills of a Data Scientist ( 


Basic Tools  Data Munging 
Basic Statistics  Data Visualization & Communication 
Machine Learning  Software Engineering 
Multivariable Calculus and Linear Algebra  Thinking Like a Data Scientist 
The website, Master’s in Data Science (
Technical skills and tools of a Data Scientist (Master’s in Data Science).
Technical skills and tools of a Data Scientist (Master’s in Data Science) 

Math (e.g. linear algebra, calculus and probability) 
Statistics (e.g. hypothesis testing and summary statistics) 
Machine learning tools and techniques (e.g. knearest neighbors, random forests, ensemble methods, etc.) 
Software engineering skills (e.g. distributed computing, algorithms and data structures) 
Data mining 
Data cleaning and munging 
Data visualization (e.g. ggplot and d3.js) and reporting techniques 
Unstructured data techniques 
R and/or SAS languages 
SQL databases and database querying languages 
Python (most common), C/C++ Java, Perl 
Big data platforms like Hadoop, Hive & Pig 
Cloud tools like Amazon S3 
In addition, the McKinsey Global Institute (
‘Doing Data Science’ (
Much information regarding data analytics techniques and skills have been found in presentations given at informatics forums such as the American Geophysical Union (AGU) Earth and Space Science Informatics (ESSI), ESIP, and SciDataCon. In addition to forum sessions being rich with experienced individuals who describe techniques and skills that facilitate data analytics, the authors of this paper have also taken the opportunity, at AGU, to visit science poster presentations to better understand research methodologies (analytics) utilized. At these meetings, research that involved coanalysis of multiple datasets were sought out. After scanning hundreds of presentations for methodologies used, 31 atmospheric science focused (study of gases) and 12 hydrologic science focused (study of liquid) presentations were targeted in which research techniques were identified (Table
Sampling of science research techniques being used.
Science Research Technologies (Sampling)  

In Atmospheric Research  In Hydrology Research  
Correlation Analysis; Bias Correlation  Spectral Analysis  Linear Regression 
Regression Analysis; Bivariant Regression  Temporal Trending; Trend Analysis  Monte Carlo 
Decision Tree  Spatial Interpolation  Darcy Equation 
Machine Learning  Revised Averaging Scheme  Poisson Regression 
Data Mining  Forward Modeling; Inverse Modeling  Multivariate time series analysis 
Data Fusion  Radiative Transfer Model  BUDYKO formula 
Computational Tools  Baysian Synthesis Inversion  Smoothing (Gaussian) 
Constrained Variational Analysis  Temporal Stability  Filtering (Destriping) 
Model Simulations  Gaussian Distribution  MESH Model 
Ratios  Exponential Differentiation  
Time Series Analysis 
This small study opens our eyes to the data analytics techniques pertaining specifically to data analysis. Yet many of these techniques also find homes in performing data preparation and data reduction analytics.
For clarity, data analytics activities fall within the scope and expertise of the Data Scientist. Data Scientists study and develop methods for analyzing, storing, and presenting data. When they practice their skills on specific problems, they are performing data analytics, applying tools and techniques to coanalyze heterogeneous data. Data Scientists, as researchers, developers, or data analytics practitioners, require similar skill sets.
Through literature research, analysis of Earth science research use cases, and integration of the science research methods, the ESIP Federation – a collaborative organization of over 170 informationcentric partners – has defined and adopted the following definition of ESDA (
This includes:
Data preparation includes the methods and techniques that uncover, discover, extract data of greatest interest. This can involve filtering, mining, format conversion, smoothing, visualization, etc. Data Reduction addresses very large amounts of heterogeneous data that face Earth science research. Several methods for the purpose of data reduction can be applied, with the goal of making data transfer, computation, and analysis easier and/or more focused. Analytics to perform Data Analysis is not as clear cut. Science, by its nature, often utilizes technologies that are not decided upon until data/information is initially looked at and better understood. It is then that researchers experiment with existing and/or novel analytics techniques. In our paradigm, Data Analysis analytics includes all aspects of science research: Hypothesis and data discovery driven methods, as well as goal driven decisions, outcomes, and impacts. Data Analysis analytics aspects can be categorized separately, if desired.
ESDA is categorized by the goals of the analytics performed (Table
Earth Science data analytics goals.
ESDA Goals 

To calibrate data 
To validate data (note it does not have to be via data intercomparison) 
To assess data quality 
To perform coarse data preparation (e.g. subsetting data, mining data, transforming data, recovering data) 
To intercompare datasets (i.e. any data intercomparison; Could be used to better define validation/quality) 
To tease out information from data 
To glean knowledge from data and information 
To forecast/predict/model phenomena (i.e. Special kind of conclusion) 
To derive conclusions (i.e. that do not easily fall into another type) 
To derive new analytics tools 
With a definition of ESDA, we now have an initial framework for which to speak the same ‘language’, discuss ESDA scope, infrastructure, and methodologies with a common focus, and to grow upon.
ESDA techniques are considered to be computational methods. Repeatedly, individuals seeking to perform Earth science express their need to utilize mathematics, numerical modeling, statistics, software engineering and the ability to integrate data from across multiple domains. Also, there is a need for expertise in techniques, such as: rule learning, classification, cluster analysis, data fusion, machine learning, neural networks, anomaly detection, modeling, time series analysis, and visualization.
In addition, many other computational methods have been identified as potential techniques to be used in performing ESDA. Table
Earth science data analytics techniques (sampling).
Earth Science Data Analytics Techniques 


Data Preparation  Data Reduction  Data Analysis 
Bias Correction  Aggregation  Anomaly Detection 
Coordinate Transformation  Anomaly Detection  Bayesian Techniques 
Data Engineering  Cluster Analysis  Bivariant Regression 
Data Mining  Data Engineering  Classification 
Data Munging  Data Fusion  Correlation/Regression Analysis 
Database Management  Factor Analysis  Factor Analysis 
Exponential Differentiation  Filtering  Fourier Analysis 
Filtering  Neural Networks  Gaussian Distribution 
Format Conversion  Outlier Removal  Graphics Analysis 
Imputation  Ratios  Imputation 
Normalization/Transformation  Revised Averaging Scheme  Linear/Nonlinear Regression 
Outlier Removal  Rule Learning  Machine Learning/Decision Tree 
Ratios  Time Series  Mathematics/Calculus 
Rule Learning  Visualization  Modeling 
Sensitivity Analysis  Monte Carlo Method  
Smoothing  Multivariate Time Series  
Spatial Interpolation  Normalization  
Time Series  Pattern Recognition  
Visualization  Principal Component Analysis  
Revised Averaging Scheme  
Rule Learning  
Signal Processing  
Spectral Analysis  
Statistics  
Temporal Trend Analysis  
Time Series  
Visualization 
ESDA skills are considered to be the ability to apply techniques. In regards to Earth science, this refers to applying ESDA techniques to Earth science domains being studied. Thus, ESDA skills include knowledge in particular Earth science domains where data analytics can advance the understanding of science.
ESDA skills also refer to the ability to facilitate making data useful. This includes understanding the relevance of: the data lifecycle, data structures, metadata, data integration, and data interpretation.
Table
Earth science data analytics skills (sampling).
Earth Science Data Analytics Skills 

Ability to integrate data across multiple domains 
Support domain scientists with data & computational knowledge 
Communicate across domains 
Knowledge of data cycle 
Software engineering 
Software programming 
Data Engineering 
Decision science 
In short, ESDA techniques and skills need to be interdisciplinary from the start. One needs to know what domain specific information is available, where to get it, how it is generated, as well as statistical, mathematical, and computational methods to manipulate it.
Although data analytics definitions and types that are oriented at business are well documented, data analytics to facilitate the interanalysis of large heterogeneous Earth science datasets has only begun to be addressed methodically. The significance of the development of a set of definitions, types, goals, techniques, and skills that target ESDA specifically provides Earth scientists the opportunity to better articulate the techniques and skills they employ in furthering their science research. In particular, now communications can be performed in terms that engage information technologists who can provide support by implementing responsive tools. With a categorization of known techniques and skills associated with data preparation, data reduction, and data analysis analytics, we know what techniques and skills are presently available, what tools have been implemented that perform these techniques, and what tool gaps need to be filled as science research methodologies evolve.
Next steps include: Engaging the Earth science research community, to better understand their research methodologies and share information technologies that may be useful; Engaging scientists to acquire additional use cases to further validate and update our knowledge of known ESDA techniques and skills; Continuing to refine our understanding of the skills needed to perform ESDA; Promoting the development of ESDA techniques and skills through university curriculums, and; Addressing the ‘moving’ gap analysis.
The authors would like to thank the ESIP Federation and, in particular, the ESDA Cluster members for their support and insights, as well as Peter Fox and Erin Robinson for their wisdom and encouragement.
The authors have no competing interests to declare.