SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL’s Weather Data

Tsaone Swaabow Thapelo; Molaletsa Namoshe; Oduetse Matsebe; Tshiamo Motshegwa; Mary-Jane Morongwa Bopape

I. Introduction

Meteorological weather data are useful in filling information needs in academia and industrial settings. The information generated from these data at local levels is useful in complementing: hydrological models (), high impact weather predictions models (), and simulations of heavy rainfall events (, , ) and heatwaves (). Moreover, weather data are also vital for agro-meteorological operations, as well as in efficacious planning of construction and recreational activities. Although there is a huge need of weather or climatological data for Southern Africa, various institutions and enterprises like BIUST, SASSCAL and WASCAL have introduced AWSs to monitor weather events at finer intervals.

However, most of AWSs installed in developing countries are underutilized. For instance, the Botswana Department of Meteorological Services (BDMS)’s mandate is to provide quality weather, climate information and services to enable informed decision making for sustainable socio-economic development in scenarios related to weather and climate. Meanwhile, the BDMS lacks a designated online platform (currently relies on radio stations, television and a Facebook page) to disseminate weather information to the public.

On a related note, BIUST identified “Climate and Society” as one of its thematic areas of focus. This is geared towards enhancing services related to: climate and impact modeling; early warning, and disaster management for weather and climate change. In 2016, BIUST installed an AWS equipped with a local machine running XConnect for data logging of historical weather data. Likewise, this particular AWS also lacks the backend service layer for dissemination of weather outputs to end users. All these can be seen as barriers and hence limitations of access to the generated weather data. For instance, to request data, clients have to go through some hectic processes. In the case of BIUST, clients have to request data using email, or copy it from the officers using physical storage devices like memory cards. In case of BDMS, end users download and complete a form; then submit it to BDMS. The service time is three days long.

It is irrefutable that, the demand of climatological data in Southern Africa invites key stake holders (i.e., researchers and developers) and organisations to implement platforms that facilitate ease access and visualisation of climate data. As a result, the Southern African Science Service Centre for Climate and Land Management (SASSCAL) was initiated () to support regional weather monitoring and climate research in Southern Africa (). The SASSCAL Weathernet disseminates near to real-time data from AWSs at hourly intervals, including aggregated daily and monthly data (see Figure 1).

Figure 1

Visualisation of AWS data via the SASSCAL Weathernet.

The SASSCAL weather data is reviewed for quality control before dissemination (). These data can also be integrated with data from different sources for research purposes. For instance, Moses et. al. () merged it with other meteorological data from the BDMS to analyse effects of solar radiation, wind speed and humidity on evapo-transpiration around the Okavango Delta. Similarly, predictive data analysis and modeling of temperature patterns (, ) is vital in the understanding of heatwaves (); while rainfall values can help in assessing rainfall erosivity ().

Despite the distinct potential use of the SASSCAL weather data, there is a burden on the end users to access, download and use such data in research (see Figure 2). First, the user has to navigate to the SASSCAL Weathernet to identify a country, AWS of interest, and the temporal resolution of the weather data. The user can then manually copy and paste the whole data to a storage file for data analysis. There is an option to download the SASSCAL weather data in excel format only. However, there is no option to only select the desired weather values from AWSs of interest. Even after downloading the weather data, end users face a challenge of generating clean data sets containing the desired variables for further use. The situation worsens when extracting finer temporal data from multiple AWSs across the entire region.

Figure 2

Manually extracting data from the SASSCAL Weathernet. This process is costly, time consuming and error-prone.

This work presents the SASSCAL Web Scraping Application Programming Interface (WebSAPI). Web scraping () is a data science technique that deploys scripts for extraction of structured data from websites. A script is a computer program that automates a specific task using some selected programming languages like R or Python. Thus, a WebSAPI can be seen as an application service that allows access to online data for further use in research projects. By digitalising the BDMS’ form in 4 for climate data requests, this work will be enabling end users to efficaciously (1) access and visualise weather data from the SASSCAL Weathernet; and (2) download desired data for use in data driven projects.

The structure of the work is as follows. Section II provides a brief background information to this work. Section III presents the approach deployed in the development of the SASSCAL WebSAPI. Section IV presents results. It also illustrates how the SASSCAL WebSAPI can be used to support the extraction of weather variables, as well as the visualisation and dissemination of the generated outputs. Lastly, section V and VI present discussions and conclusions.

Most of African countries () like Botswana () are lagged behind in terms of climate informatics () and environmental data science (, ). This can be attributed to lack of readily available platforms and data as also pointed out in (, ). All these bottlenecks can be unlocked by integrating computing technologies like web scraping and dashboard applications. Web scraping techniques have been widely deployed in a number of projects from different disciplines such as economics () and climate science ().

Regardless of the discipline, the general idea is to allow greater visibility, access, extraction and usability of the online data. This work contributes by addressing the second “pillar” of the Global Framework for Climate Services (Vaughan et al. 2016) using climate informatics. This WebSAPI is motivated by authors in () who presented a free tool for automated extraction and consolidation of climate data from different online web data banks. A similar work by Yang et al. () presented a system with functionalities for scraping, filtering and visualising climatic data for easy use. This work is related to Ref () regarding the user API for data request. It is also related to () in such it deconstructs the URL for a given station and then modifies the date range and the desired temporal resolution to extract desired weather data.

Web scraping is still emerging, with no dominant standards at current. This technology also presents a combination of ethical and legal challenges (, ) that necessitates standards to support data exchange. The ethical issues attached to web scraping can be summed into four generic groups: property, privacy, accessibility and accuracy ().

The property aspect of it entails ownership of data and its possible use. In this context, a web scraping algorithm (WSA) can lead to infringement of copyrights, especially when end users make profit out of the data without the consent of data owners ().
Regarding privacy, web scraping can unintentionally reveal details or flaws within an organization (). For instance, a web scrapper can reveal data structures as well as some sensitive data hidden from end users ().
In terms of accessibility (), it is noted that a WSA can overload a website, which may ultimately cause damage to the organisation’s web server. Moreover, web scraping can result in unintended and un-predicted harmful consequences to the website’s server ().
The accuracy aspect of WSAs is mainly concerned with the authenticity and fidelity of the generated data (). This is crucial since erroneous data generated through a WSA may mislead end users or even damage the reputation of a particular organisation’s website.

Web scrappers can also compete with the main data provider APIs, which might diminish the value of the organisation’s intended mission (). For instance, if a web scrapper attracts more clients than the intended main API, then end users might end up neglecting the platform of that organisation. All these invite multi-disciplinary collaboration (i.e., government sectors, academia and industrial practitioners) to establish standards and boundaries for technology usage. This could irrefutably catalyse the development and adoption of the generated data driven outputs as also supported in (, ).

III. Methodology: Data, Tools and Methods

A. Data Sources and the SASSCAL WebSAPI

The first task was to identify the data sources, and the SASSCAL Weathernet came to the rescue. The aim of the SASSCAL WebSAPI is to improve data accessibility and visualisation of the SASSCAL Weather data before data analysis and predictive modeling. The target of this work was to develop and implement independent algorithms that can, later on, be consolidated and integrated into a package for data driven projects requiring SASSCAL weather data.

The SASSCAL WebSAPI comprises of modularised algorithms packaged into scripts to enable direct control of weather data provided by the SASSCAL weathernet. This include but not limited to algorithms targeted at: processing the SASSCAL Weathernet link; determining the pages containing relevant weather data; deconstructing and parsing contents of the HTML file; extracting required weather data from selected pages; combining data (i.e., data wrangling) into data frames to generate data sets and visuals; as well as sharing the generated outputs using interactive dashboards.

B. Analysis of the SASSCAL Weathernet

The SASSCAL Weathernet enables the public to use one domain to access the AWS data. Each SASSCAL country member has various AWSs, each with a unique identifier (ID). Access to the data is defined using the same abstract pattern. In essence, one can query the website’s database for any AWS within the SASSCAL region by providing the corresponding URL. Thus, one can extract the weather data via a tailored API using formats like HTML and XML.

The home page URL for each SASSCAL AWS data is defined by: x/y?z; where x is the preamble in link 5; y is just the weatherstat_α_AO_we.php token that defines the weather statistics for a given resolution (monthly, daily or hourly); and z is the string describing the logger ID (loggerid_crit = n), where n is the AWS’ unique ID. Tables containing relevant data are found by trial and error (i.e., by inspect individual elements of the SASSCAL weathernet page), or just exploring the source code of the web page.

C. Identification of Tools and Methods

This work deploys the workflow depicted in Figure 3 following the data science approach in (, ) using open-source platforms (i.e., R version 4.0.3 and RStudio 1.1.463). Thus, the algorithms are coded in R, and the functions are tested using the RMarkdown which facilitates reproducibility. R has excellent packages for statistical data science and visualisation. Table 1 shows packages deployed in this work.

Figure 3

Workflow of the SASSCAL WebSAPI.

Table 1

R packages proposed in this work.


PACKAGE	DESCRIPTION

rvest ()	web scraping

Xml2 ()	XML document processing

stringr ()	data cleaning and preparation

ggplot ()	visualisation of graphics

shiny ()	dashboard design

leaflet ()	reactive maps

dygraphs ()	time-series data and interactivity

data.table ()	tables and data munging

flexdashboard ()	shiny dashboard design

A helper function (helper.R) is scripted to install and load the packages included in Table 1. The rvest () package is required for web scraping; while the XML () is required for XML document processing. The ggplot2 () is used for data visualisation. The Shiny () and Flexdashboard () packages are used to design the WebSAPI’s dashboard. The htmlwidgets framework is deployed to provide high-level R bindings to the JavaScript libraries for data visualization. All these functions are embedded in a reproducible RMarkdown to implement the proposed SASSCAL WebSAPI. The data driven pipeline used in this work is summarised in Figure 3.

D. Visualisation of AWSs using Interactive Maps

Algorithm 1 implements an interactive map to visualise where the AWSs are located geographically. Here, w is a vector of AWSs for a given country, x and y are vectors of the latitude and longitude coordinates of the AWSs, z is a vector detailing the descriptions of a given AWS. The algorithm also allows users to select specific AWSs; thanks to the leaflet package. In Algorithm 1, the dataframe ‘c’ defining the inputs is piped into the leaflet function to automatically generate an auto-size map that fits markers of all AWSs. This function also adds some bounds in (Line 4) so that the user can’t scroll too far away from the markers of AWSs. The interactive map pops up the name of the AWS as the user hovers the mouse over a marker. This simple functionality is crucial for end users (i.e., researchers) since it provides spatio-visual exploration of AWSs that are supported by the SASSCAL weathernet.

Algorithm 1

Visualise the AWSs of a given country.


1	c←dataframe(w, x, y, z)
2	leaflet(data = c) %>%
3	addTiles() %>%
4	setMaxBounds(x₁, y₂, x₂, y₂) %>%
5	addMarkers(∼ long,∼ lat, label= ∼ name)

E. Web Scraping and Dataset Generation

The web scraping functionality in Algorithm 2 uses the All_AWS_ID.R script to construct vectors and store names and IDs of AWSs. The AWS_ID_Getter function assigns an AWS name (i.e., “x”) to its corresponding ID (i.e., “value”) using a hash map function (see Line 7 and 8). Thus, to find the ID for a given AWS of interest, the function looks-it-up into the hash function and retrieves the address of that AWS’ ID.

Algorithm 2

Data scraper.


1	AWS_ID_Getter← function(AWS) {
2	V = c(“x”, “value”); parent = emptyenv()
3	assign_hash ← Vectorize(assign, vectorize.args = V)
4	get_hash ← Vectorize(get, vectorize.args = “x”)
5	exists_hash ← Vectorize(exists, vectorize.args = “x”)
6	source(“All_AWS_ID.R”)
7	hash ← new.env(hash = TRUE, parent, size = 100L)
8	assign_hash(AWS_Name, AWS_ID, hash)
9	ID_Getter←hash[[AWS]]
10	return(ID_Getter) }

The AWS name, ID and date are then used to construct a URL used to fetch the data by the DataHaverster.R function in Algorithm 3. The DataHarveter takes in a URL to a given AWS. The URL string can be partitioned into tokens (i.e., using just the AWS name and date) to facilitate easy input.

Algorithm 3

Data harvesting.


1	μ ← TheHarvester(AWS_NAME,DATE,ρ)
2	DOM ← readHTMLTable(URL)
3	μ ← DataWrangler(as.data.frame(DOM[β]))
4	datatable(μ, ϕ, ω)

The XML package () was used to parse a given URL and create a Document Object Model (DOM). This XML package uses the readHTMLTable() function to specify the weather data to select from the HTML tables in the SASSCAL Weathernet. The number of tables for a given DOM was determined using R’s built-in length() function. There are three DOM instances for each temporal resolution; each with multiple tables. There are 14 tables in the DOM corresponding to the web page with hourly data, and the values of interest are in the 13^th table. The DOM for the web page with daily observations has 13 tables, and daily values of interest are in the 12^th table. The last DOM has 18 tables with monthly data contained in the 10^th table.

Line 3 in Algorithm 3 facilitates the cleaning and selection of desired weather tables using the parameter β (i.e., β can be 13, 12 or 10 as discussed above). The parameter ϕ defines the extensions to fix the columns of a table to be visualised; while ω defines extra options for buttons to facilitate end users to search, scroll, copy and download the weather data visualised via the table. The DataWrangler() function was implemented to iterate through the table containing dates of observations. It uses the ρ argument to determine the date range for the data of interest. The extracted weather data is then unified into a single data frame μ to generate data sets for further use as illustrated in Figures 4 and 5 in section IV.

Figure 4

Visualising Botswana AWS using Algorithm 1.

Figure 5

Screenshot of the SASSCAL WebSAPI for capturing user input when requesting weather data. The GUI allows end users to select the geographical location of interest (i.e., Botswana), temporal resolution, the AWS of interest and the downloading of data. The functionality of multi-input selection of AWSs provides end users with a feedback mechanisms to notify about the selected AWS as seen on the tab titled “Currently Selected AWS.” This is quite useful for a quick exploration of geographic locations before downloading data.

F. Dashboard Design: The Graphical User Interface (GUI)

Algorithm 4 implements functionalities for the dashboard page. This include the dashboardHeader() to define the title; and the dashboardSidebar() to define two functionalities of visualising the tables of numerical weather data from an AWS of a given country. The dashboardBody() facilitates selection of the AWS, the resolution, date range, use of data, and weather values and the functionality to also export data. Since different end users have different user needs, this work does not develop a complete GUI. Interested readers should see Ref () for completing a dashboard API.

Algorithm 4

Dashboard design for dissemination.


	Input: It requires Algorithm 4.
	Result: SASSCAL WebSAPI GUI
1	While (Interactive) do
2		gui ← fluidPage (F ← DataScraper())
3		T← dashboardHeader(…),
4		SDB← dashboardSidebar(…),
5		B←dashboardBody(fluidRow(…)));
6		server ← function(I,O) { Communicator(F) };
7		shinyApp(gui, server);

IV. Results

This work documents the development process of a lightweight WebSAPI capable of extracting and displaying timely weather data based on the SASSCAL weathernet. The WebSAPI is cost-effective since it is powered by open source technologies. Besides the functionalities of extracting numerical data, the WebSAPI’s tasks were expanded to include visuals using other formats like tables, maps, and charts. Figure 4 shows an interactive map generated using Algorithm 1. The interactive map can pop-up the name of the AWS as the user hovers the mouse over a marker.

The algorithms defined in section III-E only scrape data from one AWS at a time. These can be extend by adding a functionality to specify multiple AWSs then use a for loop function to scrape desired weather data as shown in Figure 6.

Figure 6

Screenshot of the SASSCAL WebSAPI’s GUI for data request, visualisation and extraction of data. In addition to selecting the desired AWS, temporal resolution, and the date range, the SASSCAL WebSAPI’s GUI allows end users to select the desired variables.

V. Discussions

In this work, a data driven template was developed in the form of a WebSAPI to facilitate efficacious interaction with the outputs generated by the SASSCAL weathernet. The SASSCAL WebSAPI implements modularised algorithms to collect the SASSCAL weather data and generate high-quality data sets that can be used in data driven projects. Modularised scripts facilitate an efficient product design process that integrates any efforts related to idea generation, concept development, and, modification of existing systems and platforms to develop proper solutions. This section presents discussions regarding the data quality, legal aspects, limitations and implications of the proposed WebSAPI.

A. Legality and Ethics of the SASSCAL WebSAPI

The SASSCAL Weathernet data is checked for quality control as mentioned in Ref (). This gives an “assurance” that the SASSCAL WebSAPI will provide quality data that would not mislead end users (i.e, researchers, or decision makers). However, users should note that due to occasional sensor faults, the correctness of data values cannot be fully guaranteed as also indicated in the SASSCAL Weathernet. The declaration on SASSCAL data use indicates that free use is granted for non-commercial and educational purposes.

Although there are no explicit restrictions on data scraping on the SASSCAL Weathernet, it is difficult to conclude that SASSCAL encourages end users to automatically scrape and extract data using tailor made APIs. This can be justified by the note “For data requests regarding specific countries, stations, time periods or specific sensors please contact oadc-datarequest@sasscal.org” as shown in. It should be noted that the underlined aspects are the challenges proposed to be addressed through this work. Thus, personal APIs that pro-grammatically extract the weather data by bypassing the designated SASSCAL Weathernet API can be seen as presenting slight ethical dilemma for developers.

B. Challenges and Limitations

The main hurdle relates to identifying and integrating appropriate data driven technologies to facilitate flexible access and visualisation of the SASSCAL weather data. In this regard, a couple of algorithms have been completed and tested to optimise the task of web scraping. However, the taks of retrieving weather data was tested using relatively small dataset (94 instances). The small data set were chosen to ensure that the automatic scraping and retrieving of data does not likely damage or slow down the SASSCAL website’s servers. This toolkit is built on top of the SASSCAL Weathernet. Thus, changes in structural representation of SASSCAL Weathernet implies modifying the WebSAPI.

C. Lesson Learnt

There is no free lunch in problem solution. The process of web scraping and dashboard design is iterative and evolutionary. The integration of R, flexdashboard and Shiny allows the development and deployment of interactive apps. However, before starting a web scraping based data driven project, developers should start by analysing associated legality and ethics (, ) to avoid possible bottlenecks.

D. Contribution and Implications

The contribution of this work is rather pragmatic than theoratical. The WebSAPI is flexible and reproducible, with potential to be scaled up (expanded) to address other functionalities related to the use of SASSCAL weather data. Reproducibility is an important aspect in open science research and API development. This helps to reduce time taken for data collection, development and testing since the independent components (algorithms) have been already tried and tested. This approach has potential to catalyse the development of packages from existing platforms to meet the end user requirements. It should be noted that neither the BDMS nor BIUST have an API to disseminate weather information. This WebSAPI is still under development, yet with potential to be adapted and incorporated to portals of weather service providers (BIUST, BDMS, SASSCAL, and WASCAL) to bridge gaps of weather and climate data access.

VI. Conclusion

A. Summary

Developing and implementing a data driven platform to serve end users is a challenging task that requires input from multidisciplinary stake holders. This work integrated web scraping (), data wrangling and dashboard techniques to develop a lightweight SASSCAL WebSAPI. In comparison to previous web scraping literature, this work takes into consideration that data driven outputs need to be disseminated to end users. In this case, a dashboard proto-type was developed in RMarkdown to facilitate reproducibility. The WebSAPI is expected to create new channels to extend services of the SASSCAL Weathernet. By enabling efficacious and efficient data access, the SASSCAL WebSAPI has potential to increase productivity and quality of data driven projects that make use of SASSCAL weather data.

B. Future Work

The SASSCAL WebSAPI should be seen not as a replacement but rather a complementary toolkit to the SASSCAL Weathernet. It does not cover all the tasks related to “weather data science”, but it provides the end-user community with the opportunity to reproduce it and develop in-depth product development skills to ultimately add more functionalities to a related API. In terms of extending this work, more end-user driven functionalities will be added to this API to enable data driven operations and services like investigating strategies for imputation of missing data, and modelling.

C. Recommendations

The collaboration with the concerned stakeholders (i.e, SASSCAL, BDMS, BIUST), including end users (researchers, students, and farmers) could catalyse the development and deployment process. This will surely enhance operational productivity while maximizing utilization of these amazing open-source technologies. Efforts from this work are likely to spawn new projects and collaboration that will better inform citizens and continue to help them to make use of the generated data, and contribute to the open-data community.

Data Accessibility Statement

This R based toolkit is still under development. Parallel to this manuscript is a reproducible tutorial in RMarkdown, integrating Shiny and Flexdashboard for visualisation and dissemination of outputs. The tutorial and code is available on https://github.com/EL-Grande/SASSACL-WebSAPI and the data is available online 5.

Data Science Journal

Research Papers

SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL’s Weather Data

Abstract

I. Introduction