1 Introduction

Accurate data are at the centre of mitigating risk, preventing widespread panic and sensationalism during a natural disaster. Evidence-based information obtained from accurate data is an asset, and one of the only strategic resources the public have during a crisis. The practice of sharing information to the public about the current state of things is dependent on specific data that has to be captured and shared to the public in a way that is useful, usable and desirable. During a crisis, the public needs to minimize exposure to the situation or act accordingly to provide support where needed. The contribution of this paper is a framework that substantiates the type of data employed to capture and modify shared information with the public during a crisis of a biological nature, such as the COVID-19 pandemic. This paper summarises findings (e.g., demography and neighbourhood) based on the public data repository () and dashboard, to support general understanding and lessons learned from the COVID-19 epidemic.

The disease, COVID-19, is a severe acute respiratory syndrome (SARS) caused by the SARS-CoV-2 virus (). The first reported case of COVID-19 was in December 2019 in Wuhan, China (). The problem with SARS-CoV-2 is the pressure it puts on the healthcare system because of its high infection rate (). Globally, sudden spikes in the confirmed COVID-19 cases are severe because there are limited resources to effectively manage and treat patients in any of the current healthcare systems (). As of February 2020, the World Health Organization (WHO) declared SARS-CoV-2 a pandemic- one of the first biological threats to modern society in the 21st century (). In order for the South African government to reduce the widespread of SARS-CoV-2, travelling restrictions were placed on high-risk regions and border-crossings were closed from one country to another. South Africa experienced the first COVID-19 case on the 28th of February 2020, and as of the 25th of March 2020, the number of confirmed COVID-19 cases increased to 702, despite efforts to contain the virus by putting a ban on international travel (). The increase in confirmed cases may cause widespread panic and anxiety, which is why the public relies on good, reliable information and data, now more than ever.

Currently, information regarding the COVID-19 outbreak in South Africa is shared with the public across various platforms, of which two are most popular/most widely used. Firstly, the National Institute for Communicable Diseases (NICD) publishes an infographic that contains limited information, providing a bird’s-eye view of the outbreak. This information is limited to South Africa and only reports the number of tests performed, number of confirmed cases, which regions are affected, and COVID-19 related mortality (). Secondly, the minister of the Department of Health (DoH) in South Africa, Zwelini Mkhize, updates the public regarding the cases as they arise on a daily basis. These updates are published sentences on the DoH website, containing some demographic information about the confirmed cases, including age, gender, travel history and mode of contraction of SARS-CoV-2 (). Although these sources of information are valuable, they are potentially ineffective ways of sharing information to the public regarding their usability for a variety of reasons. Amongst these is the number of different platforms a person has to navigate through to gain access to accurate data. Additionally, the format in which the data are presented is not in a computer readable format and has to undergo processing in order to be used and stored. This further complicates legibility, simplicity and accessibility of the information that is shared, a concern about South African government data that was highlighted in prior work (). The impact of not having useful, usable and desirable information has a direct effect on management strategies and responses from the public in relation to the disease ().

To counter the aforementioned problem, the Data Science for Social Impact (DSFSI) research group at the University of Pretoria, South Africa, has developed an open repository for the data integrity of South African COVID-19 cases. DSFSI Lab members and willing volunteers are responsible for the mining, validation and storage of data related to the COVID-19 patients in South Africa. To work collectively on a project related to public data in a way that can be scaled, the DSFSI manages and consolidates the available data related to the COVID-19 cases in South Africa. Once consolidated, the data are shared in an open, publicly available repository on GitHub.com, and then linked to the dashboard (). On the repository, any member or user has the freedom to critique and propose new features or data to be added to the repository or dashboard. This includes data integrity issues as well as information to be added to the repository that is otherwise outdated or virtually inaccessible to the public. The workflow of this process is illustrated in Figure 1.

Figure 1 

Data publishing cycle of COVID-19 data.

In order to link information to the COVID-19 records, data needs to be accurate and restructured. Unfortunately, when data are not updated and information changes about the state of the evidence, then it becomes difficult to track when the changes occurred, especially if the changes relate to the unique ID or name of the item. One such example relates to the publicly available South African hospital data. Some items in the hospital data were last updated more than two years prior to 2020. This includes hospital data such as coordinates, accurate contact information, and facilities available within the hospital, as well as the population size of a/the district.

Most of the aforementioned information is not consolidated in one place – a great risk factor in a time where the public needs to know where to seek care, and which hospitals are equipped to test and manage COVID-19 patients. Another example came about in a sudden spike of confirmed cases between the 23rd and 25th of March 2020. This increase of 435 cases came unaccompanied by any of the aforementioned demographic information, transmission type, nor their travel history, and these results are still pending. In subsequent days, there has been the same challenge, although the latest available data on deaths has demographic information and some travel history.

2 Aims and Methodology

The primary objectives of this study were to determine what data should be included in a public repository amidst the COVID-19 outbreak and how this data should be disseminated within a public dashboard. The public repository of data followed a Creative Commons licence for data, and MIT License for Code, with copyright for the Data Science for Social Impact research group at the University of Pretoria, South Africa (). All of the data were gathered and consolidated on the public repository which is hosted on GitHub and uploaded on Zenodo ().

To determine if the dashboard (Figure 2) and data repository were used, data and analytics were performed on the basis of descriptive statistics related to the number of views, clicks, comments and recommendations on the public repository and dashboard. The use of the repository goes further than just the views, but that other researchers actually use it for analysis.

Figure 2 

Consolidated dashboard using data from the repository. Left: Front page with aggregated national statistics. Right: Aggregated statistics by province.

To measure whether or not the repository and dashboard were useful and desirable, the public repository allowed for the posting of issues, comments and recommendations. These items were categorised according to their submission on the repository. An item could be sorted into more than one category depending on the nature of the problem. The DSFSI and public have the opportunity to choose which problems they would like to work on, and solutions are approved by the DSFSI group.

3 Results

In total, 58,169 users accessed the dashboard in the time period of 17th March to 18th April 2020 (Figure 3).

Figure 3 

Data usage information from the website.

As seen in Figure 3, at least 20.4 percent of users return to the dashboard, with an average session duration of 1.35 minutes per session. In addition to this, the repository from which the data are drawn have had over 2,000 clones of the dataset from at least 200 different people. The majority of the users were from the GitHub community, but a few unique users (meaning they had not visited before) accessed the repository from other platforms (Table 1).

Table 1

Data usage information from the website.

Non-English or SiteViewsUnique visitors

github.com2,229797
twitter.com675431
Google.com647353
Linkedin.com212125
zendo.org5826
Bing.com5727
facebook.com4327
m.facebook.com3926
vima.co.za209
Ink.in178

At least 404 unique visitors viewed the issues that were posted on the repository. A further 150 unique visitors viewed the pull requests. These were requests for data that members of the DSFSI and the public contributed to the actual datasets. To manage the issues that were posted, ten different labels were created to categorise the issues. Currently, only six of the ten labels were used for the issues that were posted. These were: bugs, data, enhancements, good first issue, help wanted and questions (Table 2).

Table 2

Categories of lodged issues.

LabelResolved issuesUnresolved issues

Bugs72
Data1910
Enhancement2617
Good first issue22
Help wanted21
Question11

Bugs referred to any error in the data or issue related to a feature of either the dashboard or function within the repository. The single unresolved issue in bugs related to a single incorrect data entry, but finding the source confirming the correct data proved challenging. Data were any inquiry about the data including differences between data sets, missing data, or additions to datasets. To resolve these issues, data needs to be updated from the source. Enhancement meant improvements to current implementations of either the data in the repository or dashboard. Pending information for most of these enhancements. These include additional fields that were not provided in the publicly available data. Good first issues were entry level problems that could be completed by people from any background. These were labelled as such for newcomers to the project that did not require either a lot of time or expertise to work on. Help wanted translated to problems that require additional attention. The unresolved issues require data that are not currently available to the public in order to solve the issues. Questions were presented as general request for clarity or required more information on a particular issue in the repository that another person posted. To resolve the one unresolved question, a decision has to be made internally about the data to resolve the matter. There were more than 10,000 additions to the repository data, and 1,430 deletions, all reviewed by different members of DSFSI and accepted if they were noteworthy contributions to the repository. In addition to this, there were 26 different contributors who pushed 345 commits to all branches within the repository.

The majority of requests and changes to the repository and dashboard were associated with the data or enhancements to the data. In total, there are fifteen datasets, with seven of them related to information about hospitals in SA. Once created, the subsequent data were displayed in a dashboard (). Included in the dashboard were information related to COVID-19, a South African helpline, sources of the information, when last the information was updated, a blog post containing the purpose of the dashboard, links to the open public repository, and general information about the research group. Some analysis that can be accomplished with the data is shown in Figure 4.

Figure 4 

Examples of analysis made possible by the data repository. Left Age distribution of positive cases. Right: Provincial growth.

4 Conclusion

Data are one of the most important assets during a crisis. Unfortunately, not prioritising this commodity had complications during the early days of the COVID-19 pandemic from a South African perspective. To prevent this from happening, the DSFSI research group have started collaborating and expanding this type of methodology to create a line list for the rest of the African continent. The data from this project led to discussion between DSFSI and the NICD and DoH, in an attempt to assist the situation. The COVID-19 pandemic showed the world the value of data, and that people and systems should effectively prepare for a time of crisis. Enabling the public to engage with data in a way that is open and collaborative is an overlooked service that can aid during a calamity. Furthermore, having readily available data is useful when needed during an emergency, but can seem redundant during peaceful times. Therefore, prioritising data management practices, getting input from the people who use the information, and collaborating between different organisations to gain the same result should be a proactive approach and standard, not one that is only implemented during a catastrophe.