An Infrastructure for Spatial Linking of Survey Data

Felix Bensmann; Lars Heling; Stefan Jünger; Loren Mucha; Maribel Acosta; Jan Goebel; Gotthard Meinel; Sujit Sikder; York Sure-Vetter; Benjamin Zapilko

1. Introduction

The ongoing increase of environmental challenges and issues, such as urbanisation, accompanied by urban sprawl and soil sealing, loss of biodiversity, climate change and water shortage, raises the pressing need to better understand the correlations between human activities and environmental change (). To address important parts of these challenges, for example, a field of research has evolved focusing on the notion of environmental justice, which comprises health and well-being aspects, as well as topics related to general social participation (). While this research has already been discussed in interplay with politics and civil rights organisations in countries like the USA for a long time, it is only in recent years that it has become visible as part of inequality studies in Germany (). Combining survey data and small-scale geospatial data is an emerging topic for social scientists, which enables answering the outlined research questions from an individual-level perspective of survey respondents.

In the geospatial domain, Keenan and Jankowski (), showed the research trend on spatial decision science for the last 30 years, where spatial linking of diverse data sources is evident such as – spatial bussiness intelligence (e.g. ) or dynamic visualization for participatory decision making with a diverse team of stakeholders. Innovation is still necessary but it is challenging to integrate big geospatial data (), processing of large volunteered geoinformation (, ) and georeferenced social media information (, ). There are already some well known infrastructures available from government agencies, non-profits and even commercial entities, that give the public access to geospatial data e.g. OpenGovData, OpenStreetMap, OpenLayer, GoogleEarthEngine (, ).

However, there are a number of reasons why the necessary integrated use of social science survey data and spatial science data is challenging: (1) the interdisciplinary nature of the two different data sources (), (2) legal restrictions, especially regarding data privacy (; ), and (3) methodological challenges regarding the use of geo-information systems (GIS) for the processing and analysis of spatial data, especially by social scientists (). An infrastructure which facilitates such an integrated use of social science survey and spatial science data would be a spatial data infrastructure (SDI), defined as a “framework that consists of institutional arrangements, policies, and technologies that would create a conducive environment for the exchange of geographic information related resources in order to create a better information sharing community” (). As of yet, an SDI targeting social scientists is only described conceptually (). In this article, we investigate and develop solutions for addressing the practical challenges and requirements of developing such an infrastructure from a social science, spatial science and computer science perspective.

Throughout the article, we use the example of environmental inequalities with regards to income and land use hazards in Germany, which motivates one research question for the infrastructure. For data analysis, we use georeferenced survey data of the GESIS Panel () and the German Socio-economic Panel (SOEP) (), and as spatial data, we use data from the Monitor of Settlement and Open Space Development (IOER Monitor) (IOER, 2017). Both the GESIS Panel and the SOEP comprise extensive information from people living in Germany aged between 18 and 70 years. Amongst others, the participants answered questions on their demographic and socio-economic situations (), which makes the GESIS Panel an appropriate data source for our research question. The IOER Monitor contains detailed geographic information on land use in Germany, ranging from indicators about settlement structures up to landscape quality. The infrastructure presented in this article builds the foundation for these kinds of interdisciplinary studies which are otherwise only with large manual effort possible.

The rest of this article is structured as follows: In Section 2, we provide an introduction on the spatial linking of survey data with spatial data and its use in social science survey research and discuss challenges in Germany. We introduce our spatial data source, the IOER monitor, its data and documentation in Section 3. In Section 4, we propose a technical infrastructure for spatial linking which addresses challenges like data privacy and data security among others. As case study, we present an analysis for the environmental inequalities of land use hazards in Germany which has been conducted with the proposed infrastructure in Section 5. We discuss the lessons learnt and conclude in Section 6.

2. Spatial Linking of Survey Data with Geospatial Data

In this section we provide a brief overview on spatial linking in general, the relevance of spatial linking of survey data with spatial data in social science research and the challenges that arise currently in Germany.

2.1. What is Spatial Linking?

Spatial linking describes the technique to combine two or more geospatial datasets into one, and has been used for a long time in disciplines familiar with the use of spatial datasets (). This technique is also known as spatial join, a term familiar by users of products of the commercial software provider ESRI, or broader as spatial overlay when two vector datasets are combined Geospatial data are data that contain spatial references in the form of coordinates associated with the observations in the data. Each observation contains at least one coordinate resulting into different geometries such as points (single coordinate), lines and polygons (multiple coordinates), or grids (single coordinates with additional information, e.g., on their geographic resolution). Spatial linking relates the set of coordinates of one dataset to the coordinates of another, resulting in a variety of possible outcomes.

Usually, all spatial linking efforts define one focal dataset to which information of another is merged to, which results in a combined dataset with the same geographic properties as the focal dataset. If the focal dataset are point data and the other one are polygons, the resulting data are point data as well. Vice versa if the focal dataset are polygons and the other one are points, the resulting data are polygon data. For example, if points are household addresses of people and polygons are neighborhood boundaries we can assign the membership of each household to its neighborhood through spatial linking. Alternatively, the other way around, we can count the number of households within each neighborhood. Accordingly, the type of spatial linking always depends on the geographic structure of the focal dataset.

The algorithms of adding information to the focal dataset not necessarily have simple requests based on, e.g., single points in space. Through the projection into one coordinate space, all geometries can be related to each other, e.g., through distances or intersectional associations. Not only can we apply the membership of each household to a neighborhood, but we can also ask which neighborhoods are within a range of say 2000 meters. If our focal data depict streets as line geometries, we can request which neighborhoods these streets intersect. Generally, there is a multitude of more possibilities of formulating such requests on georeferenced data (e.g., ) that also depend on research questions within specific scientific disciplines as shown below.

Figure 1 depicts an example of spatial linking based on circular buffer areas of sealing of soils data around one specific point. This point represents the focal dataset to which information from the sealing of soils data is added, while substantially sealing of soils is the water and airtight coverage of soils from roads and buildings. The idea is to gather information, e.g., on mean levels of sealing of soils in a specific area and not only at a certain point. Moreover, these buffer areas can be varied and compared regarding their impact on possible analyses with these data.

Figure 1

Circular Buffer Areas of Sealing of Soils in the Size of 500m Around a Coordinate.

Data Source: Monitor of Settlement and Open Space Development (IOER, 2017) (Sealing of Soils) and OpenStreetMap (Roads).

In the social sciences, survey researchers apply such methods of spatial linking to enrich their existing data with information from auxiliary georeferenced data (). For this purpose, they require access to georeferenced survey data and to the auxiliary georeferenced data sources, which is often, however, demanding as such data lack availability or are only available with limited access rights because of data protection considerations, at least in Germany (). Geographic information can increase the risk of re-identifying survey respondents, such that data providers often provide on-site facilities to access the data to control the access and research output (). While this is changing, amongst others, with the work we present here, there is also already a variety of research applications that use georeferenced survey data. We witnessed an increasing number of application the subdisciplines of political attitudes research (), health (), education (; ) or social and environmental inequalities ().

Rather prominently, methods of spatial linking are used in research on social trust or political attitudes. Sluiter, Tolsma, & Scheepers (), for example, combined survey data with detailed spatial information on ethnic diversity and investigated whether potential social trust effects vary with the geographic scale. In correspondence to the theory that ethnic diversity negatively affects bonding social capital, they found evidence for such relationships in particular small geographic neighborhoods. However, the findings are mixed and largely depend on the chosen actual geographic scale. Similarly, Förster () detected different levels of political participation by varying the geographic scale of immigrant rates between one km² neighborhoods up to the governmental district level. Thus, what researchers can predict from social science theory and by having access to geospatial information also depends on the geographic scale of the data at hand.

This notion is not different concerning the application we have chosen for this article in the area of environmental justice. Already for a long time, social scientists ask for potential inequalities in exposure to environmental hazards such as air pollution () or traffic noise (). While to considerable extent evidence on this topic stems from the US societal context, extensive work for the German case is missing. We will elaborate on this in Section 2.3. Moreover, the residential segregation structures differ between those societies (), such that research focusing on particular small-scale neighborhoods variations is needed. We will further investigate this issue below at the example of income, and the environmental hazards originating from sealing of soils, road traffic densities as well as industry and trade densities. The study of environmental justice yet represents another prominent application of spatial linking methods in the social sciences.

Besides the potential to formulate new research questions, adding geo information to survey data not only enables researchers to analyze new or simply more information, but it also allows adding upon existing findings. Previous research results can be corroborated rejected or gain more in-depth perspectives. For example, Downey, Crowder, & Kemp () discovered that single-parent mothers in the US are at a higher risk to be exposed to toxic air pollution than two-parent families or even single-parent fathers. Apart from other disadvantages such as socio-economic disparities of single-parent families, these results add another perspective to the general discussion of family structures and inequalities. Without spatially linking survey data and small-scale spatial information this would not have been possible.

2.3. Challenges in Germany

As indicated above using georeferenced data in social science survey research is challenging, starting with the availability of data. Particularly concerning auxiliary data from official authorities and from which researchers can extract information, the situation is complicated. Because Germany is a country with a federal structure consisting of 16 individual states, also multiple statistical offices are in charge to collect data, e.g., on the sociodemographics of their inhabitants. While this data is interesting to be combined with survey data, as a result of the federal structure, researchers face a multitude of data access points. Such a fragmented data landscape often results in data that is unharmonized, erroneous or that is even missing ().

Likewise, on the side of georeferenced survey data, this also applies. First of all, georeferenced survey data consist mainly of survey data accompanied by the geocoded locations of the participants of the survey. For searching and accessing it, researchers use well known catalogs for survey data. However, often these catalogs lack detailed documentation on georeferences so that researchers cannot be sure if they find all georeferenced survey data available. Prominent research data centers and surveys with georeferenced survey data indeed exist such as the SOEP () or the German Family Panel pairfam (), but finding smaller or even more specialized surveys is difficult. Social scientists who aim to use such data for their research may have to invest some time in researching for suitable data.

Another challenge that directly relates to the issue of data availability is data protection. As mentioned above, georeferenced survey data is sensitive data because it contains information on survey participants’ locations which potentially can reveal their identities (). According to German data protection legislation, however, disclosing this information is strictly forbidden (). Therefore, providers of georeferenced survey data usually control the access to this data, e.g., through on-site facilities with specialized and secured work environments.

Moreover, some additional and more specific issues of data protection increase the challenges of using georeferenced survey data. Also according to German data protection legislation, survey information (i.e., the answers to questions in surveys) and personal information (i.e., addresses and coordinates, respectively) must not be stored together. Accordingly, to link survey data and spatial information that was gathered through the use of spatial linking methods, researchers have to follow complicated processes (). First, they have to link coordinates with spatial information and second, coordinates are deleted, and only the spatial information is transferred to the corresponding rows of the survey data. Because this process also can get technologically challenging, we developed a solution, which we present in Section 4, to navigate these issues seamlessly and easy for end-users.

3. Spatial Data

The Monitor of Settlement and Open Space Development (IOER-Monitor) is a research data infrastructure on the topic land use/land cover hosted by the Leibniz Institute for Ecological Urban and Regional Development (IOER). This infrastructure provides comprehensive information on the status of land use changes by means of long-term time series in Germany at a high spatial resolution. The time series currently comprises 13 time periods beginning in 2000, then in 2006 and annually from 2008 onwards. In fact, the IOER- Monitor is enabling an assessment of the sustainable land use development targets on all administrative levels including urban districts. Currently, there are 95 indicators under 10 categories such as: settlement, transport, open space, sustainability, building/material stock, urban sprawl, landscape quality, ecosystem services, renewable energy and risk. Some example indicators are: the proportion of built-up areas, land consumption for transportation, settlement density, residential building stock, soil sealing, green infrastructure and ecosystem services of forest land and so on. The input data for extracting indicator values come from diverse sources such as – basic official geodata, official statistics and open data services.

The Digital Landscape Model (ATKIS-Basis DLM, scale 1: 25,000) is used as one of the important bases in the indicator calculation process, because it provides a detailed geo-topographic description of Germany (e.g. , ). ATKIS-Basis DLM is an object-structured vector database which describes the land use features in high accuracy (redundancy-free and gapless in meshes). It also includes the linear modeled transport infrastructure (i.e. roads, railway). Further input data of the IOER-Monitor are 2D- and 3D-building models (HU-DE, LoD1-DE), the German Land Cover Model (LBM-DE), digitized topographic old maps ((TK25), other geo-data (e.g. protected area, flood areas, Imperviousness High Resolution Layer of the Copernicus Land Monitoring Service) as well as official statistical information (e.g. population, traffic, economy and finance).

Initially, the ATKIS Basis-DLM has to be transformed into a land use map where the linear modules of road and railway features are converted into polygons using a buffering algorithm with the respective width of linear objects. Afterwards, the land use portion under existing area overlays are dissolved according to a priority scheme. A hierarchically structured land use scheme was developed for carrying out the semantic description of the land use (i.e. ), which differentiates between 35 land use sub-classes under the main classes of settlement, transport and open space. As soon as the annually updated geospatial input data are available, the current land use map is prepared and all indicators are calculated using state-of-the-art GIS technology and applications. Detailed documentation of the methods, data sources and assumptions are available and regularly updated in the monitor metadata sheet for each indicator. An example can be found here.

An interactive, internet-based geoviewer enables the user to obtain digital cartographic visualization, statistics and analyses at a desired spatial scale such as: federal state, regional planning jurisdiction, district, municipality, urban district and INSPIRE-compliant raster grids (raster widths consist of 100 m, 200 m, 500 m, 1000 m, 5000 m and 10000 m). The selected indicator values are displayed as interactive map, table and/or diagram. Comparative functionalities are enabled on selected spatial scales. Domain experts can access the display capability of the maps in the GIS environment via Web Map Service (WMS) geoservices; but can also import data directly via Web Coverage Service (WCS) or Web Feature Service (WFS) geoservices. Currently, the services of the IOER Monitor are used in development policy formulation, public administration, urban and spatial planning, the economic sector, science/education and even in private personal projects. All data and services are freely available under DE-CC-BY-2.0. The IOER-Monitor is registered in the repository r3data.org.

In the context of the presented infrastructure for geospatial linking, IOER-Monitor geoservices were added which can be used to link respondents to their environment with regard to land use by means of a WPS service using on-the-fly calculations on the geodata.

4. Technical Infrastructure for Geospatial Linking

Designing a system taking the interests of users and data providers as well as legal considerations into account is a challenging task. In this section, we first detail the most relevant preconditions and the requirements of a system which allows for spatial linking of survey data. Thereafter, we present our infrastructure design which meets these requirements.

4.1. Background

The presented technical infrastructure has been developed in the project Social Spatial Science Research Data Infrastructure (SoRa) which is funded by the Deutsche Forschungsgesellschaft (DFG) in Germany and executed by the partners GESIS, IOER, KIT and SOEP. The objective of the project is to build spatial and social science research data infrastructures and to extend them in order to enable interoperability and data integration, as well as conformity to international standards. The initial implementation is guided by the interests of our partner data providers GESIS Panel and SOEP respectively their customers which are social scientists working in Germany or with German-based survey and spatial data. In a follow-up project, we plan to increase the number of data sources from the social sciences and from the sptial sciences. However, technically the infrastructure is not restricted to these domains.

The SoRa project draws upon research on environmental justice for its example use case and to demonstrate the use of the whole infrastructure. It depicts an adequate example because this research requires access to environmental data, mostly on a rather small-scale and for a large proportion of people.

The spatial data provided through the spatial data infrastructure of the IOER Monitor fulfills this demand as it contains land use indicators on a granular grid of 100m by 100m. The IOER-Monitor offers the user more than 85 spatial indicators in different time periods, based on vector and raster data. The underlying geodata is subject to comprehensive pre-processing. Manual random sampling and semi-automatic controls, based on the basic data used, serve to ensure the quality of the results. The aim is to provide a basis for assessing the sustainability of land use development and closely related issues (Meinel & Schumacher, 2011). Survey data is used from the SOEP and the GESIS Panel. Both comprise extensive information from people living in Germany aged between 18 and 70 years. Amongst others, the participants answered questions on their demographic and socio-economic situations (), which makes the GESIS Panel an appropriate data source for our research question.

4.2. Prerequisites

In this section, we will introduce the prerequisites and requirements to implement the spatial linking task for which the proposed infrastructure is designed.

First, we present the properties of the data which is held by the different stakeholders. The main stakeholders are the survey data providers and the spatial data provider, in our use case GESIS, SOEP, and IOER. The dimensions to be considered in the spatial linking task are the temporal and the spatial dimension, i.e. the date when the survey was conducted and the locations the participant is associated with at that date. Analogously, temporal and spatial dimensions are relevant for the spatial data, i.e. the date when the indicator was measured and for which location. Furthermore, the spatial data may be provided on different levels of granularities, i.e. spatial resolution. Considering these dimensions the data integration may be performed along the spatial and the temporal dimension following two main approaches:

Static integration: For each survey participant, all available spatial indicators are determined once and stored along the survey data.
Dynamic integration: Given a request, the spatial indicators are computed on demand for a selection of participants.

Both approaches yield advantages and disadvantages. The main advantage of a static integration approach is its performance and data availability. Once the data integration process is complete, the data can be provided to the researchers and the computation needs to be performed just once. However, its main disadvantage is the static nature of the resulting dataset. Ahead of the integration, it needs to be determined which indicators should be integrated on which level of granularity and for which timeframes.

Dynamic integration refers to the integration upon a request. As a result, the data is not precomputed which alleviates the shortcomings of the static approach. However, the dynamic approach requires the availability of all data sources whenever the data is demanded. In contrast to the static approach, it does not pose any restrictions on the degrees of freedom when it comes to the integration as it allows for any combination of temporal dimension and spatial resolution. Consequently, it provides more possibilities for exploring the data. According to the argumentation outlined above, the proposed infrastructure follows a dynamic approach.

Both types of stakeholders, survey data providers and spatial data providers are concerned with data privacy issues. Often there exists a trade-off between providing data in an easy accessible way and ensuring responsible use and protection of sensitive information. Conditions on handling and processing the data are typically specified in the data providers’ terms of use. However, it is very cumbersome to compile a common terms of use document that satisfies the requirements of all connected data providers. Users have to sign the terms of use of the data providers individually and comply to their conditions. Our approach to this is to create an environment that applies a high level of privacy that suffices all participants while reducing the cost for creating and maintaining such an environment.

According to the characteristics of the data being processed within the infrastructure, it is necessary to consider any data privacy issues which may arise from combining data of these sources. As the goal is a spatial linking of the data, it is necessary to associate the participants of the survey with coordinates. For instance, these coordinates may represent the location of the participants apartment or work place. A central aspect of anonymizing survey data is avoiding data consumers to uniquely identify the participants. However, adding the spatial dimension, the chance of uniquely identifiable participants increases. Since it is possible to identify persons individually by this information, coordinates are considered to be of sensitive nature (see e.g. article 4 of European General Data Protection Regulation (European Parliament, Council of the European Union. ()). The infrastructure needs to avoid creating or exposing such sensitive data or, if absolutely needed, handle it in appropriate ways. In particular, it must not leak any such information through user querying, which has implications on the options we have to exploit both survey data and spatial data.

Users cannot be allowed to select or filter participant records based on geographic locations, because they might be able to single out individuals living in areas with unique properties.
Users cannot be allowed to extend records with data that can be connected to a specific location, for the same reason.

For security purposes data providers handling such sensitive data typically store data and coordinates on different servers. Thus access to multiple systems to resolve the mapping from a participant to a coordinate is required. As a result, the user of the infrastructure can be served without granting access to the coordinates directly, but only to spatial information associated with the coordinates of the participant. For instance, when assuming that a researcher is looking for data about population density in the area of living of a given set of participants, the procedure usually requires to get the coordinates from the participants and then look up the population density indicator values for each of it. However, by the data provider‘s terms and conditions users must not be able to associate participant’s data with their coordinates. Thus, the coordinates are not handed to the user client (SoRa-App) to allow this lookup. Instead, the provider arranges for the indicator lookup by itself and then returns solely the indicator data. Furthermore, not only the access to the data needs to be enforced but also any data transfer that might occur between the components of the infrastructure need to be secured. Therefore, any communication via the Internet is performed by default using the Transport Layer Security (TLS) protocol, which ensures secure data transmission.

4.3. Architecture

The main components of the architecture and the interactions between them are shown in Figure 2. The main components comprise

the SoRa-App component, where the survey data is selected and a spatial linking request is sent;
the Survey Data Discovery component, enabling discovery and search of relevant survey data based on its metadata;
the Participant Data Linking component, where the participants’ coordinates can be resolved;
the Spatial Linking component, where the actual spatial linking is conducted.

Figure 2

Architecture of the SoRa-Infrastructure.

In the first step (1) the user can explore the available survey data at the Survey Data Discovery component. Thereafter, the spatial linking request is defined and issued at the Participant Data Linking Component (2). The Location Mapping process retrieves the participants’ location data and requests the corresponding indicator values at the Spatial Linking component (3). The spatial linking request is completed once the Location Mapping process returns the linked data.

Survey Data Selection

The first step for spatial linking of survey data is the survey data selection. The SoRa-App enables the user to search for and discover the survey relevant to the user’s research question. The foundation of the survey data discovery is the metadata about the survey. To facilitate the deployment of this component at various survey data providers, the survey metadata is described using a well-defined vocabulary, namely the DDI-RDF Discovery Vocabulary (DISCO) (). DISCO is a subset of the Data Documentation Initiative () standard (DDI) which is used to document metadata about surveys and other observational methods in the social, behavioral, economic, and health sciences. The subset allows for discovering this type of data as well as its metadata (). At the same time DISCO is based on the Resource Description Framework (RDF) (), a World Wide Web Consortium (W3C) standard for data interchange on the web. It is a schema-less graph-based data model which allows for merging data from various data sources. This allows for providing a common interface to discover relevant data regardless of the data provider. DISCO allows for describing the most fundamental concepts such as studies, variables, questions and representations. As an example, the data for a variable to assess the gender of the participants is shown in Figure 3. The graph shows the RDF representation of the variable “gender”. The variable is associated with a question to measure the variable and representation for the variable. Several answers are associated with the corresponding representation.

Figure 3

An Example on the Metadata of a Variable.

Each provider may extend the core survey metadata with his own vocabulary. For instance, additional terms may be used to represent relationships between variables, such as the alteration of a variable over time. As a result, each survey data may extend the survey data discovery functionality. For the concrete use case addressed in this article, survey metadata from the GESIS Panel and SOEP have been modelled using DISCO for their use in the developed infrastructure.

Indicator Selection

The SoRa-App is also responsible for indicator selection. Metadata describing the indicators is provided by the Spatial Linking component (in RDF and Java Script Object Notation (JSON) ()), such that the SoRa App can display this information to support the selection of the best-suited indicator. The metadata includes information on the methodology and interpretation of the indicators. In order to improve findability in the user interface, the indicators are sorted into ten groups ranging from traffic-related indicators to environmental indicators.

Spatial Linking Request

After selecting the survey data relevant to the user’s question, the SoRa-App allows the user to specify how the selected data should be spatially linked. Therefore, users can manage their spatial linking requests in the app. This includes the creation and execution of new requests, viewing statuses of requests and eventually downloading the results. A spatial linking request includes the participant identifiers from the selected survey data, a location mapping function and an indicator parameterization. Next, we define the components of such a request.

Definition (Indicator Parameterization)

An indicator parameterization is a 3-tuple ρ = (i, y, b) with

– i a unique identifier
– y ∈ ℕ a year
– b ∈ ℕ a buffer value specified in meters

The corresponding indicator function ι maps a coordinate and a given indicator parameterization to the indicator value.

Definition (Indicator Function)

Given an indicator parameterization ρ, the indicator function ι maps a coordinate c = (x,y)∈R² to an indicator value with ι: ρ × R²→R.

The participants’ request, provides the core functionality of the app as it allows to retrieve the indicator values for a set of participants.

Definition (Spatial Linking Request)

A spatial linking request is a 3-tuple R = (P,α,ρ) with

– P = {p₁,… p_n} a set of participant identifiers
– α: P → R² a mapping from each participant to a corresponding coordinate
– ρ an indicator parameterization

A spatial linking request is issued at the Participant Data Linking component of the survey data provider (c.f. (2) in Figure 2). In the request the coordinates are explicitly stated, as the Survey Data Discovery component merely provides the participant identifiers. The actual coordinates are (typically) not stored with the survey data, i.e. the participant identifiers and may not be revealed to the user. The Location Mapping process as part of the Participant Data Linking component executes function α for a given participant identifier and thus, retrieves the coordinates for a participant. Since there are potentially several locations associated with a participant, such as the household location and the work location, the function α specifies which type of location should be retrieved. The Participant Data Linking component sends the coordinates with the indicator parameterization to the Spatial Linking component (c.f. (3) in Figure 2) which returns the indicator values according to the parameters. Thereafter, the Location Mapping step is reversed and the coordinates are replaced by the participant identifiers of the request and merged with the indicator values. The outcome is a Spatial Linking Result R* which combines the request and the resulting spatially linked participant data.

Definition (Spatial Linking Result)

A spatial linking result is a tuple R^* = (R, P^*) with P^*

R a spatial linking request, and
$P * = {p 1 *, …, p n *}$ M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm P}^{*} = \{{{\rm p}^{*}_{1}},\ldots, {{\rm p}^{*}_{\rm n}}\} \] \end{document} with $p i * = (p i, v i)$ M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm p}^{*}_{\rm i} = ({\rm p}_{\rm i}, {\rm v}_{\rm i}) \] \end{document} the spatially linked participant data where p_i the participant identifier and v_i the indicator value according to the indicator parameterization ρ of R for participant p_i at the location specified by α in R.

The Spatial Linking Result is returned to the SoRa-App and is joined with the previously selected survey data. The resulting data set may be retrieved by the user for further analysis.

Spatial Linking Component

The Spatial Linking Component allows for retrieving the high-resolution raster data of the IOER-Monitor indicators (c.f. Section 3). Internally, a Web Processing Service (WPS) () is used to access the indicator values for a given coordinate. Such WPSs are generally applied to process geodata or to refine input data based on predefined standards of the Open Geospatial Consortium (OGC) (). In our case, the WPS enables linking of coordinates to the corresponding indicator value.

Interacting with a WPS can be relatively complex. Thus, in order to facilitate the access to the indicators, the spatial linking service of the WPS is also made available via a RESTful-Application Programming Interface (API) (). The API receives an indicator parameterization and a set of coordinates as an Indicator Request and returns indicator values for the coordinates according to the parameterization.

Definition (Indicator Request)

An indicator request is a tuple W = (C, ρ) with

– C = {c₁, …, c_n } a set of coordinates with c_i = (x_i, y_i, e_i) where x_i and y_i are longitude and latitude and e_i the coordinate reference system
– ρ an indicator parameterization

The indicator values are available as raster data. Depending on the buffer value specified in the indicator parameters, the raster size is chosen and for each coordinate the corresponding raster value is determined. Since all indicator raster maps are available in the EPSG-3035 coordinate system, deviating input values must be transformed into this base coordinate system. In case a buffer value b > 0 is specified, for each coordinate c all pixels within the circular area with diameter b around c are selected. From these pixels the indicator value is average to retrieve the buffered indicator value. The result of the spatial linking process is an Indicator Result.

Definition (Indicator Result)

An indicator result is a tuple W* = (C*,ρ) with $C * = {c 1 *, …, c n *}$ M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{C}}^{*} = \{{\rm{c}}_{1}^{*}, \ldots,{\rm{c}}_{\rm{n}}^{*}\} \] \end{document} a set of linked coordinates with $c i * = (c i, v i)$ M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{c}}_{\rm{i}}^{*} = ({{\rm{c}}_{\rm{i}}}, {{\rm{v}}_{\rm{i}}}) \] \end{document} where c_i a coordinate and v_i the indicator value according to the indicator parameterization ρ.

5. Case Study: Income and Environmental Hazards

To answer the exemplary research question about environmental inequalities with regards to income and land use hazards in Germany (see 2.2), we draw on the three aforementioned sources of data (see Section 3 for details). We hypothesize that with higher incomes people experience less exposure to land use. This assumption is based on longstanding research, that found that economic resources, such as income, provide vehicles to afford to live in neighborhoods with higher quality (; ). In the analyses, we use three indicators which use for exploring environmental inequalities we justify below: sealing of soils which we already have seen above, traffic density, and industry & trade density. We will present them below in more detail.

All data sources were spatially linked using circular buffer methods by using the approach described in Section 4.3, whereas the georeferenced survey data embodies the focal data. Accordingly, the analysis data exist in the same data format as ordinary survey data but are enriched with additional attributes of sealing of soils. Moreover, the linking was conducted for each survey dataset separately to be able to evaluate potential differences between the two surveys and their impact on the results.

For the purpose of this article, the results are limited to exemplarily show some first steps of analyzing the data. We are aware and note below that further investigations may provide more in-depth analysis of the mechanisms of environmental inequalities, as they are common in the literature (; ; ). However, as far as the German case is concerned, there are not yet many studies that combine individual-level information from survey data with environmental information from geospatial data to investigate environmental inequalities. Nevertheless, we should be cautious of overinterpreting the results of the estimated relationship between income and land use below. Green spaces, for example, can also be part of a debate of gentrification (Pearsall and Eller, 2020), so that further studies should also take factors of urbanization and segregation into account (). The focus of this article, therefore, is to describe the infrastructure facilitating access to such new data to encourage further studies.

5.1. Measures

The following measures have been used in this case study.

5.1.1. Survey Measures

Income The GESIS Panel contains a measure for household income as categorized values with collapsed income ranges (Table 1). In contrast to open questions of income, this measure comprises fewer missing values and its distribution is normal, which is convenient for the estimated models. The same categorization was used for the data of the SOEP in order to create a harmonized measure of income between both survey datasets. We refer to this measure of household income shortly as income below.

Table 1

Income Ranges.

Class	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17

Income range	0–299	300–499	500–699	700–899	900–1099	1100–1299	1300–1499	1500–1699	1700–1999	2000–2299	2300–2599	2600–3199	3200–3999	4000–4999	5000–5999	6000–9999	≥10000

Sociodemographic and Contextual Controls Apart from the central survey measures we also control for the gender of the participants (0 = male; 1 = female), their age in years (18–70), their highest achieved education (low, middle, and high). Moreover, we also control for the number of inhabitants in the one km² neighborhoods of each participant as we suspect that the higher the population density is the higher is the amount of environmental hazards.

5.1.2. Geospatial Data Measures

We use three different indicators: sealing of soils, traffic density, and industry and trade density. For all indicators, we extracted the grid cell value of the data which contain 100 m by 100 m raster data as well as 500 m, 1000 m, and 2000 m buffers. In the following, we describe the content of the three indicators before we proceed to the analysis of the combined data.

Sealing of Soils is the air and water-tight coverage of land through buildings and roads. A consequence of too high amounts of sealing of soils is that water cannot seep away and no exchange of air between soils and the environment takes place. Moreover, sealing of soils can be considered as an inverse measure of green areas such as parks, forests or other recreational areas. As such green areas affect people’s lives positively, e.g., through their impact on stress processing () or health (), a lack of green areas is an environmental hazard and subject of an ongoing debate in the environmental justice literature (; ).

Traffic Density Road traffic and its consequences affect health as well. This effect on health holds true for noise originating from road traffic (; ; ), and for air pollution that vehicles emit (). Therefore, we use an indicator that measures the density of roads because it combines both of these risk factors for health.

Industry & Trade Density We hypothesize that exposure to industry and trade facilities poses a twofold risk. First, noise and air emissions of the facilities and vehicles frequenting the facilities have a direct effect on people living near these facilities (). Second, large areas of industry and trade facilities correlate negatively with free spaces such as green areas (). For this reason, we use an indicator of the IOER monitor that comprises information on areas used, amongst others, for retail and service, docks, waterworks, refinery, waste removal facilities, or waste disposal sites.

By comparing the descriptive statistics of all variables in Table 2, we observe some slight differences between the GESIS Panel and the SOEP after each dataset was linked to geospatial data measures. While it is interesting to see if these differences affect the estimations below, it is also crucial to control for them in the analyses.

Table 2

Descriptive Statistics of all Variables Used in the Analysis.

	GESIS Panel				SOEP

	Mean	SD	Minimum	Maximum	Mean	SD	Minimum	Maximum

Survey Measures
Age	46.69	13.37	18.00	73.00	48.96	16.74	18.00	100.00
Gender	0.52	0.50	0.00	1.00	0.54	0.50	0.00	1.00
Education	2.21	0.77	1.00	3.00	1.90	0.76	1.00	3.00
Income	11.71	2.83	1.00	17.00	11.16	3.15	1.00	17.00
Geospatial Measures
Sealing of Soils 100 m	0.47	0.24	*	*	0.51	0.24	*	*
Sealing of Soils 500 m	0.32	0.20	*	*	0.36	0.20	*	*
Sealing of Soils 1000 m	0.25	0.18	*	*	0.29	0.19	*	*
Sealing of Soils 2000 m	0.19	0.16	*	*	0.22	0.17	*	*
Traffic Density 100 m	15.67	7.53	*	*	15.86	7.33	*	*
Traffic Density 500 m	10.20	3.60	*	*	10.77	3.61	*	*
Traffic Density 1000 m	08.09	3.21	*	*	8.69	3.25	*	*
Traffic Density 2000 m	6.45	2.77	*	*	6.93	2.86	*	*
Industry and Trade Density 100 m	02.04	9.59	*	*	1.86	9.11	*	*
Industry and Trade Density 500 m	4.84	7.73	*	*	5.10	7.47	*	*
Industry and Trade Density 1000 m	5.33	6.22	*	*	06.05	6.72	*	*
Industry and Trade Density 2000 m	5.16	4.93	*	*	06.09	5.51	*	*
Number of Observations	2598				24136

Data Source: Georeferenced GESIS Panel () and Georeferenced Socio-economic Panel ().

* Minimum and maximum values deleted due to data protection.

5.2. Analysis Strategy

Our analysis is based on a model that predicts land use hazards based on the survey participants’ income. Thus, all other variables are held constant, whereas the income of the participants varies. These results are presented graphically as predicted values. In order to prevent predicting values of sealing of soils over 100% or below 0% we use a logit transformation:

y logit = ln y 1 − y

M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\rm{y}}_{{\rm{logit}}}} = {\rm{ln}}\frac{{\rm{y}}}{{1 - {\rm{y}}}} \] \end{document}

This transformation bounds the predicted values of the dependent variables to a range that lies between 0% and 100% after re-transforming it back to the original scale with the following formula:

y^= e y^logit 1 + e y^logit

M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{\hat y}} = \frac{{{{\rm{e}}^{{{\rm{\hat y}}_{{\rm{logit}}}}}}}}{{1 + {{\rm{e}}^{{{{\rm{\hat y}}}_{{\rm{logit}}}}}}}} \] \end{document}

Finally, we estimate all models with cluster-robust standard errors. Using them in a survey sampling setting is generally recommended (), but we also argue that different residential policies between municipalities may produce additional clustering among participants who live in the same municipality. Accordingly, the cluster-robust standard errors are based on the participants’ municipality.

5.3. Results

Can we observe land use hazard exposure of survey participants as a function of their income? We use prediction plots to answer this question based on our data. These plots yield the predicted value of each land use hazard indicator on the y-axis in relation to the income of participants on the x-axis. Moreover, within each of these plots different slopes and their corresponding 95% confidence intervals depict the predicted values for different geographic sizes of the indicators (100 m, 500 m, 1000 m, and 2000 m).

Figure 4 shows the results for the estimates based on the GESIS Panel. With regards to sealing of soils in particular, we can observe a distinct decreasing slope of sealing of soils if participants’ income is increasing. In comparison to the lowest-income group, the predicted value for sealing of soils of the highest-income group ~14% smaller. However, the overall exposure and the steepness of the slopes get smaller with increasing sizes of geographic neighborhoods. While this pattern remains significant with regards to the other two indicators, the effect of income is only small.

Figure 4

Results of the Linear Predictions as a Function of Income and the Three Land Use Indicators Across Different Geographic Scales for the GESIS Panel (N = 2598).

Data Source: Georefererenced GESIS Panel and Monitor of Settlement and Open Space Development.

These results are corroborated with the data of the SOEP displayed in Figure 5. Moreover, as the sample size of the SOEP is larger than the GESIS Panel the estimates are more precise and, in the case of sealing of soils, more pronounced. In comparison to the lowest-income group, the predicted value for sealing of soils of the highest-income group is ~19% smaller, which is a difference of 5% to the GESIS Panel. The predictions for the other two land use indicators are similar to those of the GESIS Panel analysis.

Figure 5

Results of the Linear Predictions as a Function of Income and the Three Land Use Indicators Across Different Geographic Scales for the SOEP (N =24136).

Data Source: Georefererenced Socio-economic Panel (doi:10.5684/soep.v33.1) and Monitor of Settlement and Open Space Development.

Overall, the analysis demonstrates the use of combining the data across different research data infrastructures. Using similar but slightly different population samples replicates similar findings: generally, income reduces the exposure to land use hazards, mainly with regards to sealing of soils. Such findings can and should be extended, e.g., by taking into account other sociodemographic characteristics such as ethnic minority status (). Environmental justice research is a longstanding research field, considering factors of segregation (), ethnicity (), health (), and political participation (). The results here are a starting point corroborating existing findings from the aggregate-level on the individual-level while controlling individual-level attributes. It would be interesting to see how future studies adapt the presented infrastructure and add up to such research questions.

6. Conclusion

For several investigations in social science research, there is the necessity for spatial linking in order to enrich survey data with information from auxiliary georeferenced data. However, challenges exist regarding data access, data privacy and data security among the spatial linking itself. Hence, there is a need to support researchers from the social sciences and related domains that work with survey data in accessing and linking these data sources. In this article, we presented a spatial social science research infrastructure which addresses these challenges and supports users during the spatial linking process. We used this infrastructure for investigating environmental inequalities of land use hazards in Germany and could show that increasing income of survey respondents is associated with less exposure to environmental hazards originating from land use in Germany.

However, in order to increase the benefit for users, we plan to support them in their statistics tools like R where they actually analyse the data. Also, the involvement of multiple stakeholders increases the organizational effort, in particular when aspects of data privacy and data security have to be addressed among multiple stakeholders, i.e sensitive data like georeferenced survey data must not be accessible outside of its data provider. In the infrastructure setup at hand, it has to be ensured that processing involving this data is executed on-site. The SoRa-App is currently under development as a prototype in the SoRa project. We plan to make the prototype publicly available in 2020.

Software Information

R version 3.5.2 (2018-12-20) — “Eggshell Igloo”

With packages

“magrittr” 1.5: general piping of procedures
“tidyverse” 1.2.1: data wrangling, e.g., procedures from “dplyr” (0.8.0.1)
“sf” 0.7-2: geospatial data manipulations, e.g., buffers and spatial joins
“raster” 2.8-19: extraction of raster values
“velox: 0.2.0: fast loading of huge raster files
“estimatr” 0.1.4: estimation of cluster robust standard errors regression models
“car” 3.0-2: inverse logit function
“boot” 1.3-20: re-transforming inverse logit scales
“ggplot2” 3.1.0: visualization of results
“tmap” 2.2: Creating the buffer map

QGIS Desktop 3.0.3

https://qgis.org/en/site/forusers/download.html.

Data Accessibility Statement

GESIS. (). GESIS Panel – Extended Edition. GESIS Data Archive, Cologne, ZA5664 Datenfile Version 19.0.0. https://doi.org/10.4232/1.12742 Available on-site
IOER. (2017). Monitor of Settlement and Open Space Development. Leibniz Institute of Ecological Urban and Regional Development. Available for Download https://monitor.ioer.de
SOEP V33.1: doi:10.5684/soep.v33.1, Available as SUF from FDZ SOEP
SOEP Hauskoordinaten, Available on-site at FDZ SOEP

Data Science Journal

Research Papers

An Infrastructure for Spatial Linking of Survey Data

Abstract

1. Introduction