On the Importance of 3D Surface Information for Remote Sensing Classification Tasks

There has been a surge in remote sensing machine learning applications that operate on data from active or passive sensors as well as multi-sensor combinations (Ma et al. (2019)). Despite this surge, however, there has been relatively little study on the comparative value of 3D surface information for machine learning classification tasks. Adding 3D surface information to RGB imagery can provide crucial geometric information for semantic classes such as buildings, and can thus improve out-of-sample predictive performance. In this paper, we examine in-sample and out-of-sample classification performance of Fully Convolutional Neural Networks (FCNNs) and Support Vector Machines (SVMs) trained with and without 3D normalized digital surface model (nDSM) information. We assess classification performance using multispectral imagery from the International Society for Photogrammetry and Remote Sensing (ISPRS) 2D Semantic Labeling contest and the United States Special Operations Command (USSOCOM) Urban 3D Challenge. We find that providing RGB classifiers with additional 3D nDSM information results in little increase in in-sample classification performance, suggesting that spectral information alone may be sufficient for the given classification tasks. However, we observe that providing these RGB classifiers with additional nDSM information leads to significant gains in out-of-sample predictive performance. Specifically, we observe an average improvement in out-of-sample all-class accuracy of 14.4% on the ISPRS dataset and an average improvement in out-of-sample F1 score of 8.6% on the USSOCOM dataset. In addition, the experiments establish that nDSM information is critical in machine learning and classification settings that face training sample scarcity.


INTRODUCTION
Significant resources are required to establish robust classification performance for remote sensing applications which can range from composite distributed land uses, for example urban versus rural development, to feature-specific mapping, such as roads, building and vehicles. Technical challenges naturally arise from the sheer variety of appearances and viewing angles for objects of interest, as well as the lack of annotations or labels relative to the overall volume of data.
Machine learning and artificial intelligence have made great strides in recent years due to the advent of hardware accelerated computing and the development of symbolic math libraries. However, a unified framework for remote sensing classification applications is still beyond reach (Ball, Anderson, and Chan, 2017;Zhu et al., 2017). The diversity of data collection methods, e.g. space-based or airborne, and sensor modalities, e.g. lidar RGB, and hyperspectral imagery, necessitate continuous adaptation of the underlying classification framework based on classification task, desired accuracy, computational bandwidth, and available datasets.
A range of machine learning methodologies are available for executing object classification from remotely sensed data including convolutional neural networks (CNNs) (Maturana and Scherer, 2015;R. Zhao, Pang, and J. Wang, 2018), decision trees (Blaha et al., 2016), and Support Vector Machines (SVMs) (Lai and Fox, n.d.;D. Wang, Zhang, and Y. Zhao, 2007). Although inherently different, all of those methods have been shown to be effective and successful given specific datasets and class labels for training and cross-validation. Interestingly, many algorithms originate from outside the remote sensing community, e.g. from biomedical image segmentation (Ronneberger, Fischer, and Brox, 2015), and have been modified to ingest mission-specific sensor modalities in order to classify different objects of interest.
The feasibility of applying machine learning concepts has been further studied, and showcased through various challenges, such as the USSOCOM Urban 3D challenge (H. Goldberg, Brown, and S. Wang, 2017; H.R. Goldberg et al., 2018) or the ISPRS 2D and 3D Semantic Labeling contests (Gerke et al., 2014). Several submitted algorithms yielded high accuracy, even without using 3D surface information (Audebert, Le Saux, and Lefèvre, 2018). However, given the recent success, it remains important to formally assess information content of available sensor modalities, and to infer generalizability of the resulting classifier if presented with inherently different, previously unseen test samples. In other words, given a specific object recognition task, we seek to answer the following questions. First, what are the preferred sensor modalities given the quantity and quality of training samples? After all, adding sensor modalities usually incurs cost and logistical considerations. Second, is the resulting classifier useful when applied outside the location or domain in which training data was collected?
This paper expands upon the findings in (Chen, Fu, et al., 2018), in which the authors acknowledge that the majority of ongoing work is focused on developing classification strategies with little regard to the information value provided by the sensor modality of the input data. In (Chen, Fu, et al., 2018), multi-spectral imagery is fused with DSM information in order to study the impact on the resulting classification performance. However, DSM information is pre-processed to extract specific features based on the 3D structure tensor (Weinmann, n.d.) such as linearity, planarity, and sphericity, as well as the Normalized Difference Vegetation Index (NDVI) from (Tucker et al., 2001). This paper expands upon the work in (Chen, Fu, et al., 2018) by considering multiple datasets and by examining training and test performance separately. Additionally, cross-city validation is added to formally assess classification performance when extrapolating the classifier from one city to another.
To partially close this gap between formally assessing information content and achieving high classification performance, this paper examines the importance of 3D surface information when generalizing trained classifiers using two distinct machine learning frameworks, and two different publicly available datasets as test cases. In order to derive results that are agnostic to the chosen classification methodology, we specifically select one classification scheme of high complexity, i.e. a Fully Convolutional Neural Network (FCNN) architecture, and one classification scheme of low complexity, i.e. a Support Vector Machine (SVM).
The novelties and technical contributions of this paper are two-fold. First, we examine generalizability of classification frameworks when trained and tested on different geographic locations by formally evaluating out-of-sample performance of the classifier when trained with and without 3D DSM information. Different geographic locations often imply deviations in the underlying urban infrastructure, architecture, and thus the appearance and structure of objects of interest such as buildings and roads. Second, we assess the same generalizability when facing scarce training data. Therefore, we will formally evaluate out-of-sample performance when trained with and without 3D DSM information while the number of training samples is gradually reduced.
As acknowledged in , the majority of research focuses on improving classification tasks, while "only little attention has been paid to the input data itself". Specifically, information content is often overlooked in light of achievable performance metrics for object classification and segmentation. In addition, the work in  explores the utility of handcrafted, geometric features over machine-derived ones. The information content of specific feature types is studied in (Gevaert et al., 2016), in which the authors distinguish between point-based and segment-based features. Dependent on the available sensor modalities, only some feature types may be available in any given scenario. Their relative importance for object classification can then be formally assessed using an SVM classifier for different scenarios and different combinations of feature types. Although classification is tested via cross-validation, training and testing samples are extracted from the same geographical location, which may limit the diversity of building footprints the classifier is exposed to. This paper follows a similar approach by formally evaluating not only classification performance, but also generalizability using two distinct classification frameworks. Therefore, we attempt to establish the notion of information content independent of the underling classification strategy. In addition, our study focuses on entire sensor modalities instead of individual features that are derived from those modalities. This paper is organized as follows: Section 2 outlines the technical approach. Experimental validation is presented in Section 3, while results are split into cross-city validation in Section 3.1 and validation for varying training sample proportions in Section 3.2. Section 4 concludes the study.

TECHNICAL APPROACH
This section describes the underlying classification approach and performance assessment. Section 2.1 provides an overview of binary and multi-class classification in a semantic segmentation setting. Section 2.2 discusses class balancing for imbalanced classification tasks. Section 2.3 outlines the two classification architectures used for this study, i.e. a Support Vector Machine (SVM) and a Fully Convolutional Neural Network (FCNN). The two publicly available datasets, including objects of interest used for training and testing, are described in Section 2.4. Section 2.5 summarizes the performance assessment strategies as well as the two different validation settings: (1) cross-city validation and (2) reduction of sample proportion for training.

BINARY AND MULTI-CLASS CLASSIFICATION
This work aims to (i) develop robust binary and multi-class semantic segmentation classifiers for remote sensing, and (ii) test the overall generalizability of these classifiers as a function of sensor modality. Semantic segmentation can be conceptualized as a classification problem in which we predict a class label for each pixel. We consider each of these pixel-level classification problems in binary and multi-class classification settings (Janocha and Czarnecki, 2017). One common optimization objective for finding optimal parameter sets in binary and multi-class classification tasks is two-dimensional cross-entropy, which we utilize in this work for our FCNN models. Below, we discuss the objective functions for both of these classification schemes.

Binary Classification
For binary classification, the classifier learns by minimizing a loss function over a set of parameters θ b * , in this case pixel-wise Negative Log Likelihood (NLL) with a single class.
(2.1) 1} and P(X;θ) i,j ∈ [0, 1] denote the ground truth and predicted labels, respectively, at pixel (i, j) for each training image tile, and X denotes the input features (1, 3, or 4-channel image). Both X and Y are indexed as 2D arrays with (height, width) given as (m, n). This classification methodology is utilized for the USSOCOM Urban3D dataset for building footprint classification.

Multiclass Classification
A multiclass classifier minimizes a similar objective to find parameters θ m * , except rather than considering a single class, e.g. a building, this classifier makes pixel-wise predictions over multiple classes. This classifier learns by minimizing the following objective, which corresponds to pixel-wise Negative Log Likelihood (NLL) over N classes.
To make optimal semantic segmentation predictions, we use gradient descent methods (Ruder, 2016) to iteratively solve for a set of weights θ that minimize the objectives presented above.

CLASS IMBALANCE CORRECTION VIA CLASS WEIGHTING
In addition to computing element-wise cross-entropy loss for each pixel in the image, we weight each pixel using its inverse class frequency, i.e. if class i occurs in a dataset with fraction f i ∈ (0, 1], then our weighting corresponds to the inverse of this fraction: This weighting scheme is used to correct for class imbalance in our datasets, and corresponds to a weighted Negative Log Likelihood (wNLL) or weighted Cross-Entropy objective (Aurelio et al., 2019). Using a weighted cost function allows for learning a classifier that is not biased toward the most common class. With the incorporation of these weights, our objectives become: Weighted Binary Classification: arg min ( )( log( ( ; ) ) (1 ) log(1 ( ; ) )) ( ) Where w(Y i,j ) ∈ ℝ + denotes the weight assigned to pixel (i,j) based off of its ground truth class, Weighted Multi-Class Classification: Where w(c i,j ) ∈ ℝ + denotes the weight assigned to pixel (i,j) based off of its ground truth class c i,j . Note that here we use c i,j ∈ {0, 1, …, N} to denote the class, rather than Y i,j,k ∈ {0, 1}.
Our FCNN classifiers, SegNet and SegNet Lite, which are described in detail below, each leverage these weighted objectives, and our classification results are reported as balanced/ class frequency adjusted.

CLASSIFICATION ARCHITECTURES
To fully evaluate the importance of 3D nDSM information independent of the underlying classification algorithm, two distinct classification architectures have been selected for experimentation. First, we consider a relatively simple, well-known SVM framework. Second, we apply an FCNN architecture. These two classifers are outlined in detail below. SVM is a discriminative classifier that aims to separate sets of feature vectors based on the assigned labels by finding the maximum margin hyperplane (Cortes and Vapnik, 1995). Separation can be accomplished through linear hyperplanes or nonlinear hypersurfaces, which are constructed in a feature space through numerical optimization with kernel functions. Figure 1 illustrates the application of a 5 × 5 neighborhood in the imagery to construct n-dimensional feature vectors fi ∈ ℝ n , one for each pixel (or data sample) i. For case 1 (one channel), we extract features only from the DSM information to obtain a feature vector f 1,i ∈ ℝ 25 . Similarly, for case 2 (three channels), we utilize only RGB information to obtain f 2,i ∈ ℝ 75 . For case 3 (four channels), we find f 3,i ∈ ℝ 100 by concatenating f 1,i and f 2,i , thereby constructing the full feature vector fusing DSM and RGB information. Therefore, classification will be carried out in 25, 75 or 100-dimensional feature spaces.
Based on ground truth information and training data available, labels y i ∈ L can be assigned to each feature vector f i . Here, L denotes the set of all labels with cardinality |L| = N, where N is the number of unique labels contained in the dataset. Once feature vectors have been constructed and labels have been assigned, supervised training via SVM can be accomplished using any available machine learning library. For this study, nonlinear hypersurfaces are constructed using a Gaussian/Radial Basis Function (RBF) kernel, implemented with MATLAB's fitcecoc module (MathWorks, 2018).

Architecture 2 -Fully Convolutional Neural Network (FCNN)
Although originally developed for image segmentation tasks, the Segmentation Network, or SegNet, architecture presented in (Badrinarayanan, Kendall, and Cipolla, 2017) and depicted in Figure 2 has recently gained increased attention for object classification in remote sensing applications (Audebert, Le Saux, and Lefèvre, 2018). SegNet is characterized by a Fully Convolutional Neural Network (FCNN) architecture with an encoder-decoder structure. Image information is initially down-sampled throughout the five encoder blocks (left side of Figure 2) using convolution operations, batch normalization, nonlinear activation functions, and pooling. These encoding blocks create a latent representation of the input image, characterized by spatial features extracted from the convolutional encoding layers. Throughout the five decoding blocks (right side of Figure 2), segmentation information is reconstructed from this latent representation to the full image resolution using convolution operations, batch normalization, nonlinear activation functions, and nonlinear up-sampling blocks. To perform nonlinear upsampling, the SegNet decoder leverages pooling indices computed in the encoder layers of the network and connected from the encoder to the decoder via skip connections (Badrinarayanan, Kendall, and Cipolla, 2017). The input layer can be modified to ingest one channel (DSM only), three channel (RGB-only), or four channel (DSM & RGB) data. Each of the five encoder blocks and five decoder blocks consists of two to three convolutional layers. For an input window size of 256 × 256 pixels, it can be shown that the original SegNet structure from (Badrinarayanan, Kendall, and Cipolla, 2017) consists of roughly 30 million weights. By limiting the number of convolutional layers per block to two and reducing the output dimensions (or channels) for each layer by 75 percent, we construct a similar, yet lightweight SegNet architecture consisting of only 1.2 million weights, reducing the total number of weights by 96%. For experiments carried out in Section 3, we will refer to the original SegNet architecture as SegNet (SegNet) and the lightweight SegNet structure with a reduced number of weights as SegNet Lite (SegNet Lite). Figure 3 compares our SegNet (red) and SegNet Lite (blue) architectures, listing the number of weights per layer and the total number of weights. Table 1 provides a consolidated view of these two architectures.
Both of these SegNet neural architecture variants were implemented using PyTorch (Paszke et al., 2019), which supports streamlined GPU acceleration. Source code was provided though a git repository 1 shared by the authors of (Audebert, Le Saux, and Lefèvre, 2018). Some modifications were made (i) to ingest composite images combining spectral and DSM information, and (ii) to transform the SegNet (SegNet) architecture into the SegNet Lite (SegNet Lite) architecture.   (Audebert, Le Saux, and Lefèvre, 2018;Badrinarayanan, Kendall, and Cipolla, 2017) utilizing a deep encoderdecoder structure for image segmentation and object classification. To perform nonlinear up-sampling, the SegNet decoder leverages pooling indices computed in the encoder layers of the network and connected from the encoder to the decoder via skip connections (Badrinarayanan, Kendall, and Cipolla, 2017).

DATASETS AND OBJECTS OF INTEREST
This section describes the two datasets used for performance assessment, including objects of interest. Although there exists a wide variety of available datasets for examining remote sensing applications, this work focuses on (i) high-resolution satellite imagery provided through the USSOCOM Urban 3D Challenge (H. Goldberg, Brown, and S. Wang, 2017; H.R. Goldberg et al., 2018) and (ii) aerial imagery released by the International Society for Photogrammetry and Remote Sensing (ISPRS) in support of their 2D Semantic Labeling Contest (Gerke et al., 2014). In this paper, we refer to these datasets by the originator (e.g. USSOCOM for the Urban 3D Challenge and ISPRS for the ISPRS 2D Semantic Labeling Contest). To establish the core results of this study, we apply both the SVM and SegNet classifiers to the ISPRS and USSOCOM datasets.

USSOCOM Urban3D Dataset
The USSOCOM dataset contains orthorectified red-green-blue (RGB) imagery of three US cities: (i) Jacksonville, FL, (ii) Tampa, FL, and (iii) Richmond, VA with a resolution of 50-centimeters ground sample distance (GSD). The data was collected via commercial satellites and additionally provides coincident 3D Digital Surface Models (DSM) as well as Digital Terrain Models (DTM). These DSMs and DTMs are derived from multi-view, EO satellite imagery at 50-centimeter resolution, rather than through lidar (Light Detection and Ranging) sensors. DSM and DTM information is used to generate normalized DSM (nDSM) information, i.e. nDSM = DSM -DTM. All imagery products were created using the Vricon (now Maxar Technologies) production pipeline using 50-centimeter DigitalGlobe satellite imagery. Buildings are the only objects of interest, i.e. L = {0, 1} and |L| = 2, with roughly 157,000 annotated building footprints contained in the data. These ground truth building footprints are generated through the use of a semi-automated feature extraction tool in tandem with the HSIP 133 cities dataset (H. Goldberg, Brown, and S. Wang, 2017). Figure 4 shows one of the 144 Jacksonville tiles from the USSOCOM dataset with RGB imagery on the left, nDSM information (i.e. the difference between surface and terrain) in the center, and annotated ground truth for building footprints on the right (buildings in yellow with background in blue). Tiles are 2048 × 2048 pixels and span an area of roughly 1km 2 .    The ISPRS dataset contains infrared, red, and green (IRRG) bands for two locations: (i) Vaihingen, Germany, and (ii) Potsdam, Germany. The GSDs are 9 centimeters and 5 centimeters, respectively. DSM and nDSM information is generated via dense image matching using Trimble INPHO 5.3 software. In order to avoid areas without data ('holes') in the True Orthophoto (TOP) and DSM, dataset patches were selected only from the central region of the TOP mosaic, i.e. not at the boundaries. Any remaining (very small) holes in the TOP and the DSM were interpolated. nDSM imagery is produced using a fully automatic filtering workflow without any manual quality control. 32-bit grey levels are used to encode heights for the DSMs and nDSMs in the TIFF format (Gerke et al., 2014). Ground truth annotations for objects of interest are provided for |L| = 6 classes, i.e. impervious surfaces (i.e. roads), buildings, low vegetation, trees, cars, and clutter. Figure 6 presents a sample of the ISPRS dataset with IRRG imagery on the left, nDSM information in the center, and ground truth information on the right. Ground truth is colorcoded for roads (white), buildings (blue), low vegetation (cyan), trees (green), cars (yellow) and clutter (red).
For nDSM imagery produced in both the ISPRS and USSOCOM datasets, it is unclear if the authors leveraged any techniques to mitigate occlusion due to objects such as trees, which can have a substantial, seasonally-dependent effect on the ground-truth accuracy of the semantic labels (Park and Guldmann, 2019) present in both of these datasets. For the USSOCOM dataset, the authors also note that producing DTM imagery products from overhead imagery remains an open research question (H. Goldberg, Brown, and S. Wang, 2017).

VALIDATION SETTINGS
This study provides experimental validation for two scenarios demonstrating the importance of 3D surface information for remote sensing classification tasks.
First, we seek to establish performance metrics for cross-city validation when using classifiers that were trained with and without nDSM information. Previous work in this domain concluded that in-sample performance drops only slightly when depriving the classifier (Segnet in (Audebert, Le Saux, and Lefèvre, 2018)) of nDSM information. However, the impact of nDSM information on out-of-sample performance, i.e. cross-city performance, has not been studied and formally assessed yet, and is one of the major contributions of this work.
In addition to cross-city validation, we study the impact of nDSM information when training the classifier using scarce data. Therefore, we will reduce the number of training samples while assessing overall out-of-sample performance of the resulting classifier when trained both with and without nDSM information. Table 2 summarizes the cross-city training and testing methodologies for the USSOCOM dataset, while Table 3 summarizes the cross-city training and testing methodologies for the ISPRS dataset. As noted, we establish the core results of this study by evaluating the importance of DSM information for classification. For this, we use both the SVM and SegNet classifiers for both the USSOCOM and ISPRS datasets. *It is important to note that the ISPRS -Potsdam data (5 centimeters GSD) will be downsampled by a ratio of 9:5 to achieve a GSD of 9 centimeters, which is used for training using the ISPRS -Vaihingen data.
In addition to what is summarized in Tables 2 and 3, these training and evaluation procedures also allow for formal comparisons between the SVM and SegNet classifiers on both the USSOCOM and ISPRS datasets.

EXPERIMENTAL VALIDATION
This section summarizes the experimental validation including performance assessment results. The section is divided with Section 3.1 addressing cross-city validation and Section 3.2 describing the impact of reducing available training samples. Classification performance with and without DSM information is the main focus of this study.

CROSS-CITY VALIDATION
Our first experimental study applies SVM and SegNet classifiers to the ISPRS dataset. Specifically, both networks were trained using three cases: (i) IRRG information only, (ii) nDSM information only, and (iii) nDSM & IRRG information combined. Training was conducted using 12 out of 16 tiles from the ISPRS Vaihingen dataset. Training times were approximately 20 hours for the SegNet and 3 hours and 20 minutes for SegNet Lite.
In our second experimental study, we apply SVM and SegNet classifiers to the USSOCOM dataset. For SVM, we first train the SVM classifier from Section 2.3.1 on the USSOCOM dataset using all of the available 144 Jacksonville tiles. Training samples were down-selected randomly by a factor of 1,000 in order to circumvent memory limitations, reduce the number of resulting support vectors, and to allow for adequate training and inference times. SVM training then yields three classifiers, one for training with RGB & nDSM information, one for training with RGB information only, and one for training with nDSM information only. For SegNet, we follow a similar training procedure: we train the SegNet classifier from Section 2.3.2 on all 144 available Tampa tiles. Down-selection is not performed for the SegNet classifier, i.e. the classifier is trained on all pixels present in all training tiles. SegNet training similarly yields three classifiers, one for training with RGB & nDSM information, one for training with RGB information only, and one for training with nDSM information only.
Results from these two aforementioned experimental studies are discussed below.

ISPRS Dataset Results
The results outlined in Tables 4 and 5 replicate the findings from (Audebert, Le Saux, and Lefèvre, 2018). These tables summarize resulting classification accuracy for the SegNet and SegNet Lite architectures, respectively. Individual rows are associated with different   objects of interest, while columns cite the corresponding in-sample (Vaihingen) and outof-sample (Potsdam) class-balanced accuracies for the three training cases. Note that random guessing would yield a total accuracy of 1 : 6 ≈ 0.1667 for a six class classification problem. Our results indicate that for SegNet, in-sample classification performance is not impacted significantly when depriving the classifier of nDSM information. In fact, accuracy drops less than 0.5% for either SegNet classifier between the nDSM & IRRG case and the IRRG-only case. For SVM, we observe a more significant drop in in-sample accuracy of 9% between the nDSM & IRRG case and the IRRG-only case. However, unlike in-sample performance, total accuracy drops for out-of-sample validation by 25%, from 65% to 40% for the SegNet, by 5%, from 50% to 45% for SegNet Lite, and 13%, from 54% to 41% for SVM, when excluding nDSM information from training. Performance losses are noticeable across all objects of interest. Although the nDSM-only classifier performs worst in-sample, for SegNet, it outperforms the IIRG-only classifier by 8% out-of-sample, and for SegNet Lite, it outperforms the IRRG-only classfier by 5% out-of-sample. For comparison, Table 6 lists the performance metrics when using the SVM classifier for the ISPRS datasets. As expected, overall performance drops significantly. shows qualitative results for the SegNet architecture when generating predictions using the three training cases. Ground truth is annotated using color-coding for roads (white), buildings (blue), low vegetation (cyan), trees (green), cars (yellow) and clutter (red). Again, without nDSM information, misclassifications occur between buildings, low vegetation, trees and roads. Figure 8 presents the corresponding qualitative results for the SegNet Lite architecture.

USSOCOM Dataset Results
Figures 9 and 10 summarize the resulting performance for our SegNet and SVM models using quantitative binary classification metrics such as accuracy, precision, recall, F1-score, and falsenegative and false-positive rates. Classifiers are color-coded as follows: nDSM & RGB in blue, RGB-only in green, and nDSM-only in yellow.
For SVM (Figure 10), the left three bars show in-sample performance, i.e. testing was performed on the same 144 tiles that the classifiers were trained on, while the right three bars represent out-of-sample performance, i.e. testing was performed using 144 unseen Tampa tiles. For SegNet (Figure 9), the left three bars show out-of-sample performance for procedure 1 (testing   Note that JAX corresponds to out-of-sample testing with tiles from Jacksonville, and RIC corresponds to out-ofsample testing with tiles from Richmond.

Figure 10
In-sample and out-of-sample building classification performance for the USSOCOM dataset using SVM classifiers. Classifiers are color-coded: nDSM & RGB in blue, RGB-only in green, and nDSM-only in yellow.
13 Petrich et al. Data Science Journal DOI: 10.5334/dsj-2021-020 on 144 tiles over Jacksonville), while the right three bars represent out-of-sample performance for procedure 2 (testing on 144 tiles over Richmond). Figure 10 indicates that in-sample performance (left most three bars in all six subplots) decreases only slightly when using RGB (green) or nDSM (yellow) information only, as compared to the combined nDSM & RGB classifier (blue). Note the RGB classifier slightly outperforms the (nDSM) classifier in in-sample performance. However, Figures 9 and 10 indicate that performance differs significantly when testing the trained classifiers on a previously unseen test dataset, here USSOCOM Jacksonville or Richmond tiles (SegNet) and USSOCOM Tampa tiles (SVM). In addition to the overall performance discrepancies between the three classifiers for both SegNet and SVM, it becomes evident that accuracy drops only 10% when using only nDSM data as compared to 15% when using only RGB information (SVM; see upper left plot in Figure 10). For the SegNet classifiers, we observe that classifiers trained with RGB & nDSM information exhibit an average 0.6% higher out-of-sample accuracy than classifiers trained on RGB information alone. These results support the hypothesis that the nDSM information facilitates greater classifier generalizability, as compared to RGB information alone for building classification tasks. Figure 11 presents the qualitative out-of-sample performance for all three SVM classifiers. From left to right, the upper row shows the training data RGB imagery, nDSM information, and ground truth, i.e. annotated building footprints, for Tampa tile #014. From left to right, the lower row shows predicted building footprints when training on (i) RGB information only, (ii) nDSM information only, and (iii) combined RGB & nDSM information. It is clear that the RGB & nDSM classifier on the lower right provides the best correlation with the actual ground truth (upper right). However, specific misclassifications occur for the other two cases. For example, when using nDSM information only, taller non-building objects such as trees are associated with a higher misclassification rate. In contrast, when using RGB information only, objects such as roads are often misclassified as buildings. However, when combining RGB and nDSM information, the number of misclassifications (both Type I and Type II) is significantly reduced.

VALIDATION USING SMALL SAMPLE PROPORTION
In this section, the importance of 3D surface information is further tested using classification scenarios with scarce training samples. Sufficient data with adequate representation and viewing angles for all objects of interest may not always be assumed, particularly for remote Figure 11 Qualitative outof-sample classification performance for SVM classifiers applied to USSOCOM data. From left to right, the upper row shows RGB imagery, nDSM (DSM-DTM) information, and ground truth, i.e. annotated building footprints, for Tampa tile #014. From left to right, the lower row shows predicted building footprints when training on (i) nDSM information only, (ii) RGB imagery only, and (iii) combined RGB & nDSM information. sensing applications. Therefore, we train and test the two classification architectures from Section 2.3, while successively decreasing the number of training samples. SVM classification in Section 3.1 was carried out using all 144 Jacksonville tiles from the ISPRS dataset. The 600 million training samples, i.e. annotated pixels, were randomly down-selected by a factor of 1,000 to 600,000 training samples, which corresponds to a sample proportion for training of 0.1%. For the following analysis, we further decrease the sample proportion to 0.01%, 0.001% and 0.0001%, thereby reducing the total number of training samples to 60,000, 6,000, and 600, respectively. Table 9 presents the resulting average training times for all three SVM classifiers, highlighting the orders of magnitude between the different test cases. Clearly, the underlying numerical optimization can be completed significantly faster if fewer training samples need to be classified. Figure 12 displays the resulting in-sample and out-of-sample classification performance for the three SVM classifiers: RGB-only (red), nDSM-only (blue), and RGB & nDSM (black) as a function of sample proportion for training. Here, performance is measured in accuracy (left plot), F1score (center plot) and error rate (right plot). All metrics assume class balancing. In-sample performance is plotted as dotted lines, while out-of-sample performance is plotted as solid lines. As training and test samples are selected randomly, five trials were conducted for each test case studied. In Figure 12, vertical bars are added to indicate the standard deviation for the particular performance metric over those five trials.    As discussed in the previous section, the RGB & nDSM classifier provides the best in-sample performance at 93% accuracy when using 0.1% of all training data. In-sample performance for the RGB-only and nDSM-only classifiers is 85% and 83%, respectively. In-sample accuracy (dotted lines) increases for all three classifiers as the sample proportion for training decreases. This is due to the fact that fewer training samples have to be classified. However, out-ofsample performance for all classifiers decreases with decreasing sample proportion for training, indicating that the resulting classifiers lose their generalizability when trained on smaller training sets due to overfitting. For out-of-sample performance, the nDSM-only classifier outperforms the RGB-only classifier, which further affirms the findings from Section 3.1. Interestingly, nDSMonly even outperforms the RGB & nDSM in the 0.0001% case. This result may relate to the curse of dimensionality (Keogh and Mueen, 2017), as the nDSM classifier operates in a reduced feature space of 25 dimensions (see Section 2.3.1), while the combined RGB & nDSM classifier operates in 100 dimensions. In general, if training data is scarce, a reduced feature space can improve generalizability by avoiding overfitting.
In addition to the SVM classifiers, we conduct the same validation analysis for small sample proportion for training the two SegNet architectures from Section 2.3.2. Figure 13 displays the results when training the SegNet and SegNet Lite classifiers with 15%, 25%, 50% and 100% of the data. The method used for obtaining a subset of the data is to select a random point from each image and take a width and height equal to the desired fraction of the full image as the cropping region, which is then used for training. As before, training was carried out using 12 ISPRS Vaihingen tiles. Testing was then performed for three cases: (i) in-sample/in-city (using the 12 Vaihingen tiles that were used for training), (ii) out-of-sample/in-city (using the remaining 4 Vaihingen tiles not used from training), and (iii) out-of-sample/out-of-city (using all ISPRS Potsdam tiles). Outof-sample/cross-city accuracy across the SegNet and Segnet Lite models with and without nDSM generally indicate a mild positive correlation between portion of data and accuracy, suggesting that 50% of the data for a given city might be sufficient for the full city classification.
In-sample/in-city accuracy across the SegNet and Segnet Lite models with and without nDSM exhibits a negative correlation between portion of dataset and accuracy. As with the SVM classifier, this can be attributed to the network having less samples to classify, and therefore being able to overfit to the scarce training set. Lastly, the non-nDSM trained SegNet model has a negative correlation between accuracy and training proportion in regards to cross-city Figure 12 Impact of sample proportion on in-sample (dotted lines) and out-ofsample (solid lines) SVM classification performance on the USSOCOM Jacksonville, FL dataset. The study compares three input data scenarios, (a) RGB & nDSM (black), (b) RGBonly (red), and (c) nDSM-only (blue). From left to right, the individual plots show accuracy, F1-score, and error rate as a function of sample proportion.

Figure 13
Impact of sample proportion on classification performance using SegNet (left) and SegNet Lite (right) on ISPRS data.