Additions to the Last Millennium Reanalysis Multi-Proxy Database

Progress in paleoclimatology increasingly occurs via data syntheses. We describe additions to a collection prepared for use in paleoclimate state estimation, specifically the Last Millennium Reanalysis (LMR). The 2290 additional series include 2152 tree ring chronologies and 138 other series. They supplement the collection used previously and together form a database titled LMRdb 1.0.0. The additional data draws from lake core, ice core, coral, speleothem, and tree ring archives, using published data primarily from the NOAA Paleoclimatology archive and a set of tree ring width chronologies standardized from raw International Tree Ring Data Bank ring width series. In contrast to many previous paleo compilations, the data were not selected (screened) on the basis of their environmental correlation, multi-century length, or other attributes. The inclusion of proxies sensitive to moisture and other environmental variables expands their use in data assimilation. A preliminary calibration using linear regression with mean annual temperature reveals characteristics of the proxy series and their relationship to temperature, as well as the noise and error characteristics of the records. The additional records are structured as individual files in the NOAA Paleoclimatology format and archived at NOAA Paleoclimatology (Anderson et al. 2018) and will continue to be improved and expanded as part of the LMR Project. The additions represent a four-fold increase in the number of records available for assimilation, provide expanded geographic coverage, and add additional proxy variables. Applications include data assimilation, proxy system model development, and paleoclimate reconstruction using climate field reconstruction and other methods.


Background
Climate and environmental reconstructions from paleo proxy records often emphasize collections of proxy records. Collections expand the spatial and temporal extent of the data, reduce noise, and can enable the reconstruction of multiple climate variables. Following the pioneering multi-proxy temperature reconstruction by Mann et al. (1998), many different reconstructions, based on unique and overlapping collections of proxy records, have been assembled. Most reconstructions have focused on the Northern Hemisphere land surface where observations are concentrated (Mann, et al. 2008;PAGES 2k Consortium 2013;Masson-Delmotte et al. 2013). Recent compilations have expanded networks in the Southern Hemisphere (Neukom et al. 2014), the oceans (Tierney et al. 2015), the Arctic (McKay and Kaufman, 2014), and the globe (PAGES Consortium, 2017). Compilations of moisture-sensitive proxies are being developed using multiple proxies (Smerdon et al. 2017) and tree ring databases (Cook et al. 2009;Cook et al. 2010). Some of these moisturesensitive chronologies are included here through incorporation of the Breitenmoser et. al. (2014) collection. Previously, the Last Millennium Reanalysis (LMR) (Hakim et al. 2016;Emile-Geay et al. 2018) used the PAGES 2k Consortium (2013) temperature-sensitive multi-proxy database, and subsequently used the PAGES 2k Consortium (2017) update (consisting of 571 records) archived at NOAA (PAGES Consortium (2017a). Additions were sought that would expand the geographic coverage and the sensitivity to different environmental variables. The additional data should be amenable to the development of quantitative proxy system models, including proxy error variance estimates needed for assimilation. Moreover, the data must be structured (i.e. consistently formatted) so that thousands of records, each consisting of tens to hundreds of years of observations, can be parsed by the assimilation algorithm. For these reasons we developed a method (and available code ( Table 1)) to quality-control, standardize, and structure a set of records consisting primarily of existing (published) paleo proxy time series.
Our goal is to produce a collection of paleoclimate proxy time series, in their original data units, with minimal processing, add additional metadata about the series characteristics, and provide structured time series in individual files for reanalysis, reconstruction, proxy system model development, and other applications. The additions described here are merged with a collection of 571 records described previously (PAGES 2k Consortium 2017) in a processing step of the LMR code to create a comprehensive, integrated and assimilation-ready proxy database termed LMRdb v1.0.0. The scope of the additions is broad and, in contrast to the PAGES 2k Consortium (2017) data, minimally screened (screening refers to inclusion or rejection of series based on criteria). We added proxies from five archives (tree rings, corals and sclerosponges, ice cores, lake cores, and speleothems ( Table 2)). The original series are annually, sub-annually, or super-annually sampled. Short records (less than 30 years length) are included and the maximum length spans the Holocene (10,000 years). All of the records are hypothesized to have some relationship with environmental variables that can be quantified as either an empirical linear univariate or bivariate regression or a physically based proxy system model. The LMR framework includes a data pre-processing step (which can include screening or filtering records based on certain criteria), and a proxy system model step. The model step develops a statistical model relating proxy observations to environmental conditions and quantifies uncertainty in observations required during the assimilation step. The code is designed so that this statistical approach can be replaced by more sophisticated proxy system models. In the final assimilation step the proxy observations are compared with estimates derived from the prior state using the proxy system model. The additional records described here are all-encompassing, from many proxy sources and locations, of varying length and resolution, minimallyscreened, for the broadest application. The files are structured and archived as NOAA-formatted files (NOAA Paleoclimatology 2018) to ensure that sufficient metadata accompany the records so that transformations to other formats, such as LiPD (McKay and Emile-Geay 2016), are readily achievable. The original references appear in the header of individual proxy files and should be cited when a record is used.
The selection criteria for the additional records are guided by the needs of the LMR and by the needs of proxy system model development (Dee et al. 2015;Evans et al. 2013). The LMR project aims to make code and data publically available to facilitate use and reuse, thus only publicly accessible or already-archived data are considered. In selecting data to add we targeted data for which forward proxy system models have already been developed, including tree rings (VS-lite; Tolwinski-Ward et al. 2011), corals (Thompson et al. 2011), and ice cores (Dee et al. 2015). In addition to their use in assimilation, availability of proxy data can improve proxy system model development (Dee et al. 2015). Two recent paleo data assimilation efforts (Hakim et al. 2016;Franke et al. 2017) have been designed to use proxy data that are annually or better resolved and that overlap with the instrumental period, over which required characteristics such as the error can be assessed and regression models developed. A third assimilation effort ) used a draft version of the data described here. The additions focus on proxy time series that overlap the instrumental period. All three assimilation efforts are designed to analyze many of the variables within the climate system Table 1: Code used to process and analyze the collection and available on GitHub .

Code Purpose
LMR_proxy_preprocess.py Process NOAA text files, perform QC, standardize on LMR conventions, build data frames and save frames as Python Pickle files.
LMR_PSMbuild.py Build proxy system models using user-specified calibration files and options. Currently linear univariate and bivariate regressions are supported; classes allow for other models to be developed.
summarize_proxy_database.py Calculates summary statistics for the collection organized by archive types listed in Table 2. summarize_linearPSM.py Analyzes linear univariate regression models.
that covary with proxy records, encouraging incorporation of proxies sensitive to moisture availability, sunlight, surface pressure, water isotope ratios, and other aspects of the climate.

Data Processing
The workflow consists of 1) adding metadata and information to candidate records needed to meet NOAA archive requirements and structuring the data time series and metadata in the NOAA text format and, and 2) preprocessing the data to identify duplicates and errors, standardize variable names, and average the data to annual values. The workflow is described in the following steps followed by archive-specific notes.
Step 1. Obtain proxy time series and metadata and format in concordance with the standard developed for NOAA text files (NOAA Paleoclimatology Program 2018). Additional metadata and notes were added but the data were not altered. The Breitenmoser et al. (2014) chronologies and metadata were obtained from the authors as Matlab files. The sensitivity of a series to the environment during a specific season was provided by PAGES 2k Consortium (2017) authors, assigned by us for the Breitenmoser et al. (2014) trees (described below), or obtained if it existed from the original metadata.
Step 2. Process individual files. Identify and eliminate duplicates (see notes below): translate the original archive types and variable names to LMR types and names, compute annual January-December means (optionally April-March), optionally gaussianize the time series, make coarse checks on the time variable, standardize the time variable to Year C.E., read metadata (site name, latitude, longitude, elevation, investigators, variable attributes (including variable name and units)), read the numerical data (age and values), check the earliest year and most recent year metadata fields with the numerical data, arrange series from oldest to youngest, and identify data as missing (e.g. data reported as NaN) for years where valid data are not available. The guassianize option is provided because some paleo proxies, notably lake records, have a baseline zero value with excursions from zero which are not successfully modeled by a linear regression. Ranking the data by point value and transforming so the distribution has a mean equal to zero while preserving the rank can improve the linear regression model. Data frames are then built consisting of 1) metadata and 2) numerical data (i.e. the actual proxy time series). The data frames are available from the LMR project documentation web pages . A critical step in compiling heterogeneous paleo data from multiple sources is to determine which variables represent the same quantity, and can be treated identically by proxy system models. The identification of similar variables is listed in Table 2, and accomplished in the processing step, leaving the variable names in the archive unchanged. In cases where suspect data or metadata were found the data series was rejected.
No data values were changed. A Python program labeled LMR_proxy_preprocess.py performs this step. The processed data are not archived because many different processing options may be selected. Instead, the raw data are archived and the processing code made available ). The approach facilitates adding additional raw data that may be processed in the same way. A working version of the processed data is available for use with the LMR code . The production version of the code, including the preprocessing code, is available from GitHub .
The primary emphasis of the processing step and its description here is to document how the data are transformed, and how heterogenous time conventions and variable names are treated so that disparate data sets produced by different authors can be made amenable to data assimilation, while preserving the original characteristics of the data in the archive. These transformations involve several choices regarding processing that may be changed depending on the application.

Tree Rings
A refined set of International Tree Ring Data Bank (ITRDB) chronologies was produced by Breitenmoser et al. (2014) for use in data assimilation and forward modeling using VS-lite, and these chronologies were incorporated with minor changes and metadata added after receiving the data from the authors as Matlab files. Breitenmoser et al. (2014) developed their chronologies by first editing the raw ring width data and metadata and then applying a uniform and consistent detrending procedure and compositing to produce site chronologies. The editing increased the total number of raw ring width files that could be processed but eliminated a small number of files that contained unresolvable issues. The application of uniform processing to all sites produces a collection that differs from existing ITRDB chronologies which have been developed by different authors using different types of detrending in the standardization. As described in Breitenmoser et al. (2014) the program ARSTAN (Cook 1985) was used for processing. A hierarchical approach was used to select the ARSTAN processing options. The first preference was the negative exponential detrend, and the second preference linear detrending where appropriate. A smoothing spline with a low-frequency cut-off was used in some cases. The chronologies were developed using a biweight robust mean estimate, with the variance stabilized. After applying two quality-control criteria (every point in a chronology must be developed from at least 8 points, and the 1901-1970 period must be fully represented) the processing yielded 2287 chronologies. ARSTAN processing was only performed on the years 1600 and after in Breitenmoser, et al. (2014). Approximately 500 chronologies extend prior to 1600. For these longer series, we processed the raw ring widths obtained from the authors using the 2014 update of the ARSTAN program (44xp), selecting processing options that yielded a close match during the post-1600 interval. In most cases the selected coefficients were the ARSTAN default. For the first detrending, the default negative exponential option was selected (option 1, negative exponential curve (k > 0)). If that created any divide-by-zero issues, a smoothing spline was applied (with a value of −75, denoting the use of a percent smoothing rather than an absolute length). Variance stabilization was also applied (ARSTAN menu option [13][1]). No data transformation was used (ARSTAN menu item [3]). The result yields 2287 consistently processed and minimally quality-screened chronologies. We eliminated the chronologies that appear in the PAGES 2k Consortium (2017) database because for all duplicate proxy records (including ice cores and corals) we retained the PAGES 2k Consortium (2017) version owing to its greater scrutiny and quality control (PAGES 2k Consortium, 2017). Breitenmoser et al. (2014) calculated VS-lite parameters (i.e. M1/M2 and T1/T2 soil moisture and temperature thresholds), as well as VS-Lite-based determinations of each chronologies' sensitivity to climate (precipitation/temperature) and these statistics were transferred from the matlab files to the metadata. Generalized seasonalities are included in the metadata according to the following rule: tropical trees (23.5°S to 23.5°N) were denoted as year-round records, NH trees (23.5°N to 90°N) denoted Boreal summer records (JJA), and SH trees (90°S to 23.5°S) denoted Austral summer records (DJF).

Corals
Twenty-seven coral records were added, including the Palmyra coral oxygen isotope series obtained from Linked Earth (http://wiki.linked.earth/Palmyra.Cobb.2013) (documentation appears in the file metadata). Since the publication of Cobb et al. (2003), many hundreds of measurements have been made at seasonal resolution at the same sites. In parallel, revisions to the U/Th dates have substantially altered the timing of the record. The updated record was published in Emile-Geay et al. (2013), but was not incorporated in the Ocean2k synthesis (Tierney et al. 2015).

Speleothems
Twenty speleothem records identified during discussions at the first LMR workshop were added. In contrast to the PAGES speleothem records the additions were not screened for temperature sensitivity. The speleothem records are from low latitudes including six from the Asian monsoon region.

Ice Cores
We added 77 ice core oxygen isotope records with the ultimate goal of using them with forward models (Dee et al. 2015). In contrast to the 39 PAGES 2k Consortium (2017) records they are not screened for temperature.

Data Description and Linear Regression Summary
The geographical and temporal distribution of the additional data and the complete database are shown in Figure 1. The archive types are unevenly distributed in time and space. Coral records (27) dominate the tropical ocean, ice core records (88) dominate at latitudes greater than 60 degrees, and the middle northern latitudes are dominated by tree rings. The Breitenmoser et al. (2014) chronologies expand the previous number of records four-fold. The additions are dominated by trees from the Northern Hemisphere mid-latitude continental areas. Going backwards in time the data density decreases rapidly from ca. 1875 to a much smaller number of sites prior to 1600, however some of the sites with data at 1600 also have observations back to 1 CE. The addition of the Breitenmoser et al. (2014) chronologies particularly increases the number of observations in the 20th century. Tree ring chronologies decrease in number in the latter part of the 20th century because many of the chronologies were collected prior to 1980. There are less than ten series within each of the following archive variable categories ( Table 3): tree ring isotopes, bivalve δ 18 O, ice core accumulation, ice core melt feature, lake core varve, lake core misc, and marine core δ 18 O.
Within the current LMR framework, statistical relationships between proxies and key climate variables (e.g. temperature) are used as forward models (i.e. proxy system models) within the data assimilation procedure. To provide a preliminary assessment of the sensitivity of the additional data for this report, linear univariate regressions were calculated for individual proxy time series using annually-averaged calibration data (we used mean annual temperature from GISTEMP v4 (GISTEMP Team, 2018)) available over the instrumentalera (1850 to 2015). The regression parameters are similar to the parameters used by Hakim et al. (2016) on PAGES 2k Consortium (2013) proxy time series. The models are formulated as the relationship (slope (m) and intercept) between the expected proxy value (y e ) and the independent environmental variables (x, in this case temperature) (for example, y e = m*x + intercept). Regression statistics are included on the ' dashboard' graphical data summary pages and the archive summaries. Correlations (R 2 ) relate the expected to the observed values. The program labeled LMR_PSMbuild.py performs this step. The linear univariate annual regressions are intended to be a first step in data exploration, extending the approach used by Hakim et al. (2016) to the the additional data. Other models, particularly models including moisture related variables, can be explored and may be more appropriate for the moisture-sensitive proxies that we have added. These comparisons will be reported elsewhere.
The development of series-specific regression models reduces the number of records that can be assimilated from 2290 to 2244. The linear regression to annual mean temperature used by Hakim et al. (2016) is only one of many possible relationships with the environment, and indeed the purpose of structuring this collection and including moisture-sensitive proxies such as the Breitenmoser et al. (2014) chronologies and ice core oxygen isotopes and coral oxygen isotopes is to support exploration of alternative models, including bivariate regression models (temperature and moisture) and other proxy system models (Dee et al. 2015). Cases where a regression was not produced are due to lack of observations, either proxy observations or calibration observations near the proxy location during the target calibration interval . This reduces the ice core records from 77 to 61, corals from 19 to 11, and speleothem records from 20 to 6 ( Table 3). The criteria used here follow Hakim et al. (2016): a linear regression of the annualized proxy values to annual mean temperature of the corresponding grid cell from the GISTEMP v4 temperature product, for the period 1850-2015, with a requirement of annual proxy resolution, and 50% data available over the period 1850-2015. No minimum correlation threshold is set for models. Records with poor fits have less weight in the assimilation.
The regression summary statistics ( Table 4) support consideration of all these archive types in assimilation. Statistics based on fewer than 10 records are struck-through and not described here, however we present the R 2 information on the dashboards for each record. Several measures are useful in assessing how well the regression models fit the observations. The correlation (R 2 ) directly reflects the fit of the regression model and appears for each record in the online supplement. The signal to noise ratio is used in archive summaries to compare different proxies. The coefficient of efficiency used by us (Hakim. et al. 2016) assesses the skill of the assimilation analysis. Here it is applied to the point-based reconstructions. Coral δ 18 O (37.82) has the highest signal-to-noise (SNS) ratio, followed by ice core δ 18 O. These records also have the highest coefficient of efficiency. The mean SNS decreases from 4.3 to 3.5 and mean CE decreases from 0.033 to 0.030 for chronologies when all trees (irrespective of their sensitivity to temperature or moisture) are considered compared to the chronologies screened for the sensitivity to temperature (not shown). This suggests that alternate regressions, involving moisture in addition to temperature, could provide more accurate proxy modeling. Nevertheless, differences remain small and the similar goodness-of-fit measures compared to the extensively screened PAGES 2k Consortium (2017) tree ring width set (not shown) justifies their consideration in the LMR data assimilation framework, particularly given the enhanced geographic coverage these chronologies provide, for example in the Southern Andes and Tasmania (Figure 1). Breitenmoser et al. (2014) note that 7% of the analyzed land grid cells contain only temperature sensitive series, 5% contain only moisture-limited series, and 37% contain both temperature and moisture-limited chronologies. The record-specific regression equations, along with the R 2 values, are reported in the online supplement (data dashboards).   Table 2, and the following time units: ['age', 'age_int', 'year',\'y_ad', 'Age_AD', 'age_AD', 'age_AD_ass', 'age_AD_int', 'Midpt_year', 'AD',\'age_yb1950', 'yb_1950', 'yrb _1950',\'kyb_1950',\'yb_1989', 'age_yb1989',\'yb_2000', 'yr_b2k', 'yb_2k', 'ky_b2k', 'kyb_2000', 'kyb_2k', 'kab2k', 'ka_b2k', 'kyr_b2k',\'ky_BP', 'kyr_BP', 'ka_BP', 'age_kaBP', 'yr_BP', 'calyr_BP', 'Age(yrBP)', 'age_calBP', 'cal yr BP']. Files not using these time and variable names will not be processed. 5. In addition to the code used for data assimilation, LMR code can be used to analyze the collection ( Table 2). Proxy system models can be built, and the characteristics of the data and the models can be analyzed using LMR_PSMbuild.py, summarize_proxy_database.py, and summarize_linearPSM. py codes.

Summary and Application
We describe 2290 additions to 571 previously available proxy records, assembled and structured to facilitate reanalysis of the last millennium. The two data sets are merged by processing software to form the Last Millennium Reanalysis dataset version LMRdb v1.0.0. The additions provide a four-fold increase in the number of sites, improve the data density in the Southern Hemisphere, and massively increase the density of Northern Hemisphere tree ring chronologies spanning the last two centuries. The collection consists of observations in their native units and is all-encompassing, minimally-screened, and minimally-processed. Many of the additional records, notably tree ring widths and ice and cave speleothem oxygen isotopes are sensitive to precipitation and other hydrologic variables, and are expected to enable the reanalysis of variables other than temperature, including precipitation, drought indices, and surface pressure. The intended applications include 1) the development and testing of proxy system models, 2) climate reconstruction using various methods such as climate field reconstructions, and 3) reanalysis combining proxy system models and reconstructions derived from model simulations.