The European Space Agency (ESA) has the mandate to assure the long-term preservation, sharing and exploitation of space data and its associated knowledge. ESA’s aim is to turn space exploration and space-related activities into an overall societal project involving a wide variety of stakeholders. To this end, it brings together and coordinates as many countries as possible under the banner of space missions. It is a basic principle that ESA deals with its stakeholders openly and with real transparency, an approach that has contributed to its long-term success.
The Earth Observation (EO) Data Preservation System has the main objective of providing the required infrastructure and services to assure ESA and Third Party Missions (TPM) EO Data Records and Associated Knowledge preservation and accessibility, and to support the cooperation activities with national and international organizations in the data preservation domain. The generic “EO Missions/Sensors Preserved Dataset” content includes Data Records and Associated Knowledge.
The EO Data Preservation System has the main objective of providing the required infrastructure and services to assure ESA and Third Party Missions (TPM) EO Data Records and Associated Knowledge preservation and accessibility, and to support the cooperation activities with national and international organizations in the data preservation domain. The generic “EO Missions/Sensors Preserved Dataset” content includes Data Records and Associated Knowledge.
The main components of the Earth Observation (EO) Data Preservation System involved in the preservation process are the Management System for Data and Associated Knowledge (KMS), Data Information System (DIS), Master Archive (MAR), Cold Back-up Archive (CBA). The picture below shows the EO Data Preservation System (Figure 1).
The Master Archive and the Cold Back-up archives cover the archiving functionalities whereas the Data Information Service (DIS) provides the information, the history and the provenance of the data archived. The concept of Data Information Service arises from the service requirements ESA included within the Data Service Initiative (DSI) program modelling systems and processes to enable the management of ESA’s EO Data as a standard asset. These requirements ensure that metadata for each product is collected in a database and the data element itself is systematically stored with full track of all value added to the data during the service activities. The space data archived are in EO-SIP format being the standard for most ESA Earth Observation missions. EO-SIP represents the package which includes the actual product in its native format, a quality report, a quick-look picture and metadata.
The metadata is an xml file following the OGC standards “Earth Observation profile of Observations & Measurements (OGC EOP O&M)” OGC 10–157. The underlying database storing all information is well suited to report on the activities within the DSI and other services but also to present zoom-able level of detail for any EO Data asset held by ESA, making the tool useful to staff involved in operational support to management and decision-making. In addition, there are other sources of operational data and repositories around the data payload ground segment. The huge value of the information contained in all these systems is enhanced by providing a harmonised, service-independent view and control of the data assets held across system in order to provide end-to-end operational analysis of data assets to pinpoint changes, errors or discrepancies. The crucial factor determining the success of the DIS is the ability to receive the metadata, recognise products across source services, adjust obtained information and produce the unequivocal set of attributes for any data asset. This paper describes the features of the system and any relevant preservation processes.
The ESA Master Archive is implemented through a dedicated service (DAS) awarded to industry through an open Invitation To Tender (ITT) in 2016. ESA has outsourced EO data archiving activities to a single provider with the goal to benefit from economy of scales and standardization of data archival & delivery processes and interfaces.
These archiving activities shall cover:
The industrial consortium is made of ACRI-ST (FR), adwäisEO (LU) and KSAT (NO), with the following distribution of roles (Figure 2):
In order to guarantee the data safety, the archive is distributed between two locations sited >200 km apart; a master archive in Luxembourg and an archive back up in Sophia Antipolis (France). The archival operational flow between the two facilities is depicted in Figure 1, which also lists the technical solutions for the infrastructure in each data centre. Several data quality checks in terms of reliability of the process during data transfer are performed all along the flow (data not corrupted during transcription or during copy to the backup center, data still readable on tape).
The Master Archive infrastructure is mainly based on two similar Quantum iScalar 6000 libraries connected via 10+ GE and 16 Gbps SAN links to DELL M1000e + M6x0 blade enclosure and servers. A StorNext System is used to manage the data in Hierarchical Storage Management (HSM) mode. In this environment, the disk structure containing the data is exported to NFS clients as a classical Linux volume and specific policies determine the way to store the data (keep on disk and/or tape, automatic generation of several copies…). Data are copied to LTO7 tapes in native Quantum format (ANTF):
Finally, the content of the temporary tape is then restored in the Backup Centre library.
Figure 3 provides an high-level view of the full process. This technical solution offers full scalability and shall cope with ESA requirements in case of unexpected growth of service needs.
One of the most important activities performed by the Master Archive service is the data quality check, which is performed not in terms of scientific content but in terms of reliability of the process during data transfer. This verification includes the points summarized in the following questions:
The following Figure 4 shows the verifications performed at the various stages of the data ingestion into the archive.
Checks performed during the data ingestion can be found in the following Table 1.
|1||Main||Ingestion||The list of files is compared to any delivered inventory.|
|1’||Main||Ingestion||The structure and the content (repository, datasets, products, files) is compared with the delivery spreadsheet delivered by ESA.|
|2||Main||Ingestion||The total number of products is compared to the delivery information.|
|3||Main||Ingestion||The actual file MD5 checksum is compared to the value extracted from the product metadata (manifest file or attached checksum file).|
|3’||Main||Ingestion||If data is included in a container (zip, tgz, …), the integrity of the container is verified (i.e. container content can be accessed).|
|4||Main||Ingestion||After the generation of the zip container, the files are extracted in a scratch directory and checked with respect to the content of the checksum.txt file. This operation is logged for future use by the global verification of the dataset ingestion.|
|5||Main||Ingestion||MD5 hash code computed by Quantum from the products are queried from the StorNext database and compared to the DMPC MD5 hash code before products are set in “TAPE” status.|
|6||Both||Ingestion||LTO-7 drives apply an automatic verify-after-write technology to immediately check the data as it is being written.|
|7||Both||Backend||Both Quantum libraries include EDLM. The Extended Data Life Management feature ensures that tapes are trouble-free (based on tape scan and tape memory analysis). Tapes scan and analysis is performed following predefined policies (max. 4 tapes per day i.e. 7% of a 10 PBytes archive per month using 2 LTO7 EDLM drives). Suspect tapes are automatically copied to new tapes.|
|8||Main||Validation||After the ingestion process where data has been verified at product level, a global ingestion verification is performed using the DPMC database information, the initial media inventory and the ingestion process log files.|
|9||Backup||Ingestion||The inventory of the tapes sent to the backup centre is retrieved and used for comparison with the copy process performed to copy the products from temporary tapes to ANTF tapes via disk cache.|
|10||Backup||Ingestion||The zip container integrity is used to verify that the transferred products have not been corrupted.|
The Master Archive (DAS) ingests data from historical missions according to a data ingestion plan constantly maintained by ESA Data Librarian as well as data coming from live missions (currently Cryosat-2, SMOS, ADM-Aeolus and OceanSat-2) upon formal requests by the relevant ESA Operations Managers.
The EODAS Service Web Portal is updated on hourly basis with continuous injection of fresh information coming from the core processes that drive the data archiving. Lots of views and filters are available to select the information of interest allowing also making exports in pdf format or excel tables for any further analysis and statistics of the archived products.
The Cold Back-up Archive (CBA) has been implemented in ESRIN (ESA premise in Frascati - IT) in 2014, throughout the years, it has been upgraded to enhance performances and allow seamless archive and extraction capabilities. It currently contains (Figure 9):
In the scope of the Inter-directorate Joint Activities, a dedicated circulator software has been implemented to allow reception of ESA Science data from both the internet and ESA WAN connected centres. The CBA is therefore being populated with:
ESA and Third-Party active mission data are circulated by the Preservation Element Front-End (PE-FE). Once the data has been ingested and validated, confirmation reports are sent to the Mission Payload Data Ground Segment via standard network transfer protocol. Bulk dissemination of data coming from reprocessing campaigns is circulated on storage devices. The DAS archive is the Master Archive.
The volume of ingested data by the CBA is 7,9 PB of data for a total of 129 million of files (Figure 10).
The CBA archive can handle the following format of data:
Data is handled by an ingestion software specifically developed for the matter; the Next Generation Archive (ngA) which validates the data and extract metadata when available. Metadata is used to populate its Database to be used for advanced queries. The Next Generation Archive (ngA) (Figure 11) is the software we use to ingest EO-SIP packages into the Cold Backup Archive. The ingestion software inspects the package and populate its database with data extracted from the EO-SIP included metadata xml files.
The ngA replaces the formed ingestion and archiving software that had data compatibility constrains.
The infrastructure of the CBA consists of two Robotic Libraries, capable of storing 24 PB of online data each. The Main Library is a 3000 slot STK SL8500, equipped with 8 T10000D, 4 robotic arms and 10 T10000B drives, the Disaster Recovery Library is a 3000 slot STK SL3000 with 4 LTO-7 drives (due to be upgraded to LTO-8 in 2019).
Data is initially written on an SSD based cache and then moved to the different tiers of the storage by the ORACLE Hierarchical Storage Manager (Oracle HSM) where the final tier is the tapes of the Robotic Archives.
Synchronization between the two libraries is performed through a dedicated 16 Gb/s fibre optic connection.
The Data Information System (DIS) is a management support system based on EO data products. DIS is in charge of managing metadata information for all products generated, used and distributed by any activity (project or service) for all the ESA and Third Party Missions. DIS has been developed within the same Contract, which implements the service for data consolidation and reprocessing (DSI) as an extension of the internal inventory system. DSI is managed by Serco Italy.
The metadata available in DIS is extracted by Master Archive from the products managed during the ingestion phase, as part of the archiving process aimed to populate the local inventory. The format mainly used for products distribution by ESA is the EO-SIP format, where metadata is explicitly provided in the package by dedicated xml files, but metadata is extracted as well from all products using common formats, All the information available is stored, including key attributes for data access and quality information when available.
The original core of the system has been upgraded to manage the information about products available at many different systems and the relationship among them and the history of data. DIS supports awareness and control of the data and the value added to the data, as the system is keeping trace of any metadata for all products stored at different processing and archiving sites. DIS is consisting of a database repository, an Extraction, Transformation and Loading (ETL) module and a Business presentation layer. DIS facilitates improved control of the EO data owned by ESA, concentrating information spread over many different external services in a single place, and supporting verification that all data is aligned at all different sites and distribution of the right data for the activities requesting it. DIS is fed from different services at ESA; each service is providing all the available information about its data, how it is organized and the changes applied to it. DIS collects all the information, integrating what has been received to build up the full set of metadata for each product. A major requirement for DIS is to trace history of data, the changes applied in the past, what are the datasets including this product, what the different versions of the datasets consolidated over the time, what is the version accessible for users requesting it (Figure 12).
DIS has initially started collecting data from the DSI projects, and now it is integrating with DAS. In a near future additional service will be added to complement the scenario as repository of data, specifically Cold Backup and Dissemination Services. Further extensions of DIS coverage will be analysed later on. The information available in DIS is accessible through the Business Intelligence layer on top of it, allowing to design and publish on web the reports defined by users to monitor and control their data. A set of predefined reports supports direct access to the most common views over the data, and dedicated reports are being developed for specific tasks. The presentation layer of DIS is supporting drill-in on the charts presented, allowing to select the specific data of interest to retrieve high level information up to the full detail.
For each product DIS is storing and referencing not only metadata, but cross-references to other products and processes related to the product, as
Most of the above information is already available on the presentation layer, or it will be added in the next future. Analysing large sets of data at product level cannot easily managed, and therefore DIS provides full support for the dataset level, where products can be freely aggregated to highlight common characteristics. For each dataset, DIS stores:
DIS has been designed to support requirements from different user categories:
A campaign of interviews with all user profiles of the system allowed the design team to filter driving requirements and usage scenarios for the implementation of a stable core architecture of DIS. The reporting interface is subject to continuous improvements, as new requests for dedicated views on the data result in additional reports taking advantage of the powerful BI tool selected.
Access to the DIS reporting interface is web-based, supporting user profiles for accessing dedicated or restricted resources. Reports are grouped to allow a direct access to the reports of interest for each user category.
The provenance view on the DIS website shows graphically the relationship between a dataset and the others, and the processes applied to it, using a high-level view, easy but very helpful for the understanding of the data involved (Figure 13).
Verification of the Archiving Policy applied to products is critical, as it ensures correct preservation of data but avoiding the costs of an uncontrolled replication (Figure 14). DIS allows a monitoring of how many copies are being stored for each product (Figure 15).
Events in the lilecycle of each product can be monitored, assessing “activity” for each product type in terms of number of projects working on it and resources allocated (Figure 16).
The DIS repository maintaining the knowledge of data available, their attributes and relationships is a fundamental tool for the full organisation, and it will gear new reporting features in the future, allowing to exploit and analyse new unexpected and exciting relations among the data in way that we can’t yet imagine.
Since its birth, the EO Data Preservation System has allowed ESA and its Heritage Data Program (LTDP+) to preserve both the data and the knowledge, and being compatible with the Open Archival Information System (OAIS) of Consultative Committee for Space Data System (CCSDS) and relevant preservation standards and guidelines. Both Archives manage the Submission Information Package (SIP), Archival Information Package (AIP) and Dissemination Information Package (DIP) in line with the OAIS functional Model.
The Data Preservation System will be continuously supported and enhanced in terms of both scope and functions, in order to ensure the preservation and valorisation of the ESA Earth Observation Assets.
The authors have no competing interests to declare.