Data quality is an important guarantee for scientific research. The Geomagnetic Network of China (GNC) controls data quality for all observatories under the jurisdiction of China Earthquake Administration. This paper presents the quality control content and methods; quality evaluation and feedback procedures; and the process for publishing data sets via the GNC website. Technical challenges and proposed quality assurance procedures for future GNC data sets are described.
Geomagnetism is an important branch of geophysics. Geomagnetic observatories are the main sites for obtaining geomagnetic observation data, which can reflect the various physical processes related to solar or Earth activities. Geomagnetic observation data have been widely used in various fields, such as space environment prediction, oil well drilling, etc., and the reliability and accuracy of data are critical for scientific research and commercial applications. In order to improve the quality of observation data, it is necessary to track and monitor data over time. The Geomagnetic Network of China (GNC) is now responsible for data quality control for the more than 40 geomagnetic observatories affiliated to China Earthquake Administration (CEA).
Each observatory is equipped with at least two triaxial fluxgates to make variation recordings; and one fluxgate theodolite and one proton precession magnetometer to make absolute measurements. The most important observatories are equipped with a Chinese GM-4-type triaxial magnetometer, a Danish suspended FGE magnetometer, a Chinese CTM-DI fluxgate theodolite, a Hungarian MINGEO DIM fluxgate theodolite, a Chinese proton precession magnetometer, and a Canadian GSM-19F Overhauser effect magnetometer.
In the past, observation data were transmitted from observatories to the GNC by email, paper mail, by burning onto computer disks, and so on. On completion of the ‘China Digital Seismic Observation Network’ project in 2007, a large, distributed database was established consisting of four levels information nodes: observatories, local earthquake administrations (LEAs), the China Earthquake Network Center (CENC), and the GNC. The observation data are first transmitted from the observatories to the LEA, then to the CENC, and ultimately to the GNC. The final data collected by the GNC include the raw data, preprocessing data, log, product data, etc., from 2007 onward. Observatory data are supplied for distribution within 48 hours of acquisition. The GNC is responsible for the quality control and distribution of observatory data. Many scholars have developed various quality control methods for geomagnetic data (Curto & Marsal 2007; Dawsn et al. 2009; Reda et al. 2010; Xin & Zhang 2011; Zhang & Yang 2011; Reay, Clarke & Macmillan 2013). This paper focuses on the content and process of controlling data quality, and the publication of datasets via the GNC website.
The aim of quality control is to check the reliability and accuracy of geomagnetic observation data. Within the GNC, data quality control includes several aspects, primarily: integrity, log, noise level, data processing, baseline, and definitive data.
The integrity of data is essential for scientific research and commercial applications. In GNC, based on the preprocessed minute data, the integrity of every variometer is calculated monthly or yearly, using the following formula:
Here, I is the integrity of the preprocessing minute data, Wo is the number of samples in the chosen period, Wr is the number of originally missing data, and Wp is the number of missing data deleted by observatory staff when preprocessing.
The reasons for missing data are traced back to the variation recording log completed by the observatory staff. The reasons can include instrument failure (engine fault, sensor fault, communication unit fault, etc.), utility system failure (switches, router, communications line, power supply system, database server, etc.), environmental disturbance (vehicle interference, high-voltage DC transmission, metro and light rail, ground resistivity, city construction, etc.). If the log is incomplete, GNC staff ask observatory staff the reason, and request that they add the log as soon as possible. Information on the operational state of instruments and utility systems, and the operational environment of the observatory can be obtained by analyzing the integrity of the data, which can provide technical support for optimization of the observatory.
The log is important for enabling users to understand the observation system, observation data, and the environmental conditions during observations. The log consists of a variation recording log and an absolute measurement log.
The variation recording log is used to track information on data missing from the original observations, or deleted by observatory staff when preprocessing. It includes the following information: observatory name, date, instrument, components, start and end times for missing data, duration of missing data, type of missing data (0 for originally missing, 1 for deleted by observatory staff), event types, event description, and the identity of the processing staff. The event types include instrument failure, utility failure, and environmental disturbance, as mentioned above. The first seven items of information are added to the database automatically by the processing software. Event types are chosen by the observatory staff from the list mentioned previously. The event description and the identity of the processing staff fields are completed by the observatory staff.
GNC staff check the log, including whether: the event type written is reasonable; accompanying event descriptions are consistent; and event descriptions are sufficiently detailed to explain the phenomenon and possible causes.
The absolute measurement log is used to record information related to absolute measurements. The GNC checks whether the observatory staff recorded the various events during the process of absolute measurement, including: jumps in adopted baseline values, complementary measurement or additional measurement, changes made to observation instruments, changes in observation marks, etc. This information is recorded in detail in the absolute measurement log. Usually, the log is checked in combination with a baseline curve. As an example, Figure 1 depicts how the log is marked along the baseline curve. These baseline curves consider three geomagnetic components of the KSH observatory in 2015, clearly showing three jumps in the curves. One jump (top panel) occurs in mid-March on component D, as a result of replacing the fluxgate magnetometer. The other two jumps (central and bottom panels) occur in late August on components H and Z, respectively, due to lightning.
Background noise level (i.e., total noise that is irrelevant to the useful signal) is an important indicator of data quality. Background noise levels are usually measured at times when the geomagnetic field is comparatively quiet. One method of assessing noise levels, which is widely used due to its simplicity, is to analyze the rate of change of the field for periods of very low geomagnetic activity (Love 2006). While analyzing background noise by this method, some magnetic disturbances were found even when the geomagnetic activity indices (such as Kp, C index) were very small. We therefore developed a new version of this method, termed ‘special quiet time noise’, described as follows:
Firstly, five special quiet times are calculated for each month, as follows: Five observatories (GZH, CNH, LYH, WMQ, and MCH) are chosen, from southern, northern, eastern, western, and central China, respectively. The standard deviations (3-h intervals) of the first differences for declination (D), horizontal component (H), vertical component (Z), and total intensity (F) are calculated. The five minimum 3-h intervals for all five observatories are chosen as the ‘five special quiet times’ for that month. Secondly, the first differences of the ‘five special quiet times’ are calculated for each element and each observatory, and then the frequency of the absolute value of the first difference is counted. If the frequency of the first difference is greater than 80%, then a value double that of the first difference is considered as the noise value of this element. Otherwise, the frequency values are sorted in descending order and then summed until the frequency value exceeds 80%. Imagine the frequency is defined as Ci, and the absolute values of first difference are Di, then the noise value of the element (S) is as follows:
Thirdly, the maximum value among the five noise values is removed, and the average of the four remaining values is regarded as the noise value for the D, H, Z, and F elements for that month, recorded as SD, SH, SZ, and SF.
The mean noise values for the entire geomagnetic network are calculated after removing 20% of the maximum noise values. If the noise value of a particular observatory greatly exceeds the mean noise value, the GNC will supervise the observatory staff in attempting to identify the underlying reasons (environment, line, instrument, etc.), in order to solve the problem. As an example, Figure 2 plots background noise for elements D, H, and Z at LYH observatory in 2015; the noise values for D and H do not exceed 0.2nT and that of Z does not exceed 0.1nT.
The raw geomagnetic records may include artificial disturbances such as calibration signals, spikes, and electromagnetic interference. The main tasks of data preprocessing are to remove artificial disturbances contained in the daily variations. The observatory staff check the daily variation every morning, remove artificial disturbances from the records, and provide an entry in the log to explain which part of the data was modified, how and why the modification was done, and who is responsible for the work. Two basic principles must be followed in preprocessing. Firstly, the natural geomagnetic field cannot be changed. In order to identify the natural geomagnetic field, the data must sometimes be compared with those of adjacent observatories. Secondly, a backup of raw data and preprocessed data should be stored in different database tables. The preprocessed data are transferred to the upper nodes every day. The data are transmitted to the GNC, where the staff regularly check the daily variations in preprocessing data, and mark the quality of each record, expressed using different colors and numbers. The checks include the following procedures:
Firstly, the daily variations of the previous week are checked every Tuesday, and their quality is marked: A mark of ‘9’ in deep blue means the record was good when first checked; a mark of ‘4’ in bright red means part of the record is corrupted because of jumps, spikes, drift, and other artificial disturbances that occur in data records, and should be modified; a mark of ‘3’ in dark red means the record is entirely unusable due to severe environmental disturbance or instrument failure, and hence no further modification should be done. This procedure is called the ‘first check’. Observatory staff can see the quality marks on the GNC website as soon as they are saved to the database. If they do not agree with the mark, they will recheck the data and post a query on the website forum. GNC and observatory staff regularly discuss whether the marks are correct. If the GNC mark is deemed correct, the observatory staff reprocess their data. The reprocessed data are re-uploaded automatically and marked ‘1’ in light green, and then checked by GNC staff during the next check. If the GNC mark is deemed incorrect, it is revised by GNC staff. This procedure is called ‘inquiry and question’ (Rasson, Toh & Yang 2010).
Secondly, the daily variations of 7 days of data that occurred two weeks previously are checked every Wednesday to see whether the problematic records found in the ‘first check’ have been corrected, and the quality is marked. A mark of ‘8’ in royal blue means the revised record is good; a mark of ‘5’ in yellow means the revised record is bad still and should be modified further. This procedure is called the ‘second check’. The observatory staff will see the marks and recheck the data, and repeat the process of ‘inquiry and question’.
Thirdly, the daily variations of the previous month are checked in the middle of every month to see whether problematic records found in the first and second checks have been corrected, and the quality is marked. A mark of ‘7’ in cyan means the revised record is good, and a mark of ‘6’ in brown means the record is still not correct. This procedure is called the ‘month check’. The observatory staff will see the marks and recheck the data, and repeat the process of ‘inquiry and question’. The entire checking procedure is shown in Figure 3.
The GNC staff check the quality of preprocessed data by intercomparison with other magnetometers at the same observatory, or by contrasting the daily variation plots and difference plots with those of adjacent stations. Figure 4 shows daily variation plots of two GM-4-type magnetometers at LYH observatory, with good consistency between two daily variation plots.
It is easy to detect some variometer problems such as jumps, spikes, drift, and other artificial disturbances that occur in data records. Figure 5 shows raw daily variation plots for D, H, and Z components at observatories LYH, DLG, and TAA. The vertical component Z shows spikes and jumps. At LYH, spikes are caused by the movement of a harvester near the observatory, and the jumps are caused by the failure of a high-voltage direct current (HVDC) transmission system. When an HVDC system is functioning well, equal and opposite current flows in two overhead lines. When a failure occurs, the current will flow in a single overhead line; the resulting unbalanced current will affect data at geomagnetic observatories within a certain radius. Figure 6 plots processed data for the observatories shown in Figure 5, in which the spikes and jumps have been modified by observatory staff.
Baseline stability is an important parameter for evaluating the quality of observatory data. The quality of the absolute measurements and the stability of the variometer are determined by investigating the baseline value. The general form of the equation for computing the baseline value for arbitrary component ‘W’ is shown below (St-Louis, 2008):
Here, (i, j) is the time interval (generally several minutes) for the measurement, (k) is the k-th time, the average time of the interval (i, j), WB is the computed baseline value (also known as observed baseline value), WO is the observed absolute field value for time interval (i, j), and WR is the minute value recorded by variometer.
The frequency of absolute measurement depends on whether the drift of variation recording data can be controlled effectively; this depends on the characteristics of the variometer, the stability of the observation pillar, and logistic support, etc. Usually, observatory staff are asked to make absolute measurements each Monday and Thursday (3–5 pm local time), acquiring at least two groups of valid observed baseline values. Observatory staff will add some measurements immediately, even a few days of continuous measurements, when the following situations occur: (1) the difference in observed baseline values within one day is >1nT, which maybe because of unstable absolute observation instruments, unstable geomagnetic field, or incorrect manipulation; (2) the variometer is restarted; (3) wide temperature variation within the recording room; (4) the two observed baseline values deviate from the general variation tendency. Figure 7 shows an example baseline for the three components D, H, and Z at MCH observatory in 2015, which includes a complementary measurement because of a magnetic storm the previous day. The variometer is a GM4 triaxial fluxgate magnetometer; absolute measurements are obtained via a Chinese CTM-DI fluxgate theodolite; and absolute total intensity measurements use a Chinese proton precession magnetometer G856. In this example, the annual changes in baseline values for D, H, and Z are only 0.6°, 4nT and 6nT, respectively. In this example, the variometer works well, the recording room is effectively insulated, and the observatory staff have provided a complimentary dataset when required.
The root mean square error (RMSE) between adopted monthly baseline values and observed baseline values provides an index of observation accuracy. The equation for computing the RMSE for arbitrary component ‘Y’ is shown below:
Here, SE is the RMSE, n is the number of absolute measurements, Y1i is the observed baseline values, and Y2i is the adopted baseline values. It is generally considered that the RMSEs of components D, H, and Z should not exceed 0.1°, 1nT, and 1nT, respectively. If these values are exceeded, the reasons are analyzed using a combination of baseline graphs and logs. Monthly RMSEs at MCH observatory in 2015 are given in Table 1.
|Station Code||Component||Root Mean Square Error|
In Table 1, none of the RMSEs exceed the threshold values (1nT or 0.1°), thereby demonstrating that the baseline values are stable and the measurements have high accuracy.
The quality of a variometer can be evaluated by its baseline plot. The baselines and their differences, determined by the same variometer and different fluxgate theodolites, are compared, thereby identifying potential problems in the absolute measurement. Figure 8 shows the baselines for elements D, H, Z, F, and I, determined by the same variometer (GM4 magnetometer) and by different absolute measurements (CTM-DI and MINGEO DI fluxgate theodolites) at WMQ observatory in 2015. The plots show good consistency between the two baseline values, except for June and December. In this case, the observatory staff would investigate the underlying reasons and make complementary measurements.
The definitive data represent the flagship product of each geomagnetic observatory, so the utmost accuracy of absolute levels is demanded (Rasson, Toh & Yang 2010). Greater absolute accuracy of definitive data allows greater accuracy of global geomagnetic field models, such as International Geomagnetic Reference Field (IGRF) and World Magnetic Model (WMM). The GNC mainly checks the midnight means of the definitive data, and F-P difference.
The midnight means are calculated by averaging the minute mean values at 00:00–03:00 h local time, during which the geomagnetic field is relatively quiet (Zhang 2015). These data are often used for establishing geomagnetic models and research into seismo-magnetic relationships. Adjacent observatories are grouped together in order to detect spikes, fluctuations, and unnatural jumps. The time series graphs of midnight means and their differences are examined, and marks are assigned to every record. If a bad record is found, the underlying reasons must be identified, possibly including: incorrect preprocessing, inappropriate adopted baseline value, absolute measurement problem, variometer instability, environmental interference, etc. If the error is caused by uncorrected preprocessing or inappropriate adopted baseline value, the GNC will ask the observatory staff to reprocess the data or to re-adopt baseline values. This kind of error is marked ‘E’. If the error is caused by absolute measurement or serious environmental interference, the GNC will assign a mark of ‘W’, meaning that this kind of error could not be corrected. If the record is good, it will be marked as ‘D’.
Figure 9 shows a group of time series plots of midnight means and their differences for three adjacent observatories (JIH, LH, and CNH). The upper four panels are the time series of midnight means for geomagnetic elements D, H, Z, and F. Each panel has three curves corresponding to the three observatories. The lower four panels are the time series of the differences for geomagnetic elements D, H, Z, and F. Each panel has two curves corresponding to the differences between any other observatory and referred observatory CNH. The plots show that the midnight means and their differences are consistent, and the records are classified as good.
F-P (total field difference ∆F) inspection is another quality control index for definitive data, where F is the total field computed from component values (baseline adjusted) and P is the total field recorded by a continuous total field magnetometer. It is one of the international common effective methods recommended by INTERMAGNET (St-Louis, 2008).
In the GNC, 28 observatories operate a continuous total field magnetometer and a vector magnetometer at the same time. The continuous total field magnetometer records P, and the vector magnetometer records component values from which F is computed. So, ∆F values can be acquired by F-P inspection. GNC can produce the time series graphs of ∆F in real-time to monitor the performance of those observatories continuously. F and P are produced simultaneously and are corrected to the absolute standard pillar, such that they represent the total field at the same site and time. In the ideal case, the ∆F value should be close to zero; the ∆F curve is an approximate straight line that changes slightly near zero, which shows appropriate variometer behavior and good absolute measurement. Any spikes, jumps, or drift that appear on the ∆F plots indicate baseline problems with the variometer or interference with the instrument. Li et al. (2012) studied the ∆F inspection of several observatories for several years in GNC. Figure 10 shows an example of minute data curve for F, P, and ∆F at CNH observatory. In this example, F is the total field computed from component values (baseline adjusted) recorded by FGE magnetometer, and P is the total field recorded by GSM-19F Overhauser effect magnetometer. F and P show good consistency in the upper panel, and in the lower panel, maximum ∆F is only 0.2nT. As described above, this produces an approximate straight line with slight change near zero.
In order to provide a human–computer interface for processing the data, and automation of quality control, the GNC developed ‘Geomagnetic data processing’ and ‘Geomagnetic data quality monitoring’ software, which together handle all of the work mentioned above, including data processing, comparative analysis, quality control, and other functions. The software also provides data download (including INTERMAGNET, IAGA 2002 and other formats), data backup, and data query functions, etc.
As soon as the GNC staff complete data quality checks and assign a mark, the results are published on the GNC website. Published results include: integrity, log, noise level, data processing, baselines, midnight means, etc. The observatory staff are able to see the allocated marks and judge whether they should reprocess the data or query the check mark via the website forum, where GNC staff regularly answer queries. The query system thereby functions as an additional feedback mechanism for checking the work of the GNC, further ensuring data quality.
Data integrity, accuracy, and precision are directly related to the quality of subsequent research. It is the responsibility of all staff involved with data output to ensure and improve data quality. To date, GNC data quality operations have mainly concentrated on quality control of minute data and data with lower sampling rates. Quality control of second data is still in the initial exploratory stage; as yet, no efficient and feasible mechanisms have been identified. Future efforts will be directed to resolving these issues.
The identification of bad records when preprocessing data and midnight mean values relies predominantly on artificial contrast, in addition to the difference method and first-order difference. In the future, we hope to identify alternative methodologies for improving the automation of quality monitoring, such as the use of natural orthogonal components (NOC), data simulation (Yao 2015), and so on.
Supported by major national projects to develop scientific instruments and equipment (2014YQ100817) and The National Natural Science Foundation of China (41504129).
The authors have no competing interests to declare.
Curto, J and Marsal, S (2007). Quality control of Ebro magnetic observatory using momentary values. Earth Planets Space 59: 1187–1196, DOI: https://doi.org/10.1186/BF03352066
Dawsn, E, Reay, S, lan, S, Flower, S and Shanahan, T (2009). Quality control procedures at the world data centre for geomagnetism, Edinburgh. Poster presented at: 11th IAGA Scientific Assembly. 23–30 Aug 2009, Sopron, Hungary Retrieved from: http://nora.nerc.ac.uk/11740 (November 30, 2012).
Rasson, J L, Toh, H and Yang, D M (2010). The global geomagnetic observatory In: Mandea, M and Korte, M eds. Geomagnetic Observations and Models. Volume 5 of the series IAGA Special Sopron Book Series. Springer, pp. 1–27, DOI: https://doi.org/10.1007/978-90-481-9858-0
Reay, S J, Clarke, E and Macmillan, S (2013). Operation of the world data center for geomagnetism, Edinburgh. Data Science Journal 12: WDS47–WDS50, DOI: https://doi.org/10.2481/dsj.WDS-005
Reda, J, Fouassier, D, Anca, I, Linthe, H J, Matzka, J and Turbitt, C W (2010). Improvements in geomagnetic observatory data quality, geomagnetic observations and models In: Mandea, M and Korte, M eds. Geomagnetic Observations and Models. Volume 5 of the series IAGA Special Sopron Book Series. Springer, pp. 127–148, DOI: https://doi.org/10.1007/978-90-481-9858-0_6
St-Louis, B (2008). Intermagnetic technical reference manual, : 23–24. Available at: http://www.intermagnet.org/publications/im_manual.pdf (Last accessed 27 April 2016).
Xin, C J and Zhang, S Q (2011). The analysisof baselines for different fluxgate theodolites of geomagnetic observatories. Data Science Journal 10(0): IAGA159–IAGA168, DOI: https://doi.org/10.2481/dsj.IAGA-23
Zhang, S Q and Yang, D M (2011). Study on the stability and accuracy of baseline values measured during the calibrating time intervals. Data Science Journal 10(0): IAGA19–IAGA24, DOI: https://doi.org/10.2481/dsj.IAGA-04