Safeguarding Old and New Journal Tables for the VO: Status for Extragalactic and Radio Data

Independent of established data centers, and partly for my own research, since 1989 I have been collecting the tabular data from over 2600 articles concerned with radio sources and extragalactic objects in general. Optical character recognition (OCR) was used to recover tables from 740 papers. Tables from only 41 percent of the 2600 articles are available in the CDS or CATS catalog collections, and only slightly better coverage is estimated for the NED database. This fraction is not better for articles published electronically since 2001. Both object databases (NED, SIMBAD, LEDA) as well as catalog browsers (VizieR, CATS) need to be consulted to obtain the most complete information on astronomical objects. More human resources at the data centers and better collaboration between authors, referees, editors, publishers, and data centers are required to improve data coverage and accessibility. The current efforts within the Virtual Observatory (VO) project, to provide retrieval and analysis tools for different types of published and archival data stored at various sites, should be balanced by an equal effort to recover and include large amounts of published data not currently available in this way.


INTRODUCTION
During the 1980s, the amount of information on astronomical sources had grown to such an extent that it was necessary to create databases that would not only provide links between the different designations of a certain object in different wavebands of the electromagnetic spectrum, from radio through X-rays, but would also include a complete bibliography of each object. As a result, SIMBAD was created at the Centre de Donnes Strasbourg (CDS), France (Wenger, Ochsenbein, Egret, Dubois, Bonnarel, Borde, et al. 2000), initially for stars in our Galaxy and later expanded to include extragalactic objects, as well as LEDA at Observatoire de Lyon, France (Paturel, Andernach, Bottinelli, Di Nella, Durand, Garnier R., et al. 1997) for galaxies of the nearby Universe, and eventually NED at IPAC, USA (Mazzarella & The NED Team, 2007) to include all extragalactic objects known. In parallel to these databases, CDS Strasbourg maintains a growing collection of astronomical catalogs in electronic form, which are individually accessible for download and have become searchable through the VizieR catalog browser (Ochsenbein, Bauer, & Marcout, 2000; see URL vizier.u-strasbg.fr/cgi-bin/VizieR).
In 1989, motivated by a lack of data on radio continuum sources in NED, SIMBAD, and VizieR, I started collecting electronic tables of radio sources and/or extragalactic objects that were not readily available from data centres (see e.g., Andernach, 1992). This has grown into a collection of data tables from currently over 2600 articles. The history and current status of this collection is described in Sect. 2; while in Sect. 3 the CDS catalog collection is analysed. In Sect. 4 the size distribution of catalogs in my collection is studied, and in Sect. 5 the fraction of the literature that is covered by more established catalog collections is estimated. In Sect. 6 the coverage of the purely electronic literature since 2001 is investigated, and in Sect. 7 the most complete database

GROWTH AND STATUS OF AUTHOR'S CATALOG COLLECTION
After having started my collection of electronic catalogs not initially available from data centers in 1989, by 1995 I had secured over 130 radio source lists. Half of the latter, typically the larger ones, had been made available for cone searches in the Einstein On-line Service (Harris, Stern Grant, & Andernach, 1995) until that service was closed. Since 1997, a larger number of radio catalogs from my collection were integrated into the CATS database (Verkhodanov, Trushkin, Andernach, & Chernenkov 1997) and a smaller amount into the VizieR catalog browser. The growth of my collection over time, independent of size and publication year, is plotted in Figure 1. Catalogs were selected if they (a) contained data on an appreciable number (approximately at least 50) of radio continuum sources or other extragalactic objects and (b) were initially unavailable in the CDS catalog collection. Many of these catalogs were later incorporated in the CDS archive and VizieR, but, as I shall show in this contribution, more than half of the tables in my collection are still not accessible from these public catalog browsers. Until the present, I collected radio source lists from 1380 articles published since the 1950s and other tables including extragalactic objects from a further 1250 articles. About a third of these were taken from the arXiv preprint server at (URL www.arXiv.org), about another third were recovered with optical character recognition (OCR) methods, and the rest was either supplied by authors upon my request (for papers before about 1995) or recovered from electronically published articles (since about 1995).
The initial motivation for this collection was to convince the then established data centers (CDS and NASA's Astronomical Data Center, ADC) to aim for better literature coverage in their catalog collections. The hope that my collection work would soon become redundant did not become a reality. On the contrary, the collection work by established data centers was further N_cats (cumulative) all 2650 cats (11,340,000 records) 1350 radio cats (4,750,000 records) Figure 1: Cumulative growth of the author's catalog collection. For each year on the horizontal axis, the number of published articles from which tabular data were collected until that year (not the year of the publication of each catalog) is plotted on the vertical axis. Continuous line: all collected catalogs; broken line: radio source lists only. reduced with the closure of NASA's ADC in 2002. As Figure 1 shows, my collection has been growing at an increasing speed over the years and is currently growing at 200 items per year where I define an "item" as one article with at least one table in my collection. In Figures 1 through 4, these items are also denoted as "cats" (short for catalogs).
Given the incompleteness of the catalog collections of CDS and CATS, the author is continuing to collect newly published catalogs if they are unlikely to enter these collections and to recover older catalogs which have never existed in electronic form by using OCR methods. To this end, the OCR results of the scanned paper archive at the SAO/NASA Astrophysics Data System (ADS, adsabs.harvard.edu/cgi-bin/signup ocr) were exploited whenever possible. It turned out that more often than not, these tables either needed heavy reformatting or a complete rescanning and subsequent OCR treatment. Before being accepted in the author's collection, all OCR results are checked for their consistency with the original by overlaying the OCR result printed on the same scale as the original, which turned out to be the most efficient method to spot errors in the OCR result. Frequently, errors in the original papers were also detected and reported as notes to the resulting electronic table. These errors show a certain lack of attention the referees pay to the data part of papers. Ironically, it is usually the data part that remains valid for many years after publication, while ideas about their interpretation may change. The main bottleneck of making these tables available for catalog browsers is the need for preparing metadata, which implies complete and consistent column descriptions, byte per byte. These can in most cases be taken from the text (column descriptions) contained in the papers, although their preparation requires a person with minimum knowledge of the research field.

THE CATALOG COLLECTION AT CDS
The CDS at Strasbourg maintains the most complete collection of astronomical catalogs accessible for public downloads. As of September 2008, the CDS collection offers catalogs from about 7700 publications, growing now at over 500 items per year. In Figure 2, the solid line shows the total number of items in the CDS collection as function of publication year. (Note the difference in the meaning of the abscissae in Figures 1 and 2.) About 90 percent of these catalogs are also incorporated in the VizieR catalog browser, which allows a search through all or a subset of these catalogs around a user-specified sky position (dot-dashed line in Figure 2). The sudden increase of the slope of the continuous line in Figure 2 near 1994 is due to the start of the electronic version of the major astronomical journals and specifically due to an agreement between CDS and the journal Astronomy & Astrophysics (A&A) to store its data tables directly in the CDS archive. The 10 percent of CDS catalogs not in VizieR are either still in the process of preparation of their metadata for inclusion into VizieR or are unsuitable for cone searches (e.g. for lack of absolute coordinates or suitable documentation).
The dashed line in Figure 2 shows the cumulative distribution of publication years of catalogs in the author's catalog collection. It is noteworthy that for papers published between 1982 and 1992, my collection exceeds in number the catalogs available at CDS, while the overlap between the two collections is small. This is due to my OCR activities on tables from a period that immediately precedes the electronic journal era. But, as I explain below, the more recent catalogs in my collection are by no means a duplication of efforts at CDS since a large fraction of these catalogs have still not entered the CDS archive. For the last few years I have also monitored the catalogs listed as "in preparation" in the CDS archive (dotted line in Figure 2). The currently 300 such catalogs have a median age since publication of nine years, compared to 2.5 years for 426 catalogs in 2003 (Andernach 2003) and five years for 400 of these in 2006 (Andernach 2006). While the decrease in number of these "inprep" catalogs may appear promising, it only implies that CDS has managed to reduce the backlog by including preferentially the more recent "inprep" catalogs, but, as I show below, CDS appears to have stopped to list more recent catalogs as

CATALOG "BIOMETRICS" AND DATA CENTER COVERAGE
In Figures 3 and 4 (continuous lines) I have plotted the cumulative size distribution of the radioand non-radio catalogs in my collection. Here "size" typically means the number of objects a catalog deals with. Sometimes this is the number of flux density measurements, e.g. if cataloged for various observing frequencies or epochs, and can be much larger than the number of objects. Since these plots are on a double-logarithmic scale, it can be seen that the size distribution of both types of catalogs closely follows a power law with similar slopes of −0.69 and −0.61, respectively. Such power laws are known as Zipf's or Lotka's laws in biometrics. The dashed lines in Figures 3 and 4 indicate the size distribution of those catalogs in my collection that are also in the CDS collection. While it can be seen that the CDS collection becomes incomplete for radio catalogs of fewer than about 10,000 records, it is incomplete for the non-radio catalogs in my collection for all sizes. In Figure 3, the dotted line gives the size distribution of those catalogs in my collection that are also in the CATS collection, which become incomplete at a much smaller size of about 1000 records. It would have been interesting to create such plots for the entire collection of over 7000 CDS catalogs, but their sizes were not readily available to me.
Obviously the catalogs in all three collections have a typical minimum size of roughly 50 records to be worth of being included in the collection. The assumption that for all publications containing astronomical data, regardless of their presence in catalog collections, the power law for the number of objects holds down to very few records, would lead, e.g. by extrapolating the continuous line in Figure 3, to an estimate of ∼20,000 papers that have ever dealt with at least one radio source. As the latter number seems reasonable, one may infer from the turnover of the power law (below about 100 records) that the collections become incomplete for smaller sizes. This is a natural "collection bias" since it is more work for less return to transfer into electronic form the very many smaller catalogs, e.g. via OCR methods. Figure 3 includes all radio source lists that exist in electronic form, still excluding many dozens of published lists that have not (yet) been recovered in electronic form. Thus, one may conclude that of all existing electronic radio source catalogs, virtually all the ones larger than 10,000 records are available from CDS, and all the ones larger than 1000 records are included in CATS, but for those larger than 100 records, only 52 percent are in the CDS archive, and only 34 percent are included in the CATS catalog browser.
From Figure 4, one sees that for those non-radio catalogs in the author's collection, the CDS archive contains 69 percent of those larger than 1000 records and 56 percent of those larger than 100 records. As opposed to Figure 3, however, Figure 4 only contains a biased subsample of catalogs that were of some interest to the author, and thus the completeness of the CDS archive cannot be assessed as a whole on the basis of these data. Nevertheless, Figure 4 shows that of the 774 catalogs with at least 100 records collected by the author, 344 (46 percent) are missing in the CDS archive. This percentage decreases only slightly when a fraction of items in my collection, which may not be suitable for inclusion in VizieR or archiving at CDS, e.g. for lack of object coordinates or adequate documentation, are discarded.

LITERATURE COVERAGE OF VIZIER AND CATS
Based on a systematic inspection of all major astronomical journals for the period 1987 through 1993, I had found (Andernach, 1994) that there were 374 articles that deal with at least 100 supposedly extragalactic objects, and I showed that in 1994 the tables for only 21 percent of these articles were available from the CDS archive. Today, CDS has 46 percent of these same 374 articles, showing that some of the published data of the past are being recovered, albeit at a very slow pace. Table 1 gives an overview of the coverage of the CDS and CATS of those items in my collection. One can see that overall only about 40 percent of my collected items is covered. Table 1. For each catalog collection in the first column, I list the number of items in my collection of radio and non-radio catalogs, followed by the percentage in number of catalogs, the total number of records of the catalogs and its percentage of the total. The status is as of 1-Sep-2008. HA stands for the author, and the third row refers to catalogs which are in the author's, but not in the CDS or CATS collections.
RADIO source tables in HA coll.| Non-radio tables in HA coll. Thus, for radio sources, CATS slightly "beats" VizieR in number of objects, but VizieR offers more variety, i.e. more of the smaller source lists. This is due to CDS's continuous activity of incorporating tables of all sizes from the major electronic journals in astronomy. Given this difference between the two collections, one needs to search in both to obtain the most complete both radio-and non-radio catalogs, more than half of the items I collected is neither in VizieR nor in CATS. For the vast majority of the latter catalogs, metadata would have to be prepared from column descriptions given in the paper text. For a small fraction of these catalogs (e.g. those lacking absolute coordinates for the objects they contain), this would require a further effort of inserting these coordinates, which could partly be achieved with existing name resolvers of NED or SIMBAD. A still smaller fraction of these catalogs, e.g. those that only contain derived parameters, may be unsuitable for their inclusion into catalog browsers.

THE ELECTRONIC AGE SINCE 2001: NOT ALL GOLD
The above statistics may be distorted by the fact that catalogs published in the pre-electronic age before about 1995 may be over-represented in my collection and under-represented in the CDS collection. I report in Table 2 the situation for tables with over 50 records in my collection, both radio and non-radio, and published in electronic journals since 2001. Only journals with at least five catalogs in my collection, and published in this period, are included. Percentages in parentheses are for the period from 1998 to 2006 as taken from Andernach (2006).    However, the exact percentages of the coverage by records of course depend on the inclusion or not of a few very large catalogs (e.g. the Sloan Digital Sky Survey, SDSS, www.sdss.org). While the above statistics are biased by my way of selecting radio and extragalactic source tables, the large number of articles they are based on still shows a significant fraction of data missing from the data centers, which has not improved in recent years. Moreover, the following problems are often faced with electronic journal tables.
The journals ApJ and AJ offer most tables as separate files in ASCII format, usually with good metadata provided by the authors, but often the tables are not well aligned and mixed with HTML or Latex symbols; the recent change of publisher of these journals from the University of Chicago Press (USA) to the Institute of Physics (UK) has lowered the percentage of wellformatted and documented tables.
In A&A many tables are offered only in Latex or PS, but not in ASCII format, nor do they A&A data tables automatically at CDS, the editors decide which tables flow to CDS, and for the sample in Table 2 the coverage is only 72 percent.
In the UK journal MNRAS, data tables are not usually offered in ASCII, except for articles with supplementary material. The tables are mostly offered in a separate window which does not offer to download them as files, but only in cut-and-paste mode, which becomes difficult for long tables. Metadata are not common. Many tables are not offered as separate files at all and need to be recovered by pdf-to-text conversion. It is often easier to recover these tables in Latex format from the public preprint archive at www.arXiv.org, but then there is no guarantee that they are identical with the published version. The publisher of MNRAS declines any responsibility for supplementary material such as data tables and refers to individual authors in case of problems. This does not show an attitude towards a long-term preservation of published data.
For other journals like ARep, BSAO, CHJAA, PASA, or RMxAA, no ASCII tables are offered at all. They can only be recovered via a pdf-to-text conversion.
A problem that applies to all journal tables not offered as a straight ASCII file is that the cut-and-paste copying or pdf-to-text conversion usually does not preserve table alignment and does not correctly interpret special characters such as minus signs.

LITERATURE COVERAGE OF DATABASES
The contents of the CDS catalog archive are easy to assess since the URL ftp://vizier.ustrasbg.fr/cats/cats.all offers a full and always up-to-date list (except for their subset in VizieR). However, to obtain the bibliography and published data on an object, most astronomers consult databases like SIMBAD, NED, or LEDA. In the following, I try to obtain an estimate of the fraction of tabular data covered by these databases. It should be stressed that these databases are independent of the catalog collections described above. Databases make use of individual catalogs to link data to certain objects, but this requires fairly sophisticated cross-identification procedures, which need to be controlled by the database managers. Thus, it requires much more effort to include the data of a certain catalog into a database, compared to including a catalog into a searchable catalog browser. As a result, a user may expect a fairly reliable set of data for a given object put together in a database, while one may obtain more complete data using cone searches around an object in catalog browsers if one is ready to check which of the data actually correspond to the object in question.
As SIMBAD puts a weaker emphasis on extragalactic data and LEDA concentrates on galaxies in the nearby Universe, they were not studied here. The literature coverage of NED is offered at the URL nedwww.ipac.caltech.edu/samples/NEDmdb.html and is expressed for each reference with terms such as "incomplete", "partial", "complete", or "entered as found". For 82 percent of the altogether 4637 references listed in the above file, NED claims that the objects were completely entered. For 5 percent of the references, NED explicitly admits incompleteness or a mere lack of the electronic source catalog, and for 7 percent objects were only entered "as needed" or as references to them were found in the literature. For one percent of the references, only the likely extragalactic objects were entered in NED. The disadvantage of the NEDmdb list is that it is apparently not updated frequently since in late September 2008 that list was of 27-Mar-2008. Moreover, occasionally objects in NED are attached to bibliographic references for which NED decided to create a specific nomenclature (acronym) for an object. This latter problem does not persist for the kind of study I describe in the following paragraph.
For a more quantitative study, I did a NED search "by refcode" for all articles for which I had collected tables, so as to obtain the number of NED objects per article, and compared this with the full catalog size. This is not always reasonable, as some tables give several records per sources, etc.). Moreover, some of the largest datasets (e.g. NVSS, or the various data releases of SDSS) have a different refcode in NED than their publication refcode. Despite all these limitations, my results are summarized in Table 3. Note that I did not limit the catalogs to a minimum size, as NED should recognize a refcode with any small number, if only extragalactic, objects. One should also note that the presence of a certain catalog in NED does not mean that the user is able to retrieve the entire catalog's data content from NED. It basically means that NED has made a link between all catalog objects (or a fraction thereof) and its bibliographic reference. Catalog browsers generally offer the content of all catalog columns.  Given the above-mentioned limitations, these numbers are only indicative but suggest a more significant lack of radio source data. Indeed, a few of the largest recent radio surveys, such as WENSS and WISH, Miyun, 7C, UTR, VLSS, and SUMSS, were not (or almost not) represented in NED. This is where systems such as CATS (Verkhodanov, Trushkin, Andernach, & Chernenkov, 2008) complement, offering more data, albeit leaving the cross-identification work to the user.

CONCLUSIONS
The teams of CDS and NED are doing their best to cope with the avalanche of data, but human resources do not suffice to achieve completeness above about the 50-percent level. Bigger catalogs are preferentially incorporated, but it is the medium-and smaller-sized catalogs that offer wavebands or observing epochs different from the bigger ones and are thus important for a more complete multi-wavelength research. Based on a systematic search in the major astronomical catalog browsers and databases, for the presence of data from 2600 articles containing data on radio sources and extragalactic objects, and for which I collected the electronic data tables, I come to the following conclusions.
Less than 70 percent of currently published journal tables with more than 50 entries become accessible through catalog browsers (VizieR, CATS) routinely. We are definitely far from an "automatic data flow" from current electronic journals towards the data centers. This should not be left to the willingness of the authors, who tend to be under severe publication pressure and many lack the time to work on making their tables available in an appropriate form.
The completeness of databases such as NED, SIMBAD, and LEDA is more difficult to assess and quantify, but the present study reveals significant shortages, at least for NED and especially for radio source data.
Catalog browsers are easier to maintain or upgrade than databases such as NED since they avoid the tedious task of cross-identifications, but they can offer valuable data more rapidly to the interested (and knowledgeable) user.
Current data centers are overloaded with the rate of data published in journals. For the future Virtual Observatory, it is necessary to dedicate more emphasis (a) to recover tables from the "pre-electronic age" before about 1995 and (b) to prepare adequate metadata for these tables, e.g. those collected by the present author, to make them searchable.
We need more collaboration between authors, referees, journals, and the data centers to make more data available immediately after publication and safeguard them for the future.
The author is (and has always been) ready to provide his table collection to data centers and database managers (who would only need to write the metadata). For recovery of further tables the "heritage" of the LEDA team, which is suffering from a lack of long-term support but has scanned a vast amount of the pre-electronic literature on galaxies, should be salvaged.

ACKNOWLEDGMENTS
I am grateful to various students and secretaries who in the course of time have helped in recovering or proofreading journal tables via OCR. Many tables were recovered also from the OCR offered at ADS, albeit with heavy reformatting and correction work. Thanks to F. Ochsenbein for providing me with an up-to-date list of VizieR catalogs and to N. Aguilera Navarrete for converting this paper into a Word document. I acknowledge support from grant 81356 from CONACyT of Mexico and the hospitality of the Emmy-Noether Research Group of T. Reiprich at AIfA, Univ. of Bonn, Germany, where partial support from the Transregional Collaborative Research Centre TRR33 "The Dark Universe" was received. I am grateful to an anonymous referee for comments that helped clarify parts of this paper.