Most of the data from ground-based observatories, and virtually all data from observatories in space, are now carefully collected and preserved for posterity. Most of these archives are at least partly accessible on-line, but the great diversity of user-interfaces and data management systems limits their use, and attempting to use information from two or more archives together can be exceedingly difficult. Many people have realized that some form of integrated access is desirable, and that, if powerful data mining facilities were also provided, the scientific gains could be enormous. The name given to this vision, the virtual observatory, seems to me to be a little too grandiose, but the term seems to have stuck.
A number of collaborations have been formed in support of this aim, the three main ones being:
It is pleasing to report that, at least in Europe, the funding agencies are starting to respond. The UK's Astrogrid project has been approved in principle by the Particle Physics and Astronomy Research Council; we expect to get full approval shortly when a detailed plan of work has been agreed. We in Astrogrid are working closely with the AVO project, which also has been successful in its bid for funding from the European Union; we look forward to similarly fruitful collaborations with the NVO. Many cross-links are already in place.
This paper gives a personal view of some of the problems that need to be solved before the virtual observatory can become a reality, although only a few of these are likely to be tackled in the initial phases of Astrogrid. Since the rest of this paper mainly lists the difficulties we face, it is worth emphasizing a couple of positive factors:
One driving force behind the UK's interest in this area is our involvement in a number of new observatories which are providing, or will soon provide, an unprecedented volume of astronomical data. This includes:
Astronomical archives are also growing at unprecedented rates in many other leading centres, for example the VLT is adding many terabytes a year to the ESO archive.
Sometimes even the most basic questions can be hard to answer, for example: finding an observation of a particular object in a given waveband, or a list of all sources found within a small circle around a certain RA and Dec, or whether a source of interest was observed by anyone between certain dates.
Ideally we would have a distributed database system somewhat along the lines of the Domain Naming System (DNS) which underlies the Internet, but perhaps it is still to early to decide exactly what properties such a system should have. A metadata description system, involving XML, will almost certainly form part of the eventual solution.
One might have thought that the number of astronomical data archive sites in the world was still so small that a list of them all could easily be compiled by hand, but this is already becoming difficult. The ASTROBROWSE facility at GSFC has probably the most comprehensive list of on-line resources, and it can be used to send out a query to over one hundred different source catalogues at a couple of dozen different sites (the number of resources rises to the thousands if you count each separate catalogue in systems like VIZIER).
A number of important data resources are already located at two or more sites, which is a valuable safeguard against data loss, and also improves data accessibility. The GLU[1] system, invented at CDS, is a global naming service which generates URLs on the fly, this avoids a number of problems associated with hard-coded URLs.
Another obstacle to data access is the fact that many data archives, especially many of those holding ground-based observations, still keep most of their data off-line using a medium such as magnetic tape. For these, the normal access method is to search an on-line log of observations, and then e-mail a request for the relevant tapes to be posted to you. You have to hope that the tapes will not get lost in the post, will be compatible with your hardware, and that data format will be compatible with your software.
Technical progress in disc storage, and market forces, may be coming to our aid. The capacities of tape cartridges (such as DLT or 8mm) are struggling to keep up with sizes of hard discs, and prices are only falling slowly: currently it costs nearly $1 per gigabyte for the tape alone, without counting the cost of a tape deck or auto-changer. On the other hand IDE discs already cost less than $5 per gigabyte, with prices continuing to fall by half every 18 months. It is not hard to see that within a couple of years disc will be the medium of choice not only on grounds of vastly greater access speed, but on grounds of cost, simplicity, and reliability as well.
One has to bear in mind, however, the rapid obsolescence of disc drives and interfaces: the hard disc is a good data archive medium only if you are prepared to copy your entire data collection to a new set of discs every few years.
Optical storage has looked attractive for some years. The CD-ROM format was standardized around 1988, and has been used extensively for astronomical data exchange and storage since CD-R blanks reached an affordable price. Their use for archiving has been limited by because of doubts as to the longevity of recorded discs, and also because each disc only stores 0.6 GB of data. The recordable DVD stores around ten times as much, but the medium has few users, because there are three competing formats, and prices of data recorders and the blank discs are still rather high. Other forms of optical disc storage such as large-format WORM discs have remained expensive, as none of them has found a mass market.
My conclusion is that that there is at present no really satisfactory medium for long-term data archiving.
Another storage problem may arise from the lack of scalability of many current filing systems. Since the early eighties (at least) the default integer has been 32-bits long (usually 31 bits with a sign). Sizes of discs went through the 2 GB (231 byte) limit some years ago, but many operating systems still cannot handle files more than 2 GB in size, including some of those in most widespread use such as the common versions of Linux and Windows. Even on 64-bit aware operating systems running on 64-bit cpus, such as Solaris on Sparc, the support for ``long files'' is patchy: one needs to get a complete set of utilities, compilers, and applications which support 64-bit addressing before all the problems of accessing files of more than 2 GB in size go away.
It is worth noting that the CFITSIO library from William Pence and colleagues at GSFC has been successfully compiled in 64-bit mode on Sparc/Solaris, but even here there are limitations: the number of rows in a table cannot exceed 231. Several astronomical tables, such as the USNO-A2 catalogue, are already within a factor of four of that limit.
Some of us are old enough to remember the difficulties of the 16-bit to 32-bit transition nearly 20 years ago, and we can only hope that things are less painful this time.
The FITS format has become the de-facto standard for astronomical archives, because it is reasonably efficient, supports both tables and images, allows an unlimited amount of metadata to be attached to each dataset, and is described to the bit level in the primary literature. But FITS is not without its limitations. For example:
The most obvious alternative to FITS is XML (extensible markup language) which, if all the news stories are to be believed, is being widely adopted in the commercial world. But XML is just a framework, and astronomers need to agree on a common set of metadata tags and attributes before it can be used for data interchange. There are several groups currently working on metadata standards such as document type descriptions (DTDs) for astronomical data.
One serious limitation is that XML is text-based: there is no standard way of encapsulating binary information, whereas all astronomical images and most FITS tables use raw binary numbers. A number of solutions have been proposed, including:
A number of current projects are studying astronomical use of XML in detail, see for example astrores[2], FITSML[3], and XSIL[4] (Extensible Scientific Interchange Language).
The FITS standard has widespread acceptance, but it as far as metadata is concerned it is a rather minimal standard. The high-energy community has made somewhat greater progress with the so-called OGIP recommendations[5], but even here there is some room for differences in interpretation. The XMM-Newton and Chandra projects have both tried to conform closely to OGIP conventions, but anyone trying to analyse data from both missions is bound to find a number of small but annoying incompatibilities.
There seem to be two ways of getting data interchange standards established. The formal way is to set up standard committees and hold international consultations. This usually leads to extremely slow progress, as we see from the standardization of programming languages like Fortran and C; the astronomical community is no faster: the FITS World Coordinates System (WCS) proposal has been under discussion for something like a decade. The alternative is for a single group, or better two collaborating groups, to set up an informal ad-hoc ``standard'', and then start using it. With luck others will join in. The original FITS standard was created this way, and a number of subsequent FITS conventions have followed this route. For a recent example see the Astronomical Server URL convention[6] invented at Strasbourg and now used at several other sites.
If we want the virtual observatory to be set up in just a few years, we have to find a streamlined way of getting widespread agreement on a whole range of data interchange standards.
It makes no sense for us to write new software to handle data archives if there are existing packages, commercial or freeware, which satisfy our needs. This section examines the software categories most likely to be useful; these comments are very preliminary ones, and the Astrogrid Project is planing to assess and evaluate some of these software packages in more detail.
RDBMS have been around long enough to be large, stable, efficient packages. They excel at handling tabular data, which forms the most important part of any astronomical data archive (or is perhaps second only to the image). Unfortunately there are a number of problems:
Several leading RDBMS such as Oracle and DB2 now claim to be ``object-relational'', although the term does not seem to be clearly defined. In practice these systems support complex data structures, and usually have some multi-dimensional indexing, which makes them potentially of greater interest to astronomers. The leading open source package, PostgreSQL, for example, supports R-tree indexing. Experiments with it using an astronomical source catalogue[7] gave promising results, but I have seen other reports of relatively poor performance. More evaluations are obviously needed.
The typical OODBMS usually has a quite different structure: they aim to provide persistent storage to programs written in an object-oriented language such as C++, Smalltalk, or Python. This gives them unrivaled power to handle data structures as complex as astronomers might want. Unfortunately OODBMS have a very limited market share, and are mostly the product of small vulnerable companies. Since there are practically no standards for OODBMS in areas such as schema definition, language interfaces, or query language, users tend to be locked into the product they have chosen, with dire results if that product can no longer be used. We learned this to our cost when, several years ago, the XMM-Newton Survey Science Centre project adopted the O2 database. Unfortunately, O2 was taken over by Unidata, which was taken over by Ardent Software, which was taken over by Informix, most of which has just been taken over by IBM. At some point O2 no longer seemed to be a prime product, and development ceased, and software support will soon end for us as well.
Only a few years ago, the object database was ``the next big thing'', but OODBMS have failed to gain significant market share, and now appear to be used only in small and specialized roles.
A few products have been announced which claim to have XML as their native data format: these must be very similar in structure to object-oriented DBMS. There are obvious difficulties in using an XML file as a database on anything other than a rather small scale, and I have seen no independent assessments of their properties or performance.
There are many of these on the market, some of them dating back to the days of the mainframe[8]. Leading products include: SAS, MINITAB, BMDP, SPSS, S-PLUS, and MATLAB, but there are many others, since the category overlaps somewhat with vizualisation packages, and with high-level data manipulation languages such as IDL. Many packages are oriented towards particular branches of science or social science, but only IDL seems to have considered it worth supporting astronomical functionality to a significant extent. And while many of them provide a wide range of functions for data mining, they are designed for datasets of megabyte size, not for the very large tables which astronomers now encounter, and it is not clear that any of them is sufficiently scalable.
The data warehouse is a fairly recent development, aimed at the large company which wants to do data mining on, for example, the preferences of its customers. One might expect that these would be well-matched to the requirements of astronomy, but the few assessments which have been published e.g. [9], so far suggest that they are not. One might guess that these packages are very much oriented to find weak and hidden correlations, and much less oriented to finding outliers and rare events.
Almost every packages considered above has its own proprietary data format. This means that datasets have to be specially ingested: often a slow and tedious process. It also means that archive data needs to be stored twice, as FITS remains unrivaled as an open data format, described to the bit level in the primary astronomical literature. For a long-term archive, no proprietary format would be suitable.
Current on-line data archives show a wide variety of user interfaces, even those constrained by the limitations of HTML and the Common Gateway Interface (CGI). Even if RAs are nearly all requested in sexagesimal hours-minutes-seconds, sometimes these are separated by space, sometimes by colons, and sometimes by commas. A circle radius may be have to be given in degrees or arc-minutes or arc-seconds, and date/time formats have even more varieties. Promising developments include things like Astrobrowse which provide a single user-interface to a wide variety of on-line resources. Most current systems assume the user will only want to specify one query at a time, but users will often want to search each database for a whole list of objects: only a few systems support this.
Unfortunately it is much harder to get uniformity in the output, and user requirements obviously vary widely. Whether users want their results with celestial positions given as sexagesimal degrees-minutes-seconds or decimal radians depends on whether they want to inspect them on the screen, or download the results for subsequent processing. Ideally there should be a wide range of upload and download formats, for example: HTML tables, FITS tables, Latex tables, CSV (character-separated value) files, and perhaps the internal formats used by the more common PC spreadsheet and database packages.
There do not seem to be any comprehensive statements of what is meant by data mining in astronomy, but a recent ADASS paper[10] provides a good summary. It seems to me that the following functionality will be important, nearly all of which have been found valuable in the past:
Examples from the history of astronomy include: the discovery of the HR diagram, and the Hubble recession law.
Examples include the discovery of pulsars while studying interplanetary scintillation; finding quasars as stars with strange spectra.
Examples include pulsar searches; SETI@home.
Examples include finding novae, SNRs, or minor planets using a blink photometer.
Examples include classifying galaxies by morphology.
In the commercial world, it is said that 20 to 30% of the time spent in data mining has to be devoted to understanding the characteristics of the data, and another 50 to 70% is spent on data cleaning. These figures may well apply to astronomical data: especially when searching for rare events or objects, since the presence of a small proportion of spurious values (for example from cosmic rays hitting the CCD) may overwhelm the effect being sought. The need to understand the data characteristics in great detail is the best argument for continuing to site data repositories in close proximity to the experts who built the instrumentation and supervised the collection of the data.
Transactional databases are commonplace in the commercial world, but for data mining it is generally necessary to construct a separate data warehouse. This contains essentially static data, carefully cleaned, all in compatible form, and in a format optimised for reading. It may be that the most demanding forms of astronomical data mining will the data to be transferred to a separate data warehouse.
Many astronomical data collections have now reached a terabyte or more in size. The typical speed of a local area network connection is 10 MB/sec: reading a terabyte over such a link takes just over one day. There are several ways of obtaining better performance, for example:
Multi-processor systems come in a number of distinct configurations. Symmetric Multi-processor (SMP) systems, as used on many modern supercomputers, may not be the right way to go: the cost of an N-processor SMP system seems to be proportional to N2, while the processing power only rises as something like N0.9 because of the overheads of interprocessor communication. A cheaper and more scalable route is to use commodity or Beowulf clusters: a loose assembly of machines, each with its own memory and disc, communicating over fast Ethernet. Astronomers are starting to use such clusters, but more work is needed to develop suitable algorithms. Many data mining operations, which involve trawling serially through a large database would seem to be very suitable for commodity clusters. What can be achieved is illustrated by the fact that commodity clusters of this type support well known internet services such as the Google search engine, with some 5000 cpus, and Hotmail, reputed to have 8000 cpus.
In commercial data mining, great care is taken to get all the required datasets on to the same machine. In astronomy we expect some of the most interesting results to come from relating disparate datasets, gathered from widely separated sites. Fortunately those datasets most likely to be involved in the most complex and data-intensive operations are source-lists, which are at present of relatively modest size, just a few gigabytes. These could be copied over the network in bulk fairly cheaply. The datasets occupying terabytes tend to be the collections of image data: in such cases it is usually only necessary to mine rather small parts of the sky, in which case the relatively small image sections are also small enough to be copied across the network so that they can all be available on the same machine. But there will, no doubt, be cases in which it is not easy to copy across to one location all of the datasets that one wishes to mine. In such cases the facilities of the computational grid and data grid, which are under construction by computer scientists and network experts around the world, will surely be invaluable.
This paper did not set out to reach any conclusions or find any solutions, but just present some of the problems. It is clear that there are enough of these that the building of the virtual observatory will be a considerable challenge to all concerned.