Database Technology for AstroGrid:

a Discussion Document

Clive Page

2002 May 9

Note: this is an extensively revised version of the document originally dated 2002 April 12.

This document is also posted on the AstroGrid Wiki site, but the Wiki software mangles certain HTML constructs, so a plain HTML version has been placed beyond the reach of the Wiki on - this may be easier to read.

1. Introduction

The aim of this document is to put the database-related work of the AstroGrid project in context, identifying the main problem areas and reporting what  progress is being made.

It is obviously possible to do this only within the framework of the overall AstroGrid architecture. Unfortunately this is still under active discussion, indeed a new architecture task force was only set up in April. What follows, therefore, is based on my current assumptions and guesses as to the eventual overall design. If these assumptions turn out to be wrong, then some of the proposed work in the database area may have to be re-planned. Another disclaimer needs to be made: this document is based primarily on the needs of extra-solar astronomy, but my discussions so far with the solar physics and STP communities suggests that the database facilities described here will also satisfy their requirements.

The general aims of the AstroGrid Project are, as I understand them:

  1. To federate existing astronomical data archives so that they form a Virtual Observatory (VO).
  2. To provide data mining facilities so that astronomers can extract the full scientific value from archived data resources.
  3. To ensure that the VO infrastructure is scalable to the terabytes or petabytes expected from new projects, especially WFCAM and VISTA.

The Virtual Observatory is obviously a world-wide project, but its importance to the UK is that a number of important astronomical data archives are (or will be) located within AstroGrid consortium institutes. These include archives from the following instruments or projects: MERLIN, FIRST, WFCAM, VISTA, SuperCOSMOS, SDSS(EDR), 6dF, WFS, INT-WFC, 2dF, Chandra, XMM-Newton, SOHO, Yohkoh, Cluster, and EISCAT. Many of those in the UK responsible for producing the new data archives are expecting AstroGrid to give advice on the best technologies to adopt, and some projects centred outside the UK (for example AVO and Astrowise) are looking to AstroGrid to give them a technical lead in certain areas. I think that database management is one of these areas. It does not seem unreasonable to expect that those running some of the of the existing archives to retro-fit with more modern technology, provided that we can demonstrate that this is easy to do, and clearly superior. At the very least we must encourage them to register their resources with the VO Resource Registry (explained later) and to adopt the emerging standards of the VO for their web interfaces (XML, SOAP, WSDL, VOTable, etc.).

2. Functional Requirements

Ideally users will interact with data archives by sending them questions and getting information back, but the best we shall be able to do for the foreseeable future is to allow users to send queries and get back data: the gap between these concepts will continue to require at least a little astronomical expertise. With a good VO design, however, it should be reasonably easy for astronomers to formulate the types of query that the VO will support, and the resulting data should contain information in a form which users find easy to interpret.

Although there are already a good many science-cases and use-cases on the AstroGrid Wiki pages, I feel sure that we are still some way from being able to condense these into a reasonably complete and coherent set of user requirements. Another attempt to get to grips with the variety of user requirements is set out in an appendix to the document entitled Data Mining the SDSS SkyServer Database by Jim Gray et al.: this lists some 35 different queries to the SDSS data archive (but many of these are, of course, rather specific to the data products coming from the Sloan sky survey).

In the following analysis I have tried to generalise the types of query likely to be posed, and them classified them into a number of distinct types. Real astronomical problems, of course, will usually involve queries of a number of different types, used in sequence. From a database perspective, therefore, I think that VO queries will fall into two principal categories:

2.1 Positional queries

Positional queries are a very important special case because many astronomers spend long periods of time studying one celestial object quite intensively, often over a range of wavelengths or frequencies. Such queries form a high proportion of all those sent to astronomical archives at present, but to this slightly over-estimates their importance because most data archives are set up to handle positional queries well, but currently handle other types of query much less efficiently or not at all. As the old saying goes: "If all you have is a hammer, then everything looks like a nail".

Positional queries are also special in database terms because it should be possible to handle them efficiently through the use of an index on celestial position. In theory the user of a database should not need to know which parameters have been indexed. In practice the difference in response time between a sequential scan of a billion-row table and an indexed look-up may be a difference between hours and milli-seconds, so indexing cannot be ignored as a mere implementation detail. Unfortunately constructing an efficient index of positions on a spherical surface is surprisingly difficult: more on this in section 7.

Positional queries can be further subdivided into four principal types, based on the main types of information stored in astronomical data archives:

  1. Queries of source catalogues, for example:
  2. Queries of image repositories, for example:
  3. Queries of observatory archives, for example:
  4. Bibliographical queries. These will be ignored here as they do not form part of the AstroGrid remit, and because services such as Simbad and NED already do the job extremely well.

2.1.1 Source catalogue queries

Source catalogues are essentially tabular datasets listing the positions, fluxes, and other properties of all the astronomical objects detected in some area of sky in some waveband. The larger ones are generally the end product of a systematic survey of large areas of the sky, others are produced by analysis of the sources detected serendipitously in images of the sky obtained for some other prime purpose. At present the several of the largest source catalogues are tables containing between 108 and 109 rows, and tables of over 109 rows can be expected soon. Some, but not all of these, have 50 or more columns of source attributes. This makes them the largest tables encountered in astronomy, and we plan to use them as test cases of the handling of tabular datasets.

2.1.2 Image queries

Collections of images of the sky are another important astronomical resource: the volume of data involved means that images are often searched primarily by machine, using algorithms developed by skilled and experienced astronomers, but the human eye and brain are unrivalled in their ability to detect anomalies in images, so astronomers continue to request images of the sky for their own inspection. The superiority of the eye over the computer algorithm is also seen in projects such as Google's Pigeon-ranking system. The total volume of data in an image archive may be large: for example if you sample the whole sky at 1 arc-second resolution with 4 bytes/pixel you get 2 terabytes. But the sky is typically covered by individual images or tiles, covering many square degrees, and many optical sky surveys used a plate size around 6 x 6 degrees, so only a few thousand of them (allowing for overlaps) are needed to cover the entire sky. Indexing a collection of a few thousand files is a relatively trivial problem compared to indexing source catalogues.

A more difficult decision is whether to store the images as separate files external to the DBMS package and merely store the file names (and perhaps associated metadata) within the database tables, or whether to store images as BLOBs (binary large objects) in the DBMS, since nearly all packages now support them. As far as I can tell, all of the large astronomical image collections have chosen the external storage option. One partial exception is the database set up by the XMM-Newton project for its processing pipeline and internal data archive: here we experimented with storing images and indeed whole FITS files as BLOBS in an object-oriented database (O2), but the overheads of importing and later exporting these images were found to be non-negligible, and the benefits of internal storage were relatively small. The compromise we adopted was to store within the DBMS mainly the metadata, so it was easily searchable, but all the information was also present in the external FITS files.

2.1.3 Observation log queries

Observation logs are kept by practically all observatories, whether on the Earth or in space, and are the main entry point for astronomers seeking to browse through the data products from each observation, such as spectra or light-curves, or those wishing to download the raw or semi-processed data files to do their own data reduction. Many telescopes have a fairly large field of view, up to a few degrees across, so that many objects of no interest to the original proposer are recorded as a by-product of the main investigation. It is very important to allow astronomers to discover such serendipitous sightings, which they can generally do by searching the observation log by reference to the celestial position in which they are interested. The shape of the effective field of view is not necessarily circular, so some complex trigonometry may be necessary to determine exactly which observations covered a given point in the sky. There are, obviously, lots of other selection parameters that users will want to specify, such as the instrument, mode, filters, time-resolution, and so on. But the scale of the sky indexing problem is again relatively modest: most telescopes only point at a handful of distinct celestial positions per 24 hours, so this means that a log file will only accumulate maybe 104 or 105 rows over the course of a decade.

A more difficult problem is designing a way of storing the collection of data files from each observation in a systematic or structured way. These files generally include the raw data (images in the case of many wavebands, complex visibility files from radio interferometers, photon event lists from X-ray telescopes), calibration files, reduced images, and derived files such as spectra and light-curves, as well as a range of auxiliary and telescope house-keeping data of many different forms. The overall structure of the data collection is often roughly hierarchical, with the collection divided into discrete observations by each instrument, each of these consists of separate exposures, perhaps further subdivided into frames, and so on.

Data analysis usually requires the appropriate set of calibration files to be provided, and one additional problem in the data from many instruments (both ground-based and in space) is that a set of calibration files will apply to a number of distinct observations, not necessarily consecutive ones. This is why the overall structure of the file collection is not a strict hierarchy, more of a network.

Fitting a collection of data such as this into the framework of a relational database is not straight-forward, and there have been at least a few attempts to use object-oriented databases, which on the face of it, seem more suitable. This topic is examined further in section 8.

2.2 Non-positional queries

Non-positional queries are those typically involving the study of a class of astronomical objects, perhaps by reference to their properties, i.e. by selecting on some combination of parameters. The more advanced types of query merge into the realm of data mining, some of which may require rather specialised software, but which may operate on datasets held in a DBMS:

3. Architecture in Outline

The architecture of the Virtual Observatory, or at least the AstroGrid part of it, is still under discussion, but the outlines are becoming clearer.

3.1 Web Services and Open Grid Services Architecture

We are working on the assumption is that we shall make use of the Web Services model of internet interactions, using wherever possible existing or emerging standards such as XML, SOAP, WSDL, and perhaps UDDI. Where computational grid facilities are needed, the Open Grid Services Architecture (OGSA) model will also be followed.

3.2 Main VO components

A number of essential elements in the design of the Virtual Observatory can now be identified:

3.3 Typical workflow

The way in which these elements will work together can be illustrated by an example of how a fairly simple query might be handled.

  1. The User specifies the URL of a convenient Data Portal, and uses its query interface to specify the results or processing required. The interface may be a simple forms interface, but (as noted later) we are likely to devise an astronomically-oriented query language to make it easier to specify the more advanced queries.
  2. The Data Portal analyses the query and sends it to the nearest VO Resource Registry. I expect that nearly all queries will result in a request for information from the Registry, but those which are very specific and name the data archive to use can perhaps be handled directly by the portal.
  3. The Resource Registry then replies with a list of the archive sites which may have relevant information: in order to avoid missing information this list will always err on the side of generosity.
  4. The Data Portal then interrogates each of these remote archives to find the protocols they can handle (using WSDL) and subsequently seeks information on the detailed contents of their datasets (table descriptions, column UCDs and other metadata).
  5. The Portal then formulates appropriate queries and send them encapsulated in SOAP where possible, but for compatibility with archives using non-VO interfaces, it may be necessary to handle queries using ASU or raw CGI form parameters for some time.
  6. These results will, if all the archives conform to VO standards for interfaces, be returned in some standard form, for example for small tabular datasets the VOTable format. But the results from different sites may have different formats, in which case the VO Data Portal should be able to assist by carrying out transformations or data reformatting steps to make the results easier for user to absorb.

The stages above assume that the results can be retrieved from queries executed in-place at the archive sites discovered by the Resource Registry. In the case of simple positional queries this may well be true, but operations which require substantial cpu power or I/O transfers, such as a join between two tables or a statistical query which requires a sequential scan of a large table may come up against the limited capabilities of existing sites. In such cases the action might be something like this:

  1. Portal uses registry information to find about all copies (clones, mirrors) of the required datasets, to see if any site holds both of them as well as the required compute resources.  If so it will use this site.
  2. If not, it will search for a site containing one of the required datasets and the computational resources, and see if the other(s) can be copied there.
  3. Failing this, it will be necessary to copy both (or all) datasets to an existing Data Warehouse site.
This is the most likely area in which Computational Grid and OGSA facilities will be required, for example GridFTP for data transfers, and the Grid authentication and authorisation mechanisms to allow remote users to consume expensive resources at data archive or warehouse sites.

The rather complex sequence of choice to be made suggests that somewhere in the system some intelligence needs to be located, perhaps at the VO Portal. Perhaps in the initial implementation of the VO the various choices should be presented to the user, who can exercise the necessary intelligence, and leave automation until later.

3.4 Where database technology is needed

Given the architecture outlined above, database technology is likely to be involved in the following areas

Sections 4 onwards cover a number of design issues and areas which present a particular technical challenge.

3.5 Other Platforms

It is important not to forget that astronomical computing is sharply different from that in most commercial environments: most of the world's computers run some version of Microsoft Windows, while most astronomical data analysis is done on some form of Unix. Linux is clearly the brand leader, mainly because it runs on cheap powerful hardware. However operating system differences are becoming less important, now that a great many packages can run in both worlds (if only by using the Cygwin middle-ware on a Windows box). But there are two trends worth noting: firstly the Java language and the J2EE environment are being promoted for their platform independence by a large number of computer companies outside the Microsoft camp, and secondly Microsoft's .NET environment promises a language-neutral common back-end, and a shareable library and versioning infrastructure which will put an end to DLL-hell. This looks so promising that the Mono project has started up with the aim of getting the most important parts of .NET running under Linux. I think this makes .NET worth watching.

3.6 Commercial versus Open Source Software

Arguments on the pros and cons of commercial and free software are not confined to database management packages, but for several reasons which will be obvious from the following list, are particularly acute here.  Some of the factors to be taken into account are:
Most of these factors suggest the use of open source software wherever possible,  and indeed this has been the general presumption through the VO projects and the UK's e-science programme.  Discussions have mainly been about the particular open source licence model we should adopt for the software that we develop.  On the other hand, for some database operations the functionality and performance of commercial products may be so much better that their use is essential.  To find out we need to evaluate some of these products.

4. Interoperability

Our primary aim of federating existing data archives depends on establishing interoperability standards.  The Opticon Workshop held in Strasbourg at the end of 2002 January covered this topic and was very useful; the wide representation there made it clear that interoperability is seen as a vital area of work within all three major VO projects (AstroGrid, AVO, and US-NVO). This looks very promising, but a great deal of work remains to be done. Members of the AstroGrid database team have participated actively in the continuing discussions.

I think we need to establish standards for interoperability in four main areas:

  1. Standards for queries sent to archive centres,
  2. Standards for sending the results back,
  3. Standards for metadata,
  4. Standards for resource discovery (covered in  section 5 of this document).

The Opticon Workshop concentrated on standards for tabular results, leading to version 1.0 of the VOTable proposal, produced recently. In order to reach agreement, the scope was deliberately limited: VOTable does not describe queries (after an early attempt proved unpopular) not does it attempt to cover images. It also uses a DTD, whereas the use of a schema is now regarded as better practice. Adding images to VOTable should be fairly straight-forward, but XML is a strictly text-based format, so binary data has to be encoded in some form, such as base64. The VOTable, and no doubt allied topics, will be the subject of discussions at the Garching VO Workshop in June.

4.1 XML

Astronomical datasets are usually stored in binary form, but have to be converted to text to be part of an XML document., which is now the basis for everyone's favourite data interchange format. Although it has many excellent qualities, there is no doubt that XML squanders bandwidth. And bandwidth is likely to be the scarcest resource in the Virtual Observatory. Major projects such as VISTA, even as far ahead as 2006, plan to get their data back from Chile not over a data link but by stuffing discs in padded bags and using a courier service.

Those who think that software bloat is the result of a conspiracy of the hardware and software vendors to try to get us to upgrade our computers every year may equally suspect that XML has been designed mainly to further the interests of the telecommunications companies. We need to explore ways of solving this problem.  I have had useful discussions with Martin Westhead of EPCC who is studying the encapsulation of binary data in XML, and Jim Grey of Microsoft has also pointed out work under way in this area. 

4.2 Querying Databases

It is a simple fact of life that all modern DBMS speak SQL (Structured Query Language, pronounced sequel by the database elite) or some variant of it. Alternative query languages such as QUEL and QBE have died out, and research into query languages seems dormant too. This would not matter so much if it were not for the fact that SQL is also the lowest level of access: indeed if you interact with a database from within a programming language using some standard database API you have to format an SQL statement in a text string and send it to DBMS. This is unfortunate because SQL is very poorly matched to the requirements of astronomy.

From these rather severe limitations, may of us have concluded that an Astronomical Query Language (AQL) needs to be devised. Work on this has not yet progressed far.  The Open GIS community is reported to be working a geographically-oriented query language, but the open parts of the website so far do not reveal any information on it.

4.3 Sending Queries over the Web

There is an existing ad hoc standard for sending astronomical queries to CGI-based servers, the Astronomical Server URL (ASU), which was designed by a group led by the CDS in about 1995. For that epoch it was extremely far-sighted, but there are some limitations. It really only covers positional queries, and even then only a few services around the world have implemented it fully. The AstroGLU software package, also invented by CDS, has the main function of coping with the variety of idiosyncratic interfaces based on HTTP forms and CGI parameters which current astronomical archives require: it converts a given query into a number of different forms, one per service. The AstroGLU system can be seen in action by using the Astrobrowse service at GSFC/HEASARC.

Those who attended the Web Services week at the NeSC in March nearly all seemed to agree that the Web Services paradigm had a lot of potential to make astronomical archives interoperable. It is clear that SOAP can be used to encapsulate queries and results, and that WSDL can be used to describe the Web Services available at a site. There are a number of standards and draft standards for specifying queries using XML, for example Xpath, XSLT, and Xquery. The latter two incorporate much of the SQL SELECT statement syntax in their specifications; the differences between them are mainly on how they access components of XML data structures.

The Database Task Force (DBTF) of the UK e-Science programme has been working on the encapsulation of database interfaces using XML and SOAP. The work is actually being carried out at EPCC (for XML-based databases) and IBM Hursley Laboratories (for relational DBMS) and beta releases of software suitable for any DBMS which has a JDBC interface are expected over the next few months.

4.4 Returning Results - VOTable

The VOTable proposal is designed to encapsulate tabular results, similar to those which might be stored in a FITS table, in XML. Version 1.0 has already been published, and work on reference implementations is under way.

A similar proposal will eventually be needed for the encapsulation of images or sections of images; work on this has not yet started, but should be able to follow the VOTable model.

4.5 Metadata

A number of useful standards exist, for example

Although these are a good start, a lot more work is likely to be needed in all these areas. Because this will need international agreement, this is likely to take some time.

4.6 Proposed Actions

Interoperability is clearly a subject area where more work is needed. The database team will try to make some proposals in the current quarter, but we shall be handicapped by the fact that none of us have any significant experience in XML or the query languages based on it. In Phase B it may be that AstroGrid (or some other VO project) needs to develop a standard software package which existing astronomical archives can adopt to make their interfaces compliant with Web Services, by using SOAP and WSDL.

5. Resource Discovery

Many types of query, especially positional ones, require a search of resources which may be located at many different data archive sites. At present such queries involve a considerable amount of expertise on the part of the user, as there is no central store of knowledge as to which types of data archive are stored where. And with new data archives coming on-line all the time, even experts may have difficulty keeping up with the field.

I think it is now generally agreed that some sort of resource discovery mechanism has to form a basic part of the Virtual Observatory, and that this requires some sort of permanent resource registry. The nearest thing that the astronomical community has to an existing resource discovery mechanism is the AstroGLU package which was written by CDS (Strasbourg). This package is used not only by the CDS services, but also by the Astrobrowse facility provided at HEASARC (NASA/GSFC). Given a position in the sky and (optionally) a selection of resources or types of resource to search, Astrobrowse will send out appropriate queries to a number of the most relevant data archives. At present there are few standards for such queries, and a major part of the AstroGLU package is the conversion of a simple celestial position to the set of CGI parameters required by the forms interfaces of each individual data archive service. There are even fewer standards for the responses, and here Astrobrowse simply provides the results in whatever form they turn up.

5.1 UDDI and RDF

The problem of locating relevant information on the almost limitless world wide web has attracted considerable attention in the computer science community, and a couple of proposed standards, both making use of XML, appear to be relevant.

First Universal Description, Discovery, and Integration (UDDI) is a W3C standard for describing web services and registering them in a central (but replicated) database. Unfortunately UDDI appears to be wholly oriented to the world of commerce, and even then some of the details are only relevant to companies registered for business in the USA. Some of us have looked at the specifications and concluded that for the astronomical community to use UDDI we would have to make considerable use of extensions to the standard, which would probably prevent us from using regular software packages designed to support the protocol. This more or less negates any advantages we might get from using a "standard".

The Resource Description Framework (RDF) is still only a W3C Recommendation, but it is designed to promote metadata interoperability, and it looks as if it is flexible enough to be useful. Unfortunately the current proposals only cover the description of metadata, and there seems no notion so far of having all the metadata stored in any kind of resource registry or database which can be searched.

5.2 Requirements for a Resource Registry

The basic requirements, I think are these:

One immediate question is the level of detail that the resource registry should contain, or its granularity. Let me set out first the two extreme cases:

One possibility, which needs to be explored more thoroughly, is that if we can get all the participating archives to use SOAP for their interfaces and WSDL for their service descriptions, then some form of robot could interrogate each archive site at intervals, perhaps each night, to discover the current services and rewrite or update the information held in the registry. Of course this might involve an archive answering more queries per day than if users were left to explore for themselves, so the system need careful design.

One interesting question has been raised by Bob Mann after reading the first version of this document: what about queries to observational archives. If a user wants to know whether a given point in the sky has ever been observed by a large telescope (with large defined in sensitivity terms, perhaps), then at present it is necessary to send a query to each site to search the observation log. If this is to be made efficient by centralising it in the Resource Registry, then it will have to contain all the relevant observation logs, together with a considerable amount of other information, such as the sensitivity as a function of filter and observing duration, the shape of the field of view for each instrument, etc. This makes the centralisation look somewhat intractable.  If a robot could be used, the only manual action would be the rather infrequent addition of the base URL of another astronomical archive site.

My current guesses as to the information which we must, and should store in the Resource Registry are as follows:

5.2.1 Essential registry information

5.2.2 Optional (desirable) registry information

One hopes that when such a worldwide VO registry is made available, existing data archive sites will be keen to register themselves in order to participate fully in the VO movement, which seems to have widespread support. There is clearly a benefit to be gained if a site registers itself with as many details as possible, since this will mean that the queries it gets from users will more often be relevant ones. It is also important to point out that however much detail is put into a central registry, there will always be details which will only be relevant to a particular archive holding, so there is still a need for archive sites to support WSDL or something very similar, so that they can automatically provide information about themselves to users and user agents such as VO portals.

The volume of information to be contained in the VO resource registry is not large, and since we obviously want to use XML-based protocols to get information in and out, an XML-based DBMS would seem to be a good choice. Dozens of them seem to be available now, quite a number of them as freeware. Alternatively many relational DBMS have XML interfaces. The external API of the registry can probably be quite simple: an create/update method is needed to load (new) information into the registry (here some authentication of the user is obviously needed), and a query method is needed to search the database for relevant records and retrieve them. This could include an archive site manager retrieving the whole of his site's registration for verification purposes. The requirement for graceful fail-over to alternative sites for resources with more than one copy on the web may be a difficult design issue: should the resource discovery software check that each URL it provides is a live one, or should the requesting agent have the responsibility for trying the initial URL, and coming back with a request for an alternative if the site initially suggested turns out not to respond? Probably the latter, but this clearly needs some thought, as it may complicate the API.

We need to get a discussion going among the various VO projects to see what consensus exists.

5.3 A speculative idea - the Unified Source Catalogue

Another, more tentative proposal, which may assist in resource discovery is that we could construct a unified source catalogue. This would gather in one narrow but long table information about every celestial object contained in all other source catalogues. (Actually we might want to exclude catalogues with very large error circles, such as some early X-ray catalogues). The unified catalogue would contain only a very few columns, for example:

Since there are already several source catalogues with around 500 million rows, the whole thing might have a few billion rows, but being slim, the total volume would not be huge. It would be useful in a number of types of positional query, especially those covering only a fairly small area of sky, as it would immediately indicate those primary catalogues which have to be searched. In many cases these would be few. It might also be a useful resource in its own right, to study cross-identifications and to plot sources on maps of the sky. This is only a tentative proposal, as I am not sure what the cost/benefit ratio is.

6. I/O-intensive Queries

Nearly all existing data archives are set up just to handle positional queries, as these can be handled relatively efficiently. There are just a few exceptions, for example: VizieR can execute the same selection expression on a number of catalogues; Astrobrowse and Querator can send the same positional query to a number of data archives around the world; while W3browse (at HEASARC) can do cross-matching between tables that it holds.

There are limits as to what can be achieved with remote archives for the majority of non-positional queries, as these will involve scanning a large table sequentially, or the use two or more tables resident at different sites. Many remote servers have limited computing power and they often impose limits on the resources which can be used, in terms of cpu-seconds, or thousands of rows, or megabytes of I/O. The limited network bandwidth also imposes practical limits on the volume of data that can be returned to the user. No doubt these restrictions will gradually be relaxed, assuming Moore's law continues to hold, but the end is nowhere in sight, but it is naive to assume that all remote servers will have effectively infinite capacity any time soon.

The most difficult cases to support will be those which require concurrent access to two (or more) tables, such as joins. These require a lot of traffic between the DBMS and the tables and index files, and are likely to be feasible only if both tables are on discs directly connected to the processor(s) on which the DBMS runs. It is not even currently advisable to attempt joins over a local area network. Wide-area links, even if they have high bandwidth, are likely to be inadequate because of latency, and the finite speed of light makes it hard to get around this limitation. The figures in Jim Gray's paper [op.cit.] are instructive, as they show how vital it is to use fast discs and high-performance disc controllers in an optimised configuration.

6.1 The Astronomical Data Warehouse

My conclusion is that to support a whole range of data-intensive queries astronomers will need machines on which all necessary datasets are held locally. Similar problems have arisen in the commercial world, where the solution has been to set up a data warehouses to hold all their data together with the necessary data mining software. I am convinced that we need to do something similar. The amount of disc storage needed to construct an astronomical data warehouse may not be all that massive: the datasets likely to be mined most intensively are likely to be source catalogues, and at present these only occupy a few tens of gigabytes each, a few terabytes is likely to be adequate for some time. It will be sensible to equip the data warehouse with all the most popular datasets permanently, with scratch areas for temporary importation of the more esoteric ones.

Given the relatively modest cost, I expect that a number of astronomical data warehouses will be set up, many of them are likely to be co-sited with important existing data archives, often with an AstroGrid portal and a resource registry.

The data warehouse user will need an allocation of scratch space, an area which has been termed MySpace. A number of questions arise from this concept, for example who owns the temporary datasets, and for how long are they retained. It is clear that the system needs an authentication and authorisation infrastructure to work (a subject area currently under study mainly under WP-A2).

6.2 Parallel Hardware and Queries

Many types of database operation appear to be easy to execute on loosely-coupled processors such as those in a Beowulf cluster: a large table such as a source catalogue could be split between the discs of different nodes, and each node could search one section of it. This would considerably speed up sequential scans of large datasets. We plan to install a small cluster at Leicester to evaluate this in practice. A cluster of perhaps four nodes should be sufficient for this purpose, but a number of much larger clusters are installed (primarily for other purposes) at a number of AstroGrid sites.

7. Indexing the Sky

The importance of indexing celestial coordinate pairs should be apparent from the arguments above. The area is quite technical, and I have drafted a separate document Indexing the Sky which contains all the gory details; what follows in this sub-section is a summary.

Unfortunately there is no index structure for more than one dimension which approaches the efficiency and stability of the B-tree used almost universally for one-dimensional indexing. The topic has been a very active research area in university computer science departments for many years, and many algorithms have been proposed. These fall into two main categories:

Because of the importance of 2-d indexing in profitable areas of commerce some of the major DBMS have add-ons to support it, some of each category. In very few cases, as far as I have been able to discover, are these indexes fully integrated into the query analyser and optimiser. This greatly limits their use. Another problem we have is that celestial-polar coordinates have unfortunate properties, with singularities at the poles, distorted scales everywhere away from the equator, and a wrap-around at zero RA/longitude. I do not know of any general-purpose multi-dimensional indexing systems which can cope with these.

The Postgres DBMS is unusual in that it has R-tree indexing built in, and perhaps handled sensibly by the query optimiser. Information about it is sketchy, and probably the only way to discover the truth is to install Postgres and try it out with some test datasets.  Oracle and Informix also support R-trees, and the former may have support for spherical-polar coordinates, so deserves to be evaluated.

Mapping functions turn out on more detailed investigation to be rather inefficient at turning range queries into linear queries; I think this is true even for planar indexing, but the are additional problems when used to index a sphere rather than a plane.

I have discovered what seems to be a feasible way of using mapping functions to search or join source catalogues efficiently, which I call the PCODE (pixel code) method. The drawback is that each table holding a source catalogue needs an additional column to hold the PCODE values, and also some additional near-duplicate rows (to hold additional PCODE values when an error-circle overlaps two or more pixels). The advantage is that these modified tables can be joined using a simple integer equi-join, which is handled efficiently by just about all relational DBMS. The details of the method are explained further in Indexing the Sky.

8. Storage of Complex Data Structures

Astronomical data as collected by observatories tends to consist of a large number of files of many different types with a complex relationship between them. It is not straight-forward to represent these complex relationships within a relational database, but object-oriented databases appear to be more suited to the job. In practice, however, things are not as clear cut. OODBMS are not much more than a method of providing persistent storage for an object-oriented programming language, there no effective standard for query language comparable to SQL, and their interfaces and schemas are non-portable. OODBMS have clearly not caught on in a big way in the real world, and the experience of the few projects to use them is also not encouraging. The XMM-Newton Survey Science Centre invested in O2 (a product originally from INRIA, the famous research institute in France) and we are still using it in Leicester, MSSL, and Strasbourg. But after a series of take-overs, O2 is no longer on sale or under active development, and future maintenance is somewhat uncertain. We are currently investigating a number of alternatives, the current leader being PostgreSQL. The Sloan Digital Sky Survey chose to use Objectivity/DB for similar reasons, but found a number of difficulties in practice, and is now in the process of switching to Microsoft SQL Server, a relational DBMS.

This is all rather unfortunate, because in principle object-oriented database management systems should be more readily adaptable to the complexities of data structures than relational ones. The class of DBMS called object-relational may provide an acceptable compromise. The term object-relational is not very well defined: they tend to be basically relational DBMS with the ability to handle user-defined (i.e. structured) data types. Most of the major commercial DBMS now claim to be object-relational, including Oracle and DB2.

Since these problems mainly arise only in designing archives for new projects such as WFCAM and VISTA, this is slightly lower on the priority list of AstroGrid at present. 

9. Interfacing the DBMS to Astronomical Software

Although it should be possible to support a reasonable range of data mining queries on a dedicated astronomical data warehouse, AstroGrid cannot possibly think of all the algorithms which astronomers might wish to carry out. There are two obvious solutions here:

I suspect that in practice there will be a demand both for an API for extracting data from the DBMS, and for a data conversion utility. This topic is also one of the less urgent, but needs to be kept in mind in our overall design.

10. Planned Activities

The planned activities under the database work package were outlined back in April in this WP-A4 description.