Clive Page

2002 May 22

Revision history:Date | Changes |
---|---|

2002 May 9 | Original version |

2002 May 22 | Added URLS for B-tree and R-tree explanations |

Note: a more general discussion of AstroGrid's needs for database technology is provided in this Database Status Report.

The most important feature of a Database Management System (DBMS) is,
arguably, that you can find and extract the record you want in a dataset
by specifying its *contents*, you do not need to know the record number,
disc block, or the order in which the records were inserted. But finding
the required record by sequential scan becomes very time-consuming with large
datasets, for example it may take a day or more to scan a terabyte of data.
An efficient index, however, should be able to find any record in two or
three disc seeks, say 30 ms. The art of using databases is largely the art
of avoiding unnecessary I/O, since a disc seek takes typically 6 to 10ms,
whereas memory can be accessed in less than 50ns, some 100,000 times faster.

In astronomy, the largest datasets are source catalogues, essentially lists of stars and other celestial objects giving their positions, magnitudes, and other properties. Several source catalogues in current use have over 500 million rows, and the largest ones will surely exceed a billion rows very soon. Collections of images and of raw data also occupy lots of disc space, but their indexing requirements are relatively modest: it typically takes only a few tens of thousands of images from a sky survey can cover the entire sky, so the index is quite small. This note will therefore concentrate on the indexing requirements of large source catalogues.

The example of a source catalogue which will be used in this document is the USNO-A2 catalogue, not the largest source catalogue, but fairly typical of the larger ones. USNO-A2 contains 526,280,881 rows, and effectively 6 columns. It was produced by the US Naval Observatory using a precision measuring machine to scan plates from the Palomar Observatory Sky and UK Schmidt sky surveys, with calibrations using the Tycho catalogue. It was distributed in a slightly compressed form on a set of 11 CDs.

Since a source catalogue is essentially just a big table, one might think that a relational DBMS, designed to handle tables, would provide all the necessary facilities. Unfortunately, the most common method of access to a source catalogue is by position on the sky, which for efficiency requires a two-dimensional index. This brings with it the following main problems:

- Spatial indexing is rarely built-in to DBMS, and algorithms for doing this efficiently are still a topic of active research in computer science.
- The sky is not flat but the surface of an imaginary sphere. Astronomers normally use spherical-polar coordinates (Right Ascension or RA corresponding to longitude in geographical terms, and Declination or DEC which is a celestial latitude). These have a distorted scale everywhere away from the celestial equator, with a singularity at each pole. Not only that, there is wrap-around problem since zero RA is adjacent to 360 degrees (24 hours, 2 pi radians, or whatever). Spatial indices designed for flat Cartesian space do not cope well with spherical-polar coordinates.
- An index is no use unless the query is simple enough that the query analyser can work out how to use the index properly, if not the DBMS engine will revert to a sequential scan.

The most common search criterion astronomers use with a source catalogue
is search for sources at or near a given celestial position (RA, DEC). Because
coordinates are always a little imprecise and represented by floating-point
numbers, one cannot normally search for an exact match, what one wants is
all sources with positions lying within a small error region around the specified
position. If, as is normally the case, the errors in RA and DEC are equal
and independent, then the error region is actually an error circle. This
type of search is often called *search in a cone* because if you think
in three dimensions, you have a cone stretching from the eye (or telescope)
out to the error-circle on the imaginary celestial sphere.

Another common requirement is for the cross-matching of sources in one
catalogue with those in another, for example from a telescope in a different
waveband, or at a different epoch. Interpreting the results requires some
astronomical expertise, but the basic database operation is that of a JOIN
between two tables, based on the positions in the two tables being coincident
within the sum of the radii of the error circles (defined to some chosen
probability level). This operation has various names such as great-circle
join (because one needs to use the great-circle distance between positions,
the simple cartesian distance being incorrect away from the equator), but
the term *fuzzy join* is in common use, and seems simplest. Note that
this has nothing to do with dodgy concepts such as fuzzy logic.

The fuzzy join is of considerable importance because cross-matching is the
basic operation which underpins a whole range of more advanced data mining
operations, on which the new science of the Virtual Observatory depends.
But join operations in databases are intrinsically expensive as they require
a large number of search operations, potentially one per row of the first
table of the two tables. In many astronomical applications the fact that
a source in one catalogue has no counterpart in the other is also of interest:
this requires the *left outer join *between tables . This requires the
first table to be scanned sequentially, since the table produced will have
at least one row for each row in the first input table, and an indexed look-up
in the second table for each row in the first. This obviously requires an
efficient spatial index to be available for the second input table, and for
the join operation to be capable of using that index. Because this involves
the sequential scan of one table (usually the smaller) and indexed access
to the other, the fuzzy join is representative of many astronomical queries.

It is surprisingly hard to obtain even basic information on the spatial indexing features supported by commercial DBMS, such as the spatial algorithms they use and the data types and operators supported. It is almost as if they are ashamed at their offerings. I think the following details are accurate, but they are gleaned from Usenet newsgroups and the more remote corners of the web-sites of various vendors (sometimes from disparaging remarks made by their competitors).

IBM's DB2 has an optional Spatial Data Extender produced jointly with a company called ESRI, a supplier of Geographical Information Systems. This is an additional cost option to DB2 and is said to be based on a 3-stage grid; whether this means it is based on the original grid-file of Nievergelt et al, I have been unable to discover. IBM's web-site says there is a white paper on the Spatial Data Extender, but the link was broken. I think I eventually found this paper, but it reveals practically nothing of its algorithm or properties. I do not know of any use in astronomical applications.

The Informix Universal Server has a Spatial Datablade Module, apparently
based on R-trees. A notable feature of Informix Dynamic Server is the *datablade*,
a chunk of low-level code which can implement a user-defined data type, and
indexing functions on it. Sufficiently expert users can write their own datablades
(though I think the code has to be approved by Informix), but some standard
ones are provided.

A project to investigate Informix datablades as a way of handling astronomical
catalogues was undertaken jointly by the University of Maryland Baltimore
County and NASA Goddard Space Flight Center; they used the USNO-A2 catalogue
(or rather a 1/3000th part of it) as their test dataset. Their experiences
are described in a collection
of papers. They compared the performance of a number of operations using
essentially four different options: simple B-trees on either RA or DEC, two
different spatial datablades, and a datablade of their own, called Simpleshape,
which used their own indexing using R-trees.

Based on their small sample, their estimate of the time taken to load and
index the whole USNO-A2 table are rather remarkable:

Schema | Load time (days) | Index time (days) | Table size (GB) | Index size (GB) |
---|---|---|---|---|

Relational+Btree | 3 | 6 | 13 | 56 |

Geodetic blade | 14 | 25 | 190 | 104 |

Shape2 blade | 10 | 15 | 140 | 77 |

Simpleshape | 1 | 1 | 15 | 20 |

These times may appear excessive, but the estimated time taken to create a B-tree index using the Geodetic datablade was no less than 94 days. The reduced load time of Simpleshape was because they developed their their own data loader (to overcome bugs and limitations of the Informix original) as well as their own indexing method. Their measurements of join included only a self-join of a table of 60,000 stars: this took 3072 seconds, i.e. around 20 joins per second. This does not seem especially fast to me.

The history of the product is complicated. The originator of Postgres,
Michael Stonebraker, took his product to market forming a company called
Illustra. This was taken over by Informix, but they apparently found
the product unsatisfactory in some way, and re-implemented the ideas in Informix
Dynamic Server (IDS), which included the datablade concept. Another version
of their product turned into Informix Extended Parallel Server (XPS). They
started project Arrowhead to re-integrate these two products, but this was
still not complete when the company was taken over by IBM. Reports
suggest that IBM expect Informix users to migrate gradually to DB2, but perhaps
only when the best features of both Informix servers are fully integrated
into DB2. This leaves the future of the Informix products somewhat
uncertain.

The Sloan digital sky survey query system, SX, was originally implemented
on Objectivity/DB, but has now been ported to SQL Server. The sky indexing
uses HTM, written by the astronomers involved. I do not know of any other
major astronomical application which has used this DBMS.

Mysql is an open source relational DBMS with a reputation for efficiency.
Its support for transactions has been criticised (though this is gradually
being improved) but for static datasets the product has a good reputation,
and is used by LEDAS and a number of other astronomical data archives. There
is no special support for spatial data.

Oracle8i has a Spatial Data Option: the only references on the open web that I could find suggested that it was based on something called the HHCODE (Helical Hyperspatial Code). It seems to use a space-filling curve (one reference says a Peano curve) used to map two dimensions to one, so that a standard B-tree can be used. The mapping technique is considered in detail section 6, where some fundamental limitations are noted. An Oracle white paper supplied by the company also says that an R-tree index can be used, and that it also supports geodetic coordinates (which seems to mean great-circle distance over spherical-polar coordinates). This suggests that we ought to include it in our evaluation. Although Oracle is used to host several astronomical data archives, I do not know of any using the Spatial Data Option.

The Department of Computer Science at Berkeley, which produced the original
Ingres, went on under the leadership of Michael Stonebraker to produce an
object-relational DBMS called Postgres. Recent versions support SQL and are
known as PostgreSQL. Although Postgres supports both B-tree and R-tree indices,
the latter are not well documented. Many books are available on Postgres,
but few even mention R-trees at all. The documentation which comes with the
software is also distinctly sketchy in this area. As far as I can make out,
the R-trees do support queries using the cartesian distance operator `<->
`(though whether this is the distance between rectangle centres or the
nearest points is not clear), and also an boolean overlap operator` &&`.
This is clearly something to evaluate.

Some tests of R-trees were reported in the paper Object-relational DBMS for Large Astronomical Catalogue Management by Baruffolo and Benacchio of Padova Observatory. They an astronomical catalogue of about one million rows as their test dataset. The search time for a B-tree index was around 7 seconds, but less than one second with an R-tree. A later paper by the same team reports their switch to using Informix Dynamic Server. Meanwhile Postgres is said to have improved in efficiency.

One problem noted by Baruffolo et al is very slow data loading and index generation, even on a mere million rows. Of course the operations need, in principle, only to be performed once. Another area of concern is the disc space overheads.

These overheads can be estimated from published details of Postgres. Again, I shall use USNO-A2 as the example. The Postgres R-tree is based on rectangular boxes (one specifies a coordinate pair for two opposite corners). To store this in a Postgres table with an R-tree index will expand the data from its original 6.3 GB to around 74.3 GB (based on reasonable assumptions on the packing efficiency of the R-tree nodes), an expansion factor of nearly twelve. Of course disc space is cheap, but the expansion factor also affects the data bandwidth when making retrievals from the database.

If the R-tree is not found to be satisfactory, then there is another possibility: the work of the Berkeley CS group has continued with Generalised Search Trees (GiST). These have been interfaced to Postgres, but not yet fully integrated (according to some newsgroup postings). I have been in touch with Oleg Bartunov at the Sternberg Astronomical Institute in Moscow who has made a number of interesting contributions to the Postgres newsgroups on the subject. His code can be downloaded, but it is written in C with hardly any comments, and the only documentation was in Russian. The indexing problem he was trying to solve is in fact different from ours, but he thought that with some work it might well be possible to use a generalised search tree for sky indexing.

Sybase has a spatial option involving external software, SQS from Autometric, a division of Boeing Corporation, but I could not find much information on it. Sybase is used in a number of astronomical archive sites, but as far as I know, none of them use a spatial index.

It occurred to me that vendors of GIS might have already solved some of these problems of indexing the sky, since the surface of the earth is also roughly spherical. I therefore spent a couple of days trying to find out about GIS, but eventually concluded that these systems do not have much to offer the astronomer. There are a couple of reasons for this. firstly most GIS applications only cover a rather limited area of the Earth, so users simply adopt a map projection which removes the worst distortions; in the polar regions for example one would simply use an orthographic projection centred on the pole. And the polar regions are so sparsely populated that they are of little importance in most GIS projects. Secondly, most GIS systems are much more concerned about the representation of linear features such as roads, pipelines, and political boundaries, and in converting between vector and raster representations of these, whereas linear features barely figure in astronomy. While 2-dimensional indexing seems to be a recognised problem in GIS, it did not seem to be of particular importance, and it was not clear to me that there was a standard solution. Most commercial GIS are built on top of some DBMS, and where spatial indexing is required, they just use the spatial data features of this DBMS.

To the best of my knowledge nearly all existing astronomical data archives make do with just a one-dimensional index of the sky, on either RA or DEC. This is probably because two-dimensional indexing is so rarely provided and difficult to use. Before moving on to two-dimensional indexing, therefore, I want to consider indexing by just one coordinate, and its limitations.

Perhaps the first question to consider is whether to index on RA or DEC. One factor to take into account is that if sources are uniformly distributed on the sky they will also be uniformly distributed in RA, whereas in declination the distribution will be far from uniform because the area of sky per declination strip has a cosine function. This would seem to give an RA index better selectivity. A more important factor, however, is the wrap-around at zero RA, which would require special code or duplication or rows to handle properly. The declination axis has no such problem, and for this reason, I think, most astronomical data archives and catalogue-handling systems have chosen to use a declination index.

To get efficient access to a one-dimensional table there are three obvious methods:

**Sorting**- is used by a few catalogue-handling systems invented by astronomers, probably on grounds of simplicity. As far as I know sorting is not used by any commercial DBMS to assist access, perhaps because it is only suitable for completely static tables. It does, of course, save on the disc space usually occupied by an index. But sorting is not actually very fast compared to using a B-tree, as finding a single row in a sorted table by bisection only halves the number of rows under consideration at each stage, so it requires about log_{2}N steps. A B-tree has a typical fan-out factor of 100, so only requires about log_{100}N steps. For USNO-A2, where N=5x10^{8}, this is a ratio of about 29:5, though in practice the first few stages are likely to access cached disc blocks, so it may be more like 24:3 in terms of disc seeks. Thus sorting may be nearly an order of magnitude slower than using a B-tree on a large table.**Hash tables**- can be very efficient, but can only handle exact matches so are generally only used on character string fields (or fields sparsely filled with integers). They cannot be used for range searches. For this reason they are of limited use in scientific applications and will not be considered further.- B-trees - can
handle not only integers and strings but also floating-point data types
and dates/times, and support range queries. The origin of the name is
in doubt: B may stand for balanced, or may come from the surname
of one of the inventors, Bayer and McCreight [
*Acta Informatica*, vol 1, pp173-189 (1972)]. B-trees support two main retrieval functions: one to seek a particular record by reference to its contents, and another to retrieve subsequent records in indexed order, which is what makes range queries feasible, since you just continue reading linked nodes until the upper limit is reached. The B-tree is quite an efficient data structure, with good random insertion and retrieval performance, and a guarantee that nodes below the root are always at least 50% full. For this reason B-trees are implemented by all serious relational DBMS, and indeed one of the principal technologies on which they are built.

Let us suppose we have a catalogue of celestial sources which includes
columns called `RA` and `DEC`, with the coordinates given in
degrees (radians might be more efficient, but the numbers in the examples
would need more decimal places which makes them harder to follow, even though
the RADIANS/DEGREES functions could be omitted).

This SQL query is designed to find all rows with positions within 0.01 degrees of RA 123.45 and DEC -45.67 in a table called CAT:

SELECT * FROM CAT WHERE DEGREES(ACOS(SIN(RADIANS(DEC)) * SIN(RADIANS(-45.67)) + COS(RADIANS(DEC)) * COS(RADIANS(-45.67)) * COS(RADIANS(RA-123.45)))) < 0.01 ;

It is based on the simplest formula for the great-circle distance between two points, where spherical-polar coordinates are used. Unfortunately this expression is complex enough that it is certain to baffle all known query analysers, so that the DBMS will use a sequential scan of the data.

Perhaps I should point out that I realise that this simple formula is not very good if the angles are small (as they will often be here) because at least one cosine will be very close to unity and the resulting accuracy will be very poor. The next simplest formula without this problem involves haversines, which are not in Standard SQL. The haversine and its inverse could be provided as user-defined functions, but these are not Standard SQL either. There may be a portable solution to this, but the question is irrelevant, since it is clear that this type of expression is of no use anyway, if an indexed solution is required.

When an SQL query is sent to a DBMS, it is parsed, analysed, and processed by a query optimiser, before the necessary sequence of operations is scheduled. A complex SQL query can involve quite a number of operations: selection, joining, sorting, grouping, etc. The query optimiser tries to work out the fastest way of getting the result, which will often involve using an index, if a suitable one exists. But for small tables a sequential scan may actually be fastest, and it is also possible to submit a query containing expressions so complex that the optimiser is baffled into doing a sequential scan. DBMS are very clever pieces of software, but they are tuned to the types of query most commonly encountered in the commercial world, which is where nearly all their revenue comes from. The examples above illustrated some of the problems.

Next let us try an alternative approach, searching not in a circle but in a square box enclosing the circle of half-width (and half-height) 0.01 degrees, which simplifies the syntax considerably:

SELECT * FROM CAT WHERE RA BETWEEN (123.45-0.01) AND (123.45+0.01) AND DEC BETWEEN (-45.67-0.01) AND (-45.67+0.01);

If an index is available for either RA or DEC then any competent query
optimiser will use it to speed up a query of this form. If there are indices
for *both* RA and DEC the optimiser has a choice. A good optimiser
will use whichever seems likely to result in faster execution, by estimating
the selectivity of each index from any information it has about the distribution
of values in each column. Suppose it chooses the RA index, it will use this
to select the subset of rows satisfying the condition on RA, and for each
of these also evaluate the condition on DEC. **But, because of the way that
indices work, there is no way to utilise both of them at once** Each index
just returns a list of records selected from the base table. In principle,
one might think, the combination of conditions could be handled by returning
a set of row numbers from the application of each index separately, and then
selecting only those rows which are present in both sets, that is: implement
the `AND` condition by a set-wise intersection. I do not know of any
DBMS which does anything like this at present, although I suspect that search
engines such as Google do it. But many of the details of commercial DBMS
are kept secret, so I could be wrong about this.

The square box obviously returns more rows than required, because its area is larger than the circle. The easiest way to get SQL to do the full selection is first to use an indexed selection on one of the coordinates, and then apply the great-circle formula to the resulting set of rows. Assume that we have an index on DEC, to avoid the complications of wrap-around, and then use the great-circle selection. Here is what result:

SELECT * FROM CAT WHERE (DEC BETWEEN (-45.67-0.01) AND (-45.67+0.01)) AND (DEGREES(ACOS(SIN(RADIANS(DEC)) * SIN(RADIANS(-45.67)) + COS(RADIANS(DEC)) * COS(RADIANS(-45.67)) * COS(RADIANS(RA-123.45)))) < 0.01);

This, I think, selects the required rows, and is simple enough SQL that any reasonable DBMS will know that it can make use of an index on DEC.

The average astronomer, one might guess, will not be very happy if forced
to write queries such as the one shown above. One simplification possible
in many SQL systems to provide user-defined functions, so that one could
use a great-circle distance function, such as `gtcirc(ra1,dec1,ra2,dec2)`,
instead of the rather complicated expression given in the last example. But
user-defined functions are not part of Standard SQL, and ways of defining
them vary considerable in practice.

Even with such measures, SQL is not easy to use, and Jim Gray noted that astronomers using the SLOAN Sky Server used only the most basic types of query. I think that we may have to accept the necessity of designing our own astronomical query language, which can be pre-processed into the required SQL.

The SQL devised so far has, of course, ignored the questions of errors in the RA and DEC columns in the table. For some catalogues this error is small and much the same for all rows, so a fixed value may be appropriate, and even listed in the metadata. In other cases there is a position-error column called something like PERR. This is typically a standard statistical error (sigma) and the prudent user will want to search out to a distance of perhaps 2 or 3 sigma to include all possible matches. If there is a separate position error for each row then it is quite hard to know how to deal with it, since we do not know without looking at each row whether to include it in any given search or not, so it somewhat defeats indexing. In practice, the error-radii do not usually cover a very wide range of values, and a pragmatic solution is to estimate, once and for all, the largest error radius of all rows in the entire table. This value can then be added to the error criterion (0.01 in the previous examples) used in each test.

Errors are sometimes elliptical or even more complex in shape: it is hard to cope with such cases in a simple way. The best solution is probably to use the radius of the exscribed circle, and later filter out the results which are not within the actual error region.

Using a one-dimensional index is better than nothing, but to appreciate the gains from using a two-dimensional index consider the following example.

Suppose we want to search the USNO-A2 source catalogue for the counterpart of an X-ray source which has a positional error of 3 arc-seconds: the area of an error circle of this radius is 28.27 square arc-seconds. The USNO-A2 contains 525 million sources; if these were distributed randomly on the sky there would be one source per 1018 square arc-seconds, so the probability of a source occurring at random in a circle of area 28.27 square arc-seconds would be around 2.78 percent, on average. Thus in a randomly-chosen circle of this size there will usually be no matching USNO-A2 sources, but one will be present about 3 percent of the time, and more than one even more rarely. If we only have an index on one coordinate, let us say declination, the best that we can do is use it to select a strip of sky along the whole RA axis 6 arc-seconds high: this has an area of up to 7,776,000 square arc-seconds (depending on declination, it will be exactly that on the equator). A strip of that area will typically contain 7639 sources. We have to retrieve all these rows, and then use the great-circle distance expression to scan through them all to find whether one or two (but most probably zero) actually lie within the specified circle. This would not necessarily need 7639 separate disc seeks, if the data in the RA strip were optimally striped on the disc, but the query would probably take a few seconds, and might take a bit longer. With a 2-d index, however, there is no reason why the necessary row (or rows) could not be retrieved in under 30 milliseconds.

Obviously, a JOIN operation on a table of more than a few thousands rows will be exceedingly slow if only a one-dimensional index is available.

Multi-dimensional indexing has been a hot topic in computer science departments of universities for many years (although the UK does not seem to have been very active in this field). Not only is the topic inherently challenging, because a good solution to the 1-d problem was found so long ago in the form of the B-tree, but there are many profitable areas of commerce where a good multi-dimensional indexing method would pay considerable rewards, for example mining and oil prospecting. Spatial problems arise in many other areas, of course, for example in micro-chip design, circuit board layout, and in the design of buildings and engineering structures of all types. An efficient multi-dimensional indexing system would also be of use in a wide range of queries on multiple parameters. Consider, for example, querying a vehicle database like this:

SELECT * FROM CARS WHERE COLOUR = "GREEN" AND MODEL = "ASTRA ESTATE" AND YEARLETTER = "G";

Or a star catalogue like this:

SELECT * FROM STARS WHERE SPTYPE = "G4" AND VMINUSB BETWEEN 0.1 AND 0.2 AND KMINUSL > 0.7;

Such queries could be speeded up considerably if a multi-dimensional index were set up on two or more of the fields in this selection. For all these reasons, the computer science literature abounds with multi-dimensional indexing algorithms.

There are in fact two distinct options for spatial indexing

- A true multi-dimensional index: covered immediately below.
- A function which maps 2-dimensional (or more) space to a line, upon which a conventional B-tree index can be constructed. This mapping option is covered further in section 6 below.

The fact that there are so many different algorithms suggests to me immediately that the problem has not yet been cracked to general satisfaction. As far as I know, no multi-dimensional algorithm has yet been proposed which has all of the most desirable features of the B-tree, i.e. fast insertion, deletion, look-up, and sequential retrieval, and a guarantee that it remains a balanced structure, and that all of its nodes below the root will be at least 50% full under all circumstances. .

The only algorithm which appears to have been sufficiently well-regarded to
have been integrated into one or two DBMS is the R-tree, invented by
Antonin Guttman (Proc. *ACM- SIGMOD International Conference on Management
of Data,* pages 47 - 57, 1984). Postgres is one of the few to incorporate
the R-tree, and since it is also an open source product, it is high on our list
of DBMS to evaluate. It could be that one of the more recent algorithms is
better, but at present the R-tree is the market leader (in a very limited
market)

The R-tree is defined not on points, but on rectangular regions of space.
At first sight this might seem like overkill, since many source catalogues
only have lists of what appear to be point sources. In fact, it should be
helpful, as it ought to solve the spherical-polar distortion problem. When
loading the data into the R-tree it is necessary to compute the coordinates
(RA, DEC) of two opposite corners of the rectangle which encloses the error-circle.
Allowance can obviously be made at this stage for the increased RA-range
which an error-circle occupies at declinations away from the equator. Indeed
if the spherical trigonometry is done correctly, it should be possible to
define the enclosing rectangle correctly even for a point exactly at the north
or south pole.

Since one-dimensional B-trees are so efficient, and so well integrated
into query optimisers, whereas indexing in two or more dimensions appears
to be so difficult, other solutions to the problem have been sought. One,
perhaps more of a fudge than a solution, is to devise a mapping from the
two-dimensional surface to a one-dimensional quantity upon which a conventional
B-tree can be built. An important refinement is to fill the space with a space-filling
curve, so that points nearby in spatial terms are also nearby on the line.
This means that **a range query in space can be converted to a linear range,
so permitting the use of an ordinary B-tree.**

Of course, as any topologist will tell you, a 2-d to 1-d mapping cannot be done without losing something. The question is, whether it can be done usefully at all.

One of the simplest mappings using a space-filling curve, and one which seems to have been invented independently a number of times, is variously called the bit-interleaved index, or Z-order index. A simple way to visualise this in an astronomical application is as follows: take the RA and DEC coordinates, scale each to the range 0 to 65535, and convert to an integer. This scaling has been chosen to fit into a 16-bit (short) integer, and gives a cell size which (at the celestial equator) is just under 20 arc-seconds in RA by 10 arc-seconds in declination. Next we interleave the bits, by taking one bit from each alternatively, to make a 32-bit integer. Obviously the method can be extended to smaller cells on the sky by using 32-bit coordinates, producing a Z-order index of 64-bits, upon which an ordinary B-tree index can be constructed.

An important feature is that in any small patch of sky, small offsets in RA or DEC will generally only change a few of the low-order bits in each coordinate, which will cause changes to no more than twice as many low-order bits of the Z-order index. So as a general rule, nearby points have similar Z-order values. This leads to the idea that one can search for all points in a given small patch of sky by specifying only the lowest and highest Z-order values in that region, that is one can convert a spatial range to a linear range in the Z-order index. To illustrate this simply, the table below shows a pair of 4-bit coordinates interleaved to produce an 8-bit Z-order index, with values from 0 to 255. Note that if you try to draw a line through the cells in sequential order, you get a repeated pattern of the letter Z, hence one of its names.

0 | 1 | 4 | 5 | 16 | 17 | 20 | 21 | 64 | 65 | 68 | 69 | 80 | 81 | 84 | 85 |

2 | 3 | 6 | 7 | 18 | 19 | 22 | 23 | 66 | 67 | 70 | 71 | 82 | 83 | 86 | 87 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

8 | 9 | 12 | 13 | 24 | 25 | 28 | 29 | 72 | 73 | 76 | 77 | 88 | 89 | 92 | 93 |

10 | 11 | 14 | 15 | 26 | 27 | 30 | 31 | 74 | 75 | 78 | 79 | 90 | 91 | 94 | 95 |

32 | 33 | 36 | 37 | 48 | 49 | 52 | 53 | 96 | 97 | 100 | 101 | 112 | 113 | 116 | 117 |

34 | 35 | 38 | 39 | 50 | 51 | 54 | 55 | 98 | 99 | 102 | 103 | 114 | 115 | 118 | 119 |

40 | 41 | 44 | 45 | 56 | 57 | 60 | 61 | 104 | 105 | 108 | 109 | 120 | 121 | 124 | 125 |

42 | 43 | 46 | 47 | 58 | 59 | 62 | 63 | 106 | 107 | 110 | 111 | 122 | 123 | 126 | 127 |

128 | 129 | 132 | 133 | 144 | 145 | 148 | 149 | 192 | 193 | 196 | 197 | 208 | 209 | 212 | 213 |

130 | 131 | 134 | 135 | 146 | 147 | 150 | 151 | 194 | 195 | 198 | 199 | 210 | 211 | 214 | 215 |

136 | 137 | 140 | 141 | 152 | 153 | 156 | 157 | 200 | 201 | 204 | 205 | 216 | 217 | 220 | 221 |

138 | 139 | 142 | 143 | 154 | 155 | 158 | 159 | 202 | 203 | 206 | 207 | 218 | 219 | 222 | 223 |

160 | 161 | 164 | 165 | 176 | 177 | 180 | 181 | 224 | 225 | 228 | 229 | 240 | 241 | 244 | 245 |

162 | 163 | 166 | 167 | 178 | 179 | 182 | 183 | 226 | 227 | 230 | 231 | 242 | 243 | 246 | 247 |

168 | 169 | 172 | 173 | 184 | 185 | 188 | 189 | 232 | 233 | 236 | 237 | 248 | 249 | 252 | 253 |

170 | 171 | 174 | 175 | 186 | 187 | 190 | 191 | 234 | 235 | 238 | 239 | 250 | 251 | 254 | 255 |

To see the locality of reference, a couple of regions of 3 by 3 cells have been shown in bold (and colour in the on-line version). Square regions are shown, which you can imagine might be an approximation to a small error-circle. The 3x3 set (in green) around cell number 25 covers a range of Z-order values from 18 to 30. One could build a B-tree on the Z-order values, and use a simple one-dimensional range search to access cells in this patch of sky, except that it would return 13 cells when only 9 are needed, so a little subsequent weeding is necessary. This works in most parts of the plane, but unfortunately not everywhere.

The set of 3x3 cells (shown in yellow) around cell number 63 shows what can go wrong: here the range of Z-order values is 60 to 192, so that if you were to do a range search with these limits the resulting collection of cells would cover nearly half the total area, when only 9 cells are really wanted. The Z-order index, therefore, works reasonably well in most places, but goes spectacularly wrong in a few places when one of the high-order bits flips.

A 32-bit index is just about good enough for current astronomical tables:
the cell size of around 200 square arc-seconds has on average 0.2 sources
in a catalogue of the size of USNO-A2, so the Z-order index of a row will
mostly be unique, but larger catalogues will need 64-bit integers for their
Z-order values to be unique. Even if the index in unique, and the error-radius
much less than the cell size, there will always be some sources located close
to a cell boundary or corner, which gives them more than one Z-order value
per row.

It is easy to see that the *average *performance of the Z-order index
in range searches may be rather poor. Even if just a small number of places
in the sky are affected, if these cause a sequential scan of most of the
table, they have a huge effect on the average. Estimating this average performance
analytically was beyond my mathematical ability, so I did what all physicists
do when the maths get too hard: cheat. I wrote a small program, a hundred
or so lines of Fortran95, to set up a Monte Carlo simulation of the problem,
with the added but necessary complication of spherical-polar coordinates.
Using 16-bit coordinates for RA and DEC and thus a 32-bit Z-index, which
leads to a 20 arc-second cell, my results were as follows:

Radius of error-circle, arc-seconds | Average No. of cells covered |
---|---|

1 | 3,915 |

3 | 10,135 |

10 | 26,776 |

30 | 115,718 |

100 | 435,365 |

This is a remarkably poor performance, indeed only for the very smallest search radii is the Z-order index better than using a simple one-dimensional index on declination alone.

There are several other space-filling curves with similar properties, the most well-known perhaps being the Peano and Hilbert curves. Some of these have also been proposed as 2-d to 1-d mapping functions. I have not tried to simulate any of these, but all have the same basic drawback, of failing when a high-order bit flips in either coordinate. I would not expect any of them to have a much better performance.

Part of the problem is the distortion around the pole, which means that a small error-circle intersects a lot of cells if these are set up on a simple grid (as in a Mercator projection). This has led astronomers to seek alternative mappings, appropriate for the surface of a sphere, with more nearly equal areas per cell all over the sky. The two best known are HTM and HEALPix.

HEALPix is a Hierarchical
Equal Area iso-Latitude Pixelisation of the sky invented by Krzysztof Górski
and colleague. It is a development of the Quadrilateral Spherical Cube, devised
for processing COBE data. HEALPix starts with a division of the sky into
12 segments, each of which is then subdivided into roughly square pixels;
these can be subdivided into four smaller squares, and so on recursively.
One important design feature of HEALPix is that the pixels are all of approximately
equal area on the sky: this facilitates the use of 2-dimensional Fourier
transforms and similar techniques to investigate the large-scale structure
of the sky, which are not of interest here. There are two ways of numbering
the pixels, the *ring* and the *nested* schemes. The nested numbering
has the desirable property (for our mapping to one dimension) that adjacent
pixels tend to have numbers differing only in low-order bits. A few special
pixels have only seven nearest neighbours, all the others have the usual eight.
The HEALPix package is written in Fortran90, but there is also an IDL interface.
Current implementations appear to be limited to 32-bit pixel numbers, giving
a resolution of around 20 arc-seconds.

The Hierarchical Triangular Mesh (HTM) was devised by Peter Kunszt, Alex Szalay, and Ani Thakar at Johns Hopkins University for use in the Sloan Digital Sky Survey. One might think that dividing the sphere into squares was hard enough, but the idea of using triangles seems to have been proposed first by Paul Barrett (ADASS-IV proceedings, pp492-495) in what he called the Quaternary Triangular Mesh. The HTM tessellation works as follows: the sphere is first divided into eight equal spherical triangles with 90-degree sides; from then on down each triangle is divided into four sub-triangles of approximately equal area by bisecting all three sides and joining these mid-points with great-circles. This sounds as if it involves a lot of messy trigonometry, but in fact can be carried out fairly rapidly if the coordinates are expressed in 3-vector notation rather than in spherical-polars. At each level you subdivide the triangle into four, so two additional bits suffices to name them. A 32-bit integer provides an index with triangles around 20 - 30 arc-seconds in size. Going to 25 levels in the structure fills a 64-bit integer and gives a typical resolution of 9.6 milli-arcseconds. A library of routines written in C is available. One important function converts a given position (RA,DEC) to the HTM pixel number to the desired level of precision, and other routines can list all the HTM pixel numbers covering a given small region of sky. There is also a Java version, but it is reported to use only 32-bit HTM codes. In most cases the triangles in a given small region of sky have HTM pixel numbers which are similar, differing only in low-order bits.

The only comparison of HTM and HEALPix that I have been able to find is this one by Wim O'Mullane et al. It notes that in practice HEALPix appears to generate indices about an order of magnitude faster than HTM, perhaps because the HTM calculation is more complex and requires a recursive descent (or perhaps because HEALPix is written in Fortran90, whereas HTM uses C and Java). This speed advantage may not matter much in practice, since most applications will be limited by I/O transfer rates, not cpu speed.

I have not had time yet to use either HTM or HEALPix in my simulation program, but I would expect that while both of these mappings solve the problem associated with the singularities at the poles, they do not overcome the more fundamental limitation that when a high-order bit flips, the locality of reference is lost, and a B-tree search would have to cover a large number of rows in linear space.

Mapping functions have no problem when mapping a point on a plane to one on a line, it is the idea that a spatial range can be mapped to a reasonably contiguous set of points on a line which is fundamentally flawed. If this concept is abandoned, then an alternative procedure becomes clear.

- First, choose some mapping of the sky into pixels. In order to avoid problems with spherical-polar coordinates, something like HTM or HEALPix is desirable, but either of them would do. Indeed we might be able to devise something simpler and faster with the limited properties required for our purposes. The most important properties are that pixels should not differ too much in area, and that each pixel should not have too many near neighbours: here HEALPix gains in having at most 8 nearest neighbours, whereas HTM triangles are adjacent to 12 others. Then number the pixels in any self-consistent scheme: let us call the pixel number the PCODE value.
- Next choose a level of sky division such that the typical pixel size exceeds the size of most error-circles.
- Then create a copy of the input table, with an additional column for the integer PCODE values (and space for extra rows).
- Then: for each row in the source catalogue where the circle is contained entirely within one pixel, the PCODE value is obvious. Rows in which the error-circle crosses a pixel boundary (or even more than one boundary) will have a list of PCODE values: these are handled by inserting additional rows into the table, one for every extra PCODE value; these have the same contents as the original row, except in the PCODE column.
- Then create a regular one-dimensional B-tree on the PCODE column.

Provided that not too many error circles cross pixel boundaries, the table size does not increase much. For USNO-A2, which has typical error radii under 0.3 arcseconds, the increase in the number of rows would only be 0.1%.

The only functions needed on the mapping are the following:

`pcode_point(ra,dec)`which returns a single integer PCODE given the spherical-polar coordinates of a point, and

`pcode_list(ra,dec,radius)`which returns a list of PCODE values for all pixels at least partially overlapped by the circle with the given centre and radius

Both the HTM and HEALPix libraries contain functions which effectively do this.

The new table can then be used in a *search in a cone* operation
as follows. Find call `pcode_list` to find the set of PCODE values,
then search for all of them in the PCODE B-tree index. There may be a few
duplicates as a result, because the search cone covered two pixels both of
which were covered by the error-circle around the same source in the table,
but it should be easy enough to remove duplicates by standard DBMS techniques
such as `SELECT DISTINCT`(this this may require a unique source-number
column also to be present in the table).

It should be possible also to carry out a *fuzzy join* between two
tables, if they have both been processed by the same procedure. This needs
only a simple equi-join on their PCODE columns, an operation for which most
DBMS are highly optimised. Again there may be duplicates in the output table
caused by overlaps of error-circles which both cover two (or even more) pixels,
and these will need to be removed.

The obvious question is whether the PCODE method is sufficiently scalable. Clearly the pixel size of the grid needs to be reasonably well chosen so that not too many sources lie in a single pixel (or the weeding stage has more to do), but so that the pixels are mostly larger than the error-circle radius. It is a well-established rule of thumb in survey radio astronomy that one does not try to extract from the data more than one source per forty beam-areas (and there are similar principles in other wavebands using different terminology). This should ensure that there is always a pixel-size larger than the beam (which is normally larger than the error-circle radius) but smaller than the typical spacing between sources. It is perhaps fortuitous that a 32-bit integer PCODE results in pixels 20 to 30 arc-seconds in size, which is quite suitable for the current state of astronomy. When we get tables of more than a few tens of billions of rows, there will be rather too many point sources per pixel, and then it will be time to add more bits. No doubt by then 64-bit processing will have entered the mainstream of computing.

It is important to note that with both HTM and the HEALpix nested numbering scheme, the recursive division of the sky produces a cell number which can be truncated to produce a PCODE value suitable for a coarser resolution. This should make it possible to produce a pixel index for each table which is scalable to the resolution required for any given join situation.

The principal disadvantage of the method is that it requires a new version of each source catalogue to be produced, with an additional column for the PCODE, and space for additional rows. In processing which does not involve the PCODE column, these extra rows would have to be ignored.

The great advantage of the PCODE method is that it can be used with *any*
common-or-garden relational DBMS, with no special support for spatial data
or query processing. The equi-join on integer fields is just what they are
all supposed to be good at.

The use of space-filling curves to turn spatial ranges into linear ranges seems intrinsically flawed, and the simulations suggest the performance is totally inadequate.

The R-tree indexing of Postgres, Oracle, and Informix seems worth evaluating; given the uncertain future of Informix, we should perhaps concentrate on the other two. Figures reported for data loading and indexing time for both Postgres and Informix give some concern.

A sky grid based on the PCODE idea (using HTM or HEALPix or something
simpler) could be used as outlined in section (7) above to provide an efficient
search and join on tables. This would require an extra column for the index
code, and additional rows in the tables, but these overheads are relatively
modest. This option also needs to be evaluated with suitable test datasets,
but essentially *any* relational DBMS can be used.