Direct access to Guide datasets

Direct access to datasets on the Guide CD

Some users and programmers may be interested in accessing the data on the Guide CD-ROM from their own programs. In certain cases, this can be easy to do; some of the data (for example, all of the "user datasets") are in their original ASCII form, and are well-documented. In others, it can be a difficult undertaking (e.g., the GSC, Tycho, and PPM data). The compression really is not intended to frustrate people. But a quick check will show you that the CD-ROM is almost completely full; unless many larger datasets were highly compressed, something would have been omitted.

There is a great deal of detail given about the format of datasets in Guide in various files in the COMPRESS directory of the CD itself. Starting with Guide 6, so many datasets are provided and so little space was available on the CD that it was necessary to compress almost every dataset, except for a few tiny ones where compression would have freed up little space on the CD.

Since the first Guide 6.0 CDs were made, new sample source code has been written to demonstrate how to extract data from Ax.0/SAx.0 CDs, and the example code for extracting GSC data from the Guide CDs has been vastly improved.

Datasets on the Guide CD in plain text

New GSC extracting software

New Ax.0 extracting software

New Tycho extracting software

Accessing the variable star data

Asteroid orbital elements and other asteroid data

Voyager images

PPM (Position and Proper Motion)

DM (Durchmusterung)

SAO (Smithsonian Astronomical Observatory)

Datasets not currently accessible by users

Datasets on the Guide CD in plain text

Several smaller datasets were stored without any compression, since the savings would have been marginal. They are listed below. You may have some or all of these, depending on which version of the CD you have. Some are plain ASCII; a few are in FITS format.

Nebulae:
   (All in the NEBULAE directory,  except as indicated)
   HII.FIT              Sharpless (HII) catalog
   LBN.FIT              Lynd's Bright Nebulae
   LDN.FIT              Lynd's Dark Nebulae
   PLN.FIT              PK (Strasbourg) Planetary Nebulae
   REFLECT.FIT          Reflection Nebulae
   SNR.FIT              Supernova Remnants
   TEXT\BARNARD.DOC     Barnard Dark Nebulae

Clusters:
   (All under CLUSTERS directory)
   OPEN\OPENCLU2.DAT      Lund catalog
   OPEN\SELECTED.FIT      Selected Clusters catalog
   GLOBULAR.DAT

Stars:
   Hipparcos data:
      \HIPP\HIP_DM_G
      \HIPP\HIP_DM_O  Orbital solutions
      \HIPP\HIP_DM_V  Variable doubles
      \HIPP\HIP_DM_X  Stochastic doubles
      \HIPP\HIP_VA_1  Solved variables
      \HIPP\HIP_VA_2  Unsolved variables
         Documentation for these is in \HIPP\README.HTM
   HD (Henry Draper) catalog
      \HD\HDE.DAT
         The "extension" to the HD catalog.  The original
         specifications for this data are contained in the
         file \HD\HDE.DOC.
   \VARIABLE\GCVS4.DAT        General Catalog of Variable Stars
   \VARIABLE\NSV3.DAT         New Suspected Variables
   \WDS\WDSCAT.DAT            Washington Double Star catalog

New GSC extracting software

Click here for UNPACK.CPP

The file \COMPRESS\UNPACK.CPP has been available since Guide 1.0 as an example of how to decompress a GSC tile. It has always been provided as a "monolith" that runs as a standalone program to extract a particular GSC tile.

The new source code separates the extraction process from the user interface. Extraction is now encapsulated in the following function:

int gsc_unpack( const int cd_drive_letter, const int tile_no,
                           const int fix_numbers);

If you need a particular tile, let's say tile #239, with your CD letter being drive E:, you could call:

   err_code = gsc_unpack( 'e', 239, 0);

If this runs correctly (err_code = 0), the function will create a file, 0239.GSC, in the original GSC format. (Parsing the file and cleaning it up are left to you.)

The "fix_numbers" parameter addresses a bug in the original GSC 1.1. If the star numbers in a given tile go past 10000, the ten-thousands digit in the original GSC remains set to zero. (See the tile 3588.GSC for one of the very few examples of this bug.) Usually, there are not enough stars in a tile for this to happen; most tiles have far fewer than 10000 stars.

If the "fix_numbers" parameter is zero, gsc_unpack( ) will follow the GSC 1.1 convention and set the 10000s digit in all star numbers to zero. If "fix_numbers" is non-zero, gsc_unpack( ) will take the more logical course of setting the GSC star numbers correctly.

New Ax.0/SAx.0 extracting software

Click here for A10_EXTR.CPP

Despite the name 'A10_EXTR.CPP', this code actually can handle A1.0, A2.0, and the SA1.0 and SA2.0 disks. The extraction process is pretty simple. It mostly runs through this function:

int extract_ax0_from_multipath( const char *cd_path,
            const double ra_degrees, const double dec_degrees,
            const double width_degrees,
            const double ht_degrees, FILE *ofile)

This gathers data for the requested area, and writes it out to the output file as sixteen-byte records. The first four bytes are the star ID number, in a form described below. The remaining twelve are the raw Ax.0 data, as defined in the 'readme' on the Ax.0 disks. The only change made to those bytes is to flip the long integers to match Intel format.

The "cd_path" tells the function where to look for Ax.0 data, and can specify multiple directories. For example, 'x:\;d:\a2' would tell it to look for data in the root of drive x:, and if that didn't do the job, look in d:\a2.

The function figures out what "extracts" cover the given area, and what CDs will be needed to cover that area. It will recognize SA1.0 CDs and handle them correctly. However, it does not automatically recognize A2.0 and SA2.0 disks. You have to make use of the global integer using_a20 to do that. Set using_a20 to a non-zero value if you're using A2.0 or SA2.0.

The star ID number packs in information as to the zone (0-23, from south to north pole), running ID number of the star within the zone, and whether it's an A1.0, A2.0, SA1.0, or SA2.0 star. It is stored as follows:

   For A1.0,  id = offset +  zone       * 50 million
   For A2.0,  id = offset + (zone + 24) * 50 million
   For SA1.0, id = offset +  zone       *  4 million + 1.2 billion
   For SA2.0, id = offset + (zone + 24) *  4 million + 1.2 billion

All this takes advantage of the fact that, for Ax.0, no zone has more than 50 million stars; for SAx.0, no star has more than four million stars.

Other functions provided in this file are:

int find_needed_a10_cds( const double center_dec_degrees,
                     const double height_degrees);

provides a list of which disks cover a given range in declination. The return value is a bitmask indicating the needed CDs; for example, if the return value were 0x84, you'd need CDs 2 and 7. Again, you have to set using_a20 to a non-zero value if you're using A2.0 or SA2.0.

You may want to prompt the user to (for instance) "Insert disk 4, 5, or 10 into drive D:" In such a case, you could feed the return value of the above function to this following function:

int build_prompt_string( const int needed_mask,
        const char *prompt_string, const int cd_letter, char *output_prompt)

The 'prompt_string' should be text such as "Please insert CD number %s in drive %c:". (Guide supplies this text because it will vary with the language used; in French, for example, "S.V.P. introduire le CD numéro %d dans le lecteur %c:" is used.)

Also, some example code to handle the extraction in DOS (or any command-line based system) and Windows is provided toward the end of the file.

Tycho extracting software

Various flavors of Tycho are stored on the Guide CDs, depending on what was available at the time. Guide 8.0 has the full Tycho-2 dataset. Guide 7.0 has ACT ("original" Tycho with better positions and proper motions). And Guide 6.0 has the plain old Tycho data.

You can click here to download source code to access these data (about 4 KBytes). The formats have changed slightly between versions, with some new fields added in (mostly things I found out, after the fact, I really ought to have included to begin with.)

Using this, you can get data for individual Tycho stars (though it should be clear how one gets data for entire "tiles", if desired). The source consists of four files when unZIPped: UNTYCHO.CPP, TYCHO.H, WATDEFS.H, and TYC_EXT.CPP.

UNTYCHO.CPP provides an example "test program"; its use is really not all that complex. In essence, you first examine the main Tycho file LG_TYCHO.LMP, to find out where the desired Tycho tile lies in the file. You allocate memory for that tile, read it in, and call the function parse_tycho_star( ) to get data for each star in that tile.

You'll notice that this function takes a parameter indicating which version of the data is in use. For Guide 8, you'll get error values for the magnitudes. The older CDs lack that data, and the error values are left at zero. For Guide 6 CDs, the error values for the positions and proper motions and the VT and BT magnitudes are stored separately in the \HIPP\TYC_EXT.DAT file, discussed in the following section. (Guides 7 and 8 have no TYC_EXT.DAT files.)

Tycho "Extras" file, TYC_EXT.DAT, format (Guide 6.0 only)

In Guide 6.0, the \HIPP\TYC_EXT.DAT contains Tycho data such as error values (sigmas) for the positions and proper motions; the VT and BT ("V Tycho" and "B Tycho") magnitudes; and the parallax data. (In Guide 7.0, that system was junked, as described in the preceding section.)

To illustrate how the data is stored (and to provide a pretty simple way to access the data without having to understand the format very well), I have provided C/C++ source code, \HIPP\SOFTWARE\TYC_EXT.CPP, for extracting "Tycho Extras" data from the CD-ROM. A few comments on this source code may be helpful:

Data for each Tycho star is stored in an 18-byte structure (nine short integers). The start of the file contains an index of 9,537 long integers, used to figure out where in the file the data for a given tile is stored. That's the meaning of the lines:

   fseek( ifile, (long)( gsc_zone - 1) * 4L, SEEK_SET);
   fread( loc, 2, sizeof( long), ifile);
   loc[1] -= loc[0];

   fseek( ifile, 9538L * 4L + loc[0] * (long)sizeof( TYCHO_EXTRAS), SEEK_SET);

With those lines, TYC_EXT.CPP is getting the starting record number for the zone, and the starting record number for the next zone; by subtraction, it finds out how many records this will contain. It fseeks to the start of the data, and starts reading in records. Either it finds one with the matching gsc_num, or it runs out of records.

The main( ) function does little more than to open up TYC_EXT.DAT, call get_tycho_extras_data( ), and display all fields from the TYCHO_EXTRAS structure.

For space reasons, the Guide CD does not store the entire Tycho database. (The full dataset includes additional data such as Durchmusterung cross-references.) If you really feel you need that data, let me know why, and I'll burn a CD-ROM for you. (The version I have has some advantages over the original TYC_MAIN distributed by the European Space Agency and the ADC. I added proper motions from the ACT (Astrographic Catalog/Tycho), which greatly improves proper motion precision; and spectral types and SAO cross-indices from the PPM (the original Tycho inexplicably omits these); and variable star designations for almost all variable stars in Tycho.)

Accessing the variable star data

The variable star data is available, in raw ASCII text form, in the files \VARIABLE\GCVS3.DAT and \VARIABLE\NSV3.DAT. In theory, that ought to be all you need for most purposes. The datasets have their original Astronomical Data Center documentation. Aside from adding plenty of Name List stars to the GCVS, they're unaltered.

However, if you're doing star charting, the GCVS can be quite awkward. Rick Hudson has just inquired about accessing the file \VARIABLE\VARIABLE.LMP, the data file used by Guide in drawing these objects. So here's a description of it.

How VARIABLE.LMP is subdivided spatially

The first problem with such a dataset is splitting it up into manageable chunks in the sky. In doing this, I followed a system similar to the GSC "large area" one. At its highest level, the GSC is split into 24 zones in declination, each 7.5 degrees high. Each zone is then split evenly in right ascension; the zones nearest the poles are split three ways, the 75-82.5 zones in nine pieces, and so on... until the zones from 0-7.5 degrees are split in 48 pieces, each covering 7.5 degrees in declination.

The appeal of this system is that you've come up with roughly equal areas of sky coverage. It's easy to write code to figure out which "large areas" cover a desired chunk of the sky. And it's pretty easy to sort out data into such tiles. With the 24 zones of the GSC, you wind up with 732 "large areas". (The GSC is then further split up; each "large area" is split into 2x2, 3x3, or 4x4 "small areas", with the 4x4 scheme used in dense areas and the 2x2 in sparse areas. Personally, I consider it a headache and wouldn't do such things. The result is a total of 9,537 "small areas". You can forget this complication in dealing with VARIABLE.LMP.)

For the data I was dealing with, 732 areas seemed excessive (that would average 50 variables/area). I used 18 zones, each 10 degrees high; that resulted in 412 areas instead. Areas 0, 1, and 2 circle the south celestial pole (-80 to -90 declination); area 0 runs from 0h to 8h, area 1 from 8h to 16h, area 2 from 16h to 24h.

In the next declination zone, we've got nine areas, numbers 3 to 11, circling the pole at -70 to -80 declination. Area 3 runs from 0h to 2h40m, area 4 from 2h40m to 5h20m, and so on... up to area 11 running from 21h20m to 24h.

You can compute the number of zones at a given declination as

n_areas_in_zone = floor( 2. * n_zones * cos( center_dec) + .5)

where center_dec is, logically enough, the center of that particular zone (-85 for our first zone, -75 for the second, etc.) This formula can be used for both the GSC and VARIABLE.LMP. If you'd rather not do this math, you can simply store the eighteen "n_areas_in_zone" values as

3, 9, 15, 21, 25, 29, 33, 35, 36, 36, 35, 33, 29, 25, 21, 15, 9, 3

(You'll notice that there is symmetry around the celestial equator and a total of 412 areas, numbered 0 to 411.)

Reading the variable star data for a given area

The first part of VARIABLE.LMP contains a set of 412 offsets and sizes, indicating where the data for each area is within the file and how many bytes it consumes. For odd historical reasons, each record of offset/size data consumes 22 bytes, of which eight are used. Here's some source code to load up data for a given tile into a buffer; it should make things pretty clear.

char *load_up_variable_tile( const int tile_no, int *size)
{
   FILE *ifile = fopen( "d:\\variable\\variable.lmp");
   char *rval;
   long offset_data[2];

   if( !ifile)
      return( NULL);
   fseek( ifile, 22L * (long)tile_no + 16L, SEEK_SET);
   fread( offset_data, 2, sizeof( long), ifile);
   *size = offset_data[1];
   fseek( ifile, offset_data[0], SEEK_SET);
   rval = (char *)malloc( *size);
   if( !rval)
      {
      fclose( ifile);
      return( NULL);
      }
   fread( rval, *size, 1, ifile);
   fclose( ifile);
   return( rval);
}

Parsing the data for a given area

OK, so the above now hands you the data for a tile in a buffer, also giving you the size of that data buffer. (You may want to modify the above, to avoid opening and closing the file each time and allocating and deallocating memory so freely. It can also pay to preload all the offsets, to evade unnecessary seeks. Guide does all these things; they're omitted here for brevity and clarity.) Now you need to parse the data. The first two bytes of the buffer give you the number of variables in this tile; call this n_var_stars. The variable star data follows immediately, in 12-byte VAR_STAR structures that can be defined in C like this.

struct var_star
   {
   short loc[3], var_no;
   char constell;
   unsigned char var_type, mag, min_mag;
   };

Given this, one could process a given tile as follows.

int process_variable_tile( const int tile_no)
{
   int tile_size;
   short n_var_stars, i;
   char *variable_data = load_up_variable_tile( tile_no, &tile_size);
   char *tptr;

   if( !variable_data)
      return( -1);
   memcpy( &n_var_stars, variable_data, sizeof( short));
   tptr = variable_data + sizeof( short);
   for( i = 0; i < n_var_stars; i++)
      {
      struct var_star variable;

      memcpy( &variable, tptr, sizeof( struct var_star));
      process_variable( &variable);    /* draw it or whatever */
      tptr += sizeof( struct var_star);
      }
   free( variable_data);
   return( 0);
}

Interpreting the data for a star

Much of the data in the structure is cryptic at best. var_no and constell are simplest. They correspond to the GCVS four-digit variable and two-digit constellation number, respectively. For example, T And would have var_no = 3 (third variable in this constellation) and constell = 1 (first constellation). If you see constell = 90, you've got an NSV star on your hands. In this case, var_no is the NSV number. It's bumped up by one for NSV stars after 10360, because for some odd reason, there's an "NSV 10360" and an "NSV 10360A".

The loc[3] is a little tricky. It has two different interpretations: one for Guides 1.0 through 6.0, and one for Guide 7.0.

First, the explanation for Guides 1.0 through 6.0: Some jaws will drop in disbelief that I did something this weird. You must remember that I originally created this dataset on a '486 SX-20 (no math chip) and evaded floating-point math whenever possible. The oddity I am about to describe enables me to draw variables on a chart, in stereographic projection, without doing a single floating point operation. Back in 1992, that was an extremely important concern.

These three short integers define a Cartesian-coordinate unit vector. For a given RA/dec, they will be equal to

loc[0] = (short)(  32767. * cos( ra) * cos( dec))
loc[1] = (short)( -32767. * sin( ra) * cos( dec))
loc[2] = (short)(  32767.              sin( dec))

i.e., loc[0] (the "x-axis") points toward 0h RA; loc[1] (the "y-axis") to 18h RA; and loc[2] (the "z-axis") to the North Celestial Pole. Note that I committed an oddity with loc[1]; it's negated relative to what you might naturally expect. You can recover RA/dec as

ra = atan2( -(double)loc[1], (double)loc[0]);
dec = asin( (double)loc[2] / 32767.);

And now for the storage system in Guide 7.0. By late 1998, the fact that the above scheme evaded floating-point math was not so important. On Pentium II (and to some extent, "original" Pentium) processors, the implementation of floating point is so good that the advantages of the scheme were mostly gone. And the above scheme does have one real weakness: the precision is 1/32767 radian, or about 6 arcseconds.

To duck this problem, Guide 7.0 stores the RA in the first three bytes of loc[], and dec in the next three bytes, as "24-bit integers". As a result, you can extract the data as follows:

   char *tptr = (char *)loc;
   long ra_as_24_bit_int, dec_as_24_bit_int;
   double ra_degrees, dec_degrees;

   ra_as_24_bit_int = dec_as_24_bit_int = 0L;
   memcpy( &ra_as_24_bit_int, tptr, 3);
   memcpy( &dec_as_24_bit_int, tptr + 3, 3);
   ra_degrees  = (double)ra_as_24_bit_int * 360. / (double)( 1L << 24);
   dec_degrees  = (double)dec_as_24_bit_int * 360. / (double)( 1L << 24);
   dec_degrees -= 180.;

The above code pulls out six bytes, three at a time, from loc[] and puts them into two long integers. In this system, one integer "unit" is 2^-24 of a full circle, or about .012 arcsecond... plenty of precision for our purposes. Multiplying by 360 and dividing by 2^24 gets us a result in degrees, ranging from 0 to 360. This causes some trouble with the declination, of course, so we subtract 180 degrees to get a result between -180 to +180. (Since dec can only range from -90 to 90, I _am_ wasting one bit here. The dec could be stored in 23 bits, not 24. But I didn't really need that bit.)

mag and min_mag are in tenths of a magnitude. The var_type is an index into the text file VAR_TYPE.DAT (found in your Guide directory). For example, the third line of VAR_TYPE.DAT reads

ACVO    9P ^Rapidly oscillating Alpha CVn^

If you came across a variable star with var_type = 3, you could open VAR_TYPE.DAT, read three lines, and you'd have loaded the data to tell you that this star is a rapidly oscillating Alpha CVn type. (The meaning of the '9P' is discussed in the file \COMPRESS\VAR_TYPE.DOC.)

Having done all this, you'll also note that the entire tile hasn't been parsed. This is because the name VARIABLE.LMP is slightly misleading. The remainder of the tile contains, first galaxy data from the PGC, and then nonstellar object data (planetaries, dark nebulae, open and globular clusters, and so on). I'll cover all that later, probably when somebody e-mails me a question about them.

Asteroid orbital elements and other asteroid data

The 'starting place' for orbital data must be the file \ASTEROID\ASTORB. This file is the Lowell Observatory asteroid database; it is updated almost continuously, and you can grab a current version at ftp://ftp.lowell.edu/pub/elgb.

The orbital elements used by Guide, however, come from files such as \ASTEROID\ASTEROID\2449\2449400.AST and similarly numbered files. Each contains orbital elements for a particular epoch, corrected for perturbations; they consume a considerable fraction of the CD-ROM. The layout and format of the (binary) data is described in a file on the Guide CD, in the file \COMPRESS\ASTEROID.DOC.

The ASTEROID directory of the Guide CD also contains files such as DISCOVER (asteroid discovery dates, locations, and discoverers); PROVISIO.DAT (a list of provisional designations for each asteroid); and ASTNAMES.DAT (asteroid names and magnitude parameters). There are also a few other files containing data such as rotational periods, pole positions, estimated diameters, and so on; each of these has its own documentation file. They were all copied from ADC (Astronomical Data Center) CD-ROMs.

Voyager images

The VOYAGER directory of the Guide CD contains a few example images from the twelve-CD series "Voyagers to the Outer Planets", provided by NSSDC (National Space Science Data Center). Software to display them is provided in the PDSWIN directory of the Guide CD-ROM; this software is copyrighted freeware distributed courtesy of Steve Green, the author.

PPM (Position and Proper Motion)

As with the SAO and HR catalogs, the PPM is stored in a compressed binary format. A structure definition and some very brief description of the format can be found on the CD-ROM, in the file \COMPRESS\PPM.DOC.

DM (Durchmusterung)

This data is also stored in a compressed form, in the file \DM\DM.DAT. There is some description provided in the .DOC files in that directory; there is also a short example routine, \DM\DE_DM.CPP, provided on the CD.

SAO (Smithsonian Astronomical Observatory)

The SAO is mostly of purely historical interest now, and has been for some years. It was used as the basis for star display in Guide 1.0, back in 1992; shortly afterward, it was replaced by the far superior PPM, which in turn has been replaced with Tycho/Hipparcos data. Under almost all circumstances, use of these catalogues is preferable to use of SAO.

If, however, you wish to access the SAO, you should click here to download the C/C++ code for SAO access (about 3 KBytes). This provides a small 'example' program for unpacking SAO data. The SAO is stored in a form that has not changed since Guide 1.0. Each star consumes 36 bytes in the file SAO.LMP, in the SAO directory of the Guide disk. However, this file is not organized by SAO number. It was originally used for drawing star charts, so the data was sorted into areas in the sky, and by magnitude within those areas.

Therefore, in order to find a given star in the SAO, one must first look through the file INDEX.SAO (also in the SAO directory). The get_sao_info function does this for you, and then uses the resulting offset to get the actual data from SAO.LMP.

Once you have the 36-byte "packed" data for an SAO star, a call to parse_entry "unpacks" it into a suitable structure, and it can be easily used.

Datasets not currently accessible by users

For the following datasets, there is little or no documentation of the format. This is not to conceal information; it's just that no one has presented a good reason to document them yet. If you have a need to use this data, please let me know; I'll consider writing up a data specification.

   HR (Yale) (Bright Star)
      To conserve space and improve speed,  these two catalogs
      are stored in a compressed binary format.  No provisions
      have been made for their direct use.  Some description of
      the layout of the SAO data is provided in the file
      \COMPRESS\SAO.DOC.