The Gigatrees GEDCOM 5 Validator (VGedX)

The Gigatrees GEDCOM 5 Validator (VGedX) is a standalone Windows-x64 application similar to Gigatrees in configuration and operation, but is limited to the validation of GEDCOM 5 files only. It can be downloaded here.

VGedX supports the following GEDCOM versions:
  • GEDCOM 5.5 Rev. 1 (January 2, 1996)
  • GEDCOM 5.5 Rev. 2 (January 10, 1996)
  • GEDCOM 5.5.1
  • GEDCOM 5.6 (Draft)

Note: The following GEDCOM versions are not supported.

GEDCOM 5.5.5 made changes to the underlying format making it incompatible with earlier GEDCOM 5 versions.
GEDCOM 6.0 XML use XML format.
GEDCOM 7.0 uses another format entirely.

Description

In addition to parsing GEDCOM files, GEDCOM validation requires that the GEDCOM file is validated against the specification requirements of both the GEDCOM grammar (line and file syntax) and the GEDCOM dictionary (record format, data types, data formats and data values). In this respect, VGedX is only a partial validator in that it does not in all cases, validate data formats or data values. It also does not have complete coverage of the validation tests listed in the Validation Details below. This is because in many cases, the data types and values defined in the specification are incomplete or inconsistent. VGedX will validate each record's defined properties, i.e. minimum and maximum occurrences, id reference types, etc. It will not validate any enumerated values. Data consistency testing is not part of a GEDCOM validation.

Command Line Options

VGedX supports several command line options. The input file and output path are optional here and can be set via the <Main> configuration option show below. The log is always optional.

Config File
-c config.xml
Loads an individual configuration file.
Input File
-i family.ged
Loads a GEDCOM file.
Output Path
-o web
Sets the output path.
Log
-l build.log
Generates a log file.
Export Test File
-b[v] test.ged
Creates and exports a GEDCOM test file based on the GEDCOM version specified ( -b55R1, -b55R2, -b551, -b56 ).

The distribution file contains all four GEDCOM test files ( gedcoms/55r1.ged, gedcoms/55r2.ged, gedcoms/551.ged and gedcoms/56.ged ) created using the '-b' option. These test files can be used to test VGedX's importing capabilities. The test files use typical values only, and are therefore not useful for testing boundary uses cases. The distribution includes an additional file ( gedcoms/test_grammar.ged ) that can be used to test some boundary conditions.

Configuration

To configure VGedX you must provide the path to your GEDCOM file and the output folder either using the command line options listed above, or here. File and folder paths may be entered as absolute or relative paths. Relative paths are relative to the folder where the application file ( vgedx.exe ) resides, called the working directory.

<Main> Options

<GedcomFile>
[ ]
Expects an absolute or relative file path.
<OutputPath>
[ ]
Expects an absolute or relative file path.

VGedX categorizes its validation results as Errors, Warnings and Alerts. Errors cannot be hidden.

<Validation> Options

<ShowValidationWarnings>
[ true ]
Shows validation warnings. It is sometimes useful to turn off warnings temporarily to reduce the size of the table used to hold the validation results.
<ShowValidationAlerts>
[ true ]
Shows validation Alerts.
<ShowUnusedRecords>
[ false ]
Shows records that are not referenced elsewhere in your GEDCOM file, making it effectively unused.
<ShowUserDefinedRecords>
[ true ]
Shows records that start with an underscore ( i.e. _PRIV ). These are not used internally by the GEDCOM standards and therefore come from various vendors. As such they may not be recognized by an importing application.
<ShowTrailingDelimiters>
[ true ]
Trailing delimeters such as spaces or tabs, are a violation of the GEDCOM standard, but cause no harm and were prevalent in my export testing of many genealogy applications.
<ShowTagWarningDuplicates>
[ true ]
Often when a particular tag causes a warning, thousands of that same tag will as well. If you are only interested in looking at the first occurance, you can disable this, reducing the validation table size significantly.
<ShowTagLists>
[ true ]
VGedX builds tag lists for each record ( i.e. INDI@.BIRT.@SOUR ), where and ampersand appearing after a tag indicates a level 0 record containing a record id ( INDI, SOUR, OBJE, FAM, etc.), and an ampersand appearing before the tag indicates it is a reference to another record and contains an xref_id. When disabled, these are not displayed, making it tricky to understand where the issue occured in the file without looking up the line number in the file. On the otherhad, crazy long tag lists can cause the validation table to display less ... efficiently.
<ValidationDataWidth>
[ 100 ]
When text is display in the validation table, it is limited in length and if cropped an ellipsis will be appended to the text. You can control the width of this text here.

Example

In the following example, the GEDCOM file and the output path are both located in the working directory.

<Options>

  <Main>
    <GedcomFile> gedcoms/test_grammar.ged </GedcomFile>
    <OutputPath> vgedx                    </OutputPath>
  </Main>
  
  <Validation>
    <ShowValidationWarnings>      true  </ShowValidationWarnings>
    <ShowValidationAlerts>        true  </ShowValidationAlerts>
    <ShowUnusedRecords>          false  </ShowUnusedRecords>
    <ShowUserDefinedRecords>      true  </ShowUserDefinedRecords>
    <ShowTrailingDelimiters>      true  </ShowTrailingDelimiters>
    <ShowTagWarningDuplicates>    true  </ShowTagWarningDuplicates>
    <ShowTagLists>                true  </ShowTagLists>
    <ValidationDataWidth>          100  </ValidationDataWidth>
  </Validation>

</Options>

Sample

The following image shows a portion of the validation report. In the GEDCOM file section are listed details about the file itself. It is important to make sure that the File Line Count matches the number of lines in the actual file. In rare cases, when creating continuation tags that require splitting text strings (probably wide characters), some exporting applications create non-printable characters that appears to VGedX as an end-of-file marker, preventing the file from being completely parsed. A GEDCOM revision will be displayed only if detected. In the GEDCOM status section, status types are color coded, and the table can be sorted by clicking on any of the table headings.

VGedX - Report
VGedX - Report

Validation Details

Data Formats

Symbols

() parentheses  = grouped components
[] brackets     = optional components
*  astricks     = multiple occurrences of a component
-  dash         = range of values of a component
|  pipe         = component or

Characters

Character               ASCII value
=========               ===========
tab                     = 0x09
line feed               = 0x0A
carriage return         = 0x0D
space                   = 0x20
exclamation point (!)   = 0x21
cross hatch (#)         = 0x23
colon (:)               = 0x3A
ampersand (@)           = 0x40
underscore (_)          = 0x5F

Character Sets

Character set           ASCII range
=============           ===========
number digit (0-9)      = (0x30 - 0x39)
alpha char (a-zA-Z_)    = (0x41 - 0x5A) | (0x61 - 0x7A) | 0x5F
non-alpha char          = (0x21 - 0x2F) | (0x3A - 0x3F) | (0x5B - 0x5E) | (0x7B - 0x7E) | (0x80 - 0xFE) | 0x60

Character Groups

alphanum                = (alpha char | number digit)	
printable character     = alphanum | non-alpha char | space | cross hatch

Strings

double-at string (@@)   = ampersand + ampersand
number string           = number digit + [number digit]*
alphanum string         = alphanum + [alphanum]*
pointer id              = (alphanum | exclamation point) + [printable character]*						
pointer string          = ampersand + pointer id + ampersand 
embedded id string      = ampersand + [pointer id +] exclamation point + pointer_id + ampersand 
escape string           = ampersand + cross hatch + (printable character | double-at string)* + ampersand + [space] + (printable character)* 
value string            = printable character + [printable character]*
data string             = (value string | escape string) [+ (value string | escape string)]*
delimiter               = space
terminator              = carriage return | line feed | (carriage return + line feed) | (line feed + carriage return)
whitespace              = ([tab]* + [space]* + [terminator]*)* 

Validation Tests

GEDCOM Validation testing includes two types of tests, GEDCOM data format and the GEDCOM form.

GEDCOM 5 Line Syntax

All of the supported GEDCOM Dictionaries use the same GEDCOM 5.5 data format, which defines a line as having the following syntax:

line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + reference_id +] terminator

or

line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + line_value +] terminator

GEDCOM 5 Grammar Tests

The following is a list of requirements of the GEDCOM 5.grammar. Unsupported tests will be noted. String lengths are measured in characters, not bytes.

  1. The level is a number string.
  2. Level numbers should not contain leading zeroes.
  3. The minimum level number is 0.
  4. The maximum level number is 99.
  5. The maximum level number increment is 1.
  6. The level must be followed by a delimiter.

  7. A record_id can be a pointer string or an embedded id string.
  8. The length of a record_id is between 3 and 22 characters
  9. The record_id must be followed by a delimiter.
  10. The record_id must be unique to the file.

  11. for example:

    0 @I1@ INDI
    1 @!O1@ OBJE (I1 is implied)
    1 @I1!O1@ OBJE (duplicates not allowed)

    0 @I1@ INDI (duplicates not allowed)

  12. The tag is a alphanum string.
  13. The length of the tag is between 1 and 31 characters.
  14. The first 15 characters of the tag must be unique.

  15. A reference_id is a pointer string.
  16. The length of a reference_id is between 3 and 22 characters
  17. The reference_id must be preceded by a delimiter.
  18. The reference_id must be followed by a terminator.
  19. The presence of a reference_id implies that the record_id exists in the file unless a colon is present.
  20. If the reference_id contains an exclamation point, the record_id must exist in an embedded record contained within the same logical record.

    for example:

    0 @I1@ INDI
    1 @I1!O1@ OBJE
    1 OBJE @I1!O1@
    1 OBJE @!O1@ (I1 is implied)

    0 @I2@ INDI
    1 OBJE @I1!O1@ (not allowed)

  21. A line_value is a data string.
  22. The line_value must be preceded by a delimiter.
  23. The line_value must be followed by a terminator.
  24. If an ampersand is desired as part of the line_value, it must be included as a double-at string (i.e. name@@school.edu).

  25. The maximum length of a line is 255 characters.
  26. The maximum length of a logical record is 32 kilobytes (logical records are delineated by level numbers equal to 0 (zero)). [NOT SUPPORTED]

GEDCOM Dictionary Tests

To validate the GEDCOM dictionary, VGedX compares the structure of the logical records to the GEDCOM dictionary template associated with its GEDCOM version. It also validates general GEDCOM dictionary constructs common to all supported GEDCOM versions.

  1. The GEDCOM version must be either "5.5", "5.5.1" or "5.6".
  2. Each line should match the GEDCOM dictionary template unless the line has a user defined tag beginning with an underscore.
  3. Each record_id should be referenced from within the same file.
  4. If the template expects a record_id, then the line must have a record_id of the same type.
  5. If the template expects no record_id, then the line must not have a record_id.
  6. If the template expects a reference_id, then the line must have a reference_id of the same type.
  7. If the template expects no reference_id, then the line must not have a reference_id.
  8. If the template defines a minimum number of record occurrences, then the record should not have fewer.
  9. If the template defines a maximum number of record occurrences, then the record should not have more.
  10. If the template defines a minimum line_value length, then the line_value should not be shorter.
  11. If the template defines a maximum line_value length, then the line_value should not be longer.
Comments