XML Schema Clinic -- Normalizing XML, Part 1

XML Schema Clinic — Normalizing XML, Part 1

Will Provost

Originally published at XML.com, November 13, 2002

As you regular readers of the XML Schema Clinic are aware, your faithful correspondent tends to view the world of XML through object-oriented glasses. For this installment, though, we're reaching out to the relational data folks, switching lenses for one eye at least and seeing what RDB concepts we can usefully apply to XML. To wit: can the normal forms that guide database design be applied meaningfully to XML document design?

Note that we're not talking about mapping relational data to XML. Instead, we assume that XML is the native language for data expression, and attempt to apply the concepts of normalization to schema design. The discussion is organized loosely around the progression of normal forms, from first to fifth. As we'll see, these forms won't apply precisely to XML, but we can adhere to the spirit if not the letter of the law. It is possible to develop guidelines for designing W3C XML Schema (WXS) that achieve the goals of normalization:

Eliminate ambiguity in data expression
Minimize redundancy (some would say, "eliminate all redundancy")
Facilitate preservation of data consistency
Allow for rational maintenance of data

In this first of two parts, we'll consider the first through third normal forms, and observe that while there are important differences between the XML and relational models, much of the thinking that commonly goes into RDB design can be applied effectively to WXS design as well.

XML Composition vs. First Normal Form

The first normal form in relational theory is so fundamental it's often forgotten: it states that every record in a table must have the same number of fields. That is, relational data is rectangular, and a table is a matrix of rows and columns.

Table: Car
	Make	VARCHAR(64)
	Model	VARCHAR(64)
	Year	INTEGER
	Color	VARCHAR(64)
PK	VIN	VARCHAR(6)

Of course, XML is hierarchical by nature, not rectangular. It can capture matrices easily, though, making the table a parent of the row and the row a parent of the field, as shown in the schema fragment below. (Fields could also be expressed as attributes of the row element.)

<element name="Car">
  <complexType>
    <sequence>
      <element name="Make" type="string" />
      <element name="Model" type="string" />
      <element name="Year" type="integer" />
      <element name="Color" type="string" />
      <element name="VIN"><simpleType>
        <restriction base="string">
          <pattern value="[A-Z]{2}[0-9]{4}" />
        </restriction>
      </simpleType></element>
    </sequence>
  </complexType>
</element>

The tricky part is that XML can also render non-rectangular "shapes" for data, thus fitting in more naturally with the type systems of most modern programming languages than with relational data. An XML "record" can vary from first normal form in several ways, all of them legitimate means of data modeling:

It can have varying numbers of fields (though this is not much different from allowing NULLs in RDB fields) and default values for attributes.
It can have multiple values for a field, through the maxOccurs attribute for particles.
It can have choices of field types instead of a straight sequence or conjunction.
Fields can be of complex type.

The last feature is the most apparent when looking at an XML document: an XML record is a tree, not a table. However, it's actually the second feature that has great impact when considering the RDB normal forms. It may at first seem that it's all to the good that we can model multiple children (of simple or complex type) directly under a parent, without having to resort to multiple tables and foreign keys just to express a simple one-to-many relationship. This is certainly an excellent feature of XML, but as we look at applying normal forms one through five to XML, this deviation right at the foundation will have consequences all the way up the structure.

What we'll see, specifically, is that second and third normal forms are affected only slightly; this is because they deal explicitly with single-valued facts. Fourth and fifth normal forms address multi-valued facts, and it is here that XML's ability to robustly express such facts in hierarchies will force a re-evaluation of the meaning and usefulness of normal forms for XML.

Primary Keys

WXS and RDB both use keys as means of identifying records. Note the "PK" notation in the Car table shown earlier: this is the field whose value uniquely identifies a given Car record. WXS has a similar feature, the key, and like RDB keys it can be defined to include one or more fields, in the latter case creating a composite key.

One difference -- and again it's a subtle but ultimately important one -- is that WXS keys cannot be defined by the identified type. Instead, they must be defined at some enclosing scope: specifically, as part of an element that contains instances of the identified type. Here's another fragment of the Car schema that shows the corresponding key definition: the Car key is defined as part of the Dealership element.

<element name="Dealership">
  <complexType>
    <sequence>
      <element ref="cars:Car" minOccurs="0" maxOccurs="unbounded" />
    </sequence>
  </complexType>
  <key name="CarKey">
    <selector xpath="./cars:Car" />
    <field xpath="cars:VIN" />
  </key>
</element>

The choice to place the key definition on a parent, grandparent, or other ancestor affects the scope over which key values are expected to be unique. We'll observe the impact of this difference in part two of this article.

XML Association and Second and Third Normal Forms

I'll venture to say that most design thinking for relational databases is concerned with the second and third normal forms. These are closely related and both assert that a fact expressed by a field in a table should relate only to the primary key (and to the whole key, not to one of several key fields). To do otherwise either establishes invalid correlations or results in redundant expressions of related data.

For example, if we record available rental housing, and include contact information, but we know that a realtor's name, phone, address, etc., will be the same every time that realtor occurs, then packing all this information into one table would be a poor choice, and would break (in this case) third normal form:

Table: HousingUnit
PK	UnitID	VARCHAR(6)
	StreetAddress	VARCHAR(64)
	City	VARCHAR(24)
	State	VARCHAR(2)
	ContactName	VARCHAR(32)
	ContactPhone	VARCHAR(16)
	ContactEMail	VARCHAR(64)

The correct -- normalized -- design is shown next. This really amounts to common-sense decomposition, and is very similar in spirit to OO design in that it insists on single points of maintenance.

Table: HousingUnit			Table: ContactInfo
PK	UnitID	VARCHAR(6)	PK	Name	VARCHAR(32)
	StreetAddress	VARCHAR(64)		Phone	VARCHAR(16)
	City	VARCHAR(24)		EMail	VARCHAR(64)
	State	VARCHAR(2)
FK	ContactName	VARCHAR(32)

Second and third normal forms apply very well to XML. The fact that XML records are not constrained to rectangular shape is of little concern here. For instance, the contact/realtor information in our rental-housing example would likely be modeled as a single complex-type child element, but no matter: the point of these forms is to eliminate redundancy, and repeating a complex-type instance is no more healthy than repeating a series of single values.

The WXS technique that allows a schema to observe second and third normal forms is the keyref, which establishes an association between XML nodes. For our purposes a keyref functions very much like a foreign key: it relates to a specific key, and it asserts similar constraints on the values of referencing fields, so that all associations between key-related types can be validated and navigated.

This leads to our first detailed example of XML normalization. The rental-housing example discussed above is expanded to a more complete solution as shown in Housing1.xsd. Note that the HousingUnit type includes the entire contact-information structure by composition:

<complexType name="HousingUnit">
  <sequence>
    <element name="bedrooms" type="integer" />
    <element name="floor" type="integer" />
    <element name="wheelchairAccessible" type="boolean" />
    <element name="rent" type="positiveInteger" />
    <element name="available" type="date" />
    <element name="location" type="h:PhysicalAddress" />
    <element name="amenities" minOccurs="0">
      <complexType>
        <sequence>
          <element name="amenity" type="string" 
            minOccurs="0" maxOccurs="unbounded" />
        </sequence>
      </complexType>
    </element>
    <element name="contact" type="h:BusinessEntity" />
  </sequence>
  <attribute name="unitID" type="NCName" />
</complexType>

The problem can be seen in the sample database Listings1.xml: some contacts are listed for multiple units, and all their information is repeated over and over. This is inefficient, but worse than that it threatens consistency. If Judy Trueblood's phone number changes, how can we be assured that it won't be updated in one place and left stale in another? The failure to normalize the XML document leaves us with multiple points of maintenance.

The correction is shown in Housing2.xsd, and in the diagram below. A separate type is created for realtors, so that each can be recorded just once. The housing unit can then refer to the realtor, rather than duplicating its information. (The introduction of this feature as a choice allows both old and new forms to be used as appropriate; this is peripheral to the matter at hand.)

<complexType name="HousingUnit">
  <sequence>
    <element name="bedrooms" type="integer" />
    <element name="floor" type="integer" />
    <element name="wheelchairAccessible" type="boolean" />
    <element name="rent" type="positiveInteger" />
    <element name="available" type="date" />
    <element name="location" type="h:PhysicalAddress" />
    <element name="amenities" minOccurs="0">
      <complexType>
        <sequence>
          <element name="amenity" type="string" 
            minOccurs="0" maxOccurs="unbounded" />
        </sequence>
      </complexType>
    </element>
    <choice>
      <element name="contact" type="h:BusinessEntity" />
      <element name="realtor" type="string" />
    </choice>
  </sequence>
  <attribute name="unitID" type="NCName" />
</complexType>

The parent element HousingUnitList now keeps a list of housing units as well as realtors. The foreign-key relationship between HousingUnit and Realtor is asserted here, using keyref:

<element name="HousingUnitList">
  <complexType>
    <sequence>
      <element name="HousingUnit" type="h:HousingUnit" 
        minOccurs="0" maxOccurs="unbounded" />
      <element name="RealtorList">
        <complexType>
          <sequence>
            <element name="Realtor" type="h:Realtor"
              minOccurs="0" maxOccurs="unbounded" />
          </sequence>
        </complexType>
        <key name="RealtorKey">
          <selector xpath="./Realtor" />
          <field xpath="name" />
        </key>
      </element>
    </sequence>
  </complexType>
  <key name="UnitIDKey">
    <selector xpath="Unit" />
    <field xpath="@unitID" />
  </key>
  <keyref name="HousingUnitToRealtor" refer="h:RealtorKey">
    <selector xpath="./Unit" />
    <field xpath="realtor" />
  </keyref>
</element>

As a result of this change the redundancy in recording realtor information has been eliminated. Realtors are referenced only by their key values, and any associated information can be updated in one place, so Judy Trueblood can be assured she won't be missing any calls. The revised instance document Listings2.xml shows this. (Incidentally, here is the XSLT that was used to migrate from one schema to the next: Migration.xsl.)

The Plot Thickens

So the key concept of reducing redundancy through key association is alive and well in W3C XML Schema design. While I'd love to finish on this bright note, I must report that there are a few devils lurking in the details. In part two of this article, I'll point out a few of them and discuss resulting design issues, as well as addressing the subtler fourth and fifth normal forms. Stay tuned.