XML Schema Clinic — Normalizing XML, Part 1Will Provost Originally published at XML.com, November 13, 2002 As you regular readers of the XML Schema Clinic are aware, your faithful correspondent tends to view the world of XML through object-oriented glasses. For this installment, though, we're reaching out to the relational data folks, switching lenses for one eye at least and seeing what RDB concepts we can usefully apply to XML. To wit: can the normal forms that guide database design be applied meaningfully to XML document design? Note that we're not talking about mapping relational data to XML. Instead, we assume that XML is the native language for data expression, and attempt to apply the concepts of normalization to schema design. The discussion is organized loosely around the progression of normal forms, from first to fifth. As we'll see, these forms won't apply precisely to XML, but we can adhere to the spirit if not the letter of the law. It is possible to develop guidelines for designing W3C XML Schema (WXS) that achieve the goals of normalization:
In this first of two parts, we'll consider the first through third normal forms, and observe that while there are important differences between the XML and relational models, much of the thinking that commonly goes into RDB design can be applied effectively to WXS design as well. XML Composition vs. First Normal FormThe first normal form in relational theory is so fundamental it's often forgotten: it states that every record in a table must have the same number of fields. That is, relational data is rectangular, and a table is a matrix of rows and columns.
Of course, XML is hierarchical by nature, not rectangular. It can capture matrices easily, though, making the table a parent of the row and the row a parent of the field, as shown in the schema fragment below. (Fields could also be expressed as attributes of the row element.) <element name="Car"> <complexType> <sequence> <element name="Make" type="string" /> <element name="Model" type="string" /> <element name="Year" type="integer" /> <element name="Color" type="string" /> <element name="VIN"><simpleType> <restriction base="string"> <pattern value="[A-Z]{2}[0-9]{4}" /> </restriction> </simpleType></element> </sequence> </complexType> </element> The tricky part is that XML can also render non-rectangular "shapes" for data, thus fitting in more naturally with the type systems of most modern programming languages than with relational data. An XML "record" can vary from first normal form in several ways, all of them legitimate means of data modeling:
The last feature is the most apparent when looking at an XML document: an XML record is a tree, not a table. However, it's actually the second feature that has great impact when considering the RDB normal forms. It may at first seem that it's all to the good that we can model multiple children (of simple or complex type) directly under a parent, without having to resort to multiple tables and foreign keys just to express a simple one-to-many relationship. This is certainly an excellent feature of XML, but as we look at applying normal forms one through five to XML, this deviation right at the foundation will have consequences all the way up the structure. What we'll see, specifically, is that second and third normal forms are affected only slightly; this is because they deal explicitly with single-valued facts. Fourth and fifth normal forms address multi-valued facts, and it is here that XML's ability to robustly express such facts in hierarchies will force a re-evaluation of the meaning and usefulness of normal forms for XML. Primary KeysWXS and RDB both use keys as means of identifying records. Note the "PK" notation in the Car table shown earlier: this is the field whose value uniquely identifies a given Car record. WXS has a similar feature, the One difference -- and again it's a subtle but ultimately important one -- is that WXS keys cannot be defined by the identified type. Instead, they must be defined at some enclosing scope: specifically, as part of an element that contains instances of the identified type. Here's another fragment of the Car schema that shows the corresponding key definition: the <element name="Dealership"> <complexType> <sequence> <element ref="cars:Car" minOccurs="0" maxOccurs="unbounded" /> </sequence> </complexType> <key name="CarKey"> <selector xpath="./cars:Car" /> <field xpath="cars:VIN" /> </key> </element> The choice to place the XML Association and Second and Third Normal FormsI'll venture to say that most design thinking for relational databases is concerned with the second and third normal forms. These are closely related and both assert that a fact expressed by a field in a table should relate only to the primary key (and to the whole key, not to one of several key fields). To do otherwise either establishes invalid correlations or results in redundant expressions of related data. For example, if we record available rental housing, and include contact information, but we know that a realtor's name, phone, address, etc., will be the same every time that realtor occurs, then packing all this information into one table would be a poor choice, and would break (in this case) third normal form:
The correct -- normalized -- design is shown next. This really amounts to common-sense decomposition, and is very similar in spirit to OO design in that it insists on single points of maintenance.
Second and third normal forms apply very well to XML. The fact that XML records are not constrained to rectangular shape is of little concern here. For instance, the contact/realtor information in our rental-housing example would likely be modeled as a single complex-type child element, but no matter: the point of these forms is to eliminate redundancy, and repeating a complex-type instance is no more healthy than repeating a series of single values. The WXS technique that allows a schema to observe second and third normal forms is the This leads to our first detailed example of XML normalization. The rental-housing example discussed above is expanded to a more complete solution as shown in Housing1.xsd. Note that the ![]() <complexType name="HousingUnit"> <sequence> <element name="bedrooms" type="integer" /> <element name="floor" type="integer" /> <element name="wheelchairAccessible" type="boolean" /> <element name="rent" type="positiveInteger" /> <element name="available" type="date" /> <element name="location" type="h:PhysicalAddress" /> <element name="amenities" minOccurs="0"> <complexType> <sequence> <element name="amenity" type="string" minOccurs="0" maxOccurs="unbounded" /> </sequence> </complexType> </element> <element name="contact" type="h:BusinessEntity" /> </sequence> <attribute name="unitID" type="NCName" /> </complexType> The problem can be seen in the sample database Listings1.xml: some contacts are listed for multiple units, and all their information is repeated over and over. This is inefficient, but worse than that it threatens consistency. If Judy Trueblood's phone number changes, how can we be assured that it won't be updated in one place and left stale in another? The failure to normalize the XML document leaves us with multiple points of maintenance. The correction is shown in Housing2.xsd, and in the diagram below. A separate type is created for realtors, so that each can be recorded just once. The housing unit can then refer to the realtor, rather than duplicating its information. (The introduction of this feature as a ![]() <complexType name="HousingUnit"> <sequence> <element name="bedrooms" type="integer" /> <element name="floor" type="integer" /> <element name="wheelchairAccessible" type="boolean" /> <element name="rent" type="positiveInteger" /> <element name="available" type="date" /> <element name="location" type="h:PhysicalAddress" /> <element name="amenities" minOccurs="0"> <complexType> <sequence> <element name="amenity" type="string" minOccurs="0" maxOccurs="unbounded" /> </sequence> </complexType> </element> <choice> <element name="contact" type="h:BusinessEntity" /> <element name="realtor" type="string" /> </choice> </sequence> <attribute name="unitID" type="NCName" /> </complexType> The parent element <element name="HousingUnitList"> <complexType> <sequence> <element name="HousingUnit" type="h:HousingUnit" minOccurs="0" maxOccurs="unbounded" /> <element name="RealtorList"> <complexType> <sequence> <element name="Realtor" type="h:Realtor" minOccurs="0" maxOccurs="unbounded" /> </sequence> </complexType> <key name="RealtorKey"> <selector xpath="./Realtor" /> <field xpath="name" /> </key> </element> </sequence> </complexType> <key name="UnitIDKey"> <selector xpath="Unit" /> <field xpath="@unitID" /> </key> <keyref name="HousingUnitToRealtor" refer="h:RealtorKey"> <selector xpath="./Unit" /> <field xpath="realtor" /> </keyref> </element> As a result of this change the redundancy in recording realtor information has been eliminated. Realtors are referenced only by their key values, and any associated information can be updated in one place, so Judy Trueblood can be assured she won't be missing any calls. The revised instance document Listings2.xml shows this. (Incidentally, here is the XSLT that was used to migrate from one schema to the next: Migration.xsl.) The Plot ThickensSo the key concept of reducing redundancy through key association is alive and well in W3C XML Schema design. While I'd love to finish on this bright note, I must report that there are a few devils lurking in the details. In part two of this article, I'll point out a few of them and discuss resulting design issues, as well as addressing the subtler fourth and fifth normal forms. Stay tuned. | ![]() |