XML Schema Clinic — Normalizing XML, Part 2Will Provost Originally published at XML.com, December 4, 2002 When one of the many father-son carpentry projects of my childhood would make its seemingly inevitable leap from confidence to confusion, the elder Provost's face would acquire a strangely bemused quality as he pronounced the day's lesson: "Nothing is simple." For better or worse, it is on that note that we resume our discussion on the applicability of concepts of data normalization to XML document design. In part one of this article, we observed that while XML's hierarchical model is somewhat at odds with the rectangular structure of relational data, the goals of data normalization as stated in the relational world are certainly worthwhile. We've also seen the usefulness of the normal forms of relational theory -- perhaps not applied literally, but rather posed as challenges to find equally strict guidelines for XML data design. The basic trick of RDB normalization -- the foreign-key relationship -- has been duplicated for W3C XML Schema (WXS) using the Ah, but truly, nothing is simple, and so in this second and final part, we'll look at some of the subtler issues of "normalized" XML data design, and complete our run through the normal forms to see how well they apply to XML. When Not to NormalizeThe rental-housing example from part one shows the basic technique of defining an XML association between complex types so that instances of one of those types can be referenced by multiple instances of the other. (See the schema Housing2.xsd and instance document Listings2.xml.) This addresses the goals of the second and third normal forms by eliminating redundant statements of fact in the XML document. It's easy to carry this sort of decomposition too far, however, and it's especially tempting for those with relational backgrounds to overuse XML association. In RDB design, foreign keys are commonly used for either of two purposes: one-to-many relationships, and many-to-many or many-to-one relationships, in which multiple objects share some other value, such that if that referenced value changes, it changes for all referencing objects. Because the statement of a foreign-key relationship alone does not distinguish these cases, SQL includes semantics for "cascading delete" to assert a truly compositional relationship between tables. In XML, though, multiple cardinality can be managed through simple composition. (Here again is that fundamental difference in data shape we discussed in part one.) So, in XML, the mere fact of a one-to-many relationship does not in itself call for association through keys. A good rule of thumb for RDB designers is this: if you would have applied a cascading delete to a foreign-key relationship, then you're talking about composition in XML. Here's a quick example: a product-order model including Order records and separate Item records:
The two-table decomposition is made necessary by the multiplicity of Items per Order; this is not a matter of association, but of composition. (The likely association is actually between Item and a third type, Product, using SKU as a key, yielding an attributed, many-to-many relationship from Order to Product.) An XML document would express the items as children of the order, as shown below. This model enforces the compositional relationship in ways that a <complexType name="Order"> <sequence> <element name="OrderNumber" type="integer" /> <element name="CustomerName" type="string" /> <element name="Date" type="date" /> <element name="Item"> <complexType> <sequence> <element name="SKU" type="string" /> <element name="Name" type="string" /> <element name="Price" type="decimal" /> <element name="Quantity" type="integer" /> </sequence> </complexType> </element> </sequence> </complexType> Scope of UniquenessAnother important difference between RDB schema and WXS concerns the scope of key uniqueness. A primary key in a relational database must be unique within its database instance. By contrast, a WXS For example, an airline might record its staffing schedule in a hierarchy from If we want to assert uniqueness over position name, we'd have to do so only for a certain flight on a certain date; that is, while we can't have two captains on the plane, we certainly need one for each plane that leaves the ground. So the "path" to a particular staffing fact would be expressed as //airline/flight/date/position/Employee. If this looks a lot like XPath, no wonder. This path-based addressing fits XML's hierarchical structure, and the ability to define WXS There is a downside to this facility, however. The trick is that a Consider this simple workflow model, in which an ![]() A <element name="process" type="work:Process" > <key name="ActorKey" > <selector xpath="./work:actor" /> <field xpath="work:name" /> </key> <key name="FlowKey" > <selector xpath="./work:flow" /> <field xpath="work:sourceActor" /> <field xpath="work:destinationActor" /> </key> <keyref name="FlowSource" refer="work:ActorKey" > <selector xpath="./work:flow/work:sourceActor" /> <field xpath="." /> </keyref> <keyref name="FlowDestination" refer="work:ActorKey" > <selector xpath="./work:flow/work:destinationActor" /> <field xpath="." /> </keyref> </element> We encounter a problem at the next level of the hierarchy: how can we assert that a Possible workarounds include:
![]() XML Composition Obviates Fourth Normal FormHaving confronted some of the subtleties of dealing with second and third normal forms, we now proceed to fourth normal form, which is the first on our tour to deal with multi-valued facts. Fourth normal form prohibits overlapping multi-valued facts in one record. Another way to put this is to say that a table that attempts singlehandedly to implement multiple one-to-many relationships breaks fourth normal form and should be decomposed into one table for each such relationship. For example, if we want to associate multiple phone numbers and multiple e-mail addresses with individual people, we might be tempted to pack this information into one table, such as:
This record structure would capture necessary information, but the table design leaves the relationship between phone number and e-mail address unclear. Even with our knowledge that they are unrelated, we'll have maintenance troubles. If a phone number is removed, the e-mail address in that same row must be preserved, so the PhoneNumber field would have to be left NULL -- even if another record might have a phone number and no e-mail address! Also, what's the proper primary-key definition? The listing above takes a safe approach, but does this reflect the real state of things? How would foreign keys reference this table? All of these problems are manageable, but what looked like an elegant table design is now exposed as a kludgy solution; clearly, the following design is better:
Note that fourth normal form is interesting only if first normal form is strictly observed, which in XML it is not! As we discussed in part one of this article, XML can vary from first normal form in many ways. Most significant here is the fact that an XML record can easily manage independent multi-valued facts, using composition. XML data, once again, is not strictly rectangular, and fourth normal form has no real meaning when applied to an XML tree. The schema for the contact-info model is therefore a little more natural in WXS -- and might include additional personal information as well, which is presumably a bad idea in the RDB solution above: <complexType name="ContactInfo"> <sequence> <element name="Name" type="string" /> <element name="SSID" type="myNS:MySSIDType" /> <element name="FavoriteColor" type="myNS:MyColorType" /> <element name="PhoneNumber" type="string" minOccurs="0" maxOccurs="unbounded" /> <element name="EMailAddress" type="string" minOccurs="0" maxOccurs="unbounded" /> </sequence> </complexType> Note that querying on structures such as Fifth Normal Form Strikes Back!So if fourth normal form is irrelevant, can we safely ignore fifth normal form? Not really. Where fourth normal form concerns unrelated multi-valued facts, fifth normal form treats the question of how to handle multi-valued facts that are related by some additional rule. In other words, where there are cycles of relationships between more than two record types, fifth normal form enforces complete decomposition of those relationships. The spirit of this rule certainly applies well to XML. Let's say we need to record information on musicians: what instruments they play and what styles of music they know. If instrument and style are independent, then this is a fourth-normal-form problem, but let's add the rule that only certain instruments are appropriate to certain musical styles. Do we list each pairing of instrument and style in a collection under a musician? This would be appropriate if the pairings were chosen by individual musicians, but if we're stating a general rule that excludes rock'n'roll clarinet playing, then we should capture this in a separate tree (or really matrix), and keep the relationships to instrument and style independent under each musician. The benefits of fifth normal form in storage efficiency are a little harder to quantify for XML than for RDB, but they are certainly there, as well. The broader point of fifth normal form, as with all the others, is to avoid redundant statements of fact, and that is as valid for XML as for relational data. | ![]() |