Beyond XML Schema — XPath and XSLT for ValidationWill Provost Originally published at XML.com, April 10, 2002 The XML developer who needs to validate documents as part of application flow may choose to begin by writing XML Schema for those documents. This is natural enough, but XML Schema is only one part of the validation story. In this article, we will discover a multiple-stage validation process that begins with Schema validation, but also uses XPath and XSLT to assert constraints on document content that are too complex or otherwise inappropriate for XML Schema. We can think of a schema as both expressive and prescriptive: it describes the intended structure and interpretation of a type of document, and in the same breath it spells out constraints on legal content. There is a bias toward the expressive, though: XML Schema emphasizes "content models", which are good at defining document structure but insufficient to describe many constraint patterns. This is where XPath and XSLT come in: we'll see that a transformation-based approach will let us assert many useful constraints, and is in many ways a better fit to the validation problem. (In fact, one might define Schema validation as no more than a special kind of transformation — see van der Vlist.) We'll begin by looking at some common constraint patterns that XML Schema does not support very well, and then develop a transformation-based approach to solving them. Constraints — Common PatternsWe'll observe two examples, each of which it is problematic to implement in XML Schema. First, consider the schema shown below, modeling a home stereo system. It requires one of two configurations for sound amplification, and then allows any number of sound sources in sequence. Finally, speakers are listed. (Note that for simplicity in this example we're leaving out data type information and focusing on structure. For more fully-worked examples and downloadable code, see the complete whitepaper.) <?xml version="1.0" encoding="UTF-8" ?> <xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" > <xs:element name="Stereo"><xs:complexType> <xs:sequence> <xs:choice> <xs:sequence> <xs:element name="Amplifier" /> <xs:element name="Receiver" /> </xs:sequence> <xs:element name="Tuner" /> </xs:choice> <xs:element name="CDPlayer" minOccurs="0" maxOccurs="unbounded" /> <xs:element name="Turntable" minOccurs="0" maxOccurs="unbounded" /> <xs:element name="CassetteDeck" minOccurs="0" maxOccurs="unbounded" /> <xs:element name="QuadraphonicDiscPlayer" minOccurs="0" maxOccurs="unbounded" /> <xs:element name="Speaker" minOccurs="2" maxOccurs="6" /> </xs:sequence> </xs:complexType></xs:element> </xs:schema> We have occurrence constraints that demand at least two speakers, but let's assume that a system with a quadraphonic sound source must have at least four speakers to be valid. So the following document is valid against the above schema, but for our broader purposes is incorrect: <?xml version="1.0" encoding="UTF-8" ?> <Stereo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Stereo.xsd" > <Amplifier>Mondo Electronics</Amplifier> <Receiver>Mondo Electronics</Receiver> <QuadraphonicDiscPlayer>CSI Labs</QuadraphonicDiscPlayer> <Speaker>Moltman</Speaker> <Speaker>Moltman</Speaker> </Stereo> Now, we could break the whole content model into an Thus one common pattern: the need to analyze the document tree as a whole. XML Schema focuses on the immediate relationships between elements and attributes, parents and children. A more direct approach to pure validation is to start from the document scope and make assertions from there, drilling down as far as necessary to express a constraint. As we'll see, XPath is far better suited to abstract tree analysis than is XML Schema. Secondly, consider weakly-typed designs. Weak types are generally to be discouraged in XML document design, but this pattern does tend to pop up at some point out of necessity: instead of creating multiple subtypes to express specializations, one complex type is used that includes a "type" attribute, usually an enumeration with one possible value for each pseudo-subtype. Based on the value of this attribute, other attributes and elements may or may not be meaningful. Thus from a Schema perspective all these attributes and elements must be considered optional, and this weakens the prescriptive capability of the schema. This is an especially tough nut for Schema to crack, since nothing in the Schema recommendation allows for validation of structure based on values in the instance document. An example of a weakly-typed system is in this schema for credit transactions. Various means of authenticating the human actor are defined, and the As in the Stereo example, this is primarily meant to illustrate what the schema cannot do: how could we express that if XPath as Constraint Language — Selecting What Shouldn't ExistIn order to develop a comprehensive architecture for XML document validation, it is clear that we will need more than XML Schema is able to provide in the way of specifying content constraints. The need here is twofold: we need a language by which to define constraints — we'll look at this now — and a mechanism by which to assert those constraints for a given XML document. Generalizing from the examples in the previous section, we can see that our constraint language should allow us to express constraints of any scope — up to document scope at least — and any complexity. It should enable at least basic node selection by tag or attribute name, pattern recognition, existence tests and node counting, and simple numeric, string and boolean expressions for comparing values. XPath is clearly an excellent fit to this problem. It is expression-based, allowing for arbitrarily complex constraints. It can do simple math and string manipulation, and with little effort can perform some modestly complicated set arithmetic. Best of all, XPath expressions can evaluate to node sets, allowing for selection of all nodes that meet certain criteria. The question, then, is what to select. It's intuitive to think in terms of selecting what's valid. Looking a short way downstream, though, it can be seen that validation is really a process of weeding out invalid data. So our aim should be to express constraints as assertions about unacceptable content patterns. The trick, in other words, is to select what shouldn't exist. For instance, the XPath expression XSLT — Transformation as ValidationXPath never stands alone; it was conceived as a useful common language for various purposes, including transformation, parsing, and even schema design. We need a way to apply this expression-based language to validation. XSLT gives us our solution by framing the process of validation as a transformation whose output will consist of error messages, or will be empty. The structure of the
We'll now look at XSLT-based solutions to the two problems posed in the previous section. First, let's assert two constraints which are not addressed by the stereo schema:
We now introduce a second stage in the validation process: the application of a validating XSLT transform to the instance document. Following the strategy laid out above, the transform defines a template for each of the two constraints, producing the appropriate error message in each case: <?xml version="1.0" encoding="UTF-8"?> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > <xsl:output method="text" /> <xsl:strip-space elements="*" /> <xsl:template match="text ()" /> <xsl:template match="Stereo[QuadraphonicDiscPlayer][count (Speaker) < 4]" > <xsl:text>ERROR: Quadraphonic sound source without enough speakers. </xsl:text> </xsl:template> <xsl:template match="Stereo[count (CDPlayer | Turntable | CassetteDeck | QuadraphonicDiscPlayer) = 0]"> <xsl:text>ERROR: Stereo system must have at least one sound source. </xsl:text> </xsl:template> </xsl:transform> Consider the instance document shown earlier, which should not be considered valid. It still validates against the schema, but is now flunked by the validating transformation. Sample output for transforming the instance document using the above transform is: ERROR: Quadraphonic sound source without enough speakers. Now let's return to our weakly-typed transaction model. We define a validating transform to assert that, for instance, in-person transactions must be verified by checking the signature and visually identifying the other party based on a photo. Then, the candidate document, one of whose sales records doesn't have both required verification steps, fails the validating transformation, producing: ERROR: In-person sales must have verified signature and visual ID. Advantages and Limitations of XPath/XSLTWe've seen that XPath and XSLT can form a second line of defense against invalid data. The value of this second stage in the validation architecture will be judged by what it can do that Schema cannot. Here's a short list of constraint patterns XPath can express well:
The third line of defense, if you will, is application code. Clearly, XPath and XSLT cannot do what this code can do; computational ability especially is limited. XPath has some math functions, and XSLT's flow-control constructs and variables can be used to perform simple calculations, such as a sum of products. This only scratches the surface of what a modern programming language can do. Still, XPath/XSLT will do whatever it can do in very few lines of simple code; we're only hoping that this stage can handle enough of the load to make its inclusion in the process worth the trouble. Code-level integration of XPath and XSLT offers great advantages, too, and may blur the line between the second and third stages as described here. A frustration at the moment is that XPath has yet to catch up with XML Schema's datatypes. It would be nice, for instance, to use XPath to select all flights in an itinerary to assure that they are indeed sequential. XPath 1.0 doesn't have a Postscript — Similar Approaches and ToolsWe've discovered a multi-stage validation architecture based entirely in W3C-standardized technology. Out in the big bad world, another popular transformation-based approach is Schematron, an open-source tool which specifies constraint definitions in its own language. Its vocabulary simplifies the XSLT structure shown in the previous section, and relies on XPath for its constraint expressions. It also allows for both "positive" and "negative" assertions — the negative sort that "select what shouldn't exist" being our approach here. The big difference in process is that a Schematron schema must be pre-compiled, or "pre-transformed" if you will, into a validating stylesheet, which once created is the true counterpart to the pure-XSLT transformations used here. (For a primer on Schematron, see Ogbuji.) Recommended Reading
W3C and Other XML Specifications:Other XML Validation Articles: |