XML Schema Clinic -- Working With Metaschema

XML Schema Clinic — Working With Metaschema

Will Provost

Originally published at XML.com, October 2, 2002

If in its entire lifespan W3C XML Schema were used merely to validate document after earnest little document, it would have proven its worth. Happily, schema are useful for much more than validation. XML applications can extrapolate all sorts of functionality from an XML vocabulary: everything from document authoring and GUI-building to marshaling to workflow and process management. This has been the promise of XML all along.

WXS excels at just this sort of modeling. It provides a structural template that describes in detail each type and relationship: just the information an application would need, say, to build a new instance document from a data stream, or to create an intuitive GUI for data entry. Given the tremendous complexity of WXS, however, applications that consume schema face a daunting processing challenge. Often, the full power of the language is neither needed nor wanted, as modeling requirements may be relatively simple and developers don't want to be responsible for every possible wrinkle (and there are ever so many) in a schema. If only we could constrain a candidate schema to use just a subset of the full WXS vocabulary ...

Oh, wait — we can! "WXS vocabulary" is the tip-off: a schema is just an XML document, after all, and it can be validated like any other. What we need, in other words, is a schema for our schema.

In this article we'll investigate uses of metaschema, and techniques for creating them. This will bring us in close contact with the existing WXS metamodel, which is an interesting study in and of itself. We'll consider several strategies for bending this metamodel to our application's purposes, and see which strategies best suit which requirements. (To tip the hand a bit, the prize will go to the WXS redefine component as a way of redefining parts of the WXS metamodel itself.)

Talking the Talk

As soon as the prefix "meta" finds its way into the room, conversation tends to get a bit stilted. After all, models are also known as "metadata," and so terms such as "metamodel" and "meta-metadata" can be slinking around in the same discussion. The OMG offers some useful definitions in the Meta-Object Facility, or MOF. Especially, this specification defines formal levels for various kinds of data, each level shown here describing and governing the previous one:

The information level includes raw data: non-schema XML documents, database rows, or object instances in memory are all examples.
The model level describes the information: examples are XML schema, RDB schema, and UML models.
The metamodel level governs the "shape" of models themselves. Any modeling language has a grammar and structure that prescribes what can be expressed in a model, and how to express it. A metaschema is an expression of a metamodel.
There is a meta-metamodel layer, which is in fact the focus of the MOF specification, and which addresses issues of portability between metamodels such as UML, IDL, and WXS. We'll not try to breathe such thin air today.

So, where most schema discussions focus on the model level, we are going to be more concerned with working at the metamodel level. The descriptors consumed by our application will be called "candidate schema," and live at the model level; they can validate information, and can also be validated as instance documents based on our metaschema.

Discovering the WXS Metamodel

Before we gear up to create our own metamodels, note that WXS already has a standard metamodel. If this is surprising, ask yourself how your favorite parser makes sure that a given schema is itself valid. Many parsers have this metamodel encoded in their own logic, but the metamodel is also expressed in normative "schema for schema," which are ordinary WXS documents. In fact, you can validate a WXS document explicitly if you have these metaschema handy. ("Our schema documents" as shown below are slight modifications of the standard ones, with schema-location links between them so they can be referenced locally by a parser.)

Metamodel	Source	Our Schema Document
WXS Part 1 — Structures	WXS 1.0 Recommendation, Part 1, Appendix A	XMLSchema.xsd
WXS Part 2 — Data Types	WXS 1.0 Recommendation, Part 2, Appendix A	XMLSchema2.xsd
XML+Namespaces	http://www.w3.org/2001/xml.xsd	XML+Namespaces.xsd

The WXS metamodel is dense and brambly, and we won't attempt to map the whole of it here. To illuminate some useful areas, UML diagrams will show what definitions — top-level elements, complex types and groups — depend on what other ones. The notation is crude, focusing entirely on dependencies, and leaving out all cardinality and much other information. Suffice it to say that these had to simmer on the stove a long time before they were edible.

Don't be too spooked by this first overview diagram that shows the first few levels of the model starting from schema. Note that symbols that we recognize from general-purpose schema design are few and far between. There are many intermediates, such as schemaTop and redefinable, that are of no use to a model designer, but that must be understood in order to leverage the WXS metamodel.

Goals

Before we look at specific techniques, let's define more concretely what we're after. For any application that acts dynamically based on schema — the most intuitive example is probably a GUI builder — users of the application are expected to provide schema as a way of defining their requirements: what data-entry forms to build, for example. The application defines it's own flavor of schema, if you will, which might enforce rules such as:

Refusal to accept derived complex types
Insistence on sequence-based models for child content (no all models)
Flattening the composition hierarchy to allow only two levels: parent and child element
Insisting on names for certain schema components
Insisting on default values for all optional attributes

Note that some of these rules imply restrictions on the standard metamodel, while some imply extensions. The application places these constraints on candidate schema to express business rules, to facilitate users' understanding by limiting redundant modeling options — or perhaps just to limit the scope of development for a version 1.0.

The goal, then, is to establish a metaschema for the application such that any candidate schema (1) is valid under WXS proper, and (2) observes the application's own rules.

Strategies

Once the standard metaschema are in hand, we can see several techniques for leveraging the WXS metamodel:

We could extend or restrict the metaschema, creating our own types where necessary, for instance a mySchema or myLocalSimpleType. This is a non-starter, though, because it flunks the first of our criteria: models that observe this derived metaschema would not be able to function as ordinary WXS schema, because our derived types would be unknown to generic WXS tools.

We could define metamodel information in our own namespace, and allow candidate schema to use both the WXS namespace and our own. This is simple enough, and is explicitly allowed under WXS by the openAttrs base type, which allows schema components to include attributes from other namespaces. This technique works well enough for extending the metamodel, but doesn't address requirements for restriction.

We could rewrite the WXS metaschema to suit our purposes. That is, we could simply edit XMLSchema.xsd to change component definitions. This feels a bit icky, and certainly poses the risk of breaking compatibility with WXS proper. It is a valid approach, however, so long as one exercises great care in making changes.

WXS provides a means of incrementally changing existing schema, in order to create new versions. This is the redefine component, and it turns out to be a potent means of leveraging the WXS metamodel itself. Only certain components can be redefined — remember the redefinable type from the overview diagram. Still, this is a preferable approach to creating modified metaschema documents, for maintenance as well as esthetic reasons.

Pattern-based validation is always an option, and it's an especially strong one here. In implementing various rules, we'll face the usual limitations of WXS in expressing document-scope constraints, and with redefine as our best option for reuse, we're additionally hobbled. Constraints expressed in XPath and asserted via XSLT or Schematron can establish rules that none of the above techniques can manage.

Examples

We'll now look at a series of simple examples that illustrate most of the techniques described above.

Using `redefine`

Let's say we're building a component that creates a graphical form for entry of application data. We want to use WXS as our type model to define the shape of this data, but the business requirements have been bounded so that the following WXS constructs are unnecessary:

Derived complex types
Nested hierarchies of child elements (we only want a two-level hierarchy of parent-child)
Conjunctions (all) in content models (we only want sequences and choices)

Each of these constraints can be expressed with a separate redefinition of the WXS metamodel. Therefore we build our own metaschema whose target is the standard WXS namespace. It includes a single redefine element:

<?xml version='1.0' encoding='UTF-8'?>

<xs:schema 
  targetNamespace="http://www.w3.org/2001/XMLSchema"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
>

  <xs:redefine schemaLocation="XMLSchema.xsd">
  </xs:redefine>

</xs:schema>

A redefine can hold any number of redefinitions, so we'll add each of the three as children of the empty element shown above. First, we'll nix complex-type derivation. Here's another piece of the WXS metamodel, showing how type extension is implemented:

Type dependencies: complexType to typeDefParticle

Our target is the complexTypeModel: we redefine this to exclude complexContent and simpleContent as modeling options:

<xs:group name="complexTypeModel">
  <xs:choice>
    <xs:sequence>
      <xs:group ref="xs:typeDefParticle" minOccurs="0"/>
      <xs:group ref="xs:attrDecls"/>
    </xs:sequence>
  </xs:choice>
</xs:group>

Removal of all content models is similarly straightforward. The target now is the typeDefParticle, which as the previous diagram shows is a focal point of the content modeling system.

Type dependencies: typeDefParticle to localElement

Our redefinition simply fails to list all as a member of the group:

<xs:group name="typeDefParticle">
  <xs:choice>
    <xs:element  name="group"  type="xs:groupRef"/>
    <xs:element  ref="xs:choice"/>
    <xs:element  ref="xs:sequence"/>
  </xs:choice>
</xs:group>

Finally we attack the localElement component, forbidding deep hierarchies of complex-type elements by insisting that a local element have only simple type. A redefine of a complex type has the odd appearance of a type extending itself:

<xs:complexType name="localElement">
  <xs:complexContent>
    <xs:restriction base="xs:localElement">
      <xs:sequence>
        <xs:element ref="xs:annotation" minOccurs="0"/>
        <xs:choice minOccurs="0">
          <xs:element name="simpleType" type="xs:localSimpleType"/>
        </xs:choice>
        <xs:group ref="xs:identityConstraint" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:restriction>
  </xs:complexContent>
</xs:complexType>

(Note that the distinction in the standard metamodel between topLevelElement and localElement is critical here: without it we'd potentially be prohibiting complex-type elements, period, which would make for some pretty trivial document models!)

The completed metaschema is GUIBuilder.xsd. See also a valid candidate schema Valid.xsd as well as three that each flunk one of the constraints: DerivedType.xsd, UsesAConjunction.xsd, and NestedTypes.xsd. Note that all of the candidate schema are valid under normal WXS — just change the schemaLocation from "GUIBuilder.xsd" to "XMLSchema.xsd" to prove it.

Namespace Extension

Perhaps our GUI builder can handle choice and union types, although this obviously adds complexity. Two or more parallel interface panels must be presented to the user, which is a solvable problem. What to say about these panels, though? If the application is asked to allow entry of either a city and state or a ZIP code, for instance, how can the application indicate to the user which panel means what? The most intuitive approach would be for the candidate schema to name each of the possible choices in some descriptive way. WXS doesn't allow all possible children of choice to be named, though, and even if it did component names are not generally meant for end-user consumption.

Here an attribute from a separate namespace is the natural choice for extending the content model. A very simple schema is developed for the new "named choices" namespace, and the candidate schema simply references this along with the normal WXS metaschema.

Pattern-Based Validation

Let's say a given processor can't work with missing attribute values. The requirement is set that any attribute in the model must either be required or must provide a default value. This is not so easy to redefine using the attribute component type. We've stumbled across a general weakness of WXS: content model constraints based on instance values cannot be implemented.

In XPath, by contrast, it is dead simple to express this rule, and using XSLT it is just as easy to enforce it. The validating transform below can be applied to the candidate schema (see documents ExplicitAttributes.xsl and MissingDefault.xsd).

<?xml version="1.0" encoding="UTF-8" ?>

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
>

  <xsl:output method="text" />
  <xsl:strip-space elements="*" />

  <xsl:template match="text ()" />

  <xsl:template match="//xs:attribute
      [(@use='optional' or not (@use)) and not (@default)]">
    <xsl:text>ERROR:  Must provide a default value for optional attribute </xsl:text>
    <xsl:value-of select="@name" />
    <xsl:text>.</xsl:text>
  </xsl:template>

</xsl:transform>

Conclusion

Certainly, the system isn't perfect. There are many ways in which I'd like to leverage the WXS metamodel that are either closed to me or just too darned complicated to be worth the trouble. This isn't a shortcoming in WXS, as I see it; if the type model were as pliable as I'd like it to be, well, it just wouldn't be W3C XML Schema, and wouldn't have the tremendous descriptive power and precision that I also want.

Where they are feasible, redefinitions of schema components offer an elegant way to tailor the WXS model to the needs of an application. XPath/XSLT validation can provide another option, but it's important to see past logistics and remember that the WXS metamodel is as stiff as it is for a reason. If you find yourself demanding features in your application's candidate schema that make them malformed under WXS proper, or changing so many things that the metamodel is unrecognizable, you should probably be building your metamodel from scratch, or working from a different starting point.