877-277-2477
info@capstonecourseware.com

Transformation Wrap

Content Integration and Multichannel Delivery with Cocoon

Mark Choate

Originally published by Quoin, Inc. and the Cutter Consortium, May 2005

In Douglas Adams' Hitchhiker's Guide to the Galaxy, a computer called Deep Thought is given the task of finding the answer to life, the universe, and everything. Much to everyone's disappointment, the answer Deep Thought delivers is "Forty-two." In an effort to explain the answer, Deep Thought says, "I think the problem, to be quite honest with you, is that you've never actually known what the question is."

I can relate to that story. In my decade of experience with content management systems (CMSs), I've encountered many applications that provided solutions to problems the designers did not quite understand. Most CMSs take an excessively narrow view of content management. At the center of their world lies a CM repository and a presentation layer focused primarily on displaying HTML. An application built upon the assumption that there is one source of content and one delivery channel greatly underestimates the complexity of the content management challenges many organizations face.

In this article, I will present a content management reference model that provides a more complete understanding of the CM process by placing greater emphasis on the demands associated with content integration and multichannel delivery. This will be followed by a discussion of Cocoon, a CM framework from the Apache Software Foundation that not only knows what the question is but also provides a compelling answer to it.

The Growing Challenge of Content Management

The technology used to solve a problem often changes the fundamental nature of the problem itself. Sometimes content management solutions are really content management problems in disguise.

If we look at the history of content management, from ancient Egyptian hieroglyphics to the latest ideological rant typed into a Weblog, we see an interesting trend. As content gets easier to create, copy, and share, the expected lifespan of that content decreases and the volume produced increases. While it may make perfect sense to write the Ten Commandments in stone, it is decidedly less sensible to publish the current state of my stock portfolio that way.

More importantly, information technology has greatly impacted the efficiency of information sharing. As the marginal cost of producing one more copy or one more revision approaches zero, then the number of copies and revisions goes up, as does the degree to which content can be personalized and customized according to individual preferences.

Technology has also created an explosion of document formats. Now, in addition to things like books and magazines, content is displayed on Web pages, tiny mobile phone screens, e-books, and PDF files. In a modern publishing environment, organizations have documents and data that reside in different systems, in different formats, and even in different languages, currencies, and scales. They are constantly creating new content internally and acquiring content from third parties. As the volume of data grows, it is increasingly difficult to manage it.

Since the problem is an increasing number of sources of data and an overall increase in the volume of data, the tempting solution is to try to reduce the volume and limit the number of sources. Instead of multiple systems, organizations want to use just one comprehensive system that addresses all of their content management needs. Typically, this kind of enterprise CM system is centered on a content repository into which all content is entered and from which content in various formats is output. In this view, there is a "document" that can be stored in a centralized repository, from which instances of that document can be produced inexpensively in the form of flyers, brochures, and Web pages. Unfortunately, this Platonic ideal of content management is not a realistic view of the CM process, because data and documents are constantly in a state of flux (which I suppose requires a more existential view of content management).

Having a single, centralized repository simply isn't possible. The reality is that organizations do, and always will, source data from a variety of systems, aggregate the data, transform it, and distribute it in different channels. An ideal development framework must provide tools for content integration, which requires the aggregation of content from multiple sources and the transformation of that content into a common structure. At the same time, the framework should also provide tools to work in the opposite direction by enabling the developer to start with a common source of content and transform it into different formats.

Dynamic Documents

It is highly unlikely that the number of inputs to our CM process is going to decrease. Likewise, the number of outputs is only going to increase. We have a growing number of inputs and a growing number of outputs, and the lifespan of the data we use keeps getting shorter. We need to manage the flow of information through this complex process.

Content management is more than creation, duplication, and distribution. Central to the entire process is the transformation of content, which can involve amending errors, creating multiple versions, updating the document with more current information, even performing calculations and analysis on the source data. With each transformation, the document is substantively changed, and the document will undergo numerous transformations of this sort throughout its lifespan.

A CM solution that effectively mitigates this problem is one that enables the transformation process, streamlines it, and compresses the time required to shuffle the content from one form to another. Cocoon is a framework that can serve as a model for how this problem can be overcome.

Figure 1 serves as a reference model for the CM process, and it illustrates the typical flow of content through an organization.

Content Management Reference Model

Figure 1 -- A reference model of the content management process.

In this model, XML content is generated from a variety of sources and is transformed into a standard format in order to be aggregated. In some cases, the content is aggregated into a content repository, where it can be cataloged for improved searchability, and digital rights can be managed. There are some situations, especially when the lifespan of the content is relatively short, in which it is not aggregated permanently into a repository but is aggregated temporarily prior to output.

Finally, after aggregation, the content is transformed again and output. The output may be HTML, but it could also be a PDF document, a graphic, or any other file format. It is important to note that I have included the results of a query as a form of output. Conceptually, there is no difference between clicking on a link to request a page and submitting a query through a form.

Multichannel delivery can involve both a transformation of the file format and a customization of the content. In the case of a query, the file format remains unchanged, but the content itself is modified with each request. Many Web sites offer PDF versions of individual pages or articles. In this instance, the content remains the same, but the format changes.

Many CM frameworks have a much smaller footprint than this reference model suggests they should. Most solutions are limited to some form of content repository, plus tools for presentation of that information. In contrast, Cocoon also provides a set of tools that facilitate the sourcing of content from several sources and transforming and aggregating that content.

Content Integration and Multichannel Delivery In the Real World

I will now use two examples to illustrate the complexities of content integration and multichannel delivery and then follow that with a discussion of how Cocoon can play a unique role in managing this process. There are three factors that contribute most of the complexity to the process:

  1. Multiple sources of content. This content is stored in different systems and in different formats.
  2. Varied content shelf life. Some information grows stale almost immediately, like homemade bread. Some information is more Twinkie-like.
  3. Multichannel delivery requires customization of content as well as format. Often, the output format dictates customization of the content itself. For example, content prepared for a printed publication may be written differently than that prepared for a Web site. In addition, calculations or other data manipulation may be required.

Cocoon is a remarkably flexible solution that has a role to play in a variety of different CM situations. While it is referred to as a CM framework, don't let that label prevent you from considering Cocoon for projects traditionally thought of as CM applications. When developing for the Web, the difference between a CM development framework and an application development framework goes away, especially if you broaden your definition of content to include data as well.

Now let's see how this flexibility plays out in the real world. The first example involves a newspaper publishing environment in which there are multiple editorial and CM systems in operation. In this case, Cocoon is used behind the scenes as a kind of "glue" between the systems, managing the flow between them. The second example involves a site that my company, Quoin, Inc., developed for the Five Star Alliance, in which Cocoon manages the entire application, including complex interaction with the user.

All the Content That's Fit to Output

Quoin consultants have worked with several newspaper companies in the past, managing and developing their online publishing initiatives. In each case, the newspapers have had editorial systems for print publications, plus separate CM solutions for internal archives and their Internet publishing initiatives. Since each system is highly specialized and supports an activity that has unique requirements, it is unlikely that newspaper publishers will ever have a single, massive CM solution that solves every problem. Rather, they will take a best-of-breed approach for each area. At the same time, they still need these systems to interoperate effectively, and this means that they need a process for connecting the distinct systems and managing the flow of data between the different elements.

Content flows between these disparate systems in both directions (see Figure 2). Content that is delivered through a wire service is routed to the newsroom editorial system and possibly to the Web publishing system. Stories published in editorial systems as well as on the Web are routed to the news archive, where additional metadata is applied to the content to assist in future research. Sometimes content from the archives is aggregated and placed online as an additional resource for current stories.

Two-way Content Flow

Figure 2 -- Two-way content flow in a newspaper environment.

When a newspaper organization takes a news story that was originally written for the printed newspaper and "repurposes" that content to be distributed through the Web, it is not simply a transformation in format or design. There are often substantive, meaningful changes made to the document. The Associated Press recently announced that it will begin to distribute stories to media organizations with two separate beginnings, one in the traditional inverted pyramid form and another in a more engaging, story-telling form. It's the same story, but with different words, not -- as is most often discussed in CM circles -- a story with the same words but displayed in different formats.

As we saw in Figure 1, the output or serialization of content can serve as the source content for another transformation. In this case, content written and produced for the print newspaper is routed to the Web site for publishing online. Between the editorial system and the Web CMS, the story is reformatted, metadata is added, and additional content, such as extra photos, are associated with it.

This is a good example of how a transformation-based framework can serve the CM needs of an organization that requires a heterogeneous assortment of systems. In this instance, the Cocoon solution is to mediate the flow of content between systems by generating and aggregating data from different sources and routing it appropriately.

From a Laptop to the Lap of Luxury

Recently, Quoin developed a site for the Five Star Alliance that features an online service for searching and booking luxury hotel accommodations worldwide. The site offers information to help the user decide where to go and where to stay while there. By providing consumers with destination information accessible through a sophisticated search interface, as well as tools that can complete the transaction by making reservations and processing payments, this site demonstrates the overlapping requirements of a CM solution and what would traditionally be viewed as a Web application. The primary elements are:

  • Static content. The site provides information about the destination itself and articles about available hotels. This information is stored in a database, much as any CMS would do. The articles themselves do not change often and are thus a relatively static source of information for consumers.
  • Dynamic content. When it comes time to actually make a reservation, the user needs to have more up-to-date information about each hotel's current rates and availability. In order to present this information to the traveler, the application has to retrieve the latest data from multiple partners, which means accessing data in several different systems operated by different companies.
  • User interaction. While it is possible to browse through the information about the hotels, the most common sort of user interaction will be an interactive search in which the user defines the search criteria and is then presented with a list of locations that meet those criteria. One feature of the site is a virtual travel agent, which works by asking the user a series of questions and then displaying suggestions based upon the user's answers.

The results of these queries are pages that combine both static and dynamic content (see Figure 3). If you refer back to Figure 1, you will see that in the reference model, query results are treated conceptually as just another form of output. This is important because it shows how the same underlying transformation-based architecture is suitable for more generalized software applications as well.

Static and Dynamic Information in Query Results

Figure 3 -- Query results are a combination of static and dynamic information.

There is a much greater level of interaction required between the user and the underlying system in a Web application such as the Five Star Alliance site. Because of the sophisticated search requirements, this site makes use of Cocoon Flow, another tool provided by the Cocoon framework that serves as the glue between the user's interaction with the site and the underlying business objects.

The Transformation Cycle

Cocoon is a CM framework, but it is much more than that. Cocoon (as the metaphor would imply) wraps the process of the transformation of content. In the natural world, a cocoon isn't a moth's home, to which it returns after a long night of flittering about porch lights. It is a stage that wraps the transformation from caterpillar to moth. In a similar way, Cocoon can be used to wrap the content transformation process, in which content flows in from disparate systems and sources, gets transformed, and flows out again in various formats, versions, and designs.

Cocoon is an ideal implementation of a CM framework in that it understands the dynamic nature of content and the central role that content transformation plays in the process. While the purpose of this article is not to provide a tutorial on how to use Cocoon, it is informative to look at the basic architecture Cocoon employs to see why it is such an effective tool for this kind of work.

Cocoon Pipeline

Cocoon is a CM framework that wraps the transformation process in what it calls a pipeline. A pipeline is a series of steps through which content flows and is transformed. The components are a collection of interfaces, each of which has a distinct role in the CM process. The four most important elements of the pipeline are generators, aggregators, transformers, and serializers.

  • Generators are components that produce XML data in the form of SAX (Simple API for XML) events. Generators wrap the initial step in the transformation process, which includes taking non-XML data and converting it into XML, which is then sent to the next stage in the pipeline.
  • Cocoon provides more than one way to aggregate content. Generally speaking, an aggregator produces XML content by combining XML content from multiple sources. While a generator generates XML in the form of SAX events for one source of data, an aggregator may combine output from multiple generators and then pass it to the next stage in the pipeline.
  • Transformers take SAX events for input and produce SAX events as output, performing some kind of transformation in between. Often, the transformation is handled by XSLT, but that is only one of many options.
  • Serializers are the final stage in the pipeline. Serializers take SAX events and output the content in the appropriate format, which can be XML, HTML, PDF, or any number of other formats.

In the newspaper example, content enters the pipeline from one system (e.g., the editorial system) and exits the pipeline into a different system. Instead of serializing the transformed content to the file system, the serialized content is captured by another application.

The Five Star Alliance site uses the pipeline as the primary means of aggregating content, both the articles that are in the local database as well as hotel price and availability data dynamically gathered from the partners' computer systems. In this environment, multiple sources of content in different formats and with different lifespans are aggregated and presented to the user as a single document. Cocoon Flow

HTTP is a stateless protocol. Simplicity is its greatest advantage -- the user makes a request, and the server sends a response. Nevertheless, there are times when information needs to be saved in between requests, and Web developers historically have had to jump through plenty of hoops in order to manage the state of an application in an otherwise stateless environment.

Cocoon Flow manages the ongoing interaction with the user and greatly simplifies the task of maintaining state between requests, thus enabling programmers to develop a Web application in a manner much more like a traditional program. It works by making use of a special version of JavaScript that is able to save the application's state between requests. Cocoon does not keep the connection open between requests. Instead it makes use of continuations, which allow it to save the actual application state between requests, including the stack of function calls and the current values for local and global variables.

In practice, it works like this: after receiving a request for a page, the script sends the requested page to the user and awaits the result. In the following example, JavaScript calls the sendPageAndWait function, which sends the page "firstPage.html" to be viewed by the user. Then, after the user takes some action (usually filling out a form or clicking on a link), the script starts up right where it left off. It calls the next line of code, which retrieves the data that results from the user's action.

    cocoon.sendPageAndWait
    ("firstPage.html");

    first = cocoon.request.get
    ("first");
    

In applications that require a large amount of user interaction, such as the one developed for the Five Star Alliance, the simplicity of this approach can greatly speed up the development process. In that application, Flow was used to mediate the user's interaction with the system in coordination with Cocoon's Form framework, which validates input and internationalizes the content.

Conclusion

As a CM framework, Cocoon has a much broader footprint than most CMSs. Many of these competing solutions are made up of a content repository and presentation tools (i.e., the aggregate and serialize segments of the Cocoon model). The Cocoon pipeline adds to that a layer that facilitates the generation, aggregation, and transformation of content from a variety of sources. It also provides tools through Cocoon Flow for efficiently managing the complex user interactions with Web applications.

Real-life content management is complex. More than likely, both the volume of data and the number of data sources will continue to grow, as will the various channels through which data can be published. This will result in the exponential growth of pathways through which content may travel from the moment it is generated until it is consumed by the end user.

A one-size-fits-all, centralized CM solution is not well suited for this kind of dynamic publishing environment. While it may work well in situations where large volumes of relatively static content are to be published, its value begins to diminish as the data becomes more dynamic. A fundamentally different approach, focused on overcoming the challenges of content integration and multichannel delivery, is required.