Beyond HTML

XML Opportunities Knocking

By Michael Floyd
Web Techniques
December 1998
Volume 3, Issue 12

Just in case the Kenneth Starr report has in some way overshadowed the goings on over at the World Wide Web Consortium (W3C), I'm here to tell you that seXML sells. Well OK, not just yet, but it will. Consider that Tim Bray, on XML day at the Seybold Conference held in San Francisco this past September, essentially said that while he didn't expect HTML to go away anytime soon, XML would eventually replace it. With 1.5 million Web pages born daily (according to Alexa Internet), it's easy to understand why HTML is here for a little longer. In fact, grandfathering out HTML will likely take years.

However, the interesting story behind XML is that it will likely take hold in areas that have little to do with the actual presentation of Web documents. And that's where the opportunities for XML developers lie. This month, I continue my discussion with Charles Goldfarb, the father of SGML and coauthor of The XML Handbook (Prentice-Hall), to find where these opportunities are.

What other kinds of applications are we likely to see over the next year or two?
Well, that's a big list. There are a couple of dozen applications already.

I think you break them down in The XML Handbook into four categories. One involves database.
Well first, it's hard to say what's an application versus what's a technology or a mode of use. For example, the Washington Post has a "help wanted" kind of Web site that dynamically communicates with the Web sites of its advertisers to get the earliest notice of employment openings. They share the data in XML, so the site doesn't have to understand the actual format of the data sources that it's dealing with. Now, when they finally deliver [the data] to a [browser], today they're doing that in HTML because you don't have native browser support for XML. But all that inter-application communication going on at the server is done with XML. So do you call that a data-processing application?

Yes, that's a basic question: "How do you define application?" There's a lot of gray area there.
That's the real point. A lot of our old boundary lines that we were forced into by older paradigms and tools are now breaking down.

Another area would be transaction processing.
Sure. Anything that you'd call message-oriented middleware (MOM). In the book I talk about MOM and POP applications, [POP] being presentation-oriented publishing. [POP applications are] the traditional area that SGML got into first, because those were the people with the biggest need. But the MOM applications are going to drive the use of XML on the Web.

And how will that evolve?
It's already starting to happen. Let's say you're doing online business integration with a supplier. You know that you need to get certain information about his products and inventory status in order for you to do your thing. With XML, you don't need to know the format of his database or the protocols for dealing with it, as long as you have some way of requesting the data. That request could itself be an XML document. Then you get back the data as an XML document which, in fact, might have originated from several different tables on the supplier's database–and you're totally immune to any future changes that he may want to make in his own system. So, in effect, XML provides a system-neutral interface, not in the sense of an API, but in the sense of the data format that's being interchanged as a result of exercising an API. And then the API can get very, very small–just, Send_Document, Receive_Document. You figure out from the document type just what the user is asking or transmitting.

Another question I have involves electronic data interchange (EDI). In your book, you state that "EDI will no longer be isolated to certain industries or the largest enterprises, it will become as ubiquitous as email." Can you explain this rather bold assertion?
Before the Web, EDI required value-added networks (VANs)–you actually had to make arrangements for how the information was going to be communicated physically before you could even get started. That goes away with the availability of the Internet. Another problematic aspect of EDI is that the types of transactions all have to be defined in advance and agreed upon between you and your business partners. [Ed. Note: See "Business-to-Business E-Commerce," Web Techniques, November 1998.] Part of that is so that a transaction will have the force of a legal contract. But it also makes the message formats very rigid. You've got to renegotiate if you want to change anything. With XML, there's the potential of defining the document type–meaning the schema of the transactions. Then it's possible for individual companies to add information that's important to them, but isn't legally part of the transaction. That can be done in a way that doesn't compromise anything. By cutting down the barriers to entry and getting everybody onto the same network, you no longer have to be a large, rich company to get into this game.

In looking for new development opportunities, who in your estimation will develop document type definitions (DTDs) and who will be DTD consumers?
Used properly, a DTD is simply a way for parties to express agreement on the meaning of what they're communicating. It isn't some outside set of imposed rules that forces you to slavishly do things in a certain way. People can't communicate unless they have some agreed-upon vocabulary to communicate with. The DTD is a way of letting people write down what that agreement is.

When you talk about opportunities for developers: In the SGML world, the major users have these enormous document collections, the structure of which in many cases is mandated by law, or by industry regulations that have the effect of a law. Because these are big, complex structures to begin with, expressing them formally as a DTD is a big deal. I don't see that being the case for these message-oriented middleware applications on the Web. You're dealing with smaller, simpler kinds of information structures. Expressing those–even expressing them well–in a DTD shouldn't be a big effort. I don't see that as a development opportunity. I think that the development opportunities are going to come in writing programs that will deliver in XML form information that's stored in some proprietary format; that is, maintaining an interface between you and the world at large in the same way that a proprietary networking system might translate things into TCP/IP.

Why are parsers of such great interest right now, and how are they currently being used?
This is the toughest thing for someone who has only been using HTML to grasp. In HTML, the concepts of element and tag and formatting command (or processing command) are all woven together. There's no practical difference, given the way most people use HTML–that is, to get a particular visual result. In actual fact, though, because HTML is also SGML, under the covers these distinctions are being maintained. And they're important, because when the Web developer understands them, everything else unlocks. So, here's the point: You've got to think of what's in that document as being data elements, just as if they were in a database. So, you might have a data element that's a customer number. What the tags are doing is therefore filling the same role as the schema metadata does in the database. The tag says, "the type of data element that this is, is customer_number." Just as in a database, the metadata doesn't say what to do with the data. It just says what it is. It gives meaning to that field. So, the content of the element–the stuff between the start tag and the end tag–is the data. The markup provides the schema information.

Now, the way you process it is in a separate thing called a "style sheet." You may have different style sheets for different purposes. I'm using the term "style sheet" very broadly. It may have nothing to do with the presentation style. It may have to do with the arithmetic you perform on a "rate of pay" data element. There's no reason to draw those artificial lines. Any kind of script you can reference from a dynamic HTML page, you can reference from a style sheet that's going to process XML. So we now have a chance to keep those three things separate: the abstract data; the markup that provides the metadata or schema information–that's what's in the tags; and then the processing stuff that's in your scripts that are invoked from the style sheet.

That XML document can be caused to use the data just the way you might have used ODBC. Or, it can be presented (or parts of it can be presented) for display at a client browser just the way you use HTML. Or–and possibly the most powerful thing of all–you can just clip the marked-up text and incorporate it in some other XML document. With a large base of well-tested SGML tools, why aren't we seeing more commercial XML tools such as editors, DTD generators, and authoring tools? It's early. And I don't think creating DTDs is such a big deal for XML applications.

Because they'll be much smaller.
Yes. They're simple. They're just like laying out a spreadsheet table. If they're more complicated than that, you're probably doing something wrong. Which doesn't mean the tools aren't on the way, but it could mean that the vendors are not judging their market very well. There will be some market for people who want to do SGML-type things with documents and can't justify getting into full-scale SGML. But I don't think that's going to be the dominant market for XML.

There are a lot of XML parsers out there, most of them free–we probably have more than the world will ever need. Whereas getting the first couple of SGML parsers out was like pulling teeth. XML editors? There are a few already. You even mention a couple at your Web site.

One last question. How do you view the state of XML today, and where do you see it going from here?
At this point in time, the world is being exposed to the buzz. Microsoft is out spending big bucks training people and making them aware of XML. But the real action is still with the major vendors and the major industry consortia. They're laying the framework that's going to make it possible, in six to eight months, for Web developers to start making use of XML in a big way. In the area of tools, I think you're going to see the most excitement in data-integration/middle-tier/application-server type tools–the things that make it easier for different Web sites to talk to one another. These tools will, for example, allow your Web site to act as though it were a client browser using some other Web site's data, and then to accumulate the results of that dialog along with other data that you collect, and send that all to [the user's browser] at once. That's what's new about XML, in terms of the Web. I mean totally new–not just differences in degree, the way the presentation-oriented stuff is.

So, what's really interesting and exciting here is that 98 percent of the XML data may turn out to be stuff that's written by a computer and sent to another computer for processing and then disappears. At the other end of the spectrum is the stuff that's been written by humans over long periods of time, and is intended for other humans to read. That's the traditional SGML POP application domain. XML is now scaled down to the point where it becomes efficient and sensible to do this other data-oriented, transaction-processing, MOM end of the application spectrum.

Before XML, data processing involved proprietary data formats that had to be negotiated between programs. These got so out of hand that we wound up making the data part of the programs and calling the result "objects." Now there's another way to keep data neutral and efficient, but at the same time restore its independence from the programming code. That's the promise of XML.

Michael is a Web developer and freelance writer. He is a founder of Web Techniques (now called New Architect).

To Charles F. Goldfarb's SGML Source Home Page.

Copyright © 1998 Charles F. Goldfarb. All rights reserved.
From an article in Web Techniques Magazine