Tools and methods for XML processing.

Abstract

This document was created to aid you setup your system and environment for effectively processing XML - validating and transforming them to other formats. It also contains usage description of most important XML tools. Content of this document was created with assumption that you have some knowledge about what is XML, DTD and so on but you need some instructions how to setup your environment to make your work more effective.

Most of issues discused here are related to DocBook or Simplified DocBook XML documents. Especially xml transformation section. However most of solutions presented here are more general and can be applied to different XML DTDs.

You may ask why I have created such document. Well, although all this information is available over the Internet I couldn't find one source discused all necessary details. I have spent a lot of time searching and discovering obvious things. And now I would like to help people like me starting editing XML documents and looking for ways to improve their work.


XML, DTD and Schema.

Although I assume you know what is XML, DTD and so on I want to define some terms I will be talking about to prevent from missunderstanding.

XML - Extensible Markup Language.

A simple, very flexible text format derived from SGML.” For all necessary information go to w3.org. XML itself is a metalanguage to design markup languages,

DTD - Document Type Definition.

This is a description of the content for a family of XML files. Detailed information is located of course on w3.org site.

Schema

Schema is a new system for document definition. It is XML based language. And the goal for creating Schema is to replace DTD with new modern language which removes all DTD limitation. Details are also on w3.org site.

Parsing

It is activity performed for checking if given file is valid XML file. It means that each open tag has appropriate closed tag. That there is only one root tag and so on. During parsing there is no need for using DTD or Schema for parsing document.

Validation

It is process of checking given document against appropriate DTD or Schema for this document. So at the moment of validating both resources are necessary - XML file and DTD or Schema.

Catalog

It is local repository for DTDs, Schemas and other entities used in XML documents.

In XML documents, entities are usually declared with location pointed to entity creator site. What means, in more cases, that at the moment of using them (i. e. validating XML document) all entities must be downloaded from source location. However it requires active Internet connection and additionall time for downloading.

There are two ways to use localy stored entity instead of this one on the Internet site. One is to change entity declaraction and put as location path to your local file or use of Catalogs.

While you are the only one who works on XML file it is acceptable to directly change entity location in declaration. However if there are more editors working on one file or on set of files it is very difficult and sometimes impossible to manage entities in this way. Using Catalogs solves this problem. It maps each entity by it identyfication string to file on your local disk. It is external system to XML files, so no changes to XML are necessary. Each editor can have his own different maping definition and can store emtities in convenient place.

XSL - Extensible Stylesheet Language.

An XSL stylesheet specifies the presentation of a class of XML documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary.” Source documents can be found on w3.org site.

XSLT - XSL Transformations

Is a language for transforming XML documents into other XML documents.” Source document are located on w3.org site.

DocBook

DocBook is widely used DTD for creating many different kind of publications.

SDocBook - Simplified DocBook.

Simplified DocBook contains subset of DocBook DTD targeted for new users focusing on articles rather than books.

XML editing tools.

This document was created with helping emacs users. So it concentrates on emacs related issues. However there are many other good XML editing tools. Most of them are worth to consider as an alternative for your XML development. Below I will present known to me XML editors with location where you can find detailed information about them.

I will be adding more editors as I find them, and I will learn they are mature products usable for at least all basic xml editing tasks. And one more condition at least basic version must be available for free. So you can consider it as an option to other programs.

If you have or find information about more editors please feel free to send them to me. I will add it to existing list.

emacs

this is not WYSIWYG or WYSIWYM application. You have rather to create XML code. However it provides excellent helper libraries for text edition as well as DTD online validation, and much more. It is difficult to point one emacs site because although it is developed in one place there are many additional libraries available from many different projects. The first one you should check is of course: emacs home page. More resources and many useful information find on emacs wiki pages. It is fully free and open source software.

xxe - XMLmind XML Editor

This is WYSIWYM editor. And it is provided in two versions: standard - free of charge and professional not free version. Personal version has all features necessary for basic XML editing. Files, documentation and license details find on company site.

Introduction for XML processing tools.

XML processing includes at least 3 basic tasks:

  1. Parsing XML.

  2. Validating XML.

  3. Transforming XML to other formats.

Please note, I am not going to describe all possible XML tools. First I will include here programms I currently use. Next I will be adding description for next applications after I will find them or someone send me description.

Editors

Most of programs used for editing XML files have built-in parsing fnctions and most of them can even validate document against proper DTD. Emacs with psgml, for example, performs also validation against DTD but it's validation is not complete. It is enough to find most of errors during editing, however to ensure that document is fully valid external validator must be used.

I am not going to discuss editor parsing functions. I am going to concentrate on external tools.

libxml2 - The XML C parser and toolkit for Gnome.

This is very fast, written in C tool set. Functionality covers almost all possible tasks necessary to handle XML documets, like parsing, validation, resolving XIncludes and transformation using catalogs and so on.

It does not support Schema and not helps with publishing your XML documents apart from it can be used for transforming your files into HTML format.

It is available on almost all system platforms including Linux, Unix, Windows, CygWin, MacOS, MacOS X, RISC Os, OS/2, VMS, QNX, MVS, ... and I believe on different not listed here should work also. Since it is created in C language and sources are available it should be possible to run it on all platforms where C compiler works. And yes, it is open source software. However you don't need to compile and install it o your own. It is already included in most Linux dostribution as well as CygWin.

xml.apache.org - aims to provide commercial-quality standards for XML-based solutions.

All tools available on this site are based on Java. Some of them are also developed for C, C++, Perl and maybe other languages. They are not that fast as previous tool set and also not that portable. It works on all platforms where Java is available. However XML support provided by Apache is complete.

Below I will list the most important packages available on xml.apache.org and describe what they can be used for:

Xerces2

This is XML parser/validator. It supports Schema, DOM Level 3 and XML 1.1 Candidate Recommendation and works in Java, C++ and Perl.

Xalan

This is XSLT processor for transforming XML to other formats. It works in Java and C++.

FOP

This is formating object processor. It transform documents from formating object structure to specified target format. At the moment of writing this document following output formats are supported: PDF (primary output target), PCL, PS, SVG, Print, AWT, MIF and TXT.

Cocoon

This is XML publishing framework. Uses technologies like XSL, XSP and it is designed with performance and scalability in mind.

Forrest

This is project documentation framework based on Apache Cocoon. It provides XSLT stylesheets, Schemas, images and other resources. It also aims to be Sourceforge-like project management tool.

Validating XML.

Transforming your XML sources to HTML output.

Catalogs - efficient DTD use.

SGML and XML file prolog contains DTD and entities declaraction. It is called DOCTYPE declaration. DTD and entities are usually stored in external files. The first quoted string after PUBLIC is the DTD's PUBLIC identifier, second quoted string is the SYSTEM identifier. Usually, the SYSTEM identifier is a full URL to website providing this DTD.

Example 1. Sample DOCTYPE declaration.

<!DOCTYPE section
  PUBLIC "-//OASIS//DTD Simplified DocBook XML V4.1.2.5//EN"
  "http://www.oasis-open.org/docbook/xml/simple/4.1.2.5/sdocbook.dtd">

When any program try to validate XML file it must first load proper DTD for this file. It knows which DTD to load from DOCTYPE declaraction. If DTD location is URL to some website file have to be downaloaded first. While it is very compfortable to have DTD stored in one network location it implies some difficulties. First of all you must be connected to internet to validate XML file. Second, if you are connected to internet, XML file validation takes much more time than necessary because of downloading time. There are also some minor inconveniences with this.

It seems, that it makes sense to keep DTD files localy on your hard disk. SYSTEM declaration can be changed to point to file stored on your local HDD. It is good solution if you are the only person who edits this file. It is difficult, however, to ensure that many editors have stored DTD file in exactly the same directory. They can use different operating systems with incompatible path structure! The simple solution to this problem is to keep DTD in “current” directory. But if you work on several XML projects, and each of them is located in different directory.... Yes, different approach must be used.

The solution to all above problems are CATALOGS. They provide mapping from PUBLIC identifier to file location. So you can put url to internet location of DTD file but CATALOGS allow you to use local version of this DTD during processing. Each XML developer can have own independent catalogs set mapping, so his DTD local copies can be placed in different locations. Catalogs are used not only for DTD mapping but also for entities mapping.

Below I will present my own CATALOGS configuration. Some theory and all details necessary to set it up and use.

CATALOGS are files containing mapping from PUBLIC identifiers to file locations. There are 2 kinds of CATALOGS files. Older standard is pure text file in defined format. New standard keeps mapping in XML files. XML catalogs are more flexible and allow also some path translation. However older standard is kwnown to all tools. So my examples are based on older catalogs files.

Example 2. Sample CATALOG file content for above DOCTYPE declaration.

PUBLIC "-//OASIS//DTD Simplified DocBook XML V4.1.2.5//EN" "sdocbook.dtd"

If path to DTD file is not absolute, as in above example, it is relative to catalog file location. This is very convenient solution. Catalog files are usually placed together with DTD set definition. And they are also often shipped with DTD sets. You can find docbook.cat file with docbook DTD set. However similar file is not available with simplified docbook DTD set.

Dividing XML file into parts.