Three SGML metadata formats: TEI, EAD, and CIMI
Work Package 1 of Telematics for Libraries project BIBLINK (LB 4034)
The BIBLINK Project
Title page
Table of Contents

Previous - Next

3 Technical context

This section discusses some technical aspects of SGML when used as a vehicle for metadata standards, specifically the role of a document type definition (DTD), the choice between descriptive and prescriptive styles,and the role played by SGML in quality control and resource discovery.

3.1 The role of a DTD

We noted above that SGML itself is not a convention for representing information, but a way of representing such conventions. SGML has little or nothing to say about how a document should be processed, what an application should do with it, or even what it means. It is not a protocol, in the sense that (say) Z39.50 is, nor is it a program. Different applications may use the information encoded in an SGML description in different ways, depending on their needs. A formatting application, for example, can choose to associate printing styles with particular elements, while a retrieval application can improve precision by searching only elements of a particular type, or within a particular context. What SGML offers is the way for such applications to interact with the same data in a mutually consistent and well-defined way.

The part of an SGML system which makes this inter-operability possible is the Document Type Definition or DTD. A DTD defines names, attributes, and co-occurrence restrictions for all the identifiable elements and entities used by a class of documents. It says nothing about their semantics: it is the role of supporting documentation or usage notes to do this.

In particular, a DTD says nothing about how a document should be rendered on paper or on a screen, any more than it does about which elements should be indexed for rapid retrieval. In order to render a document, therefore, an SGML application will need a specification additional to (or as a substitute for) the DTD: this is commonly known as a stylesheet. The recently-defined ISO Document Style Syntax and Specification Language (DSSSL, ISO/IEC 10179:1996) provides a standard for the definition of such specifications. Because this standard has only recently been adopted, most current SGML browsers and formatters tend to use their own stylesheet languages, but this is likely to change with the availability of more general purpose DSSSL-compliant formatters such as JADE.

3.2 Descriptive or prescriptive?

Some DTDs are purely "descriptive": their goal is to specify all the elements that may appear in a large range of not particularly homogenous materials. Others are more "prescriptive": their goal is to constrain as exactly as possible the contents of documents. Typically, a framework for encoding a range of existing material will aim to be descriptive, so that encoders can mark up "what is there". On the other hand, DTDs designed to hold newly-created information can be as prescriptive as they like, since the information will be added along with the structure. A tightly-defined structure can actually be helpful, by reducing the number of choices that an encoder has to make. One advantage of the SGML approach (which supports unlimited repetition, recursion and field lengths) is that even a tightly-controlled structure can support large and complex documents.

Of the schemes studied here, the TEI is the most general (least prescriptive), in that it is intended to cater for the widest variety of documents. The EAD is more prescriptive, in that it is intended for use with a specific class of documents, all of the members of which share certain elements and are unlikely to include others. The CIMI metadata records are also more prescriptive, in that they represent a customization for a particular set of applications of the general framework defined by the TEI.

In any DTD there will frequently be a choice between an analysed set of elements and free text. For example, the TEI offers three levels of formality for recording bibliographic references, from free text with arbitrary subelements (<bibl>) through to a fully-structured reference (<biblFull>). Within the TEI Header, there is a choice between using analysed subelements and free text paragraphs in areas such as the Publication Statement. This flexibility means that one cannot be certain that (say) a Publisher Name will always be available in analysed form within the metadata since it is an optional sub-element.

Even if <publisher> were made a mandatory element (which could be done with a minor change to the DTD), the SGML standard provides no means of controlling its content. Any syntax conventions or vocabulary control must be supported by additional application-specific software.

3.3 Hyperlinks

One feature of SGML that is relevant to the study of metadata is its ability to represent hyperlinks, not just to another information resource but to a specific point within it. This linking can be used to point to non-SGML objects as well as to SGML-encoded documents. For example, an SGML hyperlink could point to an area within a graphic image, or to a range of frames on a video. Thus it is possible to set up SGML-encoded metadata that includes machine-processable links to single points and passages within a wide variety of resources.

Two hyperlinking schemes are deployed by the applications being studied. Both are system- and platform-independent. EAD uses HyTime, an International Standard (ISO 10744) application of SGML. TEI and CIMI use TEI extended pointers, a scheme provided as part of the TEI application. The scope of HyTime is larger, as is its complexity while the TEI scheme is both conceptually and computationally simpler. The designers of the two schemes have however gone to some lengths to maintain compatibility between them. Software support for both schemes is increasingly being provided within SGML-aware browsers.

3.4 SGML for quality control

What facilities exist for assuring the quality and consistency of SGML-encoded metadata? Conformance of documents to their DTD is checked by an SGML parser, a program which checks that documents match the tree structure defined by the DTD. It may also be configured to produce a normalized form of the document, in which the element structure is represented unambiguously. (An SGML document need not represent explicitly all of its structural markup: various types of minimization, such as the omission of contextually-determined tags, being permitted by the standard). The Element Structure Information Set (ESIS) output by such an SGML parser is an essential first step in the creation of an efficient general purpose SGML processing tool.

A parser checks only the syntactic validity of a document. As mentioned above, additional software is necessary to check the semantic correctness of the content, for example to check that only terms from a controlled vocabulary are employed. Such checking is inevitably application-specific, requiring the development of application-specific software, either from scratch or by customization of more generic systems. Provided that the DTD accurately reflects the structure of the metadata to be processed however, it will be possible to develop more powerful applications than could be developed in the absence of marked-up data. For application areas (such as museum and bibliographic data) where the information is inherently complex with many inter-relationships, the SGML approach is also likely to be simpler to implement than the use of relational databases.

A regularly updated (and expanding) list of SGML tools and products is maintained at http://www.falch.com/SGMLtools listing several hundred products, both commercial and public domain, categorized by function. Key functions include

Semantic checking may be carried out during the process of document authoring, as a separate post-editing exercise, or both. SGML-aware application development systems such as sgmlc, Balise, or Omnimark can be used to construct modules for this purpose, either to run stand-alone, or integrated with authoring tools such as Author/Editor or WordPerfect.

Much, if not all, of the functionality of such integrated systems is also available in state of the art object-oriented document management systems such as Astoria, along with many other desirable features. However, object oriented technology has not yet reached the state of maturity where it can be considered a low-cost or wide-appeal solution.

For the immediate future it seems likely that hybrid document management systems will continue to dominate the market. The hybrid approach enables the system builder to use the known strengths of relational database technology (for example, with respect to document integrity, multiple access, etc.) in combination with the evident superiority of SGML as a document representation scheme, facilitating more sophisticated enquiry and retrieval facilities. This can be achieved, for example, by storing SGML objects as "BLOB"s within a relational database, or by modelling at least some part of the SGML conceptual schema directly in the relational database, with appropriate SGML documents being generated from the repository via SGML translation modules.

3.5 SGML in resource discovery

To use SGML metadata documents directly for resource discovery implies some sort of SGML-aware search engine, such as Open Text or BASIS. SGML-encoded metadata can always be converted to some other format, whether "on the fly"or as a batch operation, but without an SGML-aware search engine it will be difficult to take full advantage of the rich structuring inherent in the SGML data.

As noted above, the absence of low-cost full- featured SGML-aware database or document management systems encourages a hybrid approach to document management. Management and control information is stored in a conventional relational DBMS, with all its advantages for integrity control and management, from which complex SGML structured documents are generated for loading into a static SGML-aware document retrieval system, with all its advantages for efficient searching in complex structures. Results obtained from the document retrieval system can then be easily down-translated into an interchange format conforming to some externally agreed protocol such as Z39.50, Dublin Core, or even MARC.

At the Oxford Text Archive, for example, all the information required for a TEI Header is stored locally in a conventional Microsoft Access database, from which TEI Headers are dynamically generated. It is planned to load the headers, along with the texts to which they refer, into a single federated document management system using Open Text software. This database will service all bibliographic enquiries about texts, as well as analyses of the texts themselves, via a single forms-based interface. This architecture will also permit dynamic extraction of metadata information in a variety of different formats for use by remote clients. The TEI Header is certainly rich enough to support clients requiring Dublin Core records (see further section see 4.8.1, , page below); at the OTA, it is hoped to define the headers sufficiently accurately to permit also the automatic generation of basic level MARC catalogue records on demand.

Similarly, the Z39.50 protocol has successfully been used within Project CHIO to carry out searches on CIMI metadata encoded in SGML, although in this case the actual searching was carried out on a database derived from the SGML, not directly on the SGML itself. Also, the system used was unable to support the CIMI concepts of context or inheritance. The use of Z39.50 with SGML documents is still very much at an experimental stage.

Next Table of Contents


Page maintained by: UKOLN Metadata Group