Three SGML metadata formats: TEI, EAD, and CIMI
Work Package 1 of Telematics for Libraries project BIBLINK (LB 4034)
The BIBLINK Project
Title page
Table of Contents

Previous - Next

4 Comparative analysis

In this section, we compare each of the three schemes under discussion with respect to the following criteria:

4.1 User Community

In this section we briefly survey the current state of usage for each of the three formats under study, with a view to giving some indication of how widespread its deployment is at present.

4.1.1 TEI headers

As they are an integral part of the TEI scheme, TEI headers are routinely found in all TEI-encoded documents. TEI encoding is widely accepted in several parts of the research community, in particular amongst those engaged in the creation of electronic libraries and text centres, in electronic publishing (for example of scholarly critical editions), and in the creation of language corpora for use in Natural Language Processing. In the US, leading electronic library projects, such as those at the Universities of Virginia, Michigan, and Indiana all use TEI headers to document their holdings. Several major text creation projects (e.g. the Womens Writers Project at Brown University, the NEH-funded "Model Editions Project" and many others are already committed to their use. It is hard to think of a major electronic text creation project in the academic context which would not at least start by first considering use of the TEI scheme.

In Europe, the TEI has been similarly successful, though the user profile has tended to be slightly different. For example, a number of highly visible commercial electronic publishing ventures (e.g. Chadwyck Healey's English Poetry and Cambridge University Press's Chaucer's Wife of Bath's Tale) have made use of it, and the TEI scheme has been mandated for use in corpus building and language engineering projects by a series of European expert groups.

Details of these and many other TEI applications are available from the TEI applications page, maintained by the project at <URL:http://tei-uic.edu/orgs/tei/apps/>.

For projects using the TEI Header, it is helpful to distinguish between its role as a quality control mechanism during resource creation and management on the one hand, and its role as a source of rich information for use in resource discovery on the other.

4.1.2 EAD

Although the EAD framework is still in beta-test form (with the first "official" release scheduled for early 1997), it has already been widely adopted (at least in principle) within the US archives community. "Within the first few months of alpha testing, scores of archives and libraries marked up selected finding aids."

The EAD format is likely also to be adopted as a standard by the archives community in the United Kingdom and may well emerge as an EU-wide standard. Repositories in the United Kingdom committed to EAD include Liverpool University, Glasgow University, and the University of Durham. The Public Record Office is currently conducting a pilot project with the aim of converting their listings to EAD, and there is growing interest from the British Library and NCA in developing EAD applications.

The Library of Congress has agreed to take on the task of maintaining the EAD. It is anticipated that the Society of American Archivists (SAA) will, at the appropriate time, organize an EAD advisory committee comprising representatives from the archival, library, and museum communities as well as acting as the maintenance agency for EAD.

4.1.3 CIMI records

The CIMI framework has been developed in the context of a recently-completed research project (Project CHIO), which explored the possibilities of using SGML and the Z39.50 search and retrieval protocol for museum information. So far, only CIMI members (mainly North American, but with some European representation) have actively used the framework, although the wider museum community is well aware of CIMI's work through a series of workshops and conference presentations.

Within Europe, the Aquarelle project has joined CIMI, and plans to develop the metadata aspect of its work.

It remains to be seen to what extent the CIMI framework will be adopted by the museum profession as a whole.

4.2 Control agency

In this section we briefly state the body or bodies responsible for the current and future states of the three formats under review.

4.2.1 TEI headers

Future development of the TEI is controlled by an Executive Committee. composed of representatives from the three sponsoring organizations and the two TEI editors. A larger Technical Review Committee was set up in 1996, which will take responsibility for the future development and maintenance of the TEI Guidelines. It is expected that a number of work groups will be set up to deal with specific development issues, the results of whose work will be ratified by the Technical Review Committee. This Committee will also take responsibility for the continued correction and maintenance of the Guidelines, in the light of experience gained during their use over the last couple of years. Membership and other administrative procedures of this Committee are similar to the ISO model, with particular domain-specific experts serving fixed renewable terms. (Further details are given in Procedures for Maintenance and Extension of the TEI Guidelines available from <URL:http://www-tei.uic.edu/orgs/tei/ed/edw48.tei>).

The TEI has announced its intentions of setting up work groups to develop proposals on a number of specific topics during 1997. These include:

4.2.2 EAD

The Library of Congress, Network Development/MARC Standards Office (ND/MSO) has formally agreed to serve as the maintenance agency for the EAD. As maintenance agency, LC will make the DTD and support documentation available and act as a clearinghouse for communications on the EAD, chiefly through the establishment of an electronic list and World Wide Web site.

The Society of American Archivists (SAA) will be responsible for ongoing supervision of the standard. It is anticipated that SAA will, at the appropriate time, organize an EAD advisory committee comprising representatives from the archival, library, and museum communities as well as the maintenance agency.

4.2.3 CIMI records

The CIMI Consortium itself acts as the control agency for the CIMI framework. If museums adopt the framework more widely, it is likely that one or more "official" museum bodies such as the Museum Computer Network, the American Association of Museums or the U.K. Museum Documentation Association will become involved, to give the framework a more neutral support platform.

4.3 Expression of metadata

Both the TEI header and the CIMI access points are metadata whose primary purpose is to "add value" to a specific SGML-encoded document. They might be termed closely-coupled metadata, in that the metadata forms part of the document itself. (The TEI header can be "de-coupled" to form an independent header: the CIMI access points cannot.) However, both formats serve as useful examples of metadata techniques that can be applied within an SGML framework.

The EAD scheme is pure metadata. There is no presumption that the archive being described is in any particular format. An EAD description is an artefact that is new-minted with the specific purpose of acting as a finding aid.

4.3.1 TEI headers

The TEI design means that all the metadata is gathered up in the header, and is separate from the document. Such links as are present point to the header, either from the document (e.g. language) or from elsewhere in the header (e.g. classification system). This means that the document relies on the TEI header being present, but the header does not need the document in order to be meaningful.

The TEI scheme allows for classification of the content of a document to be accomplished at any degree of granularity, though it is easiest to do this at the text level using the <textClass> element within the Header's <profileDesc>. Finer-grained characterizations are however possible within the TEI scheme, using the decls attribute mechanism (which allows for any structural element to specify the particular set of declarations applicable to it, including its classification), or more generally by using the generic linking mechanisms. The CIMI scheme, as already noted, has similar flexibility, and was developed specifically to enable multiple levels of description.

4.3.2 EAD

All of the metadata describing an archival resource is stored in the <findaid> element. The EAD header (unlike the TEI Header) is not metadata, in that it describes the finding aid itself. It might be termed meta-metadata!

4.3.3 CIMI records

Metadata is stored partly as per TEI in the TEI header, and partly as <topic> and <context> elements nested just inside the element to which they apply. This second approach is unique to CIMI within the three schemes examined in this paper. It is an approach that might be applied elsewhere, if "self-indexing documents" are required.

Embedding metadata within the body of a document has good and bad points. The positive aspects are:

It should however be noted that SGML-aware conversion software can easily extract this metadata and re-express it as an separate file containing independent links (ilinks) pointing to the correct place within the source document.

4.4 Metadata concepts supported

In this section we compare each of the three schemes under review in terms of the features each supports for the encoding of bibliographic description, access terms and conditions, and subject terms or classification.

4.4.1 TEI headers

Full, authoritative information on the TEI header is available in chapter 5 of the TEI Guidelines ().

bibliographical description: As previously noted, the <fileDesc> component of the TEI header is precisely designed to give full bibliographic information, and is "closely modelled on existing standards in library cataloguing" ( op cit , p.93 ); "It is the intention of the developers...to ensure that the information required for a catalogue record be retrievable from the TEI header" ( op cit p 137 ). Its component elements are taken more or less unchanged from analogous concepts in established bibliographic standards, chiefly the International Standard Book Description, and the Anglo American Cataloguing Rules.

Terms and conditions: The<availability> element "supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, etc.". This element can take one of a small set of predefined values for a status attribute; it can also contain a complex set of rights and conditions presented as prose. Concepts such as price and licence information are not held in analysed form, but could be included as prose description or notes. Several elements relating to distribution (for example <availability> and <idno>) are represented within a repeatable <publicationStmt> element within the <fileDesc> element, and can thus be different for different publishers, distributors, etc. of a resource. The <publicationStmt> element also contains information such as the name and address of the distributor, publisher, or release authority, and any associated identifier such as an ISBN or URI.

Subject terms and classification: These are recorded within the <textClass> element, using one or more of three distinct methods. The <keyWords> element can be used to supply a list of descriptive keywords, either user-defined, or from a named authority such as Library of Congress Subject Headings. The <classCode> element can be used to specify a value from some pre-existing classification or taxonomy, such as UDC. The <catRef> element can be used to specify (by reference) a value from a classification or taxonomy supplied explicitly as a <classDecl> element elsewhere within the header.

These three methods enable very detailed subject classification information to be added using a combination of currently well understood techniques. There is however, as usual, no recommendation as to how individual projects should choose amongst them.

4.4.2 EAD

Bibliographic information: This is not a prominent feature of EAD. The "EAD header" provides a brief characterization of the finding aid itself, not the source archive. There is an <ADD> element, defined as follows: "adjunct to descriptive data. Optional. Provides for adjuncts to the descriptive information. This is information that will assist in the use of the archival material, but is not itself archival description.". One of its components is an optional <bibliography> element, which in turn contains a <bibRef> for an actual citation. This element has a loose content model, analogous to the <bibl> element in TEI. It allows for a reasonable degree of bibliographic description, but does not attempt to enforce any particular level of description.

Terms and conditions: information is held within <accessRestrict> and <useRestrict> within the <adminInfo> element. <accessRestrict> contains information on gaining physical access to the material, while <useRestrict> describes what restrictions apply to the allowed use of the material once physical access has been gained. These are optional elements, that can appear within the description of a component of the archive at any level. No guidelines are offered on the preferred structure and content of these elements.

Subject terms and classification: these are comparatively detailed within the EAD scheme. A number of specific elements are defined, grouped within the <controlaccess> element, so named because it contains controlled access terms. Each of these has a specific tag such as:

corpName An organization or group of people that is identified by a particular name and that acts, or may act, as an entity.

geogName A proper noun identifying the name of a geographical place, natural feature, or political jurisdiction.

occupation Occupations (including avocations) that are significantly reflected in the materials being described.

subject Specifies a subject term.

genreForm Types of material distinguished by intellectual content or physical characteristics.

These specific tags can be mixed in with a free text subject description, which rather complicates their usability for automatic topic extraction purposes.

4.4.3 CIMI records

Bibliographic information: CIMI records are conformant TEI documents, and can therefore use exactly the same components for detailed bibliobliogarphic description as discussed above for TEI.

Terms and conditions: Again, CIMI records can use the same components for this purpose as discussed above for TEI.

Subject terms and classification: Again, CIMI records could use the same components for this purpose as discussed above for TEI. However, the mechanisms available for linking text classification information to particular parts of a document were judged inadequate or too complex for use in CIMI. The CIMI application therefore extended the TEI scheme (using the TEI's built-in customization mechanism) to include two special-purpose metadata elements which can be anchored at any point within a document, with local scope. These elements (<topic> and <scope>) were discussed above, in section see 2.3, , page .

4.4.4 Summary of feature coverage

This table summarises how some broad metadata features are covered by the three schemes examined:
Attribute TEIEAD CIMI
Bibliographic<fileDesc> element <ADD> . <bibliography> <bibref> <fileDesc> element
Terms and conditions<publicationStmt> . <availability> <adminInfo> . <accessRestrict> <useRestrict> <publicationStmt> <availability>
Subject terms, classification <profileDesc> . <textClass> <controlAccess> <topic>, optionally qualified by <context>

4.5 Rules for formulation of content

A consistent feature of the three SGML-based metadata schemes studied is that they are relatively relaxed about the way content is actually expressed. This can start at the structural level: a typical definition from the EAD Tag Library Description will be like the following, for <accessRestrict>: "...Contains: <head> (optional). followed by zero or as many as needed of the elements found in: Paragraph-level Elements."

In other words, access restrictions are to bedescribed in prose and no specific elements are provided to represent specific concepts relating to access. A similar situation is found in the other two schemes, although the TEI Header does provide more specific elements in some cases as an alternative to running prose.

Beyond this, the three schemes have the following to say about allowed content:

4.5.1 TEI headers

The guidelines for creation of independent headers within the TEI scheme give some indications of parts of the Header to which such constraints are likely to be of importance: "where there is a choice between a prose content model and one that contains a formal series of specialized elements, wherever possible and appropriate the specialized elements should be preferred to unstructured prose" Similarly, in the discussion of the <title>element: "The level attribute must be used to indicate whether this is the title of a book, journal, or series. It is highly recommended that the type attribute be used to distinguish the main title from subordinate, parallel, or other titles" However, even in the case of independent headers there is no indication that the syntax or vocabulary of entries should be constrained in any way.

4.5.2 EAD

EAD makes no recommendations for the actual syntax or vocabulary of any textual element. This is a typical instruction (for the <corpName> element):"This element contains text and may contain any elements found in the Linking and Formatting Elements, as many times as needed."

However, the instructions for entering the attributes which provide metadata about the corporation name are much more specific:

role Used to specify the relation between the name and the item being described. The value supplied should be a word or phrase taken from the USMARC relator code list.

sources Used to indicate the source of the controlled vocabulary term contained in the element. Possible values are:

aat (Art and Architecture Thesaurus)

aacr2 (Anglo-American Cataloging rules, 2d ed., rev.)

dot (Dictionary of Occupational Titles)

4.5.3 CIMI records

CIMI shares the general TEI approach to content within the TEI Header. However, within its <topic> and <context> elements it provides more specific guidance, by proposing a set of specific values for the access-point attribute on the <topic> element, and the CHIO attribute on the <context> element. In both cases, the values are drawn from a closed list, itself compatible with the CIMI Z39.50 profile attribute set. Even here, however, there is an alternative mechanism for <context> which lets you declare other contexts as a value attribute, with a corresponding authority; this extensibility is further discussed in the next section.

The actual value of the <topic> is not constrained.

4.6 Extensibility

4.6.1 TEI headers

As noted above, the TEI scheme is designed to provide a framework which can be customized and extended to suit the user's exact requirements. Users can define their own custom tags, rename TEI elements to a form that is more acceptable within their community, define a new base structure for their information, undefine existing elements, modify content models etc. The published TEI Guidelines include examples of "approved" extensions which were developed along with the Guidelines themselves.

Within this general framework, the TEI Header is rather less easily extended than other parts of the DTD. If additional concepts are required, the existing elements that are to contain them need to be "undeclared", then re-declared with their amended content. This is rather less elegant than the standard methods for adding to a class of existing elements within the document itself, but functional.

In general, modifications to the header are associated with the selection of tagsets from the TEI scheme which imply that these predefined modifications will be needed. For example, one of the effects of selecting the TEI's predefined tag set for language corpora is to extend the TEI Header, by including tags for documenting demographic and other characteristics of the "participants" in a written or spoken text. Another is to add a fourth way of classifying texts, in terms of their situational parameters.

Another example is provided by work currently in progress at the Bodleian library, where a rich set of descriptive tags for components of a traditional manuscript description has been defined, and grafted into the existing TEI Header structure, simply by redefining the <encodingDesc> element to include a new <mssDescription> element.

4.6.2 EAD

We cannot find any suggestion that the EAD has facilities that would allow users to extend the DTD.

4.6.3 CIMI records

Insofar as CIMI records use the TEI header, the above remarks apply. In the specific area of subject classification, both the <topic> and <context> elements have been designed to accommodate an open-ended set of descriptors. For example, <context> has a CHIO attribute which contains a fixed list of the CHIO "context" access points, but it also contains a pair of attributes value and authority, which can be used together to provide a context taken from any authoritative source. This has already been used to encode museum concepts which do not happen to fall within the CHIO scheme:

<context value="measurements" authority="CDWA">

Thus two levels of extensibility are available. An open-ended range of classifications can be encoded using the existing framework. And it would always be possible for CIMI to extend its own fixed list of contexts and access points.

4.7 Future development path

4.7.1 TEI headers

As noted above, there have already been proposals for the extension of the TEI Header to handle the specific requirements of manuscript description: a recently-organized conference surveyed a range of activities in this area (see the Studley Manuscript Encoding Meeting). One of the workgroups to be set up by the newly chartered TEI Technical Review Committee will address this and related issues of extending the TEI Header in a controlled manner.

It also seems likely that the definition of a set of Guides to Good Practice in the application of the TEI Header to a range of materials, at least to the kinds of textual material held at electronic text centres, will consolidate existing and newly-emerging consensus on how best to make use of its flexibility.

4.7.2 EAD

The current version of EAD is undergoing beta-testing: presumably all development efforts will go into releasing the first "official" version of EAD. It is probably too early to say what will happen subsequently, but existing use of the beta version (and commitments made to testers) already limit the ability to change the EAD scheme in a non-upwards-compatible manner.

4.7.3 CIMI records

The CIMI framework is less finalised than either TEI or the EAD scheme. CIMI will review the results of Project CHIO, and has an open mind as to how the format might develop. The lack of any significant deployment does give CIMI the flexibility to change its mind. It has a keen desire for interoperability, and plans to talk to both EAD and TEI about this.

4.8 Relationship to other metadata schemes

This section attempts to assess the overall position of these SGML-based metadata formats in the more general scheme of things. In particular. we examine the relationship between these formats and the Dublin Core, IAFA, and MARC.

4.8.1 Dublin Core

The Dublin Core is a currently much discussed set of metadata elements, which is increasingly regarded as providing a useful basis for general purpose resource discovery activities, particularly with networked resources. It has the merit of defining a small number of very generally applicable concepts, into which almost any more elaborated set of metadata concepts can readily be mapped. Examples of mappings between Dublin Core and EAD, and Dublin Core and GILS amongst others are available from Miller 1996; we list a similar "cross-walk" for the schemes discussed here:
DC heading TEIEAD CIMI
Subject<textClass> <controlAccess> <topic>, <context>
Title<title> <titleProper><title>
Author<author> <author><author>
Publisher<publicationStmt> . <publisher> <publisher><publicationStmt> <publisher>
OtherAgent<sponsor> <funder>, <principal> <respStmt> <resp> <editionStmt> <resp> <sponsor> <funder> <principal>, <respStmt> <resp> <editionStmt> <resp>
Date<publicationStmt> <date> <publicationStmt> <date>
ObjectType<textClass> <keywords SCHEME="DCOT"> <textClass> <keywords SCHEME="DCOT">
Form[= SGML; implied] [= SGML; implied][= SGML; implied]
Identifier<publicationStmt> . <idno> <publicationStmt> . <idno>
Relation
Source<sourceDesc> <biblFull> <sourceDesc> <biblFull>
Language<langUsage> <language> <langUsage> <language>
Coverage<extent> <extent>

4.8.2 MARC

Chapter 24 of the TEI Guidelines addresses specifically the question of mapping the components of the TEI <fileDesc> on to corresponding MARC fields. The mapping defined there implies that automatic conversion would be difficult, even though each data item would be in an appropriate MARC field or subfield. For example, there is no provision for the 'Main Entry' (or USMARC 1XX fields) in the TEI header. The main entry should be manually constructed by the cataloguer, using appropriate name authority control, and human intelligence to select from the information given in a TEI header the agency primarily responsible for the intellectual content of the work. There is an <author> tag, but the form of the name would have to be checked by a cataloguer before the main entry was constructed. Specific sets of values for the TEI defined attributes would need to be enforced before the TEI tags could reliably differentiate between name, conference, or title series; in their absence there is no simple mechanical method for determining which MARC tag (410, 411, etc.) should be used for series <title> and <idno>. Safe practice would be to load any series statements into 490 fields, and then to conduct authority work on those fields.

Since that date however, there has been considerable progress: for example, with the definition of the 720 generic author field, some of the above difficulties are removed. In a report commissioned by the Oxford Text Archive () a detailed mapping between the TEI Header and USMARC is proposed along with some more tightly specified cataloguing practices which together make feasible automatic loading of TEI Headers to USMARC records. The paper demonstrates that it is possible to create valid MARC records directly from SGML-encoded metadata, by defining a set of local practices and conventions in addition to the constraints enforced by the SGML document structure.

Next Table of Contents


Page maintained by: UKOLN Metadata Group