Three SGML metadata formats: TEI, EAD, and CIMI
Work Package 1 of Telematics for Libraries project BIBLINK (LB 4034)
The BIBLINK Project
Title page
Table of Contents

Previous - Next

2 Overview of the schemes studied

This section gives an introduction to the three SGML-based metadata formats described in this report. For each, we provide a brief history, a design overview, and an example of usage.

2.1 The Text Encoding Initiative

The Text Encoding Initiative (TEI) is an international research project, sponsored by three leading professional societies, with substantial international funding, from the Mellon foundation, the US National Endowment for the Humanities, the European Union Language Engineering Programme, and the Canadian Social Science and Humanities Research Council. Its primary goal was to define a set of recommendations for the encoding of literary and linguistic textual materials in electronic form, both in order to standardize existing work, and to facilitate the development of good practice in a rapidly developing field. The project began in the winter of 1987, and the most recent version of its chief deliverables, the two volumes of the TEI Guidelines, were published in May 1994 (). Some indication of the wide range of work carried out within the TEI project is provided by the essays collected in ; for a brief overview of the project's structure and organization, see

The work of defining the TEI recommendations was carried out in a number of working groups and committee, with over a hundred volunteer contributors recruited from the international research community. Partly as a consequence of this large and varied user base, the TEI Guidelines, are extremely flexible: the end-result of the project was a modular, extensible, document type definition, combining a number of sets of element and attribute definitions, to be mixed and matched in a variety of ways according to the needs of particular communities. One of the most significant components of the TEI scheme is that defining a detailed bibliographic description known as the TEI Header.

2.1.1 The TEI Header

The TEI Header was defined in the first phase of the project, largely within the TEI working committee on Text Documentation, whose members included professional librarians and archivists as well as experts in markup. It was subsequently revised and expanded, with significant input from several TEI Working Groups, notably those concerned with the encoding of spoken language, and on the organization of language corpora. Its primary object is to address "the problems of describing an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented. Such documentation is equally necessary for scholars using the texts, for software processing them, and for cataloguers in libraries and archives. Together these descriptions and declarations provide an electronic analogue to the title page attached to a printed work. They also constitute an equivalent for the content of the code books or introductory manuals customarily accompanying electronic data sets." (TEI P3, p. 89). It is noteworthy that this "electronic titlepage" is almost the only feature of the TEI encoding scheme which is mandatory.

It consists of the following four major sections:

The structure of a TEI Header is fully detailed in the Guidelines and contains specific elements for a very wide range of elements (notably, almost all of those identified in the Survey of Libraries' Metadata Requirements reported inBiblink study D1.1, section 7). It should be stressed that this structure is architectural, rather than legislative: in other words, the TEI proposes a rich collection of metadata components, and a structure within which they can be expressed, and expanded. It provides little or no guidance as to the particular selection of such components which should be used by particular projects. Definition of such TEI Applications was consciously left to users of the scheme by its designers. Consequently, headers defined by different projects may vary widely. However, there are increasing signs of convergence amongst (for example) the practice of the growing number of electronic text centres and archives employing the TEI Header to document their holdings.

2.1.2 Sample TEI Headers

The following header is from the Victorian Women Writers Project at Indiana University:

<TEIHEADER><FILEDESC>
<TITLESTMT><TITLE>Liberty Lyrics (1895):
a machine-readable transcription</TITLE>
<AUTHOR>Bevington, Louisa Sarah (Guggenberger) (1845-?)</AUTHOR>
<RESPSTMT><RESP>Transcribed and encoded by </RESP>
<NAME>Felix Jung</NAME></RESPSTMT>
<RESPSTMT><RESP>Edited by </RESP>
<NAME>Perry Willett</NAME></RESPSTMT></TITLESTMT>
<EXTENT>TEI formatted filesize uncompressed&colon; 1426 bytes</EXTENT>
<PUBLICATIONSTMT>
<PUBLISHER>Library Electronic Text Resource Service (LETRS), Indiana University</PUBLISHER>
<DATE>September 22, 1995</DATE>
<AVAILABILITY><P>&copy; 1995, The Trustees of Indiana University. Indiana University makes a claim of copyright only to original contributions made by the Victorian Women Writers Project participants and other members of the university community. Indiana
University makes no claim of copyright to the original text.
Permission is granted to download, transmit or otherwise reproduce,
distribute or display the contributions to this work claimed by Indiana
University for non&hyphen;profit educational purposes, provided that
this header is included in its entirety. For inquiries about
commercial uses, please contact&colon;
<ADDRESS><ADDRLINE>Library Electronic Text Resource Service</ADDRLINE>
<ADDRLINE>Main Library</ADDRLINE>
<ADDRLINE>Indiana University</ADDRLINE>
<ADDRLINE>Bloomington, IN 47405</ADDRLINE>
<ADDRLINE>United States of America</ADDRLINE>
<ADDRLINE>Email: LETRS@indiana.edu</ADDRLINE></ADDRESS>
</P></AVAILABILITY>
</PUBLICATIONSTMT>
<SERIESSTMT>
<TITLE>Victorian Women Writers Project&colon; an Electronic Collection</TITLE>
<RESPSTMT><NAME>Perry Willett, </NAME>
<RESP>General Editor</RESP></RESPSTMT></SERIESSTMT>
<SOURCEDESC>
<BIBLFULL><TITLESTMT><TITLE>Liberty Lyrics </TITLE>
<RESPSTMT><RESP>by </RESP>
<NAME>L.S. Bevington</NAME></RESPSTMT></TITLESTMT>
<EXTENT>16 p.</EXTENT>
<PUBLICATIONSTMT>
<PUBLISHER>Printed and Published by James Tochatti, </PUBLISHER>
<PUBLISHER>&ldquo;Liberty&rdquo; Press </PUBLISHER>
<PUBPLACE>London </PUBPLACE>
<DATE>1895</DATE>
</PUBLICATIONSTMT></BIBLFULL>
<P>The copy transcribed is from Michigan State University Libraries.</P>
</SOURCEDESC>
</FILEDESC>
<ENCODINGDESC><EDITORIALDECL><P>All poems occur as DIV0. Sonnets are
attributed as "type=sonnets"; the rest are "type=poem". All quotation
marks, hyphens, dashes, apostrophes and colons have been transcribed
as entity references. All < lg > (line groups) are attributed as
cantos, stanzas, couplets, verse paragraphs, etc. All poems with
regularly indented lines use the attribute "rend" in the < l > tag,
with the value "indent1" for one tab stop, "indent2" for two tab
stops, etc. All split lines are attributed as "type=i" for the
initial portion, and "type=f" for the final portion.</P>
<P>All apostrophes and single right quotation marks are encoded as
&rsquo;.</P>
<P>Any hyphens occurring in line breaks have been
removed; all hyphens are encoded as &hyphen; and em dashes as &mdash;.</P> </EDITORIALDECL>
<TAGSDECL>
<TAGUSAGE GI="back" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="body" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="corr" OCCURS="4"></TAGUSAGE>
<TAGUSAGE GI="div" OCCURS="3"></TAGUSAGE>
<TAGUSAGE GI="div0" OCCURS="15"></TAGUSAGE>
<TAGUSAGE GI="div1" OCCURS="2"></TAGUSAGE>
<TAGUSAGE GI="docauthor" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="docdate" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="docimprint" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="doctitle" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="emph" OCCURS="15"></TAGUSAGE>
<TAGUSAGE GI="front" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="head" OCCURS="18"></TAGUSAGE>
<TAGUSAGE GI="L" OCCURS="484"></TAGUSAGE>
<TAGUSAGE GI="lg" OCCURS="109"></TAGUSAGE>
<TAGUSAGE GI="p" OCCURS="7"></TAGUSAGE>
<TAGUSAGE GI="pb" OCCURS="14"></TAGUSAGE>
<TAGUSAGE GI="text" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="titlepage" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="titlepart" OCCURS="1"></TAGUSAGE>
<TAGUSAGE GI="titlestmt" OCCURS="2"></TAGUSAGE>
</TAGSDECL></ENCODINGDESC>
<REVISIONDESC>
<CHANGE><DATE>1995-06-30</DATE>
<RESPSTMT><NAME>Felix Jung, </NAME>
<RESP>editor.</RESP></RESPSTMT>
<ITEM>finished data entry, basic encoding and proofing</ITEM></CHANGE>
<CHANGE><DATE>1995-09-11</DATE>
<RESPSTMT><NAME>Perry Willett, </NAME>
<RESP>general editor.</RESP></RESPSTMT>
<ITEM>finished TEI-conformant encoding and final proofing</ITEM></CHANGE>
</REVISIONDESC>
</TEIHEADER>

This example demonstrates how traditional cataloguing (bibliographical) information, rights and permissions information, specific encoding details, and version information are readily combined in one descriptive framework. The same encoding framework also applies to the text itself, of course, since this is also encoded according to the TEI Guidelines. A computer application capable of handling such a resource is ipso facto capable of handling its associated metadata.

A second example TEI header is taken from one of the 4124 texts making up the British National Corpus (). In this project some of the tags proposed by the TEI have been renamed, and the flexibility of the scheme greatly curtailed. However, the basic structure remains the same.

<bncDoc id=BDHD0 n=ZIT04A>
<header type=text creator='dominic' status=new update=1994-04-19>
<fileDesc>
<titStmt>
<title>Minutes: Juniper Green Village Association -- an electronic version
</title>
<respStmt><resp>Data capture</resp>
<name>W R Chambers</name></respStmt>
<respStmt><resp>Transcription</resp>
<name>Oxford University Press</name>
</respStmt>
<respStmt><resp>Encoding, storage and distribution</resp>
<name>Oxford University Computing Services</name>
</respStmt>
<respStmt><resp>Text enrichment</resp>
<name>Unit for Computer Research into the English Language,
University of Lancaster</name></respStmt>
</titStmt>
<ednStmt n=1>Automatically-generated header
</ednStmt>
<extent kb=188 words=12139></extent>
<pubStmt>
<respStmt><resp>Archive site</resp>
<name>Oxford University Computing Services</name>
</respStmt>
<address>
13 Banbury Road, Oxford OX2 6NN U.K.
Telephone: +44 491 273280
Facsimile: +44 491 273275
Internet mail: natcorp@ox.ac.uk
</address>
<idno type=bnc n=ZIT04A>
<avail region=world status=unknown>
<!-- terms and conditions text summarized here -->
</avail>
</pubStmt>
<srcDesc><biblStr><monogr>
<title>Minutes: Juniper Green Village Association</title>
</monogr></biblStr></srcDesc>
</fileDesc>
<encDesc>
<projDesc>
See project description in corpus header for
information about the British National Corpus
project.</projDesc>
<refsDecl>
Canonical references in the British National Corpus
are to text segment (&lt;s&gt;) elements, and
are constructed by taking the value of the n attribute
of the &lt;cdif&gt; element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target &lt;s&gt element.
</refsDecl></encDesc>
<profDesc><creation date='1990/1993'></creation>
<txtClass>
<catref target='wriAD920 wriASe4 wriATy3 wriAud3 wriDom4 wriLev1 wriMed4 wriPP920 wriSta1 wriTAS3 wriTim2'>
<keywords><term>minutes</term></keywords>
</txtClass></profDesc>
<revDesc><change n=1>
<date value=1993-12-22>1993-12-22</date>
<respStmt><resp>Unprocessed text received by OUCS</resp>
<name>fgk</name></respStmt>
</change>
<change n=2><date value=1994-02-07>1994-02-07</date>
<respStmt><resp>Processed text passed to UCREL</resp>
<name>gmb</name></respStmt>
</change>
<change n=3>
<date value=1994-03-25>1994-03-25</date>
<respStmt>
<resp>Segmented text received by OUCS</resp>
<name>bryant</name></respStmt>
</change>
<change n=4>
<date value=1994-04-19>1994-04-19</date>
<respStmt><resp>Initial accession to corpus</resp>
<name>dominic</name></respStmt>
</change></revDesc></header>

This example also demonstrates how the integration of the header and text within a single encoding framework can be beneficial. The <catRef> element in the header above specifies the descriptive (classificatory) categories applicable to the specific text to which it is attached, by reference only. A full definition for each category used in the corpus is supplied in an additional corpus header, which is prefixed to the whole corpus. Each individual text header references the parts of the corpus header which apply to it by means of TEI pointers, as in this case. This kind of linking mechanism is widely used within the TEI scheme, with obvious advantages of consistency and validation. As a further example, a <language> element can be given in the header to define each language used throughout a text. For a multilingual text, each portion in a given language will then reference the appropriate <language> element using its lang attribute. The lang attribute is applicable to any element in the TEI scheme, which makes it possible to indicate changes of language at any desired level of granularity, from sections or subsections down to individual words.

2.1.3 Other forms of metadata

The TEI scheme also proposes a number of mechanisms for the embedding of metadata within the body of a text (as distinct from in the header prefixed to one). These mechanisms vary widely in their technical sophistication and expressive power, since they are intended to cater for a wide range of analytic needs. At the simplest end of the scale, an <index> element is provided, which can be placed anywhere within a TEI text to generate an index-entry of some kind for this point in the text (this is functionally equivalent to the CIMI <topic> element discussed below); for more complex interpretative structures, the <interp> element may be used both to define an analysis, and to link it to a span of text; <interp> elements can also be grouped into hierarchically organized <interpGrp> elements.

The TEI also defines a specialized tag set for the encoding of analytic interpretations of any kind, based on the feature structure formalism. This powerful mechanism has great potential for the representation of formal systems of all kinds, but has not yet been widely implemented. (See further )

2.2 EAD

2.2.1 Background

The origins of the Encoded Archival Description (EAD) framework can be traced to a project initiated by the University of California, Berkeley, Library in 1993. The goal of the Berkeley project was to investigate the desirability and feasibility of developing a non-proprietary standard for machine-readable finding aids, that is, the inventories, registers, indexes, and other documents created by archives, libraries, museums, and manuscript repositories to support the use of their holdings. An additional motivation was the growing importance of networks as a means of gaining access to such information about holdings and the desire to extend the scope and richness of the information generally provided by traditional machine-readable cataloging (MARC) records.

The principles underlying the EAD are summarized as follows in an early definition of the project ()

These principles led to a design in which, at the most basic level, a finding aid document consists of two or three segments: Following the example of the Text Encoding Initiative (TEI), the group designated the segment about the finding aid itself the header, within which two types of information could be presented: The hierarchy of descriptive information, reflecting archival principles of arrangement, generally begins with a summary of the whole and proceeds to delineation of the parts as a set of contextual views. Descriptions of the parts inherit information from descriptions of the whole.

2.2.2 EAD header

The mandatory EAD header is based on the TEI header. Its function is to provide a descriptive identification of the encoded archival description or finding aid. Its components are:

This structure departs from the TEI Header in two minor respects:

The <eadid> is a formal, machine-processable name or address for a unique, authoritative <ead> instance. Its function as an internal reference identifier would normally be carried out by the global id attribute defined by the TEI. A more general <idno> element is also defined within the body of the TEI Header (see further below)

The <footer> is a note, disclaimer, warning, etc. that should be printed at the bottom of each page, displayed with each screen, etc. Again, there are several possible TEI equivalents, depending on the role or function of this note in a particular EAD application.

There are also, of course, minor differences of detail within the body of EAD header elements.

2.2.3 Sample EAD header

This is an example of an EAD header:

<EADHEADER LANGENCODING="USMARC"
FINDAIDSTATUS="EDITED-FULL-DRAFT">
<EADID SYSTEMID="DLC" AUTHORITY="DLC"
ENCODINGANALOG="856$f">jackson.sgm</EADID>
<FILEDESC>
<TITLESTMT>
<TITLEPROPER>SHIRLEY JACKSON</TITLEPROPER>
<SUBTITLE>A REGISTER OF HER PAPERS IN THE LIBRARY OF
CONGRESS</SUBTITLE>
<AUTHOR>
<EXTPTR DISPLAYTYPE="PRESENT" ENTITYREF="lcseal">
Prepared by Grover Batts
<LB> Revised and expanded by Michael McElderry
<LB>with the assistance of Scott McLemee
</AUTHOR>
</TITLESTMT>
<PUBLICATIONSTMT>
<DATE TYPE="finding aid created">1993</DATE>
<PUBLISHER>Manuscript Division
<LB> Library of Congress</PUBLISHER>
<ADDRESS>
<ADDRESSLINE>Washington, D.C. 20540-4860</ADDRESSLINE>
</ADDRESS>
</PUBLICATIONSTMT>
<SERIESSTMT>
<TITLEPROPER>Registers of Papers in the Manuscript Division of the Library of Congress</TITLEPROPER>
</SERIESSTMT>

<NOTESTMT>
<NOTE><P>Edited Full Draft</P></NOTE>
</NOTESTMT>
</FILEDESC>

<PROFILEDESC>
<CREATION>Finding aid encoded by Mary Lacy, Manuscript
Division, Martha Anderson, National Digital Library, and
others, Library of Congress,
<DATE>1996</DATE>
</CREATION>
<LANGUSAGE>
<LANGUAGE>eng</LANGUAGE>
</LANGUSAGE>
</PROFILEDESC>

</EADHEADER>

2.2.4 EAD finding aid

The finding aid itself contains a mandatory <archdesc> (archival description) and an optional <add> (additional materials) element. The archival description contains a descriptive identification, containing key information such as creator, title and creation date, physical description (extent, object type, etc.), repository name and department, and notes. This descriptive identification may be followed by additional detailed information such as administrative information, biography or history of people or organizations involved, controlled access headings, or scope and content of the described material.

The archival description may also contain any number of descriptions of subordinate components (<dsc>s). These can be full descriptions, like the top-level description, or can take the form of lists or tabular displays of components.

2.3 CIMI records

The Consortium for the Computer Interchange of Museum Information works to promote the standards-based interchange of museum information. It is a membership organisation, supported by individual museums and museum organisations in North America and Europe. European membership includes the U.K. Museum Documentation Association, the Victoria and Albert Museum and the Aquarelle consortium.

CIMI adopted SGML as an interchange format in 1994, and has since used SGML in Project CHIO, an experimental distributed database of heterogeneous Folk Art resources (exhibition catalogues, object records, bibliographic references and authority files).

In the course of Project CHIO, CIMI developed an SGML application for textual museum information resources, which is applied to exhibition catalogues within CHIO. This application is based on the TEI and so shares its use of the TEI Header to describe the electronic text itself. Also, the encoding of the "standard" features of the text (sections, headings, lists, bibliographic citations, etc.) follows normal TEI practice.

2.3.1 A TEI application

As noted above, the CIMI DTD was developed as a domain-specific application of the generic TEI framework. As such, it uses the standard features of the TEI Header to encode core metadata about each document. Particular emphasis is placed on bibliographic information and on access information and conditions (copyright statements, credit lines, etc.)

In addition to the standard TEI Header, the CIMI DTD introduces metadata concepts which apply within the document itself.

2.3.2 CIMI Access Points

A principal aim of Project CHIO is to provide online access to the relevant parts of documents in response to enquiries. These enquiries might come from the general museum-going public, or from museum professionals.

In order to do this, any aspects of relevance to potential queries need to be marked up. Also (less obviously) the scope of each search term needs to be made clear. If a section within a chapter describes a technique of interest (e.g. rug-hooking), then only that section should be returned to the searcher, not the whole chapter (and certainly not the whole book!).

A distinction was made between the main topic of discourse within a piece of prose ("primary" access points), and passing mentions of a topic ("secondary" access points). For example, a section of one book might be a biographical essay on Grandma Moses, whereas another book might have a passing mention of her name. Clearly, the biographical essay is likely to be much more valuable to a searcher, and so the entry "person = Grandma Moses" would be encoded as a primary access point within that section.

The CIMI access points were developed through study of the questions asked of museums by researchers and the general public. The Categories for the Description of Works of ART () were also used as background.

2.3.3 Topics

A particular requirement of the CIMI application was to associate these access points not only with discrete documents or document sections, but also with arbitrarily small chunks of text within the document, for example to answer questions of the type "Show me anything that talks about Grandma Moses". This is achieved by the use of a special purpose <topic> element, whose attributes specify the CIMI access point concerned, and its particular value, and whose location (in SGML terms, its parent) specifies the document fragment concerned.

For example, both a paragraph and an entire article about Grandma Moses would include an element like the following:

<topic access-point="subject" value="Grandma Moses"></topic>

Thi is the method by which primary access points are encoded. The access-point attribute indicates which CHIO access point is involved. The value attribute contains the actual value of the topic. This is a completely general method, and can thus be used for a variety of designators. For example:

<topic access-point="identity-number" value="1969.11.1"></topic>

indicates a topic of "identity number = 1969.1.1".

2.3.4 Contexts

For general (public) access the topic mechanism is felt to be sufficient. However, CIMI's Project CHIO also aimed to support a more complex "Museum Point of View", for which <topic>s alone were insufficient. Certain topic designators (date is an obvious example) have a meaning which is quite different in different contexts. There is also a frequent need to organize topics into a hierarchy. To provide this precision of retrieval, topics can be given a context. For example, this <context> element:

<context CHIO="creation"> ... </context>
applies the context of "creation" to anything inside it. So,
<context CHIO="creation">
<topic access-point="date" value="1860">
</topic>
</context>

gives the date "1860" the context that it is a creation date, rather than (say) a date of birth or death.

context elements allow the primary access concepts to be qualified more exactly, and also allow them to be grouped together to form meta-records describing objects, people, places, events. etc. As well as <topic>s , a <context> elements can contain subordinate <context>s, thus permitting the definition of quite complex structures of metadata.

2.3.5 Sample CIMI meta-record

The CIMI access point mechanism is designed to be very flexible. By providing simple "building blocks", the CIMI framework allows complex statements to be built up as required.

This meta-record describes an object entitled "Storm-tossed Frigate", giving its artist, date of creation and current identity number:

<topic access-point="object.work" value="Storm-tossed Frigate">
<context CHIO="creation">
<context CHIO="creator">
<topic access-point="person" value="Chambers, Thomas" ROLE="artist">
</context>
<topic access-point="date-range" FROM="1825" TO="1874" EXACT="NONE">
</topic>
</context>
<context CHIO="current-location">
<topic access=point="identity-number" value="1969.11.1"></topic>
</context>
</topic>

This set of data would typically be placed just inside a section of the text which describes that object, and so would associate the following index terms with that section:

access point valuecontext
object/workStorm-Tossed Frigate
personChambers, Thomas creation - creator ( + role = "artist")
date range1825 - 1874 creation [of object]
identity number1969.11.1 current location [of object]

2.3.6 Inheritance

The CIMI approach depends on the concept of inheritance which is inherent to SGML. Each section within the document "inherits" the topics assigned to the larger sections of which it forms a part. Thus if a whole book is "about" Folk Art, then each chapter within it is also "about" Folk Art. If one chapter within that book is "about" weaving, then every section within that chapter is "about" Folk Art and weaving, and any topics that are specific to that section:

<text><topic access-point="subject"
value="Folk Art"> [book-level index term]
...
<div1><topic access-point="process.technique"
value="weaving"> [chapter-level]
...
<div2><topic access-point="person"
value="Moses, Grandma"> [section-level]
...

In this case, the <div2> (section) is "about" Folk Art and weaving and Grandma Moses.

EAD employs a similar convention in which each level of an archival description "inherits" the information provided by its parent, higher-level, descriptions: "The <archDesc> element encompasses an unfolding hierarchy of descriptive information which, reflecting archival principles of arrangement, generally begins with a summary of the whole and proceeds to delineation of the parts. Descriptions of the parts inherit information from descriptions of the whole." If all the text were stripped out of a CIMI-encoded document, the access point information that remained would have a similar structure to an EAD archival description.

2.3.7 Linking and naming

The CIMI DTD is meant to support a distributed resource, with contributing documents, object records and image files physically stored anywhere on the Internet. This led to linking conventions between e.g. documents and their associated images that were both robust and flexible. TEI extended pointer conventions are used to express complex links (e.g. to a specific passage in another document).

Long-lived links should not, as far as possible, be "hard-wired" to a particular physical location. The location of resources is liable to change over time --- as has already happened within the lifetime of Project CHIO. Also CIMI wanted to facilitate the possibility of creating mirror sites with hyperlinks to a local copy of images etc.

In order to achieve a degree of insulation from the changes that occur in the siting and naming of Internet resources, CIMI recommends the use of formal public identifiers (FPIs) for external image files, documents, etc. A typical FPI has the form

-//XXX//YYY Name//ZZ

where XXX identifies the naming authority, YYY the kind of entity named (e.g. document type definition, entity set, document etc.), Name is a human-readable long name for the entity, and ZZ is the human language used for its definition. For example, the following FPI is defined for one of the TEI document type definitions:

-//TEI//DTD TEI Lite 1.0//EN

SGML's use of entity references to identify system-specific references within a document means that only the entity definition needs to change when a different system identifier is needed. The use of FPIs within such entity definitions adds a further level of indirection, which can greatly increase portability and document independence. When FPIs are in use it is normal to define the mapping between an FPI and a real system identifier within a so-called catalog file (along with some other aspects of importance to an SGML application). The format for such catalogs is not defined by the SGML standard, but is currently in the process of definition by an influential group of SGML vendors and implementors called SGML Open.

Where it is not possible (or convenient) to create PUBLIC Identifiers for entity references, URLs are used as SYSTEM identifiers. This at least gives a form of reference which can be resolved directly by Web-aware software. Where appropriate (for example where referring to an image file), relative rather than absolute URLs are given, again with the intention of improving "portability" of the reference.

CIMI is not alone in having to cope with the "link rot" which is endemic to the current generation of distributed information systems. Its use of SGML will enable it to benefit from whatever solution is eventually found to this pervasive problem.

Next Table of Contents


Page maintained by: UKOLN Metadata Group