Metadata Formats
Work Package 1 of Telematics for Libraries project BIBLINK (LB 4034)
The BIBLINK Project
Title page
Table of Contents

Previous - Next

5. Typology of Metadata

There is a wide variety of metadata formats in existence, and awareness of these formats is becoming more widespread across communities. This offers the opportunity to create descriptions in different and more appropriate formats while retaining the possibility of exchanging data across formats. Awareness of the strengths and weaknesses of the various formats allows choice of the most appropriate format for a particular task. The use of different formats admittedly creates additional effort for achievement of interoperability, on the other hand different formats can be justified if they are needed for particular tasks.

It is clear that the various communities involved in creating and using different metadata formats are strongly attached to their own formats. This is understandable if one keeps in mind such factors as the effort involved in reaching consensus on formats, the skill levels required to apply formats in a consistent way, and not least the heavy investment in existing systems. For these reasons alone it is unlikely that any one format will become dominant. It would also seem undesirable as the existence of a variety of formats allows for choice of the optimum format for use in a particular context.

The provision and exchange of trade and bibliographic data relating to publications has been part of the book world for a considerable time, and has increased with the availability of data in electronic form. Records are created at various stages and places in the process of supplying documents to the reader, they are created to serve different requirements but have overlapping functions. The Newbury seminar held in 1987 (Bibliographic records in the book world: needs and capabilities) considered whether there could be improvement in the 'information flow' whereby the most appropriate record content and format could be sustained throughout the process to meet the various users' requirements (publishers, suppliers, libraries).

One important strand to emerge was the idea of an evolving bibliographic record where through more organised articulation of current record supply, or less likely, through the development of an all through single record system, current requirements might be more efficiently met.

(Bibliographic records: use of data elements in the book world. Lorcan Dempsey, British National Bibliography Research Fund Report, 1989)

In the book world there is still only the beginnings of a more organised evolution of the bibliographic record. The issue of 'a single record system' remains outstanding; at present there is limited integration of records used in the publishing/supplier world and the national bibliography. For electronic publications we need to consider whether it is an ambition to more fully integrate record supply, how far various needs and requirements can be met by the single record system, or whether we accept a more evolutionary system with different record formats available to various users to meet their different requirements.

5.1 Continuum of Complexity and Richness

In order to analyse the various metadata formats it is possible to make approximate groupings based on shared characteristics. Of central importance is the underlying complexity of the format and this suggests a typology of metadata along the continuum from simplicity to complexity. This typology was used in the DESIRE report as an initial means of identifying the characteristics of various formats.

A variety of formats have been placed in this table, positioned along a continuum from simple records (Band One) to complex, rich records (Band Four). The variety of record types identified in the bibliographic control process can be placed on this continuum as shown below.

Band OneBand TwoBand ThreeBand Four
Proprietary simple records:Dublin CoreMARCICPSR
NetFirst IAFATEI independent headersFGDC
AltaVista RFC 1807
CIMI
InfoseekSOIF
EAD




Publishers' CIP forms CIP MARCEDI messages


SGML article headers

It is possible to extend this model to associate other factors with the position of the format on the continuum. The simplest record formats are used to create relatively unstructured indexes for locating items whereas the most complex records can be used as the basis of sophisticated analysis and navigational tools. Records can be associated with more or less 'rich' retrieval and analysis processes (Z39.50, emerging query routing, text analysis). The bands of records typically have common characteristics in other aspects, for example:

This pattern of association is summarised in the following table:

Simple

Rich
LocationSelectionEvaluationAnalysis
Robot generatedRobot plus manual inputManually inputHigh level of manual input
Unstructured Attribute value pairsSubfields, qualifiersHighly structured mark up
http with CGI form interfacedirectory service protocols (whois++) with query routing (Common Indexing Protocol)Z39.50Z39.50 (in future with collection navigation)
proprietaryemerging standardsgeneric standards used in information world standards used in specialist subject domains

Within the context of BIBLINK we need to consider which Band of record is appropriate for further consideration. There may well be a different answer depending whether the requirement is for a CIP type record or a more detailed record.

5.2 Further Scoping Required

In order to make this decision there are a variety of issues that need to be addressed. These include:

Recommendation

Project partners need to consider these issues and refine the scoping of the project to allow criteria to be drawn up for decision making.

While accepting the need for consensus from partners on these issues, we will assume that previous practice and discussion of these issues will inform future decisions. So for example we assume that there will be constraints on cost and staff available from national libraries and publishers; that we are attempting to identify standard formats that are controlled by authoritative agencies; and that the level of service provided by the national bibliographic agencies will be comparable to that provided for print material (whether this is at a CIP level or at a level consistent with the full record in the national bibliography).

Given these assumptions, the formats from Band One would be rejected as proprietary solutions. The formats in Band Four would also be rejected as too detailed for the service level required, too specialised in nature for general use, and too costly for system maintenance particularly in terms of specialist staff required with skill levels to manipulate and interpret such records.

Predicated on these assumptions as to scope, it is recommended that BIBLINK should concentrate on formats in Bands Two and Three for the exchange format. This does not preclude the possibility that conversion will be required from formats outside these Bands e.g. from more complex formats in Band Four into a simpler format, but that the formats for data exchange would be located in Bands Two or Three.

5.3 Topology of Metadata

An essential aspect of the level of richness of a format is the extent of the content, both in terms of range and depth. The attempt to describe more or less aspects of an object will be reflected in the overall level of complexity e.g. designation, format rules for content. In order to identify the extent of content the elements describing an object can be clustered into groups.

One example of this is given by Bearman who proposes a reference model for business acceptable communication. (Bearman, David and Sochats, Ken. Metadata Requirements for evidence. Available at <URL: http://www.lis.pitt.edu/~nhprc/model.htm>). This defines clusters of data elements which would be required to fulfil the range of functions of a record. The functions of records are identified as the provision access and use rights management, networked information discovery and retrieval, registration of intellectual property, authenticity. The clusters of data elements are defined in six layers:

Bearman's model looks at the record in a wider context than the bibliographic context alone, and it is particularly relevant to our project as it takes account of the business context in which metadata is used. Bearman includes metadata elements which would be appropriate for metadata in the context of publishing and supply, where for example the 'business function' might include sales promotion, price update etc. In a similar way the 'Use history level' would be appropriate for recording changes in electronic documents and assisting authentication.

Within the UK, the BIC/BNBRF Book Product Information project has had as its main aim the identification of the content required for an EDI message to communicate product information through the book sector supply chain for non-serial items. This work has now developed into compiling exhaustive sets of data elements which might be used in this context.

This outline suggests the general categories required to cover the trade and promotional data associated with publications. It also introduces consideration of the level at which objects are described, which may be different in the book worlds and the library world. This is particularly the case in multimedia packs where CDs and books, or more traditionally tapes and books are sold as single line items but may be described as separate items by libraries. Within the BIC study the levels of object are defined as :

Line item: the entity described in the Book Product message, in effect a tradable item

Work: body of literary or intellectual content

Piece: single indivisible physical published item.

Turning to the library world, IFLA has recently issued a draft for comment of its study of the functional requirements of bibliographic records. This study attempts to identify and define objects of interest (or entities) for users of bibliographic records i.e. what information the user expects to find in the record, and how that is used. As regards the intellectual content described by a particular bibliographic record, can be viewed at different levels, each of which can be related to different information seeking behaviours from the user. These entities are:

Work: distinct intellectual or artistic creation

Expression: realisation of the work in text, sound, music, image etc

Manifestation: physical embodiment of an expression in book, sound recording etc

Item: single example of a manifestation

Within the context of BIBLINK it would seem there needs to be a shared view among publishers and national libraries as to what 'entities' or 'objects' are of interest. Traditionally libraries have dealt with printed material to a large extent at the manifestation level e.g. particular editions of a book. As electronic formats offer increasingly varied manifestations, then creation and re-use of metadata relating to the 'work' will take on more emphasis. At a different 'entity' level, it may become desirable to describe collections rather than individual items e.g. for web-sites containing varied and changing information. The problem of the identification of different manifestations and expressions is perhaps more familiar to publishers who are accustomed to dealing with compilations, re-prints, new editions and other 'bundling' of works. The on-going work by BIC to formulate a non-serial DTD has recognised this in the proposed layering of data elements.

Recommendation

As part of the consensus building process, publishers and national libraries identify the objects and relationships which need to be represented in metadata describing electronic resources.

Next Table of Contents