|
Metadata Formats
Work Package 1 of Telematics for Libraries project BIBLINK (LB 4034) |
Title page Table of Contents |
The strength of the Dublin Core metadata format is that it is built upon the basis of international consensus. A range of interested parties from different professional backgrounds and subject disciplines have contributed to the development of the format. There has been high commitment and involvement from a range of professions (publishers, computer specialists, librarians and information workers) and sectors (library utilities, software producers, service providers, libraries). The motivation progressing Dublin Core has been to reach a consensus among stakeholders on a minimal resource description which can be used for the benefit of all involved in the creation, search and retrieval of electronic resources.
In the context of BIBLINK, Dublin Core is significant as 'web publishers' have been involved in the consensus building process. In addition library and information personnel from Europe and the US have been involved in the workshops and discussion lists. The Library of Congress are supportive of the consensus building work and have investigated use of the format.
The Dublin Core is positioned as a simple information resource description. Importantly it also aims to provide a basis for semantic interoperability between other, probably more complicated, formats. A third target use is to provide the basis for a format for embedded metadata i.e. metadata contained within the body of the resource. It is in fact in this third role that most progress has been made, with a high level of interest from those involved in automatically harvesting metadata from HTML documents.
8.1.1.2 What is the level of maturity of the format?
A series of workshops mark the significant steps in the history of development of the format:
As yet the Dublin Core format is not stable, but the level of international involvement means that once agreement is reached on the format there are likely to be several implementations. The strong backing of the format by OCLC would imply that they see Dublin Core as the format of choice for provision of 'Internet publication' services i.e. as a format for their NetFirst electronic publication search service, and as a format for record supply and exchange.
8.1.1.3 Controlling Agency
As yet no control agency has been established, but the level of international involvement should assist in agreement on establishing an authoritative agency.
8.1.1.4 How widespread is deployment?
A number of projects and initiatives are in early stages.
In Europe the Nordic Web Index is committed to using Dublin Core. Within the UK there is considerable interest in pilot implementations as part of ROADS and the AHDS service.
There is a joint project between the National Libraries of Australia and New Zealand, the National Document and Information Service (NDIS). In this project Dublin Core is used as a means of achieving semantic interoperability by mapping core elements from disparate complex records onto the core element set. Within this project the Dublin Core elements have been used as the core search attributes for all records, in effect the intersection between the various databases included in the service. This has allowed flexibility in the use of semantics across databases, with mapping of other 'search fields' to the Dublin Core set.
DSTC in Australia is using the Dublin Core in the Research Data Network Co-operative Research Centre project for resource discovery. In this instance Dublin Core is used as a simple record format for the storage of metadata, and for searching and retrieval purposes.
Within other contexts such as the Archaeology Data Service (ADS) in the UK, the embedding of Dublin Core in documents is seen as an important means to create metadata, and to enable that metadata to be automatically harvested. The Arts and Humanities Data Service, of which ADS is one of the groups, is investigating the use of Dublin Core to provide a general catalogue to its various sites, with points directly at resources themselves or at richer metadata formats (e.g. TEI).
8.1.1.5 Overview of technical issues
At the present time there are moves to create Z39.50 attributes corresponding to the Dublin Core set. However at the same time there is a wider discussion taking place within the Z39.50 implementors Group as to how Z39.50 should take attribute sets forward. There is a suggestion that the control of attribute sets should move out of the standard itself and be controlled by agreement within other domains. Cross currents across both areas of discussion may mean agreement will take a little time.
Dublin Core records could be adapted for use with the directory service whois++.
There are proposed SGML and HTML syntaxes for Dublin Core. However there are suggestions that other syntaxes could be used e.g. PICS.
8.1.1.6 Future path
Dublin Core is designed as a means for 'publishers' and authors to provide metadata at the point of mounting information on the Web. It is in the interest of these publishers to make metadata available which can be harvested by commercial and selective search services as a means to ensure their publications become publicised. Similarly search services are likely to promote a standard format for embedded metadata to make the harvesting process more accurate. Whether Dublin Core will be the format favoured, or whether the web browser suppliers favour their own format remains to be seen.
The Dublin Core is also being used as a simple format for third party creation and as a basis for semantic interoperability between richer formats. It is therefore situated precisely in the CIP area.
8.1.1.7 Content
Dublin Core is positioned as a simple Internet description, and the intention is to provide limited, minimal information about a resource. The core set agreed at the first Dublin Core workshop in 1995 consists of:
There is an intention to elaborate on the original simple set of elements to allow for a richer element set, using qualifiers to specify the type, scheme and role of elements. Such qualifiers might either refer to external schemes to be applied for processing e.g. Author (scheme USMARC); or they might specify more precise information about the element e.g. OtherAgent (role editor). The attempt to bridge the gap between simple and sophisticated could potentially cause problems with interoperability unless there is close control of the definition of such qualifiers.
8.1.1.8 Rules for formulation of content
No particular rules are specified but by using the qualifier 'scheme' one can indicate whether the content of a particular element complies to an external scheme e.g.
Author = value (scheme = USMARC)
This does provide the flexibility for different schemes to be applied to different elements.
8.1.1.9 Extensibility
Dublin Core was originally targeted at describing 'document like objects'. Such objects were not closely defined but it could be argued that the emphasis within discussions was on text based resources. More recent discussion in a wider context has highlighted the need for description at the collection level (whether collections of web pages, or collections such as archives). Dublin Core is not designed to allow for navigation between collection level and individual items.
The latest CNI/OCLC Image metadata Workshop, October 1996, has proposed some amendments to the element set in order to accommodate description of images. This would indicate there may be further amendments required to describe other object types (e.g. sound).
IAFA (Internet Anonymous FTP Archive) templates were originally used to provide some bibliographic control over FTP archives. When the templates were designed many organisations were making information available using FTP archives, and the IAFA template allowed the archive to be described from a number of aspects (documents contained there, logical collections of documents, site and configuration details and so on.). The original design came from the IAFA working group of the IETF (Internet Engineering Taskforce). With the emergence and increasing information being made available on web servers, the IAFA templates have been used to fulfil a similar function and they are now used to describe a variety of networked resources.
8.1.2.2 In what service areas are the implementations?
At present the templates are used among the higher education community in the UK as the record format in selective search services. As part of the EC project DESIRE, their use is proposed for subject based gateways covering the Nordic countries and the Netherlands.
8.1.2.3 Controlling Agency
It has been proposed that UKOLN act as a registry for the format for UK users. Within the international community Bunyip are at present fulfilling this role. There is an issue as to how the format will be controlled over time, and who will participate in change control.
8.1.2.4 What is the level of maturity of the format?
The format has been in use for approximately two years. It has proved a successful format for creating simple descriptive records, and is the basis of a number of production systems. The format has been proved by use, and various amendments and changes have been introduced as a result of implementation experience.
8.1.2.5 How widespread is deployment?
There are several implementations which use IAFA/whois++ templates. The first implementation was the ALIWEB service which enabled searching of FTP archives, in effect a forerunner of today's Internet search services. Current implementations involving bibliographic descriptions include a number of production services: SOSIG, ADAM, OMNI, EEVL (variant), NetEc, IPCA.
8.1.2.6 Technical considerations
IAFA templates are associated with the whois++ directory service protocol. There is now a lot of activity in development of directory services based on whois++ protocol, although this is chiefly concerned with 'white page' applications (i.e. names and addresses).
8.1.2.7 Content
The original IAFA templates contain simple bibliographic descriptive elements, administrative metadata, and the means to describe access and location. Within the ROADS implementation additional elements have been added to enable subject headings and subject scheme to be specified.
8.1.2.8 Rules for formulation of content
Some rules for content are specified in the original guidelines, but they are rather patchy. These require elaboration. Within the ROADS project there is a commitment to formulate simple cataloguing guidelines as an aid to those creating simple Internet description records.
8.1.2.9 Future path
Among ROADS users there is a requirement to indicate relationships between web pages. One way this might be done is to create template types for different levels of object i.e. web sites, document collections, individual items. Another suggestion is to use the category field within the template to indicate relationships. The next version of ROADS will explore these possibilities.
In addition ROADS will be introducing some internationalisation of the ROADS templates in the next version, to support the use of ROADS in the DESIRE project. The intention is to facilitate multi-lingual descriptions and subject headings.
The SOIF (Summary Object Interchange Format) is a record format used by the Harvest software. Harvest software was developed at the University of Colorado at Boulder, and is distributed by them as shareware. It is documented at http://harvest.cs.colorado.edu/Harvest/. The Harvest architecture includes a Harvest gatherer designed to collect data regarding Internet documents, and a Harvest broker which is designed to enable users to search these records. The Harvest gatherer can generate SOIF records from documents held in a variety of formats (SGML, HTML, PostScript, MIF and RTF).
In order to solve problems of limited resources and seemingly limitless web publications, many organisations and services are considering the benefits of generating records using robots. The Harvest software is one example of such a system. Because of its availability it is being considered for use in a number of projects (ROADS, DESIRE).
There is great flexibility in the data elements which can be used, which means it can be useful for a variety of applications, but there is no guarantee of interoperability. Each Harvest broker can support any attributes that are required by the data which it describes, although a set of common data elements has been defined to promote interoperability.
8.1.3.2 In what service areas are the implementations?
The main area of implementation is for Internet search services: most SOIF records are generated by robots, although as they are based on simple attribute:value pairs they can easily be generated by hand.
SOIF records can also be used as an aid to creation of other metadata formats.
8.1.3.3 Controlling Agency
There is no identified controlling agency.
8.1.3.4 Content
A broker can support different attributes, depending on the data it holds. Often brokers will hold the full text of documents as well as metadata. A list of common attributes is provided in the documentation as follows:
Bibliographic type attributes:
Rules for content form are not specified and there is no specified way to indicate whether particular rules or schemes have been applied to content. Any agreement on rules for content would need to be made between co-operating parties.
8.1.3.6 How widespread is deployment?
Harvest has been widely taken up within the academic community and as a basis for search services. Of significant importance has been the recent adoption of Harvest technologies by Netscape. In 1996 Netscape announced they would use SOIF as a basis for their Catalog Server product.
8.1.3.7 Technical issues
In a significant extension to the Harvest architecture, Netscape are working on 'Resource Description Messages' which provide a framework for the creation and communication of metadata. Resource Description Messaging (RDM) is a messaging format which can be used as the basis of a query syntax. It allows for exchange of record descriptions and is particularly designed for use with SOIF records. The client can send a RDM request in order to select resource descriptions. RDM also allows the client to access a schema definition to which resource descriptions conform (data type and format of attributes), in addition RDM supports access to a taxonomy description for the resource descriptions (classification scheme), and a server (catalogue service) description.
The combination of SOIF and RDM means once a repository of SOIF records exists, the server can export it as a whole or on the basis of selection using RDM to retain the structure.
8.1.3.8 Future path
The involvement of Netscape clearly has major potential significance. In addition the widespread use of SOIF within the Internet search services means there is familiarity with the format among an international technical community. It remains to be seen how this work develops.
MAJOUR (Modular Application for Journals) was developed by a European Workgroup on SGML linked to the Scientific, Technical and Medical Publishers group, Amsterdam and published in 1991. The group consisted of Elsevier Science Publishers, Kluwer, Springer, Thieme, Fachinformationszentrum Karlsruhe, Stuertz, MID/Information Logistics Group and Satzrechenzentrum Berlin.
8.2.1.2 Maturity/consensus
No full consensus has been reached among journal publishers on an agreed DTD for article headers. Journal publishers apparently have not been able to agree on MAJOUR as anything more than a basis for individual variants.
8.2.1.3 Controlling Agency
No controlling agency has emerged.
8.2.1.4 How widely deployed
Many major serial publishers, principally in the STM field use their own individual adaptations of the MAJOUR DTD. The DTD is viewed more as an exchange format than as a way of storing records internally in publishers' databases.
Note there is an international standard for article headers, ISO12083, based on work by the American Association of Publishers but this was considered inadequate by many publishers and never widely deployed. The American Association of Publishers and the European Physical Society developed this standard method for marking up scientific documents, and it is particularly developed for mark-up of mathematical information.
8.2.1.5 Future path
The MAJOUR DTD tends to be used as an interchange standard rather than a standard for storage. Publishers tend to use DTDs for their internal databases which can be converted to MAJOUR for interchange purposes.
8.2.1.6 Content
Initially covered the "header" to a journal paper, with a statement of intent to cover the text and end matter in further DTDs. The aim was to provide a common electronic language shared by authoring bodies, publishers, typesetters, and publishing and database access software. The DTD for the article is detailed and aims to be comprehensive.
8.2.1.7 Extensibility
Extensible in principle, but depends on there being an effective development and control agency.
SSSH (Simplified SGML for Serial Headers) is used by journal publishers in all fields, particularly for the purpose of communication from publishers to users. The aim of the SSSH is to harmonise MAJOUR with OASIS' requirement for a simpler set of elements.
8.2.2.2 What is the level of maturity of the format?
The format was developed for BIC by Pira International and was published in 1995/6. It builds on and attempts to unify the MAJOUR approach with different requirements identified by the OASIS (Organisation for Article Standards in Science) group of publishers. OASIS demanded a simpler DTD than MAJOUR and in 1995 published a minimum set of elements consisting of 24 fields. Reaching consensus on a single DTD for article headers is difficult, in that some publishers specifically choose not to supply particular data elements.
A revised edition is in the pipeline. The main purpose of the new version (SSSH2) is to enable the inclusion of alternative article identifiers. The Publisher Item Identifier (PII) is now included. This was developed by Elsevier Science and adopted by other publishers such as the American Chemical Society, the American Institute of Physics, the American Physical Society and the IEEE. It is anticipated that other alternative identifiers will be added in future revisions.
The publication of the revised version included the addition of the ISO special character entity set for Mathematical Script Characters, and minor changes to the DTD to bring it into conformance with the Reference Concrete Syntax of the SGML standard.
8.2.2.3 Controlling Agency
BIC will be the control agency in association with other relevant bodies
8.2.2.4 How widespread is deployment?
The DTD is being used by some of the UK eLib journals, and some major publishers have indicated they will move to SSSH (Springer, Elsevier, Kluwer)
8.2.2.5 Technical considerations
This DTD offers a fuller character set than MAJOUR including ISO special character entity sets. Version 2 includes:
The aim will be for BIC to build, maintain and develop a comprehensive and consistent set of SGML DTDs for all publication types.
8.2.2.7 Content
Currently consists of article header. It differs from MAJOUR in a number of aspects but where possible attempts compatibility. Compared to MAJOUR it has simplified tags, simplified author affiliations and allows for SICI article identifiers. Parameterisation allows MAJOUR definitions to be restored if desired.
8.2.2.8 Extensibility
Both technically and organisationally capable of extension, through BIC as control agency.
Intended for use throughout the "book" supply chain where detailed bibliographic, trade and promotional information must be communicated. Although this is commonly referred to as a book DTD it is intended to cover a wide range of non-serial electronic and print publications. The intention is for the DTD to accommodate a variety of media, but that the first version should be targeted at printed material.
The aim is to define a DTD suitable for use by a wide range of publishers large and small. It is intended to define a format which can be used for a variety of functions including internal databases, book promotion, transactions in the trade, and provision of bibliographic information. It is acknowledged that although there will be an attempt at exhaustivity in inclusion of data elements, a minimum list will need to be defined for smaller organisations, or those who do not wish to supply detailed information.
The inclusion of several elements relating to trade information means that the resulting records would be useful for resource selection purposes.
8.2.3.2 What is the level of maturity of the format?
The DTD is under development with the aim of completion in 1997. Work on this DTD grew out of a need recognised for some time in the library and book world for a fuller product description which would include more trade information than the traditional bibliographic record, and also from the need for a record which could match the 'line item' object level required by publishers and book suppliers. This was recognised in David Martin's original report which was the basis of the further work on this DTD. This original work included a survey of 49 BIC member organisations to contribute details of the data elements that they used on their own internal databases.
EDItEUR have agreed to review the DTD when complete and some members of that body will provide input. Some US publishers, particularly of CD-ROMs, support the development (Bowker, Ingram, and Baker & Taylor).
8.2.3.3 Controlling Agency
BIC will be the control agency in association with other relevant bodies.
8.2.3.4 Technical considerations
There is a recognition that the relational database approach will provide advantages for internal databases, but that electronic communication between systems requires a record structure.
The book DTD will certainly have as wide a capability on character sets as SSSH.
8.2.3.5 Future path
Development plans: Initial release to cover at least non-serial publications on paper, in a structure designed to cover all media, including electronic and hybrids. Options for implementation include transmission of records created using the DTD within an EDI envelope, but it is also seen that records could be transmitted in other ways.
It is considered that the data dictionary of elements will be of use throughout the information flow from publisher to library.
8.2.3.6 Content
Within each media type, comprehensive coverage of bibliographic, trade and promotional data. The intention is to design a record that allows for different object levels associated with any line item (part works, bundles of works, multimedia packs etc). The overall structure is outlined in section 4.2.
8.2.3.7 Rules for formulation of content
Typically publishers do not follow cataloguing rules. This issue is not addressed.
8.2.3.8 Extensibility
The DTD is being designed explicitly for extensibility across all publication types, and to provide for unforeseen requirements. Will be backed up by an organisation which represents all interested parties (publishers, wholesalers and retailers, libraries) and which can support development, dissemination and control.
8.2.4.1 Community of use
EDIFACT is the accepted international standard messaging format for trading transactions in all industries. It is appropriate where bibliographic and other product information is communicated in the context of a trading relationship.
8.2.4.2 What is the level of maturity of the format?
EDIFACT is a mature and fully accepted standard although its practical application in the "book" trade has begun only in 1995/6. The EDIFACT syntax is maintained and developed by a world-wide process co-ordinated by a UN agency. The "book" trade has chosen to adopt an EDIFACT subset maintained by EAN International (EANCOM).
8.2.4.3 Controlling Agency
EDItEUR is the international group which interprets and extends EDIFACT message standards for "book" and serial applications.
8.2.4.4 How widespread is deployment
Use has begun in national and international trading between wholesalers and retail booksellers, and between library booksellers and libraries (see EU-funded EDILIBE project).
8.2.4.5 Technical considerations
Unlike SGML, the scope of the EDIFACT standard extends to a complete definition of a transmission envelope, identifying sender and receiver, and supported by a large number of "off-the-shelf" software packages and international VAN networks.
In theory a wide range of character sets is supported in the EDIFACT standard. In practice, they may not be implemented in current software.
8.2.4.6 Future path
EDIFACT continues to be developed to meet new requirements, but the standard is now pretty stable, and changes are generally upwards compatible.
EDItEUR is in the process of confirming a much simpler option, where basic bibliographic data are combined with price and availability information for the continuous updating of product databases used in trading. This may, and probably will, be a subset of one of the same messages which can be used for more extensive bibliographic data.
It can be assumed that both of these two options will continue to be supported, maintained and developed by EDItEUR, within the limits of existing EDIFACT syntax. A third option may be available soon, to carry within an EDIFACT envelope an "object" which is not itself in EDIFACT syntax. This could, for example, be a MARC record or an SGML document or another metadata format.
This last approach may be attractive where the parties to the exchange of information are in a trading relationship for which they are already equipped to use EDIFACT messaging. The extent of implementation needs to be considered, and whether these are the parties who will be involved in information flow from publishers to libraries.
It is unlikely that a "native" EDIFACT message will be developed to carry the full range of metadata which might be included in an SGML document, but it is still unclear exactly where a line should be drawn.
| Next | Table of Contents |