Feature

From Aristotle to the 'semantic web'

Most of the content of today's world wide web is designed for humans to read, not computers to manipulate. Alan Gilchrist explores the principles of classification and indexing that underlie the concept of the 'semantic web', which might one day lead to much more accurate and targeted automatic searching and retrieval than is possible at present.

If you have a number of objects, you can divide them into groups according to those of their characteristics appropriate to the purpose, and you can attach labels to those groups and, perhaps, to the characteristics. Devising classifications in order to put objects (books, etc) into groups has been the traditional activity of librarians. The labelling aspect, though vital in providing entry points to the classes, was, initially, a secondary consideration.

Another body of people who have been much concerned with grouping and labelling are taxonomists, and here the labelling follows strict procedures so that the labels reflect the grouping that forms the taxonomy.

In the area of information retrieval, it has been the traditional activity of information scientists to create thesauri in order to provide labels for objects (journal articles, etc). The classificatory aspect was secondary, until they realised that concepts were objects too, at which stage a faceted classification and a thesaurus could be merged into a cohesive whole. Almost immediately after this innovation, distributed systems and full-text retrieval became prevalent; classification was largely disregarded and the thesaurus was relegated to a secondary role where, like Roget, it was used to suggest synonyms in query formulation. From Aristotle to Dewey, the philosophers and epistemologists were dominant, while the later thesaurus builders flirted with linguistics in turning so-called 'natural language' into artificial language (that is, a controlled vocabulary).

The arrival of the mathematicians
With full-text retrieval and the exponential growth in computer power, the mathematicians moved in (though one remembers that Ranganathan was a mathematician, and that post-coordinate systems use Boolean operators). First, there came probabilistic indexing, using the statistics of uncertainty formulated by Bayes. This is applied to the calculation of relevance based on the evidence provided by the occurrence and co-occurrence of words in the text. This is still, 30-odd years since the first algorithm was devised, the paradigm for most current search engines, even if they are increasingly augmented by various devices, largely borrowed from natural language processing.

The next mathematical innovation, pioneered by Gerard Salton, is roughly the same age as probabilistic indexing, but had to wait for computers to become both more powerful and cheaper to be a serious contender in the marketplace. This more cerebral mathematics uses vector space modelling, by which distances and directions between words and phrases extracted from the text are measured in multidimensional space. This technique is now being employed by some software houses selling packages that automatically create categories from analysis of text.

This is not the classification 'imposed' by an external authority, but a classification derived from a particular set of documents, and peculiar to that set. If, of course, it is found that another set of documents bears a strong topical similarity to the first, then the categorisation can be automatically applied to the second set.

However, as the more honest software vendors admit (even turning the observation into a positive marketing ploy), human intervention is desirable, even necessary, in the start-up phase, as well as for fine-tuning and maintenance. Many librarians and information scientists are now employed by software houses and information providers, grouping and labelling. (Yahoo! is reputed to employ 400 librarians categorising and tagging websites.) This is a recognition that search engines are still mostly incapable of dealing with the exponential growth of content, and that computers are still not able to fully and accurately automate the processes of grouping and labelling of concepts.

* * *

Glossary
Literal: any value other than a URI, such as a name or a subject.
Ontology: a formal description of objects and their relationships
Post-coordinate systems: information retrieval systems which allow the user to coordinate search terms, be they index terms which have been assigned, or words in full text.
Probabilistic indexing: information retrieval systems that employ algorithms to assess, in reply to queries, the likely relevance of documents on the basis of the statistical occurrence and co-occurrence of words in the documents.
RDF(Resource Description Framework): a simple data model based on triples (subject, predicate and object), where the subject and predicate are URIs and the object is either a URI or a literal. Essential to describe the context and scope.
Relevance: calculated on the occurrence and co-occurrence of words in the text (algorithms calculated in this way are the basis of search engines, augmented by various devices borrowed from natural language processing).
Schema: a set of statements, expressed in a data definition language, that completely describes the structure of a document or database.
Scope: the context in which a name or an occurrence is assigned to a given topic.
Topic maps: 'a formal way to declare a set of topics and then to provide links to documents or sub-document nodes that address the topics.' A topic (in a topic map) has the characteristics: resources (preferably a Universal Resource Identifier [URI], similar to a Universal Resource Locator [URL], but at a more granular level); names; and relationships. Where a resource contains information that is specified as relevant to a given subject, an occurrence is identified. One or more interrelated occurrences, employing the grammar of XTM, is called a topic map. Topic maps exist apart from the document (whereas HTML tags are bound to the document).
Vector space modelling: distances and directions between words and phrases extracted from the text are measured in multidimensional space.
Vocabulary: any collection of words, a terminology.
XML: Extensible Mark-up Language, a meta-language used to define customised mark-up language. Designed to enable computer applications to use and re-use elements of documents. Used to define the structure of the document.

* * *

The re-emergence of the Knowledge Engineers
There has always been a comfortable co-existence between librarians and information scientists who, broadly speaking, understand what the other is talking about. Increasingly the dialogue between these two groups and information technologists and mathematicians has been fruitful. (Indeed, we will soon see the merger between the Library Association and the Institute of Information Scientists, while over a year ago the American Society for Information Science added '& Technology' to its name.)

Now, a third front (or is it a fourth or fifth front?) is opening up, with the continuing advances made by XML developers and the Artificial Intelligence fraternity. XML (Extensible Mark-up Language) is a modification of SGML (Standard Generalised Mark-up Language). In fact, XML is more of a meta-language that can be used to define customised mark-up languages, of which many are being spawned. While XML is designed to describe the structure of a document, rather than its content, it is a key tool in two developments aimed at radically improving information retrieval, and in taming the web. Whereas SGML aspired to be a universal standard, but is immensely complex, and HTML is a universal standard, but extremely and intentionally simple, XML is a mark-up language that fills the gap between the two. Its main intention is to allow computer applications to use and re-use elements of documents, or any defined information object in the enterprise.

Topic maps
The key word in that definition is 'defined'. One development of XML is XTM, standing for XML Topic Maps; and here the terminology used becomes confusing, especially as it overlaps the traditional areas of librarianship and information science, and uses some of 'our' words in different ways. Work on topic maps started in 1993, when some people calling themselves the Davenport Group provided this insight: 'Indexes, if they have any self-consistency at all, conform to models of the structure of the knowledge available in the materials that they index. But the models are implicit, and they are nowhere to be found! If such models could be captured formally, then they could guide and greatly facilitate the process of merging modelled indexes together.'

The work of this group was continued by the GCA Research Institute (now known as IDEAlliance), and eventually finished up as an International Standard.1 Bill Trippe2 defines a topic map as 'a formal way to declare a set of topics and then to provide links to documents or sub-document nodes that address the topics. In other words, they are a way to declare a set of labels for topics, and then to point to places where those topics are discussed and addressed'. In a topic map, a topic (which is equivalent to a subject of any kind, in any area of discourse) is defined as having the characteristics: resources (preferably a Universal Resource Identifier (URI), similar to a Universal Resource Locator (URL), but at a more granular level); names (and here names have a particular meaning accompanied by some rather complex rules); and relationships.

Where a resource contains information that is specified as relevant to a given subject, an occurrence is identified. One or more interrelated occurrences employing the grammar of XTM is called a topic map. It is also important to define the scope - that is, the context in which a name or an occurrence is assigned to a given topic.

A web document describing XTM3 gives an example of a topic map for an electronic encyclopaedia. This multimedia encyclopaedia contains information about the playwrights William Shakespeare and Ben Jonson; their plays, Hamlet and Volpone; and also more general information about London and Stratford. It should be noted that XTM is a vehicle for the exchange of data at the level of subject content, but that the authority for descriptive indexing resides elsewhere. Every location where these subjects are discussed, depicted or mentioned is called an occurrence and, as this is an electronic encyclopaedia, each is an electronic resource with a unique address. But because not all subjects are electronic artefacts, it is necessary to provide an address for those subjects and this is done with the topic, acting as a surrogate. The relationships are such associations as William Shakespeare (is the author of) Hamlet, and William Shakespeare (was born in) Stratford; both of which statements can be reversed.

Whereas an HTML metatag (as used in the web standard protocol) is bound to the document which it is indexing, topic maps exist apart from the individual documents, thus allowing applications or users to understand the topical relationships between documents. It is worth repeating here that XML and XTM define structure, not content. That last job is effected in XTM by the subject indicator, defined as a resource that should provide a positive, unambiguous indication of the identity of a subject, in particular through the use of standardised ontologies. Put simply, an ontology is a formal description of objects and their relationships, and it will turn up again later in this article.

Schemas
XML and XTM are 'schemas', sets of statements expressed in a data definition language that completely describe the structure of a document or a database. A potentially important schema, using the XML language, is RDF (Resource Description Framework). RDF is a very simple data model based on triples, consisting of subject, predicate and object, where the subject and predicate are URIs, and the object is either a URI or a literal. As with XTM, the context and scope must be defined.

A very simple example used in The Semantic Web... 4 has a contacts database, in which a subject has a URI (predicate) which relates to a database identifier of contact (object), accompanied by three more predicates (name, role and organisation), accompanied by three objects, of which the first two are literals, and the third, the organisation, is expressed as a URI (in this case a URL). Again, as with XTM, there must be an external authority for the semantic description of the objects and their relationships. This is alternatively expressed in The Semantic Web... as a vocabulary or an ontology. Interestingly, the author admits that the hardest problem in this area is not the infrastructure [the RDF schema] but the actual ontologies themselves, and 'until an industry-wide standard exists for, say, vehicle parts, there is a limit to the utility of the semantic web in the auto manufacturing industry'.

The semantic web
The previous quote introduces the 'semantic web', and in the context of its being based in part on the use of RDF. The semantic web is the brainchild of the 'father' of the web, Tim Berners-Lee, and, like the web itself, the basic idea is extremely simple but the implications potentially very significant and far-reaching. Berners-Lee has written, with co-authors, a simple exposition of the idea, 5 which has already attracted a lot of interest - and some scepticism (one author referring to it as the 'pedantic web'). Berners-Lee points out that 'most of the web's content today is designed for humans to read, not for computer programs to manipulate'. He goes on to claim that 'the semantic web will bring structure to the meaningful content of web pages, creating an environment where software agents, roaming from page to page, can readily carry out sophisticated tasks for users.'

Others have quickly spotted that one of those tasks could be more effective retrieval. In addition to XML and RDF (though there are other contenders), the semantic web will require collections of information and sets of inference rules, which is where Knowledge Engineers and the Artificial Intelligence (AI) fraternity come on to the scene. Last, but by no means least, will be the requirement for ontologies and, as Berners-Lee says, 'two databases may use different identifiers for what is in fact the same concept...A program that wants to compare or combine information across the two databases has to know that these two terms are being used to mean the same thing. Ideally, the program must have a way to discover such common meanings for whatever databases it encounters'.

It took a relatively long time for librarians and information scientists to become sufficiently familiar with the jargon of information technologists to be able not only to communicate, but provide specifications for IT systems to support their largely word-based activities. This brief article has ventured, with some trepidation, into the relatively new and arcane world of schemas for interoperability between electronic objects, with the single objective of pointing out that the old human and intellectual task of grouping and labelling is still in demand and, even though computers are making remarkable advances in this area, humans are not yet redundant. The range and quantity of grouping and labelling required, if the dreams of Berners-Lee and others are to be realised, is absolutely mind-boggling, and librarians and information scientists should be prepared to work in multi-disciplinary teams on this problem.

There would certainly seem to be room for all, but this does not seem to be the opinion of the author of this last quote: '...one interesting phenomenon is that a lot of AI ends up, after fleeing the Computer Sciences Department, in Information and Library Sciences. And, of course, librarians, even the non-techie ones, are really into cataloguing, searching, sharing, correlating, using metadata, intelligent agents...to wit, all the elements of the semantic web. AI folks don't end up in library departments because librarians are pushovers (as my overdue fines attest) but because there's a pretty good fit between what (some) AI-ers like to do, what the library folks want, and between what the librarians want and what the semantic web requires. So the semantic web is an AI project, and we should be proud of that fact.' 6

References
1 International Standards Organisation. Information technology - SGML applications - Topic maps. ISO/IEC 13250. Geneva. ISO, 2000.
2 Bill Trippe. Taxonomies and Topic Maps. Categorisation steps forward (www.econtentmag.com).
3 XML Topic Maps (XTM) 1.0 www.topicmaps.org/xtm/1.0/
4 Edd Dumbill. The Semantic Web: a primer (www.xml.com/lpt/2000/11/01/semanticweb/index.html).
5 Tim Berners-Lee, James Hendler and Ora Lassila. 'The Semantic Web.' Scientific American, May 2001. Available at www.sciam.com/2001/0501issue/0501berners-lee.html
6 Bijan Parsia. An Introduction to Prolog and RDF (www.xml.com/pub/a/2001/04/25/pologrdf/index.html).

Alan Gilchrist is a certified management consultant, Associate Consultant for TFPL and Editor of the Journal of Information Science.


Feedback or comments on the contents of the printed
Record or on the Record Web pages are welcomed
record@la-hq.org.uk

For more information about the Record including a link to previous
issues of the online Record click here.


© The Library Association
Last updated: 20 December 2001