Feature
From Aristotle to the 'semantic web'Most of the content of today's world wide web is designed for humans to read, not computers to manipulate. Alan Gilchrist explores the principles of classification and indexing that underlie the concept of the 'semantic web', which might one day lead to much more accurate and targeted automatic searching and retrieval than is possible at present. If you have a number of objects, you can divide them into groups according to those of their characteristics appropriate to the purpose, and you can attach labels to those groups and, perhaps, to the characteristics. Devising classifications in order to put objects (books, etc) into groups has been the traditional activity of librarians. The labelling aspect, though vital in providing entry points to the classes, was, initially, a secondary consideration. Another body of people who have been much concerned with grouping and labelling are taxonomists, and here the labelling follows strict procedures so that the labels reflect the grouping that forms the taxonomy. In the area of information retrieval, it has been the traditional activity of information scientists to create thesauri in order to provide labels for objects (journal articles, etc). The classificatory aspect was secondary, until they realised that concepts were objects too, at which stage a faceted classification and a thesaurus could be merged into a cohesive whole. Almost immediately after this innovation, distributed systems and full-text retrieval became prevalent; classification was largely disregarded and the thesaurus was relegated to a secondary role where, like Roget, it was used to suggest synonyms in query formulation. From Aristotle to Dewey, the philosophers and epistemologists were dominant, while the later thesaurus builders flirted with linguistics in turning so-called 'natural language' into artificial language (that is, a controlled vocabulary). The arrival of the mathematicians
The next mathematical innovation, pioneered by Gerard Salton, is roughly the same age as probabilistic indexing, but had to wait for computers to become both more powerful and cheaper to be a serious contender in the marketplace. This more cerebral mathematics uses vector space modelling, by which distances and directions between words and phrases extracted from the text are measured in multidimensional space. This technique is now being employed by some software houses selling packages that automatically create categories from analysis of text. This is not the classification 'imposed' by an external authority, but a classification derived from a particular set of documents, and peculiar to that set. If, of course, it is found that another set of documents bears a strong topical similarity to the first, then the categorisation can be automatically applied to the second set. However, as the more honest software vendors admit (even turning the observation into a positive marketing ploy), human intervention is desirable, even necessary, in the start-up phase, as well as for fine-tuning and maintenance. Many librarians and information scientists are now employed by software houses and information providers, grouping and labelling. (Yahoo! is reputed to employ 400 librarians categorising and tagging websites.) This is a recognition that search engines are still mostly incapable of dealing with the exponential growth of content, and that computers are still not able to fully and accurately automate the processes of grouping and labelling of concepts. * * * Glossary
* * * The re-emergence of the Knowledge Engineers
Now, a third front (or is it a fourth or fifth front?) is opening up, with the continuing advances made by XML developers and the Artificial Intelligence fraternity. XML (Extensible Mark-up Language) is a modification of SGML (Standard Generalised Mark-up Language). In fact, XML is more of a meta-language that can be used to define customised mark-up languages, of which many are being spawned. While XML is designed to describe the structure of a document, rather than its content, it is a key tool in two developments aimed at radically improving information retrieval, and in taming the web. Whereas SGML aspired to be a universal standard, but is immensely complex, and HTML is a universal standard, but extremely and intentionally simple, XML is a mark-up language that fills the gap between the two. Its main intention is to allow computer applications to use and re-use elements of documents, or any defined information object in the enterprise. Topic maps
The work of this group was continued by the GCA Research Institute (now known as IDEAlliance), and eventually finished up as an International Standard.1 Bill Trippe2 defines a topic map as 'a formal way to declare a set of topics and then to provide links to documents or sub-document nodes that address the topics. In other words, they are a way to declare a set of labels for topics, and then to point to places where those topics are discussed and addressed'. In a topic map, a topic (which is equivalent to a subject of any kind, in any area of discourse) is defined as having the characteristics: resources (preferably a Universal Resource Identifier (URI), similar to a Universal Resource Locator (URL), but at a more granular level); names (and here names have a particular meaning accompanied by some rather complex rules); and relationships. Where a resource contains information that is specified as relevant to a given subject, an occurrence is identified. One or more interrelated occurrences employing the grammar of XTM is called a topic map. It is also important to define the scope - that is, the context in which a name or an occurrence is assigned to a given topic. A web document describing XTM3 gives an example of a topic map for an electronic encyclopaedia. This multimedia encyclopaedia contains information about the playwrights William Shakespeare and Ben Jonson; their plays, Hamlet and Volpone; and also more general information about London and Stratford. It should be noted that XTM is a vehicle for the exchange of data at the level of subject content, but that the authority for descriptive indexing resides elsewhere. Every location where these subjects are discussed, depicted or mentioned is called an occurrence and, as this is an electronic encyclopaedia, each is an electronic resource with a unique address. But because not all subjects are electronic artefacts, it is necessary to provide an address for those subjects and this is done with the topic, acting as a surrogate. The relationships are such associations as William Shakespeare (is the author of) Hamlet, and William Shakespeare (was born in) Stratford; both of which statements can be reversed. Whereas an HTML metatag (as used in the web standard protocol) is bound to the document which it is indexing, topic maps exist apart from the individual documents, thus allowing applications or users to understand the topical relationships between documents. It is worth repeating here that XML and XTM define structure, not content. That last job is effected in XTM by the subject indicator, defined as a resource that should provide a positive, unambiguous indication of the identity of a subject, in particular through the use of standardised ontologies. Put simply, an ontology is a formal description of objects and their relationships, and it will turn up again later in this article. Schemas
A very simple example used in The Semantic Web... 4 has a contacts database, in which a subject has a URI (predicate) which relates to a database identifier of contact (object), accompanied by three more predicates (name, role and organisation), accompanied by three objects, of which the first two are literals, and the third, the organisation, is expressed as a URI (in this case a URL). Again, as with XTM, there must be an external authority for the semantic description of the objects and their relationships. This is alternatively expressed in The Semantic Web... as a vocabulary or an ontology. Interestingly, the author admits that the hardest problem in this area is not the infrastructure [the RDF schema] but the actual ontologies themselves, and 'until an industry-wide standard exists for, say, vehicle parts, there is a limit to the utility of the semantic web in the auto manufacturing industry'. The semantic web
Others have quickly spotted that one of those tasks could be more effective retrieval. In addition to XML and RDF (though there are other contenders), the semantic web will require collections of information and sets of inference rules, which is where Knowledge Engineers and the Artificial Intelligence (AI) fraternity come on to the scene. Last, but by no means least, will be the requirement for ontologies and, as Berners-Lee says, 'two databases may use different identifiers for what is in fact the same concept...A program that wants to compare or combine information across the two databases has to know that these two terms are being used to mean the same thing. Ideally, the program must have a way to discover such common meanings for whatever databases it encounters'. It took a relatively long time for librarians and information scientists to become sufficiently familiar with the jargon of information technologists to be able not only to communicate, but provide specifications for IT systems to support their largely word-based activities. This brief article has ventured, with some trepidation, into the relatively new and arcane world of schemas for interoperability between electronic objects, with the single objective of pointing out that the old human and intellectual task of grouping and labelling is still in demand and, even though computers are making remarkable advances in this area, humans are not yet redundant. The range and quantity of grouping and labelling required, if the dreams of Berners-Lee and others are to be realised, is absolutely mind-boggling, and librarians and information scientists should be prepared to work in multi-disciplinary teams on this problem. There would certainly seem to be room for all, but this does not seem to be the opinion of the author of this last quote: '...one interesting phenomenon is that a lot of AI ends up, after fleeing the Computer Sciences Department, in Information and Library Sciences. And, of course, librarians, even the non-techie ones, are really into cataloguing, searching, sharing, correlating, using metadata, intelligent agents...to wit, all the elements of the semantic web. AI folks don't end up in library departments because librarians are pushovers (as my overdue fines attest) but because there's a pretty good fit between what (some) AI-ers like to do, what the library folks want, and between what the librarians want and what the semantic web requires. So the semantic web is an AI project, and we should be proud of that fact.' 6 References
Alan Gilchrist is a certified management consultant, Associate Consultant for TFPL and Editor of the Journal of Information Science. |
||||
|
Feedback or comments on the contents of the printed Record or on the Record Web pages are welcomed record@la-hq.org.uk For more information about the Record including a link to previous issues of the online Record click here. |