By Jeffrey Beall
Word-sense disambiguation is the ability of an online system to differentiate the different senses, or meanings, of words in online searching. Say for example that you need information on boxers, so you access an Internet search engine and enter "boxers" in the search box. The search engine then finds documents that contain the word "boxers" and returns those documents to you as search results.
You probably already see the problem here -- the word "boxers" is a homonym with several different meanings, and the search engine doesn’t know which meaning you want. Boxers are a breed of dog, a category of athlete, and a kind of men’s garment. It’s also the possessive of a surname, as in "Barbara Boxer’s bill …" Finally, boxers were those who participated in the Boxer Rebellion in China from 1899 to 1901. There may be additional meanings.
Information retrieval in libraries has transitioned from the high precision and recall that legacy library systems offered to the probabilistic and linguistic free-for-all that internet search engines now provide. One of the great values of legacy library databases was that they effectively handled polysemy -- the ability of a term to have multiple meanings -- in searching. Because online searching needs word-sense disambiguation to be effective and precise, it’s important for all librarians to understand the problem and its solutions.
Traditional library systems deal with word-sense disambiguation deterministically. Controlled vocabularies, such as the Library of Congress Subject Headings (LCSH), artificially force multiple concepts known by the same word to be expressed differently and unambiguously in metadata.
Take the word "poles" for example. It can mean a tall, thin structure, such as a light pole, and it can mean people from Poland. To separate these concepts out, LCSH creates metadata records that use "Poles (Engineering)" and "Polish people" to name the two concepts, instead of the ambiguous "poles." In this way, it eliminates the ambiguity and creates unique headings for each concept. In the past this disambiguation has been known as an element of "authority control," but I hesitate to use that term because it makes some people instantly stop reading. The term "word-sense disambiguation" comes from information science and has become the more current term for this aspect of authority control.
Disambiguation in databases is directly related to the concept of search precision. Precision here is the proportion of relevant items retrieved in a search to the total number of items retrieved in the search. For example, you may search "tanks" to get information about military tanks, but if your search results are mainly about water or fuel tanks, the search suffers from low precision. Ideally, databases should only include relevant hits in search results.
There are several solutions to the homonym problem, and each has its advantages and weaknesses. First, users themselves often solve the problem by adding additional words to their search terms. For example, if you are looking for information about cookies, that is, the files that some web sites put on your computer, you might enter "cookies computers" to increase the precision of the search. Similarly, if you are searching for information on edible cookies, you might enter "cookies recipes" to eliminate many of the hits that deal with computer cookies.
-- Vendor Databases
Second, some library vendor databases, such as Ebsco’s Academic Search Premier, attempt to algorithmically separate out search results by concept when the searcher enters a homonym as a search term. Usually this works by the system generating a column with links grouped by subject. For example, a search on boxers might generate a link to "Boxing (Sports)" and "Boxers (Dogs)." The problem with this approach is that is a probabilistic guess that the search engine makes, and these guesses are often wrong. They only work to a certain level of accuracy.
-- Other Algorithmic Approaches
Third, much research has been carried out on word-sense disambiguation in large textual corpora. This solution is expensive to set up, but after the programming is done it’s cheap to re-use it. Much of this type of word-sense disambiguation is done using ontologies (mappings of concepts and relationships), so it works best when it’s limited to a specific domain or area of study, such as mathematics, for example. This approach also is probabilistic and is therefore always less than 100% accurate.
-- Legacy Library Systems
The final approach is the one used in legacy library systems such as online library catalogs. For example, the fields of psychology and chemistry each use the term sublimation to mean two different things. In psychology, according to WordNetWeb, it means "modifying the natural expression of an impulse or instinct … to one that is socially acceptable." In chemistry, the term refers to a change from solid to gas without passing through a liquid phase. In LCSH, these terms are differentiated using glosses:
Of course, the weakness of this authority control approach (oops, I said it again) is that it requires humans to perform the indexing, so the process is often too expensive for large databases. Ultimately, the most successful solution may be one that incorporates the best of both manual and algorithmic processes such as automated processes that use manually-created authority records to carry out the disambiguation.
Word sense disambiguation is an important and crucial element of online information retrieval because it saves a searcher’s time and because it increases search precision. As online databases grow exponentially in size, word-sense disambiguation will garner more attention among library and information scientists, who will improve existing solutions and who will develop new solutions. Library users and online searchers will come to benefit from the greater search precision that word-sense disambiguation provides.
Jeffrey Beall is Metadata Librarian at the University of Colorado Denver.