GLOSSER as a Practical Application
The European Union is taking an increasing interest in possible ways of cooperating with Central and Eastern Europe, and so, even before these countries become members of the EU, the real cooperation has already begun. Estonian linguistics is not an exception to this.
The enormous information flow that reaches us every day via natural language makes one feel that it would be quite easy to "lose wisdom in a mass of knowledge, lose knowledge in a mass of information, lose information in a mass of data" (a quotation from Thomas Eliot). The Fourth Framework Programme COPERNICUS 1994 financed language engineering research, reusable language resources and several pilot applications. One of the aims of this kind of joint research project should certainly be bringing together academic scholars and people from industry to prove that "everything intended to be scientific need not necessarily be slow and unsalable" (the words of a Hungarian colleague). For the first time, Estonia had a chance to participate in five such projects. One of them was GLOSSER, on which people from Bulgaria, Estonia, the Netherlands, France and Hungary worked together for a period of two years.
The result of this language technology project is a program called GLOSSER, designed to support the processes of reading and learning to read in a foreign language. This is the prototype of a system where a computer is used as a reference, not a language teacher, and assistance is offered to advanced learners who are not afraid of "machines" and who find it exciting and useful to use the computer application beside or instead of the tedious task of thumbing through a dictionary. GLOSSER differs from an ordinary dictionary lookup program in its analysing procedure that appears right on the screen. The starting point was a question arising while reading a text: How can one find the lemma and the right sense in the dictionary while meeting several kinds of word forms in the text? GLOSSER is meant as a tool for finding an answer to this question.
System architecture connects modules for morphological analysis and disambiguation, dictionary access and corpora search with an output model. Let us have a closer look at each of them.
Morphological Analysis and Part of Speech Disambiguation
Morphological analysis is necessary if one wishes to consult an on-line dictionary. GLOSSER was fortunate in having access to the Xerox POS Disambiguator for English language which, drawing on the Morphological Analyser, picks up the correct part of speech out of all possible morphological descriptions. The theoretical base of the disambiguator is English Constraint Grammar, a theory from the late 1980s, which determines the function of the word using special rules for morphological characteristics and context. These rules are the constraints.
Disambiguation means getting rid of such morphological descriptions that do not fit the specific context of the word, while semantics is not taken into account. The disambiguator makes its decision after looking through the whole sentence. Homonymy differs in type and extent in different languages. For example, English is noted for part of speech homographs, whereas Estonian is noted for homonymy of morphological forms. (In Estonian the number of word forms is very large: on average there are 33 different forms per word. About 40% of Estonian word forms are ambiguous.)
GLOSSER is a system that examines a sentence word by word, and there is only a word-based access to the dictionary. For example, in Xerox codes the input sentence The concert was nothing to write home about will be analysed word by word by the disambiguator (the+AT; concert+NN; be+BEDZ; nothing+PN; to+TO; write+VB; home+NN; about+IN ), although here is a multiword expression nothing to write home about with its dictionary definition at the end of the entry for home (placed after two parts of speech, several derivatives and several multiword expressions containing the headword). And how should a user know that this expression is located under home, but not under nothing? It should be the next stage of the research project to deal especially with multiword expressions and to enable an expression-based access to the dictionary. First, the system should check the possible belonging of a word in an expression, and if the answer is yes, then, secondly, display only this part of the dictionary entry (not the whole one).¹
Reusability of lexicographic resources is a widespread trend in computational lexicography today. The only feasible option is to use an existing dictionary. For GLOSSER, the most suitable candidate was Password mainly for its belonging to the semi-bilingual type of dictionary, but also for its appropriate size (25,000 headwords)². The source language is represented by a headword, grammatical information, sense explanation and illustrative sentences. The target language is represented by brief translations for each meaning of the headword (a total of 37,000) or sub-headword (derivatives, multilingual expressions). The semi-bilingual dictionary is new and unique in Estonia, and GLOSSER was very happy to find this combination of monolingual and bilingual dictionaries in one volume.
We obtained the electronic version of the dictionary text in layout format, the so-called typographic view, which is concerned with the two-dimensional printed page. These layout codes had to be stripped and converted into a suitable format. Our task was to analyse the typographic view (the raw text format) fully, to be able to transform the text into the lexical view, i.e. lexical data as those might appear in a database, without concern for their exact textual form.
The list of headwords was sent to Xerox for testing. Prof. L. Karttunen made a network from the list and checked them against the English transducer Xerox supplied for GLOSSER. About 400 headwords were not recognized by the analyzer and needed to be added to the system. For example (a) British spelling for words that are in the morphological analyzer only with American spelling (apologise, ardour, etc.); (b) French words in their original French orthography (cafe, cortege, etc.); and (c) words that are not found in the American Heritage Dictionary (casuarina, dhoti, kampung, rambutan, etc.). The latter words originate from several other local editions. The Estonian version was supplemented with 'kroon', 'sprotid', etc.
Usually, the microstructure of a dictionary is hierarchic and, depending on type, rather complex. The conservative form of a traditional printed dictionary, because of its implicit information, is satisfactory for users, but not for various computer systems, which require information types to be set out explicitly.
For encoding the text of the printed dictionary TEI Guidelines were consulted. The encoding format has to adhere to the rigorous principles of traditional dictionaries and present them in such a way as to facilitate dictionary reusability and automatic processing. The Guidelines use the Standard Generalised Markup Language (SGML) to define their encoding scheme. It provides for a formal definition in terms of elements and attributes, and rules governing their appearance in a text³. A dictionary is seen as a linear text stream interspersed with markup. The tags provide an indication of the content of the fields they delimit. Each of the information fields has an opening marker <..> and an end marker </..>. One field can embed another.
Nerbonne J. and P. Smit, GLOSSER-RuG: in Support of Reading. Working
paper of Vakgroep Alfa-informatica, Rjiksuniversiteit Groningen.
Figure Main Window:
Margit Langemets heads the Department
of Lexicology at the Institute of the Estonian Language, and
specializes in monolingual lexicography, dictionary typology
and criticism, and computer applications. She graduated in Estonian
Philology from Tartu University, then taught Computational Lexicography
there. She initiated the publication of the semi-bilingual dictionary
in Estonia, and was among its translators. Her current projects
include completing the editing and computing of the Defining
Dictionary of Literary Estonian, preparing a Lexicographic Text
Corpus of Estonian, and computerizing several Estonian dictionaries.
tel: 972-3-5468102 fax: 972-3-5468103