Stemma

STEMMA, A ROBUST NOUN STEM GENERATOR FOR FINNISH

Kimmo Kettunen

Finnish computational morphology has been studied rather intensively during the last ten years and there are nowadays many programs which either analyze or synthesize Finnish word forms or do both. The most prominent programs have been those based on the two level model developed by Kimmo Koskenniemi in the early 1980's. The champion's challenger has been a program named Morfo (Ja"ppinen & Ylilammi 1986), which has been developed in the Kielikone project. Also other programs exist, and some of them have been made outside the major projects as individual attempts. These include e.g. Eugene Holman's Finnmorf (1988), which is a CALL program for foreign students of Finnish.

I have recently finished a prototype of a Finnish noun stem generator, which I have named Stemma (a Finnish form of the English word stem). The program is able to produce all the differing stems and substems for an input noun which has been given to it. The program may not bring anything considerably new to computational modeling of Finnish morphology, but it is a reliable and robust tool program which can be adopted to many uses. The program has been implemented with the programming language Icon, which has not been used very much in computational linguistics although its characteristics are very well suited at least for prototypes and non-commercial research programs.

THE STRUCTURE OF THE PROGRAM

The working of the program is based on string and substring matching of the input word. The program analyses the word from its end and according to its characteristics sends the processing to proper subprocedures. The most important sub-procedures are grade alternation, plural formation and 12 pattern procedures of which each can handle one or several types of nouns (mostly one type). The program also has six small vocabularies, which contain either all the tokens of exceptional small paradigms (such as vesi-type) or exceptional words which are not affected by the grade alternation rules. Altogether some 600 words are included in the vocabularies or in the rules themselves. Otherwise the program relies only on substring matching and pattern analysis and does not need comprehensive lexicons or paradigm markings. The flow of the program can be described as follows: The main program reads the input, splits compounds to parts, makes some basic stem and character variable definitions and calls for the other procedures. The actual processing of the word is begun by the patterns procedure, which first decides whether the word needs to be handled in grade alternation procedures. If grade alternation seems necessary, the word is sent to appropriate alternation procedure, either to weakening or strengthening. Otherwise it is matched against different pattern rules, which decide whether any one of the 12 special pattern procedures has to be invoked. After this the word is given to plural formation, which may call syllable count procedure and analysis of the first vowel, if needed (only with words ending with a and a"). After this the processing is finished and the resulting stems can be given out. When the program stops, it also gives a short statistics, which states how many words it has processed, how many stems were produced and the arithmetic mean of the amount of stems per input word. Also the processing time used is given.

COVERAGE OF THE PROGRAM

Stemma covers now almost completely the generation of Finnish noun stems. Also adjectives are covered pretty well, only comparation stems and alternations which are category specific to some types of multisyllabic adjectives are not handled yet. Pronouns and numerals are also outside of the program's present scope.

The only noun phenomena not covered yet are those few compound nouns which have double inflection, i.e. inflect both from the basic and modifying parts as for example vanhapoika -> vanhanpoja- ('a bachelor'). This kind of noun is very rare in Finnish. There are less than ten of them.

Normal compounds are handled right, if the last part of the compound is separated from the rest with /, e. g. avio/liitto ('marriage'). If the last part has not been separated, the result may be right or wrong depending on the word. The program does not use comprehensive dictionaries and no other way of separating the parts of compounds is reliable.

Homographic nouns are not distinguished yet, and thus the program is not able to give all the stems for example for the noun laki, which is a common basic form to two differently behaving nouns (laki, lain 'law' / laki, laen 'crest'). Homographic basic forms are also not very common and they can be easily covered if that is necessary.

I have made several test runs on different amounts of basic noun forms. My largest test file includes 3736 nouns, which have been randomly selected from a corpus of some 35 000 non-compound words. When this test file is run, Stemma gives over 12 000 different forms as output (where of course the 3736 forms are the same as the input). The percentage of erroneous forms in this sample is less than one. Other smaller test corpora (with 1000 and 133 different basic word forms) have also given similar results, and it seems that the accuracy and coverage of the program is somewhere between 98-99.2 per cent. Stemma is thus a very robust and reliable prototype program for Finnish noun stem generation.

POSSIBLE USES OF STEMMA

A program like Stemma has not very much interest in itself, since noun stems per se are not very useful. But as a tool program for different applications Stemma has some interesting perspectives. It has at least four kinds of possible uses in computer aided language learning, morphological research, as a part of more sophisticated computational linguistic programs and as a part of an information retrieval system for Finnish. A CALL program could easily be developed from Stemma by only adding some procedures that take care of concating the right suffixes to the stems given by Stemma. This could be done either straight or in such a fashion that the program shows first the stems and then asks the user to produce a certain inflected form. After getting the user's proposal, the program could check whether it is correctly formed.

As a research tool Stemma could serve in large scale quantitative analysis of Finnish noun stem formation, while it is able to produce the stems comparatively fast and reliably. It probably needs some changes for this use depending on the aims of the research, but these should be easily done. While Stemma works only for generation, it could be used as a module in text generation e.g. in machine translation. As in the CALL application this would only need formulation of the rules, which take care of the production of right forms. Information retrieval programs for Finnish texts are especially invalidated by the complex stem alternations occurring in Finnish words. For information retrieval purposes a program like Stemma could be used to produce all the altering forms of the search key. And as information retrieval is done almost totally with content words, i. e. nouns, Stemma's capabilities are readily suitable for the work. The only possible modification could be use of some simple mechanism to check that hits given by the stem forms are really possible inflected forms of the input key and not trash.

TECHNICAL INFORMATION AND CHOICE OF IMPLEMENTATION LANGUAGE

Stemma has been implemented with the programming language Icon, which has been developed at the University of Arizona in Tucson by Ralph Griswold and his colleagues (Griswold & Griswold 1990). Icon is a kind of a successor to Snobol, and has some very good features for linguistic programming. Of particular interest are Icon's good ready made string manipulation functions and its rich repertoire of data structures. Mats Eeg-Olofson (1987) has considered different criteria for choosing a suitable language to describe morphological phenomena in a computer oriented framework. According to him, a suitable programming language should be on a high level of abstraction, its data structures should help to model the phenomena being described etc. It seems that Icon fulfils most of these needs and is a good choice for anyone interested in morphological and other linguistic programming. Its string handling and pattern matching capabilities are excellent, it has a rich choice of data structures which include sets both for characters and as a general structure, and it has many ready made string functions which make programming in it easier than usual. Untyped variables of the language are also suitable for this kind of loosely structured work, where new variables are often needed in the midst of programming (c.f. Holman 1988). The size of Stemma is now about 30 kilobytes of source code, which makes some 52 kilobytes of compiled i-code. The whole program has about 730 non-empty and non-commented lines which include also the six inherent vocabularies. The implementation was done first with Icon's version 7.5 but it runs as well on version 8.0 with no modifications up- or downwards. On an ordinary 12 Mhz AT-clone the program handles some 3-5 nouns per second depending on the machine and the complexity of processing caused by the word. Stemma runs on IBM compatible MS-DOS machines which have at least 512 kilobytes of RAM but it is also easily portable to any computer which runs Icon, including mainframes and workstations, which is due to Icon's good portability to different computing platforms.

EVALUATION OF THE PROGRAM

Palmer & Finin (1990) have used two different terms for two ways to look at computational linguistic programs. A program can be evaluated either as a black box model or as a glass box model. When a program is considered as a black box model the evaluator cannot see how the program is internally organized, and his sole evaluation criteria are based on the working of the program. When the internal organization of the program can also be studied the program is evaluated as a glass box model. It is rather difficult to evaluate your own programs, but it seems to me that Stemma is working well at least from the black box point of view. It works reliably and has a high percentage of correctness. From a strict linguistic glass box point of view it may have some inadequacies, since certain phenomena are not described in an orthodox linguistic way. But most of its rules are based on standard descriptions of Finnish morphology and thus it can be seen as a very concrete level implementation of Finnish noun stem synthesis.

CONCLUSION

Computational linguistic programs are often considered difficult, time consuming and laborious to make. It seems that this is only partially true, and there exist certain areas which can be tackled reasonably well with basic knowledge of programming and normal linguistic analysis. It has been a learning process to work with Stemma and my experiences are very encouraging, since they show that any linguist who has a working knowledge of a good high level programming language such as Icon, can make a useful program of considerable coverage and complexity in just a few months. Virtuosity in programming is not always necessary in order to create something useful.

REFERENCES

Eeg-Olofson, Mats 1987. Software Systems for Computational Morphology - An Overview. Nordic Journal of Linguistics 10: 1- 34.

Griswold, Ralph E. & Griswold, Madge T. 1990. The Icon Programming Language. Second edition. Prentice Hall, New Jersey.

Holman, Eugene 1988. Finnmorf: a Computerized Reference Tool for Students of Finnish Morphology. Computers and the Humanities 22: 165-72.

Ja"ppinen, Harri & Ylilammi, Matti 1986. Associative Model of Morphological Analysis: an Empirical Inquiry. Computational Linguistics 12: 257-72.

Koskenniemi, Kimmo 1983. Two-Level Morphology: A General Computational Model for Word-Form Recognition and production. Publications of the Department of General Linguistics, University of Helsinki.

Palmer, Martha & Finin, Tim 1990. Workshop on the Evaluation of Natural Language Processing Systems. Computational Linguistics 16: 175-81.