Ch 3: ENGLISH MORPHOLOGY (USING FSAs) 2-level system for building forms of words (e.g. plurals): 1. morphological rules for the concept 2. then spelling rules for the details morphological rules - how to form the plural (or other cases) in English --> + '-s' fish --> fish goose --> geese etc. spelling rules - baby --> babies (we're still adding -s, just the details of how the output looks have changed) spelling rules change only the _surface form_ -- Need for a lexicon (dictionary): potato --> potatoes tomato --> tomatoes photo --> photos mango --> mangoes, mangos -- SOME TERMINOLOGY: morpheme - minimal unit of meaning e.g. 'b' doesn't have a meaning morphemes can be categorized as stems or affixes affixes = prefix, suffix, infix (English doesn't have infixes) -- BUILDING CORRECT MORPHOLOGICAL FORM Input: form in lexicon (= database = dictionary) - base form - features (e.g +PL = plural) Output: correctly spelled word Processing: 2 levels 1. morphological rules - how to form the plural (or other cases) in English e.g. --> + '-s' 2. spelling rules e.g baby +PL --> babies -- MORPHOLOGICAL PARSING = inverse of building morphological form input: correctly spelled word output: root form and feature list e.g. geese --> goose +PL parsing - given an item (word, sentence, etc.), find its structure -- LANGUAGE STRUCTURE AFFECTS DIFFICULTY OF MORPHOLOGICAL PARSING ENGLISH IS EASY: English only has a few affixes (-s, -ed, -ing) -- English HAS CONCATENATIVE MORPHOLOGY some languages have non-concatenative morphology e.g. Bahasa Indonesia orang --> orang-orang -- HIGHLY INFLECTED LANGUAGES REQUIRE MORE WORK: Other languages have more _case markers_ on nouns. e.g. affixes representing +FROM, +TO, etc. You might consider 's to be a case marker in English, i.e. +POSS (possessive), but there are no others. Many languages have 4-5 cases; a few have 10-15. Highly inflective languages can also have many affixes for verbs: In Spanish, +FUTURE, +UNREAL, and +1sing ("I") are all combined to index into a table of affixes (i.e. 'amare' = 'that I might love') -- AGGLUTINATIVE LANGUAGES HAVE DIFFERENT ISSUES: Compare the difference between a language like Spanish (fig. 3.1) and an agglutinative language like Turkish or Finnish. In Turkish, each of these components of meaning will have a separate affix. (Might be easier to build, but there are combination rules, such as vowel harmony...) What about parsing? Here are some data about Turkish: Corpus Word Word Distinct Distinct Compression Tokens Types Terms Stems (%) (instances) (unique words) Turkish 376,187 49,479 41,370 6,363 84.6 English 567,574 19,044 18,348 11,671 36.4 -- ROOT-AND-PATTERN LANGUAGES HAVE THEIR OWN ISSUES: Semitic languages (Hebrew, Aramaic, Arabic, etc.) have template-based morphology = root and pattern system In these languages, roots have only consonants. There are 3 levels of derivation: 1. root --> stem ("word") 2. stem --> inflected form 3. inflected form --> final form (spelling/pronunciation rules) Derivational affixes are not important in these languages; the patterns take their place. e.g. Hebrew KTB = 'write' (root) Here are a number of verb stems derived from KTB: KaTaB = to write KiTTeiB = to inscribe (+INTENSIVE) NKTB = to be written (+PASSIVE) HKTiB = to cause to write (+CAUSATIVE) HuKTB = to caue to be written (+CAUSATIVE +PASSIVE) HTKTB = to write itself (+REFLEXIVE) Each of these forms is a verb with inflectional forms (I wrote, you wrote, etc.) These patterns are semi-productive. Verb patterns are more likely to be productive than noun patterns. E.g. some noun pattern for KTB: KTiBa = signature KTuBa = marriage license (Spelling/pron. rules: KTiBa --> k'tiva) -- WHAT IS THE ROLE OF DERIVATIONAL AFFIXES? 2 ways to build words - inflection and derivation An example of derivational affixes in English: antidisestablishmentarianism One of the longest English words that is not made up. anti- opposition to dis- the removal of establish (root) to create/maintain something -ment the activity of -arian people who believe in -ism the philosophy of So we are talking about the philosophy of people who believe in opposing the removal of something. = people who are in favor of maintaining something. (In this case, the 'something' is maintaining the Church of England as the official state church in the 19th century.) -- PRODUCTIVE SUFFIXES: productive suffix - can apply to anything of the right category, e.g. '-s' in general, inflectional suffixes are productive, but derivational suffixes aren't derivational suffixes: -tion, -ee, -er, -ize, etc. Derivational suffixes are semi-productive: - can't always apply them - don't have precise meanings like inflectional suffixes -- STEMMING Why do we want to do stemming? - NLU (natural language understanding) - machine translation - information retrieval? CAN WE DO STEMMING WITHOUT A LEXICON? Why do we want to do stemming without a lexicon? Porter (1980) App. B of textbook (pp. 833-836) Problems w/ Porter stemmer (textbook, p. 83) COPING WITH ENGLISH SPELLING/PRONUNCIATION: Soundex algorithm (textbook, p. 89): words that sound similar --> same code number