STEMMING = given word, find the stem Why do we want to do stemming? - NLU (natural language understanding) - machine translation - information retrieval? -- CAN WE DO STEMMING WITHOUT A LEXICON? Why do we want to do stemming without a lexicon? Porter (1980) Section 3.4, App. B of textbook (pp. 833-836) -- In each step, only do the step w/ the longest matching LHS (left-hand side) step 0: Break up words as follows: (C+) (V+C+)* (V+) m = measure (approx. = # of syllables) = # of V+C+ m=0: tree m=1: trouble m=2: troubles -- step 1: -s (nouns & verbs) sses --> ss caresses caress ies --> i ponies poni ss --> ss caress caress s --> nil cats cat step 2a: rest of verbal morphology step 2b: cleanup step 3: y --> i step 4: derivational, part 1 (multiple suffixes) step 5: more of same step 6: single derivational suffixes step 7a: cleanup (final e) step 7b: more cleanup (final double consonants) -- Problems w/ Porter stemmer (textbook, p. 83) errors of commission: organization --> organ doing --> doe generalization -- generic errors of omission: explain, explanation analysis, analyses noise, noisy In English, stemming doesn't necessarily improve IR (information retrieval) performance anyhow. Why not? Because the form you want is likely to appear in the document, assuming it is large enough. What about in a more highly inflected language? -