STEMMING = given word, find the stem

Why do we want to do stemming?

- NLU (natural language understanding)
- machine translation
- information retrieval?

--

CAN WE DO STEMMING WITHOUT A LEXICON?

Why do we want to do stemming without a lexicon?

Porter (1980)

Section 3.4, App. B of textbook (pp. 833-836)

--

In each step, only do the step w/ the longest matching LHS (left-hand side)

step 0:

Break up words as follows:

(C+) (V+C+)* (V+)

m = measure (approx. = # of syllables)

= # of V+C+ 

m=0: tree
m=1: trouble
m=2: troubles

--

step 1: -s (nouns & verbs)

sses --> ss  caresses caress
ies --> i    ponies   poni
ss --> ss    caress   caress
s --> nil    cats     cat


step 2a: rest of verbal morphology

step 2b: cleanup

step 3: y --> i

step 4: derivational, part 1 (multiple suffixes)

step 5: more of same

step 6: single derivational suffixes

step 7a: cleanup (final e)

step 7b: more cleanup (final double consonants)


--

Problems w/ Porter stemmer (textbook, p. 83)

errors of commission:

organization --> organ
doing --> doe
generalization -- generic


errors of omission:

explain, explanation
analysis, analyses
noise, noisy


In English, stemming doesn't necessarily improve IR (information retrieval)
performance anyhow.

Why not? Because the form you want is likely to appear in the
document, assuming it is large enough.

What about in a more highly inflected language?

-