Stemming vs Lemmatization

Stemming and Lemmatization are both text normalization techniques used in Natural Language Processing to reduce words to their base or root forms. However, they differ in approach and accuracy.

Key Differences

Aspect Stemming Lemmatization
Definition Chops off word endings to get the stem (may not be a real word). Reduces word to its dictionary root (lemma) using vocabulary and POS.
Method Rule-based or algorithmic cutting of suffixes. Dictionary-based and uses morphological analysis.
Output May not be a valid word (e.g., “running” → “run”). Valid root word with proper meaning (e.g., “running” → “run”).
Speed Generally faster. Slower due to more complex processing.
Accuracy Less accurate, may produce stems that aren't actual words. More accurate and linguistically meaningful.

10 Examples Comparing Stemming and Lemmatization

from nltk.stem import PorterStemmer, WordNetLemmatizer

ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "runner", "better", "cats", "studies", "wolves", "was", "geese"]

for word in words:
    stem = ps.stem(word)
    lemma = lemmatizer.lemmatize(word)
    print(f"Word: {word:10} | Stem: {stem:10} | Lemma: {lemma}")

# Examples with POS for lemmatizer
print("Lemmatize 'better' as adjective:", lemmatizer.lemmatize("better", pos='a'))
print("Stem 'better':", ps.stem("better"))

Summary