Stemming vs Lemmatization

Stemming and Lemmatization are both text normalization techniques used in Natural Language Processing to reduce words to their base or root forms. However, they differ in approach and accuracy.

Key Differences

Aspect	Stemming	Lemmatization
Definition	Chops off word endings to get the stem (may not be a real word).	Reduces word to its dictionary root (lemma) using vocabulary and POS.
Method	Rule-based or algorithmic cutting of suffixes.	Dictionary-based and uses morphological analysis.
Output	May not be a valid word (e.g., “running” → “run”).	Valid root word with proper meaning (e.g., “running” → “run”).
Speed	Generally faster.	Slower due to more complex processing.
Accuracy	Less accurate, may produce stems that aren't actual words.	More accurate and linguistically meaningful.

10 Examples Comparing Stemming and Lemmatization

from nltk.stem import PorterStemmer, WordNetLemmatizer

ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "runner", "better", "cats", "studies", "wolves", "was", "geese"]

for word in words:
    stem = ps.stem(word)
    lemma = lemmatizer.lemmatize(word)
    print(f"Word: {word:10} | Stem: {stem:10} | Lemma: {lemma}")

# Examples with POS for lemmatizer
print("Lemmatize 'better' as adjective:", lemmatizer.lemmatize("better", pos='a'))
print("Stem 'better':", ps.stem("better"))

Summary

Use stemming when speed is important and slight inaccuracies are acceptable.
Use lemmatization when accuracy and meaningful roots are required.
Many NLP pipelines combine both or choose based on task requirements.