Stopwords in Natural Language Processing

Stopwords are commonly used words in a language that usually do not add significant meaning to a sentence. Examples include "is", "and", "the", "in". Removing stopwords is a common preprocessing step in NLP to focus on meaningful words.

Why Remove Stopwords?

Common Stopwords List

Languages have their own sets of stopwords. For English, the NLTK library provides a comprehensive list.

10 Examples of Using Stopwords with NLTK

Example 1: Basic Stopwords Removal


import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
text = "This is a sample sentence, showing off the stopwords filtration."

words = word_tokenize(text)

filtered_sentence = [w for w in words if w.lower() not in stop_words]
print("Original:", words)
print("Filtered:", filtered_sentence)

Example 2: Custom Stopwords List


custom_stopwords = stop_words.union({"sample", "showing"})

filtered_custom = [w for w in words if w.lower() not in custom_stopwords]
print("Filtered with custom stopwords:", filtered_custom)

Example 3: Counting Stopwords in Text


stopword_count = sum(1 for w in words if w.lower() in stop_words)
print("Number of stopwords:", stopword_count)

Example 4: Removing Stopwords Before TF-IDF


from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is a sample sentence showing off the stopwords filtration."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print("TF-IDF feature names:", vectorizer.get_feature_names_out())

Example 5: Stopwords in Other Languages (e.g., Spanish)


spanish_stopwords = set(stopwords.words('spanish'))
print("Spanish stopwords sample:", list(spanish_stopwords)[:10])

Example 6: Checking if a Word is a Stopword


print("'the' is stopword?", 'the' in stop_words)
print("'Python' is stopword?", 'Python' in stop_words)

Example 7: Removing Stopwords from Large Text Corpus


large_text = ["This is the first sentence.", "Another sentence with more words."]
filtered_corpus = [
    [word for word in word_tokenize(sent) if word.lower() not in stop_words]
    for sent in large_text
]
print("Filtered corpus:", filtered_corpus)

Example 8: Visualizing Stopwords Frequency with WordCloud


from wordcloud import WordCloud
import matplotlib.pyplot as plt

non_stopwords = [w for w in words if w.lower() not in stop_words]
wordcloud = WordCloud().generate(" ".join(non_stopwords))

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Example 9: Stopwords and Sentiment Analysis


from textblob import TextBlob

text_with_stopwords = "I am not happy with this horrible experience"
text_without_stopwords = "happy horrible experience"

print("Sentiment with stopwords:", TextBlob(text_with_stopwords).sentiment)
print("Sentiment without stopwords:", TextBlob(text_without_stopwords).sentiment)

Example 10: Stopwords in Topic Modeling


from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly"
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

print("Topic word distribution:")
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}:",
          [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-3:]])