Stemming and Lemmatization

In the age of Artificial Intelligence (AI), natural language processing (NLP) forms the backbone of many modern technologies from chatbots and search engines to sentiment analysis and text summarization tools. However, before any machine learning model can interpret text data accurately, it must first be cleaned, normalized, and structured. Two essential techniques that play a crucial role in this process are stemming and lemmatization.

These text preprocessing techniques are fundamental for every data scientist, NLP engineer, or AI practitioner. Whether you’re enrolled in Artificial intelligence classes online or exploring machine learning independently, understanding stemming and lemmatization will give you the foundation to build more intelligent and context-aware text models.

1. Introduction to Text Normalization

Text normalization is a process of transforming text into a consistent format so that algorithms can process and analyze it effectively. Since raw text data often contains variations such as plural forms, verb tenses, or derivations models might treat words like run, running, and ran as separate entities, even though they represent the same root meaning.

To overcome this challenge, NLP engineers apply stemming and lemmatization, which reduce words to their base or root form. This step helps models recognize word relationships, improving the accuracy of downstream tasks such as text classification, keyword extraction, or sentiment analysis.

What is Stemming

Stemming is a text normalizing technique that cuts down affixes of words, to extract its base form or root words. Stemming is a crude process and sometimes, the root word, also called the stem, may not have grammatical meaning. In fact, in some other NLP libraries like spaCy, stemming is not included.

There are various stemming programs used to carry out stemming. These programs are called stemmer or stemming algorithm. In NLTK, there is the Porter Stemmer, Lancaster Stemmer, Regular Expression Stemmer, and Snowball Stemmer. The most common is the Porter stemming algorithm

Porter Stemming Algorithm

The Porter Stemming Algorithm is arguably the most popular stemming algorithm in Natual Language Processing. In NLTK, it can be instantiated using the PorterStemmer class. The algorithm takes an input of tokenized words and outputs the stems. Let’s take a simple code example using the PorterStemmer class.

#import the PorterStemmer class from nltk.stem library
from nltk.stem import PorterStemmer
#insantiate the PorterStemmer class
stemmer = PorterStemmer()
#create a list of tokens
tokens = [‘smiling’, ‘smile’, ‘smiled’, ‘smiles’]
#create an empty list to take in the stemmed words
stemmed_words = []
#loop over each token in the list
for each_word in tokens:
    #stem each word in the list
    stemmed_word = stemmer.stem(each_word)
    #add the stemmed word to the empty stemmed word list
    stemmed_words.append(stemmed_word)
#print the stemmed words list
print(stemmed_words)

Output:

['smile', 'smile', 'smile', 'smile']

As seen, all variations of the word have been stemmed from its root word, ‘smile’. As earlier mentioned, some words may not be stemmed into meaning root words. If we attempt to stem the words ‘cry’, ‘crying’, ‘cries’ and ‘cried’, it outputs the word ‘cri’, which does not have any grammatical meaning.

Let’s take another example where we pass sentences as input.

#import the PorterStemmer class from nltk.stem function
from nltk.stem import PorterStemmer
#import th word_tokenize method in the nltk library
from nltk import word_tokenize
#insantiate the PorterStemmer class
stemmer = PorterStemmer()
#define some statement
sentences = 'NLTK is a very interesting subject. I have learnt a lot from this website. I will not stop learning'
#tokenize each word in the sentences
each_sentence = word_tokenize(sentences)
#create an empty list to take in the stemmed words
stemmed_words = []
#loop over each token in the list
for word in each_sentence:
    #stem each word in the list
    stemmed_word = stemmer.stem(word)
    #add the stemmed word to the empty stemmed word list
    stemmed_words.append(stemmed_word)
#print the stemmed words list
print(stemmed_words)

Output:

['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'I', 'will', 'not', 'stop', 'learn']

Again, some of the stemmed words do not have a dictionary meaning. Mind you, there are other stemming algorithms. The tweak to the code would be importing the new stemming algorithm and instantiating the same.

Using the Lancaster stemmer, with the LancasterStemmer class outputs

['nltk', 'is', 'a', 'very', 'interest', 'subject', '.', 'i', 'hav', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'i', 'wil', 'not', 'stop', 'learn']

The regular expression stemmer takes a regular expression and cut of any suffix or prefix that matches the defined expression. Using the Regular Exppression stemmer with the RegexpStemmer class and defining the ‘ing’ parameter as the regular expression outputs

['NLTK', 'is', 'a', 'very', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'website', '.', 'I', 'will', 'not', 'stop', 'learn']

Snowball stemmer allows for stemming in 15 other languages including, Arabic, French, German, Italian, Portuguese, Russian, Spanish and Swedish. When using snowball stemmer, the language has to be defined. Using the Snowball Stemmer with the SnowballStemmer class abd defining the language as ‘english’ outputs

['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'i', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'websit', '.', 'i', 'will', 'not', 'stop', 'learn']

Let’s now take a look at lemmatization

What is Lemmatization?

Lemmatization is almost like stemming, in that it cuts down affixes of words until a new word is formed. Only that in lemmatization, the root word, called ‘lemma’ is a word with a dictionary meaning. The morphological analysis of words is done in lemmatization, to remove inflection endings and outputs base words with dictionary meaning. Lemmatization with the NLTK library is done using the WordNetLemmatizer class. It’s almost the same methodology as using PorterStemmer. Let’s take a simple coding example.

#import the WordNetLemmatizer class from nltk.stem library
from nltk.stem import WordNetLemmatizer
#insantiate the WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()
#create a list of tokens
tokens = ['crying', 'cry', 'cried']
#create an empty list to take in the lemmatized words
lemmatized_words = []
#loop over each token in the list
for each_word in tokens:
    #lemmatize each word in the list
    lemmatized_word = lemmatizer.lemmatize(each_word)
    #add the lemmatized word to the empty lemmatized word list
    lemmatized_words.append(lemmatized_word)
#print the lemmatized words list
print(lemmatized_words)

Output:

['cry', 'cry', 'cried']

WordNetLemmatizer outputs meaningful words – cry and cried, as opposed to what PorterStemmer returned – cri.

Use Case Scenarios

Let’s say we have a text we want a machine learning model to understand. We need to preprocess the text using stemming or lemmatization. Let’s take a code example for each of them starting with stemming.

#import the necessary libraries
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

#define the text to be preprocessed
text = '''
A part of speech is a classification system in the English Language that reveals the role a word plays in a context. 
There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.
POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence. 
The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech. 
POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words. 

'''
#instantiate the PorterSTemmer class
stemmer = PorterStemmer()
#Tokenize the text to lists of sentences
sentence = sent_tokenize(text)

#loop over each list based on its index
for index, _ in enumerate(sentence):
    #tokenized each sentence to a list of words
    tokenized_words = word_tokenize(sentence[index])
    #apply stemmer if the word is not a stopword
    words = [stemmer.stem(word) for word in tokenized_words if word not in set(stopwords.words('english'))]
    #add the stemmed words to the sentence variable
    sentence[index] = ' '.join(words)

Output:

Observe that words such as, of, in, the, etc are completely taken out. They are called stopwords. Stopwords do not add serious meaning to a sentence. It is good practice to remove them.

Now, let’s carry out lemmatization on the same text and see the result.

#import the necessary libraries
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#define the text to be preprocessed
text = '''
A part of speech is a classification system in the English Language that reveals the role a word plays in a context. 
There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.
POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence. 
The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech. 
POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words. 

'''
#instantiate the WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()
#Tokenize the text to lists of sentences
sentence = sent_tokenize(text)

#loop over each list based on its index
for index, _ in enumerate(sentence):
    #tokenized each sentence to a list of words
    tokenized_words = word_tokenize(sentence[index])
    #apply lemmatizer if the word is not a stopword
    words = [lemmatizer.lemmatize(word) for word in tokenized_words if word not in set(stopwords.words('english'))]
    #add the lemmatized words to the sentence variable
    sentence[index] = ' '.join(words)

Output:

So rounding off…

Stemming or Lemmatization: Which should you go for?

No doubt, lemmatization is better than stemming. But there could be tradeoffs. Lemmatization requires a solid understanding of linguistics, hence it is computationally intensive. If speed is one thing you require, you should consider stemming. If you are trying to build a sentiment analysis or an email classifier, the base word is sufficient to build your model. In this case, as well, go for stemming.

If however, your model would actively interact with humans – say you are building a chatbot, language translation algorithm, etc, lemmatization would be a better option.

Lemmatization, Porter Stemming, Stemming

Share this article

Steven Roger

Steven Roger is a technology blogger for the H2K Infosys blog, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

Read All News