In the age of Artificial Intelligence (AI), natural language processing (NLP) forms the backbone of many modern technologies from chatbots and search engines to sentiment analysis and text summarization tools. However, before any machine learning model can interpret text data accurately, it must first be cleaned, normalized, and structured. Two essential techniques that play a crucial role in this process are stemming and lemmatization.
These text preprocessing techniques are fundamental for every data scientist, NLP engineer, or AI practitioner. Whether you’re enrolled in Artificial intelligence classes online or exploring machine learning independently, understanding stemming and lemmatization will give you the foundation to build more intelligent and context-aware text models.
1. Introduction to Text Normalization
Text normalization is a process of transforming text into a consistent format so that algorithms can process and analyze it effectively. Since raw text data often contains variations such as plural forms, verb tenses, or derivations models might treat words like run, running, and ran as separate entities, even though they represent the same root meaning.
To overcome this challenge, NLP engineers apply stemming and lemmatization, which reduce words to their base or root form. This step helps models recognize word relationships, improving the accuracy of downstream tasks such as text classification, keyword extraction, or sentiment analysis.
What is Stemming
Stemming is a text normalizing technique that cuts down affixes of words, to extract its base form or root words. Stemming is a crude process and sometimes, the root word, also called the stem, may not have grammatical meaning. In fact, in some other NLP libraries like spaCy, stemming is not included.
There are various stemming programs used to carry out stemming. These programs are called stemmer or stemming algorithm. In NLTK, there is the Porter Stemmer, Lancaster Stemmer, Regular Expression Stemmer, and Snowball Stemmer. The most common is the Porter stemming algorithm
Porter Stemming Algorithm
The Porter Stemming Algorithm is arguably the most popular stemming algorithm in Natual Language Processing. In NLTK, it can be instantiated using the PorterStemmer class. The algorithm takes an input of tokenized words and outputs the stems. Let’s take a simple code example using the PorterStemmer class.
#import the PorterStemmer class from nltk.stem library from nltk.stem import PorterStemmer #insantiate the PorterStemmer class stemmer = PorterStemmer() #create a list of tokens tokens = [‘smiling’, ‘smile’, ‘smiled’, ‘smiles’] #create an empty list to take in the stemmed words stemmed_words = [] #loop over each token in the list for each_word in tokens: #stem each word in the list stemmed_word = stemmer.stem(each_word) #add the stemmed word to the empty stemmed word list stemmed_words.append(stemmed_word) #print the stemmed words list print(stemmed_words)
Output:
['smile', 'smile', 'smile', 'smile']As seen, all variations of the word have been stemmed from its root word, ‘smile’. As earlier mentioned, some words may not be stemmed into meaning root words. If we attempt to stem the words ‘cry’, ‘crying’, ‘cries’ and ‘cried’, it outputs the word ‘cri’, which does not have any grammatical meaning.
Let’s take another example where we pass sentences as input.
#import the PorterStemmer class from nltk.stem function from nltk.stem import PorterStemmer #import th word_tokenize method in the nltk library from nltk import word_tokenize #insantiate the PorterStemmer class stemmer = PorterStemmer() #define some statement sentences = 'NLTK is a very interesting subject. I have learnt a lot from this website. I will not stop learning' #tokenize each word in the sentences each_sentence = word_tokenize(sentences) #create an empty list to take in the stemmed words stemmed_words = [] #loop over each token in the list for word in each_sentence: #stem each word in the list stemmed_word = stemmer.stem(word) #add the stemmed word to the empty stemmed word list stemmed_words.append(stemmed_word) #print the stemmed words list print(stemmed_words)
Output:
['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'I', 'will', 'not', 'stop', 'learn']Again, some of the stemmed words do not have a dictionary meaning. Mind you, there are other stemming algorithms. The tweak to the code would be importing the new stemming algorithm and instantiating the same.
- Using the Lancaster stemmer, with the LancasterStemmer class outputs
['nltk', 'is', 'a', 'very', 'interest', 'subject', '.', 'i', 'hav', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'i', 'wil', 'not', 'stop', 'learn']- The regular expression stemmer takes a regular expression and cut of any suffix or prefix that matches the defined expression. Using the Regular Exppression stemmer with the RegexpStemmer class and defining the ‘ing’ parameter as the regular expression outputs
['NLTK', 'is', 'a', 'very', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'website', '.', 'I', 'will', 'not', 'stop', 'learn']- Snowball stemmer allows for stemming in 15 other languages including, Arabic, French, German, Italian, Portuguese, Russian, Spanish and Swedish. When using snowball stemmer, the language has to be defined. Using the Snowball Stemmer with the SnowballStemmer class abd defining the language as ‘english’ outputs
['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'i', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'websit', '.', 'i', 'will', 'not', 'stop', 'learn']Let’s now take a look at lemmatization
What is Lemmatization?
Lemmatization is almost like stemming, in that it cuts down affixes of words until a new word is formed. Only that in lemmatization, the root word, called ‘lemma’ is a word with a dictionary meaning. The morphological analysis of words is done in lemmatization, to remove inflection endings and outputs base words with dictionary meaning. Lemmatization with the NLTK library is done using the WordNetLemmatizer class. It’s almost the same methodology as using PorterStemmer. Let’s take a simple coding example.
#import the WordNetLemmatizer class from nltk.stem library from nltk.stem import WordNetLemmatizer #insantiate the WordNetLemmatizer class lemmatizer = WordNetLemmatizer() #create a list of tokens tokens = ['crying', 'cry', 'cried'] #create an empty list to take in the lemmatized words lemmatized_words = [] #loop over each token in the list for each_word in tokens: #lemmatize each word in the list lemmatized_word = lemmatizer.lemmatize(each_word) #add the lemmatized word to the empty lemmatized word list lemmatized_words.append(lemmatized_word) #print the lemmatized words list print(lemmatized_words)
Output:
['cry', 'cry', 'cried']WordNetLemmatizer outputs meaningful words – cry and cried, as opposed to what PorterStemmer returned – cri.
Use Case Scenarios
Let’s say we have a text we want a machine learning model to understand. We need to preprocess the text using stemming or lemmatization. Let’s take a code example for each of them starting with stemming.
#import the necessary libraries
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
#define the text to be preprocessed
text = '''
A part of speech is a classification system in the English Language that reveals the role a word plays in a context.
There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.
POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence.
The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech.
POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words.
'''
#instantiate the PorterSTemmer class
stemmer = PorterStemmer()
#Tokenize the text to lists of sentences
sentence = sent_tokenize(text)
#loop over each list based on its index
for index, _ in enumerate(sentence):
#tokenized each sentence to a list of words
tokenized_words = word_tokenize(sentence[index])
#apply stemmer if the word is not a stopword
words = [stemmer.stem(word) for word in tokenized_words if word not in set(stopwords.words('english'))]
#add the stemmed words to the sentence variable
sentence[index] = ' '.join(words)Output:
Observe that words such as, of, in, the, etc are completely taken out. They are called stopwords. Stopwords do not add serious meaning to a sentence. It is good practice to remove them.
Now, let’s carry out lemmatization on the same text and see the result.
#import the necessary libraries
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#define the text to be preprocessed
text = '''
A part of speech is a classification system in the English Language that reveals the role a word plays in a context.
There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.
POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence.
The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech.
POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words.
'''
#instantiate the WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()
#Tokenize the text to lists of sentences
sentence = sent_tokenize(text)
#loop over each list based on its index
for index, _ in enumerate(sentence):
#tokenized each sentence to a list of words
tokenized_words = word_tokenize(sentence[index])
#apply lemmatizer if the word is not a stopword
words = [lemmatizer.lemmatize(word) for word in tokenized_words if word not in set(stopwords.words('english'))]
#add the lemmatized words to the sentence variable
sentence[index] = ' '.join(words)
Output:
So rounding off…
Stemming or Lemmatization: Which should you go for?
No doubt, lemmatization is better than stemming. But there could be tradeoffs. Lemmatization requires a solid understanding of linguistics, hence it is computationally intensive. If speed is one thing you require, you should consider stemming. If you are trying to build a sentiment analysis or an email classifier, the base word is sufficient to build your model. In this case, as well, go for stemming.
If however, your model would actively interact with humans – say you are building a chatbot, language translation algorithm, etc, lemmatization would be a better option.

























