{"id":4664,"date":"2020-09-01T13:41:04","date_gmt":"2020-09-01T08:11:04","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=4664"},"modified":"2025-10-27T03:30:05","modified_gmt":"2025-10-27T07:30:05","slug":"stemming-and-lemmatization","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/stemming-and-lemmatization\/","title":{"rendered":"Stemming and Lemmatization"},"content":{"rendered":"\n<p>In the age of Artificial Intelligence (AI), natural language processing (NLP) forms the backbone of many modern technologies from chatbots and search engines to sentiment analysis and text summarization tools. However, before any machine learning model can interpret text data accurately, it must first be cleaned, normalized, and structured. Two essential techniques that play a crucial role in this process are <strong>stemming and lemmatization<\/strong>.<\/p>\n\n\n\n<p>These text preprocessing techniques are fundamental for every data scientist, NLP engineer, or AI practitioner. Whether you\u2019re enrolled in <strong><a href=\"https:\/\/www.h2kinfosys.com\/courses\/artificial-intelligence-online-training-course-details\/\">Artificial intelligence classes online<\/a><\/strong> or exploring machine learning independently, understanding stemming and lemmatization will give you the foundation to build more intelligent and context-aware text models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>1. Introduction to Text Normalization<\/strong><\/h2>\n\n\n\n<p>Text normalization is a process of transforming text into a consistent format so that algorithms can process and analyze it effectively. Since raw text data often contains variations such as plural forms, verb tenses, or derivations models might treat words like <em>run<\/em>, <em>running<\/em>, and <em>ran<\/em> as separate entities, even though they represent the same root meaning.<\/p>\n\n\n\n<p>To overcome this challenge, NLP engineers apply <strong>stemming<\/strong> and <strong>lemmatization<\/strong>, which reduce words to their base or root form. This step helps models recognize word relationships, improving the accuracy of downstream tasks such as text classification, keyword extraction, or sentiment analysis.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is Stemming<\/strong><\/h2>\n\n\n\n<p>Stemming is a text normalizing technique that cuts down affixes of words, to extract its base form or root words. Stemming is a crude process and sometimes, the root word, also called the stem, may not have grammatical meaning. In fact, in some other <a href=\"https:\/\/www.h2kinfosys.com\/blog\/natural-language-processing-nlp-tutorial\/\">NLP libraries<\/a> like spaCy, stemming is not included.<\/p>\n\n\n\n<p>There are various stemming programs used to carry out stemming. These programs are called stemmer or stemming algorithm. In NLTK, there is the Porter Stemmer, Lancaster Stemmer, Regular Expression Stemmer, and Snowball Stemmer. The most common is the Porter stemming algorithm&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Porter Stemming Algorithm&nbsp;<\/strong><\/h2>\n\n\n\n<p>The Porter Stemming Algorithm is arguably the most popular stemming algorithm in Natual Language Processing. In NLTK, it can be instantiated using the PorterStemmer class. The algorithm takes an input of tokenized words and outputs the stems. Let\u2019s take a simple code example using the <strong>PorterStemmer <\/strong>class.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the PorterStemmer class from nltk.stem library<\/em>\n<strong>from<\/strong> <strong>nltk.stem<\/strong> <strong>import<\/strong> PorterStemmer\n<em>#insantiate the PorterStemmer class<\/em>\nstemmer = PorterStemmer()\n<em>#create a list of tokens<\/em>\ntokens = [\u2018smiling\u2019, \u2018smile\u2019, \u2018smiled\u2019, \u2018smiles\u2019]\n<em>#create an empty list to take in the stemmed words<\/em>\nstemmed_words = []\n<em>#loop over each token in the list<\/em>\n<strong>for<\/strong> each_word <strong>in<\/strong> tokens:\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#stem each word in the list<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;stemmed_word = stemmer.stem(each_word)\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#add the stemmed word to the empty stemmed word list<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;stemmed_words.append(stemmed_word)\n<em>#print the stemmed words list<\/em>\n<strong>print<\/strong>(stemmed_words)<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;'smile', 'smile', 'smile', 'smile']<\/code><\/pre>\n\n\n\n<p>As seen, all variations of the word have been stemmed from its root word, \u2018smile\u2019. As earlier mentioned, some words may not be stemmed into meaning root words. If we attempt to stem the words \u2018cry\u2019, \u2018crying\u2019, \u2018cries\u2019 and \u2018cried\u2019, it outputs the word \u2018cri\u2019, which does not have any grammatical meaning.&nbsp;<\/p>\n\n\n\n<p>Let\u2019s take another example where we pass sentences as input.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the PorterStemmer class from nltk.stem function<\/em>\n<strong>from<\/strong> <strong>nltk.stem<\/strong> <strong>import<\/strong> PorterStemmer\n<em>#import th word_tokenize method in the nltk library<\/em>\n<strong>from<\/strong> <strong>nltk<\/strong> <strong>import<\/strong> word_tokenize\n<em>#insantiate the PorterStemmer class<\/em>\nstemmer = PorterStemmer()\n<em>#define some statement<\/em>\nsentences = 'NLTK is a very interesting subject. I have learnt a lot from this website. I will not stop learning'\n<em>#tokenize each word in the sentences<\/em>\neach_sentence = word_tokenize(sentences)\n<em>#create an empty list to take in the stemmed words<\/em>\nstemmed_words = []\n<em>#loop over each token in the list<\/em>\n<strong>for<\/strong> word <strong>in<\/strong> each_sentence:\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#stem each word in the list<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;stemmed_word = stemmer.stem(word)\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#add the stemmed word to the empty stemmed word list<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;stemmed_words.append(stemmed_word)\n<em>#print the stemmed words list<\/em>\n<strong>print<\/strong>(stemmed_words)<\/pre>\n\n\n\n<p><strong><em>Output:<\/em><\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;'nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'I', 'will', 'not', 'stop', 'learn']<\/code><\/pre>\n\n\n\n<p>Again, some of the stemmed words do not have a dictionary meaning. Mind you, there are other stemming algorithms. The tweak to the code would be importing the new stemming algorithm and instantiating the same.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Using the Lancaster stemmer, with the LancasterStemmer class outputs<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;'nltk', 'is', 'a', 'very', 'interest', 'subject', '.', 'i', 'hav', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'i', 'wil', 'not', 'stop', 'learn']<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The regular expression stemmer takes a regular expression and cut of any suffix or prefix that matches the defined expression. Using the Regular Exppression stemmer with the RegexpStemmer\u00a0class and defining the \u2018ing\u2019 parameter as the regular expression outputs<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;'NLTK', 'is', 'a', 'very', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'website', '.', 'I', 'will', 'not', 'stop', 'learn']<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Snowball stemmer allows for stemming in 15 other languages including, Arabic, French, German, Italian, Portuguese, Russian, Spanish and Swedish. When using snowball stemmer, the language has to be defined. Using the Snowball Stemmer with the SnowballStemmer\u00a0class abd defining the language as \u2018english\u2019 outputs<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;'nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'i', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'websit', '.', 'i', 'will', 'not', 'stop', 'learn']<\/code><\/pre>\n\n\n\n<p>Let\u2019s now take a look at lemmatization<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is Lemmatization?<\/strong><\/h2>\n\n\n\n<p>Lemmatization is almost like stemming, in that it cuts down affixes of words until a new word is formed. Only that in lemmatization, the root word, called \u2018lemma\u2019 is a word with a dictionary meaning. The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Morphological_analysis#:~:text=Morphological%20analysis%20is%20the%20analysis,the%20internal%20structure%20of%20words\" rel=\"nofollow noopener\" target=\"_blank\">morphological analysis <\/a>of words is done in lemmatization, to remove inflection endings and outputs base words with dictionary meaning.&nbsp;Lemmatization with the NLTK library is done using the WordNetLemmatizer class. It\u2019s almost the same methodology as using PorterStemmer. Let\u2019s take a simple coding example.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the WordNetLemmatizer class from nltk.stem library<\/em>\n<strong>from<\/strong> <strong>nltk.stem<\/strong> <strong>import<\/strong> WordNetLemmatizer\n<em>#insantiate the WordNetLemmatizer class<\/em>\nlemmatizer = WordNetLemmatizer()\n<em>#create a list of tokens<\/em>\ntokens = ['crying', 'cry', 'cried']\n<em>#create an empty list to take in the lemmatized words<\/em>\nlemmatized_words = []\n<em>#loop over each token in the list<\/em>\n<strong>for<\/strong> each_word <strong>in<\/strong> tokens:\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#lemmatize each word in the list<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;lemmatized_word = lemmatizer.lemmatize(each_word)\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#add the lemmatized word to the empty lemmatized word list<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;lemmatized_words.append(lemmatized_word)\n<em>#print the lemmatized words list<\/em>\n<strong>print<\/strong>(lemmatized_words)<\/pre>\n\n\n\n<p><em><strong>Output:<\/strong><\/em><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;'cry', 'cry', 'cried']<\/code><\/pre>\n\n\n\n<p>WordNetLemmatizer outputs meaningful words &#8211; cry and cried, as opposed to what PorterStemmer returned \u2013 cri.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Use Case Scenarios<\/strong><\/h2>\n\n\n\n<p>Let\u2019s say we have a text we want a machine learning model to understand. We need to preprocess the text using stemming or lemmatization. Let\u2019s take a code example for each of them starting with stemming.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the necessary libraries<\/em>\n<strong>from<\/strong> <strong>nltk<\/strong> <strong>import<\/strong> sent_tokenize, word_tokenize\n<strong>from<\/strong> <strong>nltk.corpus<\/strong> <strong>import<\/strong> stopwords\n<strong>from<\/strong> <strong>nltk.stem<\/strong> <strong>import<\/strong> PorterStemmer\n\n<em>#define the text to be preprocessed<\/em>\ntext = '''\nA part of speech is a classification system in the English Language that reveals the role a word plays in a context.&nbsp;\nThere are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.\nPOS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence.&nbsp;\nThe methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech.&nbsp;\nPOS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words.&nbsp;\n\n'''\n<em>#instantiate the PorterSTemmer class<\/em>\nstemmer = PorterStemmer()\n<em>#Tokenize the text to lists of sentences<\/em>\nsentence = sent_tokenize(text)\n\n<em>#loop over each list based on its index<\/em>\n<strong>for<\/strong> index, _ <strong>in<\/strong> enumerate(sentence):\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#tokenized each sentence to a list of words<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;tokenized_words = word_tokenize(sentence[index])\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#apply stemmer if the word is not a stopword<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;words = [stemmer.stem(word) <strong>for<\/strong> word <strong>in<\/strong> tokenized_words <strong>if<\/strong> word <strong>not<\/strong> <strong>in<\/strong> set(stopwords.words('english'))]\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#add the stemmed words to the sentence variable<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;sentence[index] = ' '.join(words)<\/pre>\n\n\n\n<p><strong><em>Output:<\/em><\/strong><\/p>\n\n\n\n<p><img fetchpriority=\"high\" decoding=\"async\" width=\"617\" height=\"246\" src=\"https:\/\/lh3.googleusercontent.com\/6nRpWijvzLGXdAlmFhHvPasNndFOiTkwM8lwaOltsvEehbqbaVda4v1m627m9PmnXxag_EBvT5MMXiDh5o1CSCqFpn6EHtN_r4xkFAPG0z-9-Iu3eXdpXnkFJ_ACvyz8pMF-wps\" alt=\"\" title=\"\"><\/p>\n\n\n\n<p>Observe that words such as, of, in, the, etc are completely taken out. They are called stopwords. Stopwords do not add serious meaning to a sentence. It is good practice to remove them.&nbsp;<\/p>\n\n\n\n<p>Now, let\u2019s carry out lemmatization on the same text and see the result.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the necessary libraries<\/em>\n<strong>from<\/strong> <strong>nltk<\/strong> <strong>import<\/strong> sent_tokenize, word_tokenize\n<strong>from<\/strong> <strong>nltk.corpus<\/strong> <strong>import<\/strong> stopwords\n<strong>from<\/strong> <strong>nltk.stem<\/strong> <strong>import<\/strong> WordNetLemmatizer\n\n<em>#define the text to be preprocessed<\/em>\ntext = '''\nA part of speech is a classification system in the English Language that reveals the role a word plays in a context.&nbsp;\nThere are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.\nPOS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence.&nbsp;\nThe methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech.&nbsp;\nPOS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words.&nbsp;\n\n'''\n<em>#instantiate the WordNetLemmatizer class<\/em>\nlemmatizer = WordNetLemmatizer()\n<em>#Tokenize the text to lists of sentences<\/em>\nsentence = sent_tokenize(text)\n\n<em>#loop over each list based on its index<\/em>\n<strong>for<\/strong> index, _ <strong>in<\/strong> enumerate(sentence):\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#tokenized each sentence to a list of words<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;tokenized_words = word_tokenize(sentence[index])\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#apply lemmatizer if the word is not a stopword<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;words = [lemmatizer.lemmatize(word) <strong>for<\/strong> word <strong>in<\/strong> tokenized_words <strong>if<\/strong> word <strong>not<\/strong> <strong>in<\/strong> set(stopwords.words('english'))]\n&nbsp;&nbsp;&nbsp;&nbsp;<em>#add the lemmatized words to the sentence variable<\/em>\n&nbsp;&nbsp;&nbsp;&nbsp;sentence[index] = ' '.join(words)\n<\/pre>\n\n\n\n<p><strong><em>Output:<\/em><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/SLytb1cJoGoE2-eRS2EJ2rDyX6xrJPlqA5jJGoEGKkOWAXQK5rXnO5-GhOlkvaUXuLujXA1KtRXuRemt7jXmFZiekC2mwyg1BoIveqimWWyTqGJQrUOAOxEly6yMHi5lsl_v7oc\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>So rounding off\u2026<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Stemming or Lemmatization: Which should you go for?<\/strong><\/h2>\n\n\n\n<p>No doubt, lemmatization is better than stemming. But there could be tradeoffs. Lemmatization requires a solid understanding of linguistics, hence it is computationally intensive. If speed is one thing you require, you should consider stemming. If you are trying to build a sentiment analysis or an email classifier, the base word is sufficient to build your model. In this case, as well, go for stemming.&nbsp;<\/p>\n\n\n\n<p>If however, your model would actively interact with humans &#8211; say you are building a chatbot, language translation algorithm, etc, lemmatization would be a better option.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the age of Artificial Intelligence (AI), natural language processing (NLP) forms the backbone of many modern technologies from chatbots and search engines to sentiment analysis and text summarization tools. However, before any machine learning model can interpret text data accurately, it must first be cleaned, normalized, and structured. Two essential techniques that play a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4686,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","_members_access_role":[],"_members_access_error":""},"categories":[498],"tags":[1310,1311,1309],"class_list":["post-4664","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence-tutorials","tag-lemmatization","tag-porter-stemming","tag-stemming"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4664","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=4664"}],"version-history":[{"count":1,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4664\/revisions"}],"predecessor-version":[{"id":31336,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4664\/revisions\/31336"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/4686"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=4664"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=4664"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=4664"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}