{"id":5142,"date":"2020-09-30T16:58:05","date_gmt":"2020-09-30T11:28:05","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=5142"},"modified":"2022-09-11T13:44:26","modified_gmt":"2022-09-11T08:14:26","slug":"simple-statistics-with-nltk-counting-of-pos-tags-and-frequency-distributions","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/simple-statistics-with-nltk-counting-of-pos-tags-and-frequency-distributions\/","title":{"rendered":"Simple Statistics with NLTK: Counting of POS Tags and Frequency Distributions"},"content":{"rendered":"\n<p>In the last tutorial, we discussed how to assign POS tags to words in a sentence using the pos_tag method of NLTK. We said that POS tagging is a fundamental step in the preprocessing of textual data and is especially needed when building text classification models. We went further to discuss <a href=\"https:\/\/www.h2kinfosys.com\/blog\/pos-tagging-and-hidden-markov-model\/\">Hidden Markov Models<\/a> (HMMs) and their importance in text analysis. When creating HMMs, we mentioned that you must count the number of each POS tag in the sentence, to determine the transition probabilities and emission probabilities. While the counting process may be a non-issue for a small text, it can be daunting for large datasets.\u00a0<\/p>\n\n\n\n<p>In such cases, we may need to rely on automatic methods to count tags and words. In this tutorial, we will be discussing the ways to count using python\u2019s native Counter function and the FreqDist function of NLTK. Consequently, you will also learn about collocations, bigrams, and trigrams. Let\u2019s begin!<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Artificial intelligence Courses Online | Artificial Intelligence Tutorial |  H2kinfosys\" width=\"800\" height=\"450\" src=\"https:\/\/www.youtube.com\/embed\/jm_4OanymJs?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Counting the Number of Items in a String\/List<\/strong><\/h2>\n\n\n\n<p>Python\u2019s collections module has a plethora of functions including the Counter class, ChainMap class, OrderedDict class, and so on. Each of these classes has its own specific capabilities. Here, we will focus on the Counter function, which is used to count the number of items in a list, string, or tuple. It returns a dictionary where the key is the element\/item in the list and value is the frequency of that element\/item in the list.&nbsp;<\/p>\n\n\n\n<p>Let see this simple example below. Say we want to count the number of times each letter appears in a sentence, the Counter class will come in handy. We start by importing the class from the collections module.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the Counter class<\/em>\n<strong>from<\/strong> <strong>collections<\/strong> <strong>import<\/strong> Counter\n<em>#deifne some text<\/em>\ntext = \"It is necessary for any Data Scientist to understand Natural Language Processing\"\n<em>#convert al letters to lower case<\/em>\ntext = text.lower()\n<em>#instantiate the Counter classes on the text<\/em>\nthe_count = Counter(text)\n<em>#print the count<\/em>\n<strong>print<\/strong>(the_count)&nbsp;\n<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Counter({' ': 11, 'a': 9, 's': 8, 't': 7, 'n': 7, 'i': 6, 'e': 6, 'r': 5, 'c': 3, 'o': 3, 'd': 3, 'u': 3, 'g': 3, 'y': 2, 'l': 2, 'f': 1, 'p': 1})<\/pre>\n\n\n\n<p>As seen, we had 11 whitespaces, 9 a\u2019s, 8 s\u2019, 7 t\u2019s, 7 n\u2019s, and so on.&nbsp;<\/p>\n\n\n\n<p>We can use this same methodology to count the POS tags in a sentence. Take a look.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import nltk library<\/em>\n<strong>import<\/strong> <strong>nltk<\/strong>\n<em>#import the Counter class<\/em>\n<strong>from<\/strong> <strong>collections<\/strong> <strong>import<\/strong> Counter\n<em>#define some text<\/em>\ntext = \"It is necessary for any Data Scientist to understand Natural Language Processing\"\n<em>#convert all letters to lower case<\/em>\ntext = text.lower()\n<em>#tokenize the words in the text<\/em>\ntokens = nltk.word_tokenize(text)\n<em>#assign POS tags to each words<\/em>\npos = nltk.pos_tag(tokens)\n<em>#Count the POS tags<\/em>\nthe_count = Counter(tag <strong>for<\/strong> _, tag <strong>in<\/strong> pos)\n<em>#print the count<\/em>\n<strong>print<\/strong>(the_count)&nbsp;<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Counter({'NN': 3, 'JJ': 2, 'PRP': 1, 'VBZ': 1, 'IN': 1, 'DT': 1, 'NNS': 1, 'TO': 1, 'VB': 1})<\/pre>\n\n\n\n<p>Let\u2019s do a quick rundown of what each line of code done.&nbsp;<\/p>\n\n\n\n<p>We started off by importing the necessary libraries, after which we defined the text we want to tokenize. It was necessary to convert all the words to lower cases so the compiler does not view two same words as different because of uppercase-lowercase variation. Having done that, we tokenize the words and assign POS tags to each word. Bare in mind that the output of the pos_tag() method is a dictionary with keys and value pairs. The key is the individual word in the text, while the value is the corresponding POS tags.&nbsp;<\/p>\n\n\n\n<p>Here\u2019s what happens in the for loop. Recall that the pos_tag returns the words and their POS. We wish to count only the POS tags which are the value of the pos_tag() outputted dictionary. We iterate over each POS tag and count the POS tags with the Counter class.<\/p>\n\n\n\n<p>Let\u2019s go on to see how to count using NLTK\u2019s FreqDict class.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Frequency Distribution with NLTK<\/strong><\/h2>\n\n\n\n<p>Have you imagined how words that provide key information about a topic or book are found? An easy way to go about this is by finding the words that appear the most in the text\/book (excluding stopwords). The counting of the number of times a word appears in a document is called Frequency Distribution. In other words, frequency distribution shows how the words are distributed in a document.&nbsp;<\/p>\n\n\n\n<p>You could as well say that the frequency distribution is the term used to count the occurrence of a specific outcome in an experiment.&nbsp;The FreqDist class is used to count the number of times each word token appears in the text. Throughout this tutorial, the textual data will a book from the NLTK corpus, called <em>Moby Dick<\/em>. This is a NLTK\u2019s pre-installed corpora.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the necessary library<\/em>\n<strong>from<\/strong> <strong>nltk.corpus<\/strong> <strong>import<\/strong> gutenberg\n<em>#call the book we intend to use and save as text<\/em>\ntext = gutenberg.words('melville-moby_dick.txt')\n<em>#check number of words in the book<\/em>\n<strong>print<\/strong>(len(text))<\/pre>\n\n\n\n<p><strong>Output:<\/strong> 260819<\/p>\n\n\n\n<p>We can see that it is quite a large book with over 260,000 words.&nbsp; Let\u2019s have a peep into what the book looks like.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#prints the first 100 words in the book<\/em>\n<strong>print<\/strong>(text[:100])\n<\/pre>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><code>['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')', 'The', 'pale', 'Usher', '--', 'threadbare', 'in', 'coat', ',', 'heart', ',', 'body', ',', 'and', 'brain', ';', 'I', 'see', 'him', 'now', '.', 'He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and', 'grammars', ',', 'with', 'a', 'queer', 'handkerchief', ',', 'mockingly', 'embellished', 'with', 'all', 'the', 'gay', 'flags', 'of', 'all', 'the', 'known', 'nations', 'of', 'the', 'world', '.', 'He', 'loved', 'to', 'dust', 'his', 'old', 'grammars', ';', 'it', 'somehow', 'mildly', 'reminded', 'him', 'of', 'his', 'mortality', '.', '\"', 'While', 'you', 'take', 'in', 'hand', 'to', 'school', 'others', ',']<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#invoke the FreqDIst class and pass the text as parameter<\/em>\nfdistribution = nltk.FreqDist(text)<\/pre>\n\n\n\n<p>In the example above, the FreqDist class is instantiated and the text was passed as a parameter. This was saved in the \u2018fdistribution\u2019 variable. Observe that the text has been split into tokenized sentences and tokenized words. If we were dealing with raw text, we must first tokenize the text using the sent_tokenize() method, then sentence the using the word_tokenize() method. This is because the FreqDist()&nbsp; takes the word tokens as parameters.<\/p>\n\n\n\n<p>To perform the myriad operations on the fdistribution variable, we use the methods FreqDist class holds. There are various methods that can be used to perform different operations on the instantiated class. Let\u2019s see some of them.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>The keys methods. This method is called using the keys() statement. It returns a list of words in the vocabulary in ascending order. If you want to print the first 50 elements, they can be accessed by slicing the list. Check the example below.<\/li><\/ol>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#return the words in the frequency distribution dictionary<\/em>\nvocabulary = fdistribution.keys()\n<em>#prints the first 50 keys of the dictionary&nbsp;<\/em>\n<strong>print<\/strong>(vocabulary[:50])<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><code>['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'Grammar', 'School', ')', 'The', 'pale', '--', 'threadbare', 'in', 'coat', ',', 'heart', 'body', 'and', 'brain', ';', 'I', 'see', 'him', 'now', 'He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'grammars', 'with', 'queer', 'handkerchief', 'mockingly', 'embellished', 'all']<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<ol class=\"wp-block-list\"><li>Plot method. This method is called using the plot() statement. It plots the frequency curve of the words in the vocabulary. The plot statement is a common and important method of the FreqDist class as the pictorial representation gives a solid understanding of how the words are spread in the vocabulary. Check the example below.&nbsp;<\/li><\/ol>\n\n\n\n<p><em>#plot the frequency distribution curve for the first 50 words<\/em><\/p>\n\n\n\n<p>fdistribution.plot(50)<\/p>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/FMs5Bs2zZ9EHjGihzsZR0n-oWxylszPY-nQbvXc0jZPLYZF0IsHdI01lX8DEBPBsTGgGI4T_Kk4ZC2Q-gVnnaufZdbXI3SjTIU5MAnPyHsGc-8oOoTtFbCaCj3BVDB7UKy5nn_cfpIg1JsWuWA\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>If we decide to plot the cumulative distribution curve, it is as easy as passing the cumulative argument to be \u2018True;.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#plot the frequency distribution curve for the first 50 words<\/em>\nfdistribution.plot(50, cumulative=True)<\/pre>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/Adldavqf-JseMQ0n4zlM76jlbJ88jy3FwEqPasxdTj1oims3lHN9WnW0OGnqq5h0dk1wM20RAzUFOC9FYL9LuluDEBywrznYkdyAZ9eRGdiu5bDqYNCHMFhkqiQjvwICIgtrfK6mwWoGYV-y2w\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Note that you need to have the matplotlib library installed on your machine for the plots to show.&nbsp;<\/p>\n\n\n\n<p>3. Hapaxes method. Sometimes, we may then decide to check for words that are less frequent or even words that appear just once and remove them. The hapaxes() method returns the list of words that are unique in the vocabulary. These words are typically cetological, expostulations, contraband, lexicographers, and so on.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#prints the first 50 unique words<\/em>\nfdistribution.hapaxes()[:50]<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">['Herman', 'Melville', ']', 'ETYMOLOGY', 'Late', 'Consumptive', 'School', 'threadbare', 'lexicons', 'mockingly', 'flags', 'mortality', 'signification', 'HACKLUYT', 'Sw', 'HVAL', 'roundness', 'Dut', 'Ger', 'WALLEN', 'WALW', 'IAN', 'RICHARDSON', 'KETOS', 'GREEK', 'CETUS', 'LATIN', 'WHOEL', 'ANGLO', 'SAXON', 'WAL', 'HWAL', 'SWEDISH', 'ICELANDIC', 'BALEINE', 'BALLENA', 'FEGEE', 'ERROMANGOAN', 'Librarian', 'painstaking', 'burrower', 'grub', 'Vaticans', 'stalls', 'higgledy', 'piggledy', 'gospel', 'promiscuously', 'commentator', 'belongest']<\/pre>\n\n\n\n<p>The uncommon words may be numerous, making it really difficult to substantiate how critical they are.&nbsp; Let\u2019s see the number of unique words in our text.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">len(fdistribution.hapaxes())<\/pre>\n\n\n\n<p><strong>Output:<\/strong> 9002<\/p>\n\n\n\n<p>There are over 9000 unique words in this text. Obviously, these words cannot classify the book in any way. What do we do hence?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Making Finer Filtering&nbsp;<\/strong><\/h2>\n\n\n\n<p>What about if we turn our attention to longer words? The most common words in the English Language are short words such as for, of, is, a, an, but, then, etc.&nbsp;<\/p>\n\n\n\n<p>We could adjust the condition for selection such that only words with characters more than 10 are returned. It can be done using the code below.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#remove duplicate words&nbsp;<\/em>\neach_word = set(text)\n<em>#returns only words longer than 16 letters<\/em>\nlengthy_words = [word <strong>for<\/strong> word <strong>in<\/strong> each_word <strong>if<\/strong> len(word) &gt; 15]\n<em>#print the lengthy words<\/em>\n<strong>print<\/strong>(lengthy_words)\n<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>['undiscriminating', 'irresistibleness', 'indiscriminately', 'physiognomically', 'subterraneousness', 'uncompromisedness', 'preternaturalness', 'indispensableness', 'circumnavigations', 'characteristically', 'CIRCUMNAVIGATION', 'cannibalistically', 'Physiognomically', 'superstitiousness', 'supernaturalness', 'uncomfortableness', 'hermaphroditical', 'responsibilities', 'comprehensiveness', 'uninterpenetratingly', 'apprehensiveness', 'simultaneousness', 'circumnavigating', 'circumnavigation']<\/code><\/pre>\n\n\n\n<p>While this looks like a breathrough, sometimes informal texts could contain words like harrayyyyyyyy, yeaaaaaaaaaa, waaaaaaaaaaaaat, etc. These are long words but they certainly do not classify the text.&nbsp;<\/p>\n\n\n\n<p>To overcome this challenge, we can extract both the frequently used words and longer words together. Most of the informal words are unique words and will be filtered out if we use most occurring long words. The code below shows an example.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#invoke the FreqDIst class and pass the text as parameter<\/em>\nfdistribution = nltk.FreqDist(text)\n<em>#remove duplicate words&nbsp;<\/em>\neach_word = set(text)\n<em>#returns only words longer than 10 letters and occurs more than 10 times<\/em>\nwords = sorted([word <strong>for<\/strong> word <strong>in<\/strong> each_word <strong>if<\/strong> len(word) &gt; 10 <strong>and<\/strong> fdistribution[word] &gt; 10])\n<em>#print the words<\/em>\n<strong>print<\/strong>(words)\n<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>['Nantucketer', 'Nevertheless', 'circumstance', 'circumstances', 'considerable', 'considering', 'continually', 'countenance', 'disappeared', 'encountered', 'exceedingly', 'experienced', 'harpooneers', 'immediately', 'indifferent', 'indispensable', 'involuntarily', 'naturalists', 'nevertheless', 'occasionally', 'peculiarities', 'perpendicular', 'significant', 'simultaneously', 'straightway', 'unaccountable']<\/code>\n<\/pre>\n\n\n\n<p>Let\u2019s take this a little further.&nbsp;<\/p>\n\n\n\n<p>In real-life applications, most keywords are not single words. They are rather words in pairs or a combination of three words. These words are called collocations. Using collocations typically makes a robust machine learning model and thus, it is important in our discussion. Let&#8217;s talk about collocations in more detail.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What are Collocations<\/strong><\/h2>\n\n\n\n<p>As mentioned earlier, collocations are words that mostly appear together in a document. They are a sequence of words that occurs unusually often in a document. It can be calculated by dividing the number of times two or three words appear together by the number of words in the documents.<\/p>\n\n\n\n<p>Examples of collocations include fast food, early riser, UV rays, etc. These words are most likely to follow each other in a document. Moreso, one of the words cannot stand in the gap for the other. In the case of UV rays, while UV is a thing, it would be unclear to use the only UV in a sentence rather than UV rays. Hence, it&#8217;s a collocation.&nbsp;<\/p>\n\n\n\n<p>Collocations can be classified as bigrams and trigrams.&nbsp;<\/p>\n\n\n\n<p>1. Bigrams are a combination of two words. The <a href=\"https:\/\/www.h2kinfosys.com\/blog\/how-to-install-nltk-on-windows-mac-or-linux\/\">NLTK library<\/a> has a built-in method &#8211; bigrams(), that can be used to extract bigrams in a document. See the example below.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#prints the bigrams for the first 50 words in the text<\/em>\n<strong>print<\/strong>(list(nltk.bigrams(text[:50])))<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[('[', 'Moby'), ('Moby', 'Dick'), ('Dick', 'by'), ('by', 'Herman'), ('Herman', 'Melville'), ('Melville', '1851'), ('1851', ']'), (']', 'ETYMOLOGY'), ('ETYMOLOGY', '.'), ('.', '('), ('(', 'Supplied'), ('Supplied', 'by'), ('by', 'a'), ('a', 'Late'), ('Late', 'Consumptive'), ('Consumptive', 'Usher'), ('Usher', 'to'), ('to', 'a'), ('a', 'Grammar'), ('Grammar', 'School'), ('School', ')'), (')', 'The'), ('The', 'pale'), ('pale', 'Usher'), ('Usher', '--'), ('--', 'threadbare'), ('threadbare', 'in'), ('in', 'coat'), ('coat', ','), (',', 'heart'), ('heart', ','), (',', 'body'), ('body', ','), (',', 'and'), ('and', 'brain'), ('brain', ';'), (';', 'I'), ('I', 'see'), ('see', 'him'), ('him', 'now'), ('now', '.'), ('.', 'He'), ('He', 'was'), ('was', 'ever'), ('ever', 'dusting'), ('dusting', 'his'), ('his', 'old'), ('old', 'lexicons'), ('lexicons', 'and')]<\/pre>\n\n\n\n<p>As seen, the bigram() splits the words into pairs.<\/p>\n\n\n\n<p>2. Trigrams, as the name suggests, are the combination of three words that follow each other. To extract the trigrams in a text, the trigram() method is called, passing the tokens as an argument. Let&#8217;s see the example below.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#prints the trigrams for the first 50 words in the text<\/em>\n<strong>print<\/strong>(list(nltk.trigrams(text[:50])))<\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[('[', 'Moby', 'Dick'), ('Moby', 'Dick', 'by'), ('Dick', 'by', 'Herman'), ('by', 'Herman', 'Melville'), ('Herman', 'Melville', '1851'), ('Melville', '1851', ']'), ('1851', ']', 'ETYMOLOGY'), (']', 'ETYMOLOGY', '.'), ('ETYMOLOGY', '.', '('), ('.', '(', 'Supplied'), ('(', 'Supplied', 'by'), ('Supplied', 'by', 'a'), ('by', 'a', 'Late'), ('a', 'Late', 'Consumptive'), ('Late', 'Consumptive', 'Usher'), ('Consumptive', 'Usher', 'to'), ('Usher', 'to', 'a'), ('to', 'a', 'Grammar'), ('a', 'Grammar', 'School'), ('Grammar', 'School', ')'), ('School', ')', 'The'), (')', 'The', 'pale'), ('The', 'pale', 'Usher'), ('pale', 'Usher', '--'), ('Usher', '--', 'threadbare'), ('--', 'threadbare', 'in'), ('threadbare', 'in', 'coat'), ('in', 'coat', ','), ('coat', ',', 'heart'), (',', 'heart', ','), ('heart', ',', 'body'), (',', 'body', ','), ('body', ',', 'and'), (',', 'and', 'brain'), ('and', 'brain', ';'), ('brain', ';', 'I'), (';', 'I', 'see'), ('I', 'see', 'him'), ('see', 'him', 'now'), ('him', 'now', '.'), ('now', '.', 'He'), ('.', 'He', 'was'), ('He', 'was', 'ever'), ('was', 'ever', 'dusting'), ('ever', 'dusting', 'his'), ('dusting', 'his', 'old'), ('his', 'old', 'lexicons'), ('old', 'lexicons', 'and')]<\/pre>\n\n\n\n<p>Here, the words are split into threes.&nbsp;<\/p>\n\n\n\n<p>Bigrams and trigrams are especially useful for extracting features in text-based sentimental analysis. A high concentration of bigrams or trigrams in a text is an indication of a keyword or a key feature.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the last tutorial, we discussed how to assign POS tags to words in a sentence using the pos_tag method of NLTK. We said that POS tagging is a fundamental step in the preprocessing of textual data and is especially needed when building text classification models. We went further to discuss Hidden Markov Models (HMMs) [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5223,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[498],"tags":[],"class_list":["post-5142","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence-tutorials"],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5142","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=5142"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5142\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/5223"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=5142"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=5142"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=5142"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}