{"id":4915,"date":"2020-09-21T17:12:33","date_gmt":"2020-09-21T11:42:33","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=4915"},"modified":"2022-09-11T13:41:39","modified_gmt":"2022-09-11T08:11:39","slug":"nltk-regular-expressions","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/nltk-regular-expressions\/","title":{"rendered":"NLTK Regular Expressions"},"content":{"rendered":"\n<p>When dealing with textual data, you may be required to find or replace words that follow a particular pattern. For instance, you may wish to find words that end with \u201cal\u201d when carrying out data wrangling. Using regular expressions is an easy way to go about this in<a href=\"https:\/\/www.h2kinfosys.com\/blog\/natural-language-processing-nlp-tutorial\/\"> Natural Language Processing<\/a>. It is a powerful method used to find, split, or replace words according to some pattern. Regular expressions can help you extract key information from dirty data during data analysis. You can quickly get dates, price of a good, the email address of customers, or their telephone numbers.&nbsp;<\/p>\n\n\n\n<p>You can also go beyond pattern matching with regular expressions. You may want to preprocess the format or markup of texts in a document. You may want to ensure that the first word in a sentence begins with a capital letter or sentences in the form of questions that ends with a question mark. During web scraping, you may want to extract texts with a particular tag. You can, for instance, extract the texts in the &lt;abrev&gt;&lt;\/abrev&gt; tag and create a list of abbreviations with the extracted texts.&nbsp;<\/p>\n\n\n\n<p>Regular expressions have become very popular over the years. At the moment, many programming languages such as Java, Python, C,&nbsp; Perl, and many more support regular expressions. In this tutorial, you will learn how to use regular expressions in python. We\u2019d go further by treating its use cases and take some examples. Without further ado, let&#8217;s jump into it.&nbsp;<\/p>\n\n\n\n<p>Let\u2019s start by saying you make use of the regular expression by importing the re module&nbsp;<\/p>\n\n\n\n<p><strong>import<\/strong> <strong>re<\/strong>&nbsp;&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Artificial Intelligence Introduction Class | Artificial Intelligence Tutorial For Beginners |\" width=\"800\" height=\"450\" src=\"https:\/\/www.youtube.com\/embed\/Vd3CLCyjesg?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p><strong>Regular Expression Building Blocks<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>The Wildcard&nbsp;<\/li><\/ol>\n\n\n\n<p>The \u201c.\u201d symbol is referred to as the wildcard. This is because it is used to match any single character. If we create a regular expression \u201cd.ink\u201d for instance, it would match the words drink, drank, and drunk. Note that the \u201c.\u201d matches just one character. This implies that where we want to match two characters or more, the \u201c.\u201d character should be repeated for as many characters. For example, ..ng matches all four-lettered words that end in \u201cng\u201d.&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Repeatability<\/li><\/ol>\n\n\n\n<p>The \u201c+\u201d sign is used to indicate that the immediately preceding character can be repeated up to a random number of times. The expression \u201cbrus+h\u201d matches words such as brush, brush, brusssh, brusssh, and so on. The + symbol particularly shines when used alongside the \u201c.\u201d symbol. The expression \u201cb.+h\u201d returns any word that starts with the letter b and ends with the letter h. The expression \u201c.+ing\u201d returns any word that ends with the suffix -ing.&nbsp;<\/p>\n\n\n\n<p>The \u201c*\u201d is used to indicate that the immediate past character is optional and repeatable. The expression \u201c*.fit*. matches all words that contain the word \u201cfit\u201d including \u201cfit\u201d itself.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Optionality<\/li><\/ol>\n\n\n\n<p>The \u201c?\u201d symbol is used to indicate that the immediate past character is not compulsory. The expression \u201codou?r\u201d matches both \u201codor\u201d and \u201codour\u201d. The symbol could as well be used alongside punctuations such as a hyphen. The expression \u201ce-?mail\u201d matches both \u201cemail\u201d and \u201ce-mail\u201d.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Choices<\/li><\/ol>\n\n\n\n<p>While the wildcard allows you to select any character, there are situations where you may want to limit the character choices to a few options. The \u201c[]\u201d notation is used for the purpose. The expression \u201cf[aeiou]n\u201d matches words like fan, fen, fin, and fun. You may add a little flexibility with the + symbol. As explained earlier, the + symbol allows you to repeat the character selected. The expression \u201cp[aeiou]+t\u201d matches words like pout, poet, and peat.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Ranges&nbsp;<\/li><\/ol>\n\n\n\n<p>When using the [] notation, you have to list all the characters to choose from individually. But if these characters are within a range, you can use the \u201c-\u201c between the first and last characters. The expression [a-z] for instance captures all lowercase letters.&nbsp;<\/p>\n\n\n\n<p>When you combine ranges with other symbols, you can do even more powerful things. The expression [A-Z]* matches all words in capital letters. Words like acronyms or abbreviations. [a-zA-z] matches all lower or upper case letters.&nbsp;<\/p>\n\n\n\n<p>There are other important metacharacters such as $, ^, \\w, \\t, etc. The table below shows the metacharacter and their application&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Notation<\/td><td>Characteristics<\/td><\/tr><tr><td>.<\/td><td>Used to match any character<\/td><\/tr><tr><td>*<\/td><td>Used to match none, one or more of the preceding items<\/td><\/tr><tr><td>+&nbsp;<\/td><td>Used to match one or more of the preceding items<\/td><\/tr><tr><td>?<\/td><td>Used to match zero or one of the preceding items<\/td><\/tr><tr><td>^xyz<\/td><td>Used to match the pattern xyz at the beginning of a string<\/td><\/tr><tr><td>Xyz$<\/td><td>Used to match the pattern xyz at the end of a string<\/td><\/tr><tr><td>[xyz]<\/td><td>Used to match a character selection<\/td><\/tr><tr><td>[^xyz]<\/td><td>Used to match the characters, not in the square bracket<\/td><\/tr><tr><td>[A-Z0-9]<\/td><td>Used to match a character from a list of uppercase characters or numbers<\/td><\/tr><tr><td>{n}<\/td><td>Used to match n number of repeats. Note that n must be a non-negative integer.<\/td><\/tr><tr><td>{n,}<\/td><td>Used to match at least one repeats<\/td><\/tr><tr><td>{,n}<\/td><td>Used to match not more than n repeats<\/td><\/tr><tr><td>{m,n}<\/td><td>Used to match at least m but not more than n repeats<\/td><\/tr><tr><td>\\.<\/td><td>Used to match the symbol literally&nbsp;<\/td><\/tr><tr><td>\\s<\/td><td>Used to match whitespace character such as space, newline, tab, etc<\/td><\/tr><tr><td>\\S<\/td><td>Used to match a non-whitespace<\/td><\/tr><tr><td>\\w<\/td><td>Used to match alphanumeric characters<\/td><\/tr><tr><td>\\W<\/td><td>Used to match non-alphanumeric characters&nbsp;<\/td><\/tr><tr><td>\\d<\/td><td>Used to specifically match a digit i.e. [0-9]<\/td><\/tr><tr><td>\\D<\/td><td>Used to match a non-digit&nbsp;<\/td><\/tr><tr><td>\\b<\/td><td>Used to match a word boundary<\/td><\/tr><tr><td>()<\/td><td>Used to group regular expressions and returns the matched text<\/td><\/tr><tr><td>^\\W\\d_<\/td><td>Used to match letters alone<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Regular Expression Functions&nbsp;<\/strong><\/p>\n\n\n\n<p>The regular expression module has a couple of functions used for different purposes. To have a rounded understanding of how to effectively apply the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Regular_expression\" rel=\"nofollow noopener\" target=\"_blank\">Regexp module<\/a>, let\u2019s discuss some of the most useful functions.&nbsp;re.split(<em>pattern, string, <\/em>[<em>maxsplit=0<\/em>]): This function splits a list of strings according to some defined pattern. Let\u2019s see an example.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the regular expression library<\/em>\n<strong>import<\/strong> <strong>re<\/strong>&nbsp;&nbsp;&nbsp;\n<em>#splits the word \u2018Artificial Intelligence\u2019 by 'I'<\/em>\ntext = re.split(r'i', 'Artificial Intelligence')\n<em>#prints the result<\/em>\n<strong>print<\/strong>(text)<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p><code>['Art', 'f', 'c', 'al Intell', 'gence']<\/code><\/p>\n\n\n\n<p>As seen in the result, \u2018Artifical Intelligence\u2019 was split by \u2018i\u2019. There is a third argument that can be defined when using the split method \u2013 maxsplit. Maxsplit indiciates the maximum splits that can be done and are set to zero by default. In cases where the character to split by appears more than once, it good practice to define the maxsplit. Let\u2019s see an example with maxsplit=2.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the regular expression library<\/em>\n<strong>import<\/strong> <strong>re<\/strong>&nbsp;&nbsp;&nbsp;\n<em>#splits the word 'Python' by 't'<\/em>\ntext = re.split(r'i', 'Artificial Intelligence', maxsplit=2)\n<em>#prints the result<\/em>\n<strong>print<\/strong>(text)<\/pre>\n\n\n\n<p>Output:&nbsp;<\/p>\n\n\n\n<p><code>['Art', 'f', 'cial Intelligence']<\/code><\/p>\n\n\n\n<p>As seen, the text was not split after the second \u2018I\u2019<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>re.match(<em>pattern, string<\/em>): This method checks for a match in a string. It matches if the defined pattern occurs at the beginning of the string. Trying to match \u2018Artificial\u2019 in \u2018Artificial Intelligence\u2019 will match. Let\u2019s see an example.<\/li><\/ol>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the regular expression library<\/em>\n<strong>import<\/strong> <strong>re<\/strong>&nbsp;&nbsp;&nbsp;\n<em>#checks if there is a match<\/em>\ntext = re.match(r'Artificial', 'Artificial Intelligence')\n<em>#prints the result<\/em>\n<strong>print<\/strong>(text)<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p><code>&lt;re.Match object; span=(0, 10), match='Artificial'&gt;<\/code><\/p>\n\n\n\n<p>The result indicates that there is a match at index 0 to 10. If, however, we attempt to match \u2018Intelligence\u2019 in \u2018Artificial Intelligence\u2019, the program would return a None value, indicating that there is no match.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\"><li>re.search(<em>pattern, string<\/em>): This method works similarly to the match() method but does not restrict its search to the first occurrence of the pattern. The searches if the patterns match the string anywhere but return only the first occurrence. Let\u2019s see an example.<\/li><\/ol>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the regular expression library<\/em>\n<strong>import<\/strong> <strong>re<\/strong>&nbsp;&nbsp;&nbsp;\n<em>#checks whether there is a match<\/em>\ntext = re.search(r'Intelligence', 'Artificial Intelligence Intelligence')\n<em>#prints the result<\/em>\n<strong>print<\/strong>(text)<\/pre>\n\n\n\n<p>Output:&nbsp;<\/p>\n\n\n\n<p><code>&lt;re.Match object; span=(11, 23), match='Intelligence'&gt;<\/code><\/p>\n\n\n\n<p>The result shows that the match occurs from the 11<sup>th<\/sup> index to the 23<sup>rd<\/sup> index. Observe that even though the word appears a second time, the search() method does not pick it.&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\"><li>re.findall(<em>pattern, string<\/em>): This method is used to get all the patterns that match. Unlike the match() or search() method, it is not constrained to check\/return the beginning or end of the string. The findall() method is the most commonly used method since it can work like the match() and search() method. Let\u2019s see an example where the findall() method is used.<\/li><\/ol>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the regular expression library<\/em>\n<strong>import<\/strong> <strong>re<\/strong>&nbsp;&nbsp;&nbsp;\n<em>#finds the word 'Intelligence' in the string<\/em>\ntext = re.findall(r'Intelligence', 'Artificial Intelligence Intelligence')\n<em>#prints the result<\/em>\n<strong>print<\/strong>(text)<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p><code>['Intelligence', 'Intelligence']<\/code><\/p>\n\n\n\n<p>4. re.sub(<em>pattern, repl, string<\/em>): This method is used to find and replace a pattern with a new string. Let\u2019s take an example.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the regular expression library<\/em>\n<strong>import<\/strong> <strong>re<\/strong>&nbsp;&nbsp;&nbsp;\n<em>#replaces the word 'Artificial' with 'Emotional'<\/em>\ntext = re.sub(r'Artificial', 'Emotional', 'Artificial Intelligence')\n<em>#prints the result<\/em>\n<strong>print<\/strong>(text)<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p><code>Emotional Intelligence<\/code><\/p>\n\n\n\n<p>In cases where the pattern is not found, the returned string remains the same.<\/p>\n\n\n\n<p><strong>Tokenizing Sentences with NLTK\u2019s RegexpTokenizer<\/strong><\/p>\n\n\n\n<p>In earlier tutorials, we have used nltk.word_tokenize() to carry out tokenization on a piece of text. It may also interest you to know that regular expressions can as well be used for tokenization. This is done using the RegexpTokenizer class or the regexp_tokenize() helper function. Interestingly, this method gives you more control over how the text will be tokenized. Let\u2019s take some examples.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the RegexpTokenizer library<\/em>\n<strong>from<\/strong> <strong>nltk.tokenize<\/strong> <strong>import<\/strong> RegexpTokenizer\n<em>#instantiate the tokenize class with the regular expression rule as an argument<\/em>\ntokenizer = RegexpTokenizer(\"[\\w']+\")\n<em>#define a text<\/em>\ntext = \"I won't stop learning about Artificial Intelligence\"\n<em>#tokenize the text<\/em>\ntokenizer.tokenize(text)<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p><code>['I', \"won't\", 'stop', 'learning', 'about', 'Artificial', 'Intelligence']<\/code><\/p>\n\n\n\n<p>We can go ahead to do more interesting things with RegexpTokenizer class. Take, for instance, we want to extract the domain name of an email address. What changes in the code is the regular expression rule\/pattern?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the RegexpTokenizer library<\/em>\n<strong>from<\/strong> <strong>nltk.tokenize<\/strong> <strong>import<\/strong> RegexpTokenizer\n<em>#instantiate the tokenize class with the regular expression pattern as an argument<\/em>\ntokenizer = RegexpTokenizer(\"@\\w+.\\w+\")\n<em>#define an email<\/em>\nemail = 'training@h2kinfosys.com'\n<em>#tokenize the text<\/em>\ntokenizer.tokenize(email)<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p><code>['@h2kinfosys.com']<\/code><\/p>\n\n\n\n<p>Going forward, if you do not wish to instantiate the RegexpTokenizer class, there\u2019s also a helper function, regexp_tokenize(), that can quickly be used. The regexp_tokenize takes two compulsory parameters, the text to be tokenized and a defined pattern to work with. Let\u2019s see this example.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the regexp_tokenize function<\/em>\n<strong>from<\/strong> <strong>nltk.tokenize<\/strong> <strong>import<\/strong> regexp_tokenize\n<em>#define a text<\/em>\ntext = \"I won't stop learning about Artificial Intelligence\"\n<em>#tokenize the text<\/em>\ntokenized_text = regexp_tokenize(text, \"[\\w']+\")\n<em>#tokenize the text<\/em>\n<strong>print<\/strong>(tokenized_text)<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p><code>['I', \"won't\", 'stop', 'learning', 'about', 'Artificial', 'Intelligence']<\/code><\/p>\n\n\n\n<p>As seen, it\u2019s a similar result to the earlier example. A shorter code this time.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When dealing with textual data, you may be required to find or replace words that follow a particular pattern. For instance, you may wish to find words that end with \u201cal\u201d when carrying out data wrangling. Using regular expressions is an easy way to go about this in Natural Language Processing. It is a powerful [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4951,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[498],"tags":[1388,1387],"class_list":["post-4915","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence-tutorials","tag-nltk-regular-expressions","tag-regular-expression-building-blocks"],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4915","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=4915"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4915\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/4951"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=4915"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=4915"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=4915"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}