NLTK Regular Expressions

When dealing with textual data, you may be required to find or replace words that follow a particular pattern. For instance, you may wish to find words that end with “al” when carrying out data wrangling. Using regular expressions is an easy way to go about this in Natural Language Processing. It is a powerful method used to find, split, or replace words according to some pattern. Regular expressions can help you extract key information from dirty data during data analysis. You can quickly get dates, price of a good, the email address of customers, or their telephone numbers.

You can also go beyond pattern matching with regular expressions. You may want to preprocess the format or markup of texts in a document. You may want to ensure that the first word in a sentence begins with a capital letter or sentences in the form of questions that ends with a question mark. During web scraping, you may want to extract texts with a particular tag. You can, for instance, extract the texts in the <abrev></abrev> tag and create a list of abbreviations with the extracted texts.

Regular expressions have become very popular over the years. At the moment, many programming languages such as Java, Python, C, Perl, and many more support regular expressions. In this tutorial, you will learn how to use regular expressions in python. We’d go further by treating its use cases and take some examples. Without further ado, let’s jump into it.

Let’s start by saying you make use of the regular expression by importing the re module

import re

Regular Expression Building Blocks

The Wildcard

The “.” symbol is referred to as the wildcard. This is because it is used to match any single character. If we create a regular expression “d.ink” for instance, it would match the words drink, drank, and drunk. Note that the “.” matches just one character. This implies that where we want to match two characters or more, the “.” character should be repeated for as many characters. For example, ..ng matches all four-lettered words that end in “ng”.

Repeatability

The “+” sign is used to indicate that the immediately preceding character can be repeated up to a random number of times. The expression “brus+h” matches words such as brush, brush, brusssh, brusssh, and so on. The + symbol particularly shines when used alongside the “.” symbol. The expression “b.+h” returns any word that starts with the letter b and ends with the letter h. The expression “.+ing” returns any word that ends with the suffix -ing.

The “*” is used to indicate that the immediate past character is optional and repeatable. The expression “*.fit*. matches all words that contain the word “fit” including “fit” itself.

Optionality

The “?” symbol is used to indicate that the immediate past character is not compulsory. The expression “odou?r” matches both “odor” and “odour”. The symbol could as well be used alongside punctuations such as a hyphen. The expression “e-?mail” matches both “email” and “e-mail”.

Choices

While the wildcard allows you to select any character, there are situations where you may want to limit the character choices to a few options. The “[]” notation is used for the purpose. The expression “f[aeiou]n” matches words like fan, fen, fin, and fun. You may add a little flexibility with the + symbol. As explained earlier, the + symbol allows you to repeat the character selected. The expression “p[aeiou]+t” matches words like pout, poet, and peat.

Ranges

When using the [] notation, you have to list all the characters to choose from individually. But if these characters are within a range, you can use the “-“ between the first and last characters. The expression [a-z] for instance captures all lowercase letters.

When you combine ranges with other symbols, you can do even more powerful things. The expression [A-Z]* matches all words in capital letters. Words like acronyms or abbreviations. [a-zA-z] matches all lower or upper case letters.

There are other important metacharacters such as $, ^, \w, \t, etc. The table below shows the metacharacter and their application

Notation	Characteristics
.	Used to match any character
*	Used to match none, one or more of the preceding items
+	Used to match one or more of the preceding items
?	Used to match zero or one of the preceding items
^xyz	Used to match the pattern xyz at the beginning of a string
Xyz$	Used to match the pattern xyz at the end of a string
[xyz]	Used to match a character selection
[^xyz]	Used to match the characters, not in the square bracket
[A-Z0-9]	Used to match a character from a list of uppercase characters or numbers
{n}	Used to match n number of repeats. Note that n must be a non-negative integer.
{n,}	Used to match at least one repeats
{,n}	Used to match not more than n repeats
{m,n}	Used to match at least m but not more than n repeats
\.	Used to match the symbol literally
\s	Used to match whitespace character such as space, newline, tab, etc
\S	Used to match a non-whitespace
\w	Used to match alphanumeric characters
\W	Used to match non-alphanumeric characters
\d	Used to specifically match a digit i.e. [0-9]
\D	Used to match a non-digit
\b	Used to match a word boundary
()	Used to group regular expressions and returns the matched text
^\W\d_	Used to match letters alone

Regular Expression Functions

The regular expression module has a couple of functions used for different purposes. To have a rounded understanding of how to effectively apply the Regexp module, let’s discuss some of the most useful functions. re.split(pattern, string, [maxsplit=0]): This function splits a list of strings according to some defined pattern. Let’s see an example.

#import the regular expression library
import re   
#splits the word ‘Artificial Intelligence’ by 'I'
text = re.split(r'i', 'Artificial Intelligence')
#prints the result
print(text)

Output:

['Art', 'f', 'c', 'al Intell', 'gence']

As seen in the result, ‘Artifical Intelligence’ was split by ‘i’. There is a third argument that can be defined when using the split method – maxsplit. Maxsplit indiciates the maximum splits that can be done and are set to zero by default. In cases where the character to split by appears more than once, it good practice to define the maxsplit. Let’s see an example with maxsplit=2.

#import the regular expression library
import re   
#splits the word 'Python' by 't'
text = re.split(r'i', 'Artificial Intelligence', maxsplit=2)
#prints the result
print(text)

Output:

['Art', 'f', 'cial Intelligence']

As seen, the text was not split after the second ‘I’

re.match(pattern, string): This method checks for a match in a string. It matches if the defined pattern occurs at the beginning of the string. Trying to match ‘Artificial’ in ‘Artificial Intelligence’ will match. Let’s see an example.

#import the regular expression library
import re   
#checks if there is a match
text = re.match(r'Artificial', 'Artificial Intelligence')
#prints the result
print(text)

Output:

<re.Match object; span=(0, 10), match='Artificial'>

The result indicates that there is a match at index 0 to 10. If, however, we attempt to match ‘Intelligence’ in ‘Artificial Intelligence’, the program would return a None value, indicating that there is no match.

re.search(pattern, string): This method works similarly to the match() method but does not restrict its search to the first occurrence of the pattern. The searches if the patterns match the string anywhere but return only the first occurrence. Let’s see an example.

#import the regular expression library
import re   
#checks whether there is a match
text = re.search(r'Intelligence', 'Artificial Intelligence Intelligence')
#prints the result
print(text)

Output:

<re.Match object; span=(11, 23), match='Intelligence'>

The result shows that the match occurs from the 11^th index to the 23^rd index. Observe that even though the word appears a second time, the search() method does not pick it.

re.findall(pattern, string): This method is used to get all the patterns that match. Unlike the match() or search() method, it is not constrained to check/return the beginning or end of the string. The findall() method is the most commonly used method since it can work like the match() and search() method. Let’s see an example where the findall() method is used.

#import the regular expression library
import re   
#finds the word 'Intelligence' in the string
text = re.findall(r'Intelligence', 'Artificial Intelligence Intelligence')
#prints the result
print(text)

Output:

['Intelligence', 'Intelligence']

4. re.sub(pattern, repl, string): This method is used to find and replace a pattern with a new string. Let’s take an example.

#import the regular expression library
import re   
#replaces the word 'Artificial' with 'Emotional'
text = re.sub(r'Artificial', 'Emotional', 'Artificial Intelligence')
#prints the result
print(text)

Output:

Emotional Intelligence

In cases where the pattern is not found, the returned string remains the same.

Tokenizing Sentences with NLTK’s RegexpTokenizer

In earlier tutorials, we have used nltk.word_tokenize() to carry out tokenization on a piece of text. It may also interest you to know that regular expressions can as well be used for tokenization. This is done using the RegexpTokenizer class or the regexp_tokenize() helper function. Interestingly, this method gives you more control over how the text will be tokenized. Let’s take some examples.

#import the RegexpTokenizer library
from nltk.tokenize import RegexpTokenizer
#instantiate the tokenize class with the regular expression rule as an argument
tokenizer = RegexpTokenizer("[\w']+")
#define a text
text = "I won't stop learning about Artificial Intelligence"
#tokenize the text
tokenizer.tokenize(text)

Output:

['I', "won't", 'stop', 'learning', 'about', 'Artificial', 'Intelligence']

We can go ahead to do more interesting things with RegexpTokenizer class. Take, for instance, we want to extract the domain name of an email address. What changes in the code is the regular expression rule/pattern?

#import the RegexpTokenizer library
from nltk.tokenize import RegexpTokenizer
#instantiate the tokenize class with the regular expression pattern as an argument
tokenizer = RegexpTokenizer("@\w+.\w+")
#define an email
email = 'training@h2kinfosys.com'
#tokenize the text
tokenizer.tokenize(email)

Output:

['@h2kinfosys.com']

Going forward, if you do not wish to instantiate the RegexpTokenizer class, there’s also a helper function, regexp_tokenize(), that can quickly be used. The regexp_tokenize takes two compulsory parameters, the text to be tokenized and a defined pattern to work with. Let’s see this example.

#import the regexp_tokenize function
from nltk.tokenize import regexp_tokenize
#define a text
text = "I won't stop learning about Artificial Intelligence"
#tokenize the text
tokenized_text = regexp_tokenize(text, "[\w']+")
#tokenize the text
print(tokenized_text)

Output:

['I', "won't", 'stop', 'learning', 'about', 'Artificial', 'Intelligence']

As seen, it’s a similar result to the earlier example. A shorter code this time.

NLTK Regular Expressions, Regular Expression Building Blocks

Share this article

Steven Roger

Steven Roger is a technology blogger, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

All Posts

Do AI and Machine Learning Involve a Lot of Coding

Do AI and Machine Learning Involve a Lot of Coding?

What is AI Training and AI Inference

What is AI Training and AI Inference?

Is Machine Learning only for Computer Science Students

Is Machine Learning only for Computer Science Students?

Future Of Artificial Intelligence and Machine Learning

Promising Future Of Artificial Intelligence & Machine Learning

The Best AI Frameworks and Libraries for Streamlining Business Operations

The Best AI Frameworks and Libraries for Streamlining Business Operations

Name

Phone

Email

Course

- QA Testing Online Training Course

- Business Analyst Online Training with Certification

- Agile Scrum Master Certification Course

- Selenium Online Training with Certification

- Python Certification Course

- Java Full Stack Developer

- Data Science using Python Online Training

- Microsoft .NET Training Online

- Big Data/Hadoop Training

- Tableau Training Online With Certification

- Artificial Intelligence Training

- Salesforce Administrator Certification Training

- Azure DevOps Certification Training

- TOSCA Automation Tool Training

- QA Tester Training with Real Time Project Experience

- AWS Certified Solutions Architect

- Agile Methodology Training Course

- Machine Learning

- Data Science and Machine Learning

- RPA Certification Course

- Business Process And Management

- Ruby Cucumber Training

- Time Management Skills Training

- Kubernetes Training

- LoadRunner Training

- Project Management Training

- Mobile Apps Testing Training

- Microsoft Office

- Core Java with JUnit Testing

- Database Testing Training

- Devops Online Training

- Appium Automation Testing

- Effective Communication Skills

- AngularJS Training

- Devops for QA Tester Training

- Advanced ETL Testing Training

- Informatica Training

- SAS Programmer Training

- HP QTP / UFT Training

- Data Science: Real-time Exercises

- ETL Testing Training

- Data Science and Big Data

- Soft Skills Training

- Certified Software Quality Manager

- Image Management Training

- ISTQB Training

- Salesforce Real-Time Project with Experience

- Cassandra Training

- Web Services Testing / SoapUI

- PowerBI Online Training Course

- SQL Online Training Course

- Teradata SQL Online Certification Training

- Cyber Security Training Online

- Digital Marketing Online Course with Placement

Data Analytics dashboards

Can Tableau automate Data Analytics dashboards with AI?

What Are the Best Machine Learning Courses Online for Complete Beginners?

learn Data Analytics

Can Power BI help beginners learn Data Analytics with Gen AI?

Artificial Intelligence training and placement

Which Artificial Intelligence Training and Placement Course Is Best for Beginners?

AI Driven Data

How Does Power BI Support AI-Driven Data Analytics?

AI Data Analytics

Can working professionals upskill through AI Data Analytics online course?

The Ultimate Guide to Choosing the Best Artificial Intelligence Course Online

AI Data Analytics

How does H2K Infosys prepare students for AI Data Analytics careers?

AI and ML certification course

Which AI and ML Certification Course Is Right for Your Career Goals?

best AI certification course for beginners

What Is the Best AI Certification Course for Beginners?

Name

Phone

Email

Course

Data Analytics dashboards

Can Tableau automate Data Analytics dashboards with AI?

July 22, 2026

What Are the Best Machine Learning Courses Online for Complete Beginners?

July 21, 2026

learn Data Analytics

Can Power BI help beginners learn Data Analytics with Gen AI?

July 21, 2026

Artificial Intelligence training and placement

Which Artificial Intelligence Training and Placement Course Is Best for Beginners?

July 20, 2026

AI Driven Data

How Does Power BI Support AI-Driven Data Analytics?

July 20, 2026

AI Data Analytics

Can working professionals upskill through AI Data Analytics online course?

July 19, 2026

The Ultimate Guide to Choosing the Best Artificial Intelligence Course Online

July 19, 2026

AI Data Analytics

How does H2K Infosys prepare students for AI Data Analytics careers?

July 18, 2026

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.