Tokenization in NLP Tutorial

Tokenization is the process of splitting a chunk of text, phrase, or sentence into smaller units called tokens. The smaller units could be individual words or terms. Tokenization is a pivotal step for extracting information from textual data

To build NLP driven systems, such as sentiment analysis, chatbot, language translation, or a voice assistant, patterns will need to be learned from a conversation. Tokens are used to learn patterns from a chunk of text. They are also used for other NLP operations such as stemming and lemmatization. Do not be perturbed if those terms are unfamiliar. We shall be treating stemming and lemmatization in detail in a later a tutorial. Suffice to say here that stemming and lemmatization, are fundamental steps for cleaning textual data in NLP.

Tokenization operations are performed using the tokenize module of NLTK’s library. This tokenize module has functions for performing various tasks of which include word_tokenize() and sent_tokenize(). We shall take a look at each of them in this tutorial.

Tokenization of Words

The word_tokenize method is used for splitting a corpus into individual words. The list of words can be converted into a dataframe to allow for further data cleaning before it’s being fed into a machine learning algorithm for model building.

Since machine learning algorithms require numeric data to learn from data and make predictions, it becomes critical to apply Tfidf_vectorizer or Count_vectoriser on the tokens. This helps to convert the tokens in strings to a matrix of numbers. You may want to read about vectorizers to get a better understanding.

Let’s see a coding example

from nltk.tokenize import word_tokenize 
text = "I love artificial intelligence. So I am reading this tutorial. I love it!" 
print(word_tokenize(text)) 

Output: ['I', 'love', 'artificial', 'intelligence', '.', 'So', 'I', 'am', 'reading', 'this', 'tutorial', '.', 'I', 'love', 'it', '!']

Explaining each line of code

We started by importing the word_tokenize function from the tokenize module of NLTK. Afterward, a variable that held the textual data was defined. Upon applying the function and passing the variable as a parameter, the sentences were split into words and punctuations, as it can be seen in the output.

Tokenization of Sentences

Th sent_tokenize function is used to convert split a corpus into sentences. This can come in handy when you want to calculate say the average number of words in a sentence. You would need both word_tokenize and sent_tokenize for this computation.

Let’s take a code example

from nltk.tokenize import sent_tokenize 
corpus = "I love artificial intelligence. So I am reading this tutorial. I love it!" 
print(sent_tokenize(text)) 

Output: ['I love artificial intelligence.', 'So I am reading this tutorial.', 'I love it!']

Explaining each line of code

Here also, we imported the required function, sent_tokenize, and passed the corpus variable as a parameter. From the output, we see that code splits the text into three sentences.

Armed with this information, you have a robust understanding of how tokenization works and what it is used for. In the next tutorial, you will learn about stemming and lemmatization.

NLP Tutorial, Tokenization

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article

What are the different Salesforce jobs?

July 3, 2025

Network Troubleshooting Tools Made Easy

July 3, 2025

Real-World Projects Using Python for Data Science

July 3, 2025

Instant Expert Tips to Excel Data Analysis

July 3, 2025

What is Power BI and how is it used?

July 3, 2025

CI/CD Security Integration for Modern Dev Teams

July 3, 2025

Need a Free Demo Class?

Join H2K Infosys IT Online Training

Enroll Now

Ultimate Intro to AI: Bright Future Ahead

July 2, 2025

What’s Next? Emerging AI Trends to Watch

June 27, 2025

Do Blockchain Engineers Earn More Than Artificial Intelligence Experts?

July 29, 2024

7 Top AI Coding Assistants to use in 2024

July 9, 2024

Mastering Deep Learning Terminology: The Language of AI

June 26, 2024

What is an AI Coding Assistant

June 7, 2024

How to Download and Setup TensorFlow with Anaconda

June 5, 2024

Reasons for Need of AI test automation

May 27, 2024

Artificial Intelligence Tools

May 27, 2024

Artificial Intelligence Approaches

May 27, 2024

Steven Roger

Steven Roger is a technology blogger for the H2K Infosys blog, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

Read All from Steven Roger