Understanding NLTK in Python Tutorial

David ObembeAugust 18, 2020

3 1,384 3 minutes read

NLTK is a powerful library in python, used for Natural Language Processing (NLP). So, what is Natural Language Processing? You may ask. Natural Language Processing involves the interaction between machines language and human language. In NLP, texts or speeches are manipulations into forms that a machine can understand.

Have you thought about how Google Mail classifies your mails as spams or hams? That’s NLP at work. Or you’re having a conversation with a chatbot and it seems to understand your message and reply appropriately. Congratulations, my friend, you just experienced NLP first hand. Just as humans can interact with one other and communicate through texts, NLP allows computers to explore this reality with humans.

Table of Contents

What is NLTK?

NLTK means Natural Language Toolkit. The library provides a framework that can be used to build a wide range of NLP programs. NLTK sprang up from a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then, it has been developed by thousands of programmers and has now become one of the most popular NLP libraries.

Some of the language processing task NLTK performs includes accessing corpora, processing strings, part-of-speech tagging, classification, chunking, word count, character count, semantic interpretation, etc. Many of which you will understand by the end of this tutorial series.

In this course, we will discuss the following.

Natural Language Processing (NLP) Tutorial: We will discuss among other things, its history, applications, advantages, and disadvantages.
A step-by-step guide on how to install NLTK on your machine: We will guide you through on how to install NLTK on Windows, Mac, and how to run NLTK script.
Tokenization in NLP: You will understand how to tokenize a corpus and what the tokens can be used for.
Stemming and Lemmatization: We will explore these two important processing techniques in NLP and understand their importance.
Words and Sentences Tokenization with NLTK: We will discuss what tokenization means and briefly touch its usefulness in NLP
Regular Expressions (regex) for Text Tokenization: We will discuss the concept of regular expressions and how to use NLTK’s regular expression tokenizer.
Part-of-speech (POS) tagging and chunking with NLTK: We will discuss what POS tagging and chunking mean, their rules, and their use cases.
Exploring WordNet in NLTK: We will discuss how to use the WordNet database to find analyze text and find synonyms to words.
POS Tags Counting, Frequency Distribution, and Collocations: We will discuss how to use some NLTK modules for POS counting, plot frequency distribution curve and we will touch the concept of bigrams and trigrams
Word Embedding with word2vec: We will discuss what embedding is, the concept of the bag of words, and the connection between word2vec and NLTK.
Fake News Classifier using LSTM: We will build a deep learning model that detects fake news. In this project, we will apply the concepts we have learned since the beginning of the tutorial series.

Before rounding off, you should be aware that there are other NLP libraries. Let’s briefly talk about some of them.

Some Other NLP Libraries

Gensim: Gensim is good for two things: analyzing the organization of large corpus, and scoring them to retrieve similarities. In Gensim, the algorithms are memory independent with respect to the corpus size. Also, its interface is instinctive, computing is very well distributed and popular algorithms are efficiently executed.
SpaCy: For industrial applications, SpaCy is a fantastic option due to its convenience and speed. It exploits Python’s convenience and Cython speed. Its developers argue that its speed, accuracy, and model size can compete with the popular NLP libraries such as NLTK. But SpaCy is a new library and can support only English and a few other languages in Europe. This may be a huge drawback in projects that involves multiple languages.
Polyglot: Polyglot, as the name suggests, can deal with a wide range of languages in a sweep. Polyglot performs the general operations in other libraries – operations such as stemming, lemmatization, POS tagging, entity recognition, etc. – and then provide models that thrive with the desirable languages.
Pattern: Pattern is used for one major assignment – scraping websites and analyzing the retrieved texts. With Pattern, you don’t have to worry about scraping web sources such as Wikipedia, Facebook, or Twitter. With Pattern modules, those can be done very quickly and the common NLP operations can be carried out with ease.
TextBlob: TextBlob is a sweet divide between NLTK and Pattern libraries. Its friendly interface makes it possible for you to spend less time and get more results. With TextBlob, you can get started with its default settings very quickly and tweak its functionality as you upskill.
CoreNLP: CoreNLP is a Java-written NLP library that was developed at Sandford University. It can generate production-ready NLP solutions at scale. Although the library was written in Java, there exist various Python packages as well as APIs that can be easily used in Python.

In the next tutorial, we will delve into the NLTK library fully.

Facebook Comments

3 Comments

Pingback: Tokenization in NLP Tutorial - H2kinfosys Blog
Pingback: How to Install NLTK on Windows, Mac or Linux - H2kinfosys Blog
Pingback: Word Embeddings with Word2Vec Tutorial: All you Need to Know

Understanding NLTK in Python Tutorial

What is NLTK?

Some Other NLP Libraries

3 Comments

Leave a Reply Cancel reply

BA Online Test

Examples of high Severity, Priority, and low Severity, Priority defects in your Current Project?

XPath Contains, AND OR, Parent, Start with, Axes in Selenium Webdriver

SOFTWARE TESTING TYPES

4 Types of Artificial Intelligence Approaches

Software Development Life Cycle

Software Testing using Testing Tools

History of QA

SDLC – Sequential Model

Software Quality

Automation Testing Tools

What is NLTK?

Some Other NLP Libraries

Why Data Science?

Tokenization in NLP Tutorial

3 Comments

Leave a Reply Cancel reply

Related Articles

Artificial Intelligence

Transfer learning in Keras with Examples

Top Artificial Intelligence Blogs to Follow

Python Pandas Tutorial

Software Development Life Cycle

Software Testing using Testing Tools

History of QA

SDLC – Sequential Model

Software Quality

Automation Testing Tools