NLTK is a powerful library in python, used for Natural Language Processing (NLP). So, what is Natural Language Processing? You may ask. Natural Language Processing involves the interaction between machines language and human language. In NLP, texts or speeches are manipulations into forms that a machine can understand.
Have you thought about how Google Mail classifies your mails as spams or hams? That’s NLP at work. Or you’re having a conversation with a chatbot and it seems to understand your message and reply appropriately. Congratulations, my friend, you just experienced NLP first hand. Just as humans can interact with one other and communicate through texts, NLP allows computers to explore this reality with humans.
What is NLTK?
NLTK means Natural Language Toolkit. The library provides a framework that can be used to build a wide range of NLP programs. NLTK sprang up from a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then, it has been developed by thousands of programmers and has now become one of the most popular NLP libraries.
Some of the language processing task NLTK performs includes accessing corpora, processing strings, part-of-speech tagging, classification, chunking, word count, character count, semantic interpretation, etc. Many of which you will understand by the end of this tutorial series.
In this course, we will discuss the following.
- Natural Language Processing (NLP) Tutorial: We will discuss among other things, its history, applications, advantages, and disadvantages.
- A step-by-step guide on how to install NLTK on your machine: We will guide you through on how to install NLTK on Windows, Mac, and how to run NLTK script.
- Tokenization in NLP: You will understand how to tokenize a corpus and what the tokens can be used for.
- Stemming and Lemmatization: We will explore these two important processing techniques in NLP and understand their importance.
- Words and Sentences Tokenization with NLTK: We will discuss what tokenization means and briefly touch its usefulness in NLP
- Regular Expressions (regex) for Text Tokenization: We will discuss the concept of regular expressions and how to use NLTK’s regular expression tokenizer.
- Part-of-speech (POS) tagging and chunking with NLTK: We will discuss what POS tagging and chunking mean, their rules, and their use cases.
- Exploring WordNet in NLTK: We will discuss how to use the WordNet database to find analyze text and find synonyms to words.
- POS Tags Counting, Frequency Distribution, and Collocations: We will discuss how to use some NLTK modules for POS counting, plot frequency distribution curve and we will touch the concept of bigrams and trigrams
- Word Embedding with word2vec: We will discuss what embedding is, the concept of the bag of words, and the connection between word2vec and NLTK.
- Fake News Classifier using LSTM: We will build a deep learning model that detects fake news. In this project, we will apply the concepts we have learned since the beginning of the tutorial series.
Before rounding off, you should be aware that there are other NLP libraries. Let’s briefly talk about some of them.
Some Other NLP Libraries
- Gensim: Gensim is good for two things: analyzing the organization of large corpus, and scoring them to retrieve similarities. In Gensim, the algorithms are memory independent with respect to the corpus size. Also, its interface is instinctive, computing is very well distributed and popular algorithms are efficiently executed.
- SpaCy: For industrial applications, SpaCy is a fantastic option due to its convenience and speed. It exploits Python’s convenience and Cython speed. Its developers argue that its speed, accuracy, and model size can compete with the popular NLP libraries such as NLTK. But SpaCy is a new library and can support only English and a few other languages in Europe. This may be a huge drawback in projects that involves multiple languages.
- Polyglot: Polyglot, as the name suggests, can deal with a wide range of languages in a sweep. Polyglot performs the general operations in other libraries – operations such as stemming, lemmatization, POS tagging, entity recognition, etc. – and then provide models that thrive with the desirable languages.
- Pattern: Pattern is used for one major assignment – scraping websites and analyzing the retrieved texts. With Pattern, you don’t have to worry about scraping web sources such as Wikipedia, Facebook, or Twitter. With Pattern modules, those can be done very quickly and the common NLP operations can be carried out with ease.
- TextBlob: TextBlob is a sweet divide between NLTK and Pattern libraries. Its friendly interface makes it possible for you to spend less time and get more results. With TextBlob, you can get started with its default settings very quickly and tweak its functionality as you upskill.
- CoreNLP: CoreNLP is a Java-written NLP library that was developed at Sandford University. It can generate production-ready NLP solutions at scale. Although the library was written in Java, there exist various Python packages as well as APIs that can be easily used in Python.
In the next tutorial, we will delve into the NLTK library fully.