Apache Flume Tutoria

Apache Flume Tutorial

Table of Contents

What is Apache Flume?

Apache Flume is a tool for data feeding in HDFS. It accumulates, entireties, and ferries large amounts of streaming data such as log files, events from different origins like network traffic, email messages, etc. to HDFS. Flume is a highly trustworthy & distributed.

Advantages of Apache Flume

There are many benefits of Apache Flume that causes it a better alternative over others. The advantages are:

  • Flume is fault-tolerant, trustworthy, and scalable.
  • It can store data in centralized places like HBase & HDFS. 
  • The Apache Flume is scalable horizontally.
  • If the read rate surpasses the write rate, Flume delivers a constant flow of data between reading and writing data.
  • Message delivery is reliable using Flame. Flume transactions are channel-based; for each message, two transactions are maintained.
  • It sustains a comprehensive set of sources and destinations types.
  • Data from multiple sources can be ingested into Hadoop.

Flume Architecture

A Flume agent has 3 components

  • Flume Source 

Flume source consumes events generated by the Webservers. 

  • Flume Channel

The data received by the Flume source is stored in one or more channels. The Flume channel keeps the data until it is passed to the Flume sink. Flume channel uses local storage to store the data.

  • Flume Sink

Then Flume sink moves this data from channels to the HDFS.

Flume Sink

Let’s First download the Apache Flume from the link given below.

https://www.apache.org/dyn/closer.lua/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

Flume Sink

Extract the files using the following command

sudo tar -xvf apache-flume-1.9.0-bin.tar.gz

After extraction, the following folder will be created.

Apache Flume Tutorial

Creating a Twitter Application

To get tweets from twitter we need to create a twitter application. Let’s create a twitter application.

Go to https://developer.twitter.com/apps and login with your account. Click on create an app.

Twitter Application

Fill the respective details

Apache Flume Tutorial

Find the Keys and Access Tokens tab there you can see a button named Create my access token. Click on it to develop the access token.

Keys and Access Tokens

You need to place the above information in the configuration file.

Create a new file with the name twitter.conf in the conf folder of Flume.

MyTwitAgent.sources = Twittersource
MyTwitAgent.channels = MemChannel
MyTwitAgent.sinks = HDFS
MyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume
MyTwitAgent.sources.Twitter.channels = MemChannel 
MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from Twitter App>
MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from Twitter App>
MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter App>
MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value from Twitter App>
MyTwitAgent.sources.Twitter.keywords = mrcreamio
MyTwitAgent.sinks.HDFS.channel = MemChannel
MyTwitAgent.sinks.HDFS.type = hdfs
MyTwitAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser/flume/tweets/
MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream
MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text
MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000
MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0
MyTwitAgent.sinks.HDFS.hdfs.rollCount = 1000
MyTwitAgent.channels.MemChannel.type = memory
MyTwitAgent.channels.MemChannel.capacity = 1000
MyTwitAgent.channels.MemChannel.transactionCapacity = 1000

start the Hadoop Cluster using the commands given below.

$HADOOP_HOME/sbin/start-dfs.sh

Hadoop Cluster

$HADOOP_HOME/sbin/start-yarn.sh

Apache Flume Tutorial

Check by typing jps in the terminal if all the Nodes are running.

Nodes

Create a directory in HDFS

Create the directory in the HDFS using the following command.

hdfs dfs -mkdir ~/twitter_data

Now Execute using the following command.

/home/supper_user/apache-flume-1.9.0-bin/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent

The streaming of tweets into HDFS will start. Given below is the screenshot of the command prompt while fetching tweets.

Apache Flume

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article