{"id":4917,"date":"2020-09-21T19:16:29","date_gmt":"2020-09-21T13:46:29","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=4917"},"modified":"2020-09-21T19:16:30","modified_gmt":"2020-09-21T13:46:30","slug":"apache-flume-tutorial","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/apache-flume-tutorial\/","title":{"rendered":"Apache Flume Tutorial"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">What is Apache Flume?<\/h2>\n\n\n\n<p>Apache Flume is a tool for data feeding in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_Hadoop\" rel=\"nofollow noopener\" target=\"_blank\">HDFS<\/a>. It accumulates, entireties, and ferries large amounts of streaming data such as log files, events from different origins like network traffic, email messages, etc. to HDFS. Flume is a highly trustworthy &amp; distributed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Advantages of Apache Flume<\/h2>\n\n\n\n<p>There are many benefits of Apache Flume that causes it a better alternative over others. The advantages are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Flume is fault-tolerant, trustworthy, and scalable.<\/li><li>It can store data in centralized places like HBase &amp; HDFS.\u00a0<\/li><li>The Apache Flume is scalable horizontally.<\/li><li>If the read rate surpasses the write rate, Flume delivers a constant flow of data between reading and writing data.<\/li><li>Message delivery is reliable using Flame. Flume transactions are channel-based; for each message, two transactions are maintained.<\/li><li>It sustains a comprehensive set of sources and destinations types.<\/li><li>Data from multiple sources can be ingested into Hadoop.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Flume Architecture<\/h2>\n\n\n\n<p>A Flume agent has 3 components<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Flume Source&nbsp;<\/strong><\/li><\/ul>\n\n\n\n<p>Flume source consumes events generated by the Webservers.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Flume Channel<\/strong><\/li><\/ul>\n\n\n\n<p>The data received by the Flume source is stored in one or more channels. The Flume channel keeps the data until it is passed to the Flume sink. Flume channel uses local storage to store the data.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Flume Sink<\/strong><\/li><\/ul>\n\n\n\n<p>Then Flume sink moves this data from channels to the HDFS.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/ZAUraPOfy_CwYfJa_gtfLC9ck9hS59DykrOj-d0vKePacQrSzIV2lagbLLL9vc_tkxwZaHCyAexkGIsfJjY_h7LP94dctvGRgsrZu8JZqTMHH9ZjqUIjOtltMsJag1z44lQcPKMHBWYFEm8ZTw\" alt=\"Flume Sink\" title=\"\"><\/figure>\n\n\n\n<p>Let\u2019s First download the Apache Flume from the link given below.<\/p>\n\n\n\n<p>https:\/\/www.apache.org\/dyn\/closer.lua\/flume\/1.9.0\/apache-flume-1.9.0-bin.tar.gz<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/4XxwJ4loeg5Z8MUhr3VRJmMB6s3b5j_h8HXn_sYJEuk0T6knrPx4RWx20GUxeKJpem2Zppg29t3USCFBma2AXtkSx6ruFlgofmbvaPZJpVwNxW14wndtDfZZLH4BZw_5AWYpDP171qRebBhMjA\" alt=\"Flume Sink\" title=\"\"><\/figure>\n\n\n\n<p>Extract the files using the following command<\/p>\n\n\n\n<p><code>sudo tar -xvf apache-flume-1.9.0-bin.tar.gz<\/code><\/p>\n\n\n\n<p>After extraction, the following folder will be created.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/3xmMMjOr852Yiq1vyRxRjnTppJofb_q2ky2NMirCmg-IqYJktEpkx1x3e9dGRV0sRgacS4e1usRuiT62Go1r3TaJMmeUtbzxN_Ny5WUxAKla284J-nfIvza_hhbUbEZSgFGJr06iu5-3zTkVkA\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Creating a Twitter Application<\/h2>\n\n\n\n<p>To get tweets from twitter we need to create a twitter application. Let\u2019s create a twitter application.<\/p>\n\n\n\n<p>Go to <a href=\"https:\/\/developer.twitter.com\/apps\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/developer.twitter.com\/apps<\/a> and login with your account. Click on create an app.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/cdInDPATyQUC6Dxi30e5IMz8sCdMdgkA8CSamwXXNFEMGuY5CU6qRMyCUV9wUzMZg7lEVz-TWvV15V2db7Tp_dzZQKTjv9hIf8Bf5TMbfdsw6n6oX7PPEWRI3671sXsD1dF9EifyccjTAkxyfQ\" alt=\"Twitter Application\" title=\"\"><\/figure>\n\n\n\n<p>Fill the respective details<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/wO6t42f1LieVfyesWJ0HS_fGcGR1N7YSRvdK6RF62X0Zu3A4sXoMfKPUcc8D2LRCslCaJek_1KmBoajh9hB893K8S6gyYAT4Q-8YQVro_GkHqBQEMAuq918PRmHwkqwHV6ljeNn8_NqpzKNy1Q\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Find the Keys and Access Tokens tab there you can see a button named Create my access token. Click on it to develop the access token.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/YUiiQSM9Cnuy4zkg6K76t20IvawouJof626SXNxFa2-0K-aqIqZSrwKCtr31RM-l1GDRz0-wrXC0WzxFBOWdkp0ceeOEUxMhUrGoUPgr4_kV1DVA1ppb8szk2P62O0FiQ4VQe_kHAcY7sS7G3Q\" alt=\"Keys and Access Tokens\" title=\"\"><\/figure>\n\n\n\n<p>You need to place the above information in the configuration file.<\/p>\n\n\n\n<p>Create a new file with the name twitter.conf in the conf folder of Flume.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">MyTwitAgent.sources = Twittersource\nMyTwitAgent.channels = MemChannel\nMyTwitAgent.sinks = HDFS\nMyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume\nMyTwitAgent.sources.Twitter.channels = MemChannel&nbsp;\nMyTwitAgent.sources.Twitter.consumerKey = &lt;Copy consumer key value from Twitter App&gt;\nMyTwitAgent.sources.Twitter.consumerSecret = &lt;Copy consumer secret value from Twitter App&gt;\nMyTwitAgent.sources.Twitter.accessToken = &lt;Copy access token value from Twitter App&gt;\nMyTwitAgent.sources.Twitter.accessTokenSecret = &lt;Copy access token secret value from Twitter App&gt;\nMyTwitAgent.sources.Twitter.keywords = mrcreamio\nMyTwitAgent.sinks.HDFS.channel = MemChannel\nMyTwitAgent.sinks.HDFS.type = hdfs\nMyTwitAgent.sinks.HDFS.hdfs.path = hdfs:\/\/localhost:54310\/user\/hduser\/flume\/tweets\/\nMyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream\nMyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text\nMyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000\nMyTwitAgent.sinks.HDFS.hdfs.rollSize = 0\nMyTwitAgent.sinks.HDFS.hdfs.rollCount = 1000\nMyTwitAgent.channels.MemChannel.type = memory\nMyTwitAgent.channels.MemChannel.capacity = 1000\nMyTwitAgent.channels.MemChannel.transactionCapacity = 1000\n<\/pre>\n\n\n\n<p>start the <a href=\"https:\/\/www.h2kinfosys.com\/blog\/what-is-hadoop-an-introduction\/\">Hadoop Cluster<\/a> using the commands given below.<\/p>\n\n\n\n<p><code>$HADOOP_HOME\/sbin\/start-dfs.sh<\/code><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/rCFPdxevAVQPY58k2GXb4RuSoAZxrJPzj-VdJMTBhKagIDZF-jZhEmIbghBhEelIVIA-H9XnmSopIg9u9KRzmr2jxupTiSDYQz2TyEH_GbrDd2wtvl7465DagEGDb0BU3gDGBCRV_FInoflX9g\" alt=\"Hadoop Cluster\" title=\"\"><\/figure>\n\n\n\n<p><code>$HADOOP_HOME\/sbin\/start-yarn.sh<\/code><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/Lcwi7F3ZZDlx1acmuZbWHVSYUjfCKxXpcCrZtIOk3xSoEOXyYEZ8T7cy_sXMnaePO20PUCCmzo9uJdhxCy_AuSmlVpSh8lP-U5t7ZyR8WU1C9tpjEMhA79eVMdXHpj-hHirlriavxa2o_1fNEA\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Check by typing jps in the terminal if all the Nodes are running.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/yiGZn6XAuQrXNVVR_JEUo8xL_r53_CAc8Czg1js2vn_hmZB_g1wBviBTixaMIXj9qbsw6N69zNaZ2faBX1mG7HXGSM4MibSznBC-7TT-FgQu3XcPizVejHt-gRrL0vn-jwv1_RbxGf3QrnWk0w\" alt=\"Nodes \" title=\"\"><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Create a directory in HDFS<\/h3>\n\n\n\n<p>Create the directory in the HDFS using the following command.<\/p>\n\n\n\n<p><code>hdfs dfs -mkdir ~\/twitter_data<\/code><\/p>\n\n\n\n<p>Now Execute using the following command.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">\/home\/supper_user\/apache-flume-1.9.0-bin\/bin\/flume-ng agent --conf .\/conf\/ -f conf\/twitter.conf\nDflume.root.logger=DEBUG,console -n TwitterAgent<\/pre>\n\n\n\n<p>The streaming of tweets into HDFS will start. Given below is the screenshot of the command prompt while fetching tweets.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/3y5AMUmxOYW8QE8cUHEfIKHWy1SzkdqzyqlvhSvGqDlOd9SHDDeS5Y8Us5hE07Lt0iuduaG51sHZ8PQaqATmosltDOov_RETPK0khp8kUflb4PkUQjAu7lvvmjt4m3daMxh8E6XRsx69qnfrEg\" alt=\"Apache Flume\" title=\"\"><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>What is Apache Flume? Apache Flume is a tool for data feeding in HDFS. It accumulates, entireties, and ferries large amounts of streaming data such as log files, events from different origins like network traffic, email messages, etc. to HDFS. Flume is a highly trustworthy &amp; distributed. Advantages of Apache Flume There are many benefits [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4953,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[67],"tags":[449,1389],"class_list":["post-4917","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hadoop-big-data-skill-test","tag-advantages","tag-apache-flume"],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4917","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=4917"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4917\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/4953"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=4917"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=4917"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=4917"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}