{"id":4741,"date":"2020-09-04T15:52:53","date_gmt":"2020-09-04T10:22:53","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=4741"},"modified":"2020-09-04T15:52:55","modified_gmt":"2020-09-04T10:22:55","slug":"hadoop-mapreduce-join-counter-with-example","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/hadoop-mapreduce-join-counter-with-example\/","title":{"rendered":"Hadoop MapReduce Join &#038; Counter with Example"},"content":{"rendered":"\n<p>Sometimes we need to combine two large datasets for this purpose MapReduce provides join operation. If we try to do the join manually, it requires a lot of code. <a href=\"https:\/\/www.h2kinfosys.com\/blog\/what-is-mapreduce-how-it-works\/\">MapReduce <\/a>provides easy functionality, MapReduce Join and Counter having Two datasets are compared for size, and a smaller dataset is distributed to every DataNode. Then, The Reducer or Mapper uses the smaller dataset and manages it to perform lookup operations to find records. Lastly, the matching records from smaller and large datasets are merged to create the output joined records.<\/p>\n\n\n\n<p>There are two types of joins.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Map-side Join<\/li><li>Reduce-side Join<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Map-side Join<\/h2>\n\n\n\n<p>In the Map-side Join, the operation is performed by the mapper. Here, the Join is performed before the actual map function could consume the data. This type of Join has the prerequisite that it requires the input given to the map to be in the form of a partition, and all such inputs should be in the sorted order. The joining key must sort the equal sections.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Reduce-side Join<\/h2>\n\n\n\n<p>In the Reduce-side Join, the operation is performed by the reducer. In reduce-side join, the dataset is not expected to be in the form of structure.&nbsp; The map side joins processing produces the join key and the associated similar tuples from both of the records. Hence, all the tuples that have the same key group into the same reducer, they are joined to form the output records.<\/p>\n\n\n\n<p>Let\u2019s start with Hadoop first.<\/p>\n\n\n\n<p>First of all, start the Hadoop Cluster using the commands given below.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>$HADOOP_HOME\/sbin\/start-dfs.sh<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/aWKuLr3LNEpYFyAzDHJdpwpNs2YfmfX_ByKE-k5Q4JUjM8tDe6qUx4wKND0wJ2-bCEEC788JL-ufHuDVWF9z7Ku4PKUf3Q-QGjxALuFD53Bfx_BJC8rtH0jKFBSSNhxwmS6KmjQNTlg8gCYAoA\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>$HADOOP_HOME\/sbin\/start-yarn.sh<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/KpdjT2S_JwzXPexCz7LLFUALyvS7OCuO-zAjxv2Ncwdsa4dpQW6WDFsv2QaHZKCv8ZlR5wlcvxHEVNeF9utU2m8ePYv7Rsln_CPujJ8DB15m_OWGJ0X41wzYL9-8M1LJEVoyfDhhV_wHZ4iOKw\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Check by typing jps in the terminal if all the Nodes are running.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/SIBr9wOwD2wUZxjYV6OII-8KJoWBPnWtMcJ04eOLSC8IMujE-tbs7Mj3KcOwxFEk_ajk-mnVo2wsHtcsmFIkL6XJddoos2NDOHlrbfAP-y4WPbqz3ROOlUg-tLse-fn1sBN12tfAdX2PC7MgqQ\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>We have the following data&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/YuALdY4p9jipfOukWc4156u5wvQadKKHogF8GWHbt9TAk6p-j_c-eWoBZxHN07AtBGimNroR60T04UQMvnKZ_2KDjmIGOcLtmlxdU9Ebvx1oPQd37xEAUWjeaGD97I-BqXV3rjxD1KV1j1LOyg\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/ju13wAO9IzBWbvIFwwmSxh6B5ItR9IPLW2vMW91oY3UpW3JPMnjfbl1Q3Lq-Dh-BunckesIIGurzL0R8e9RE9mnv5db_Zr8wnYN5bL0gVkxRVC5L_RHsZ-KF-wf31OV5ngy64gl5D0A86ZCaEA\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Download the Github repo from the link given below. We will be using those files.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/mrcreamio\/Hadoop-tutorials\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/github.com\/mrcreamio\/Hadoop-tutorials<\/a><\/p>\n\n\n\n<p>Move the downloaded file to the respective repository using the command given below.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>sudo cp -r \/home\/ahmed\/Desktop\/MapReduceJoin \/home\/supper_user\/<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Move to the respective directory.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>cd MapReduceJoin\/<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/FbU_I7rM3Y0EFkMPl_1ROl11ClCw5T-GOPx50Jo-jSzfLogvP3rSP1ybSCIzLIYdMHkID0nT8WSLtZt6jaRuW3KSpPhdJUi_CXZuHKtpzCIE5iUCVqqhGe3xPg_2FF-a3yPICiKJI2oKlLaRlg\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Now let\u2019s copy our input files to the HDFS.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>hdfs dfs -copyFromLocal DeptStrength.txt DeptName.txt \/<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/h3OAH4qOnmXo_8dFfyYS2L7BjhnFHxudFk6mLNM_LVIpR5vx23u4lg6Dqmqac2Sx1R70yGyg6nvIUfKUdQNY1CS-n2p8TnOOS8fBAHyNthQbjRZtoybv-As_f9cs6G6wkgnKpPGvDVNjYowUFw\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Let\u2019s check if we have the files copied.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>hdfs dfs -ls \/<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/CS7pvev3qyuV7kUqqmyv_JdDw11TOBmU7ZJ3v7sJzQedXBsg81qB3eB5heVZoiHAAz8Qlsdi4ABRu-BGryn1jM-nnQqaSXUtb9izsak40Ll2dbYyL1bDd6ba3XiLwCPCbbZd_t9cUcA2fgf33A\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Run the program using the command given below.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>$HADOOP_HOME\/bin\/hadoop jar MapReduceJoin.jar \/DeptStrength.txt \/DeptName.txt \/output_mapreducejoin<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/smCLrf-xrbP2KSl_GblMQG-MAZ3A0r0WpdFUNVg4JoUsSKQ0utC23sAvWFaETO8Z8HcHLUGkXdvm8UXrMy0aMOptIGs7IqXfp0evced4yHjw2Q08sv48o9GNDDseGH4rBMuB-F6TUlXM6laUUg\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Let\u2019s see the output files using the command given below.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/GOwhwK_bugeRCf7C-MXdAdDQJ5nZF7xlCAHf0Ff5Y1p6oetLVYk24gPFvHiYn6vftxI3gm7jgwZ0HBAmLIkKG3JOc2TGeY5bTIi2lD7kLBB-rxjCPxyDM0QK_GzPdT-AM3SkPSRnpSCWf4jrmQ\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Here is the output.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-subtle-pale-pink-background-color has-background\"><tbody><tr><td>hdfs dfs -cat \/output_mapreducejoin\/part-00000<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/i03DHGPQs3OujUYKtVc_PnUXpfW5mAAm3Ln02-PiT2l_IpCbXjDjMkcQTuaVCr-2g8RfFFcFoyQdB78cP2qeMMc3tX3mKB1NhLuJm00_VIgUIckgdZw90xkfCrKmBKR6DHd1xPRpY2geIg_ZJA\" alt=\"\" title=\"\"><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes we need to combine two large datasets for this purpose MapReduce provides join operation. If we try to do the join manually, it requires a lot of code. MapReduce provides easy functionality, MapReduce Join and Counter having Two datasets are compared for size, and a smaller dataset is distributed to every DataNode. Then, The [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4756,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[138],"tags":[1329,1330,1331],"class_list":["post-4741","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-hadoop-tutorials","tag-hadoop-mapreduce","tag-map-side-join","tag-reduce-side-join"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4741","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=4741"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/4741\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/4756"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=4741"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=4741"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=4741"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}