{"id":5171,"date":"2020-09-30T17:44:23","date_gmt":"2020-09-30T12:14:23","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=5171"},"modified":"2020-09-30T17:44:25","modified_gmt":"2020-09-30T12:14:25","slug":"apache-oozie-tutorial","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/apache-oozie-tutorial\/","title":{"rendered":"Apache Oozie Tutorial"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">What is Apache Oozie?<\/h2>\n\n\n\n<p>To manage Hadoop jobs in a distributed environment, Apache Oozie is used. Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. Oozie authorizes merging multiple complicated jobs that are run in a sequential order to gain a more critical task. Within a sequence of the task, two or more jobs can also be programmed to run parallel to each other.<\/p>\n\n\n\n<p>Oozie supports three types of jobs.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Workflow engine<\/strong><\/li><\/ul>\n\n\n\n<p>The commitment of a workflow engine is to keep and drive workflows comprised of Hadoop jobs, e.g., MapReduce, Pig, Hive.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Coordinator engine<\/strong>:&nbsp;<\/li><\/ul>\n\n\n\n<p>It drives workflow jobs established on predefined schedules and availability of data.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Bundle<\/strong><\/li><\/ul>\n\n\n\n<p>The bundle is the <a href=\"https:\/\/www.quora.com\/What-is-a-high-level-abstraction-in-the-C-language\" rel=\"nofollow noopener\" target=\"_blank\">high-level abstraction<\/a> that will batch a set of coordinator jobs. The bundle is a collection of coordinator jobs.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/iVL6WjBwfgyIcER9INzjSMgPqU5ghfVFOqpyb57s7ZuvbGqS8J3bLqGkkllFG_6PxQvMQneoOKDl2Vj_t1j5P-dyeZCakUFwgJwiY5W1NmJpSxqVC3KP1yaTDmRrQ11Qx_5RpHa1n_MQ78Gh0w\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Types of Nodes in Oozie Workflow.<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Start and End node<\/strong><\/li><\/ul>\n\n\n\n<p>Start and end nodes define the beginning and end of the workflow. These nodes include optional fail nodes.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Action Nodes<\/strong><\/li><\/ul>\n\n\n\n<p>Actual processing tasks are defined in action nodes. The system remotely notifies Oozie when a specific action node finishes and the next node in the workflow is executed. HDFS commands are also included in the action nodes.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Fork and Join nodes<\/strong><\/li><\/ul>\n\n\n\n<p>Parallel execution of tasks in the workflow is executed with the help of a fork and join nodes. Two or more nodes can run at the same time using Fork nodes. In some cases, we need to wait for some specific tasks to complete their working for this join node is used.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Control flow nodes<\/strong><\/li><\/ul>\n\n\n\n<p>Control flow nodes take decisions about previous tasks. Control nodes are based on the result of the previous nodes. Control flow nodes are if-else statements that evaluate to be true or false.<\/p>\n\n\n\n<p>Here is an example of a simple workflow of Oozie.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/Z7vHoHJOXrksDL7x45-ZmI_H7Gf-NPAD071SOMXHRMx8mKIEFe5AE6Uk0citgYxJpa3QHlS4G8tAf4ULSQMvPc94mx7vbfBkhRdnOx3dPpnTywoYIXhZHQIgzXSMjksccQ8Gu4dy87sICYY65w\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Packaging and deploying an Oozie workflow application<\/h2>\n\n\n\n<p>A workflow application has the workflow description and all the related resources such as<a href=\"https:\/\/www.h2kinfosys.com\/blog\/hadoop-mapreduce-examples\/\"> MapReduce<\/a> Jar files, and Pig scripts. Applications need to obey a straightforward directory structure and are deployed to HDFS so that Oozie can access them.<\/p>\n\n\n\n<p>directory structure<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><code>&lt;name of workflow&gt;\/&lt;\/name&gt;<br>??? lib\/<br>? ??? hadoop-examples.jar<br>??? workflow.xml<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>It is essential to maintain workflow.xml (a workflow definition file) in the top-level folder.<\/p>\n\n\n\n<p>Lib directory has Jar files, including MapReduce classes. Workflow application coordinating to this layout can be built with any build tool.<\/p>\n\n\n\n<p>Now copy this to HDFS using the command given below<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><code>hadoop fs -put hadoop-examples\/target\/&lt;name of workflow dir&gt; name of workflow<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Steps for Running an Oozie workflow job<\/strong><\/p>\n\n\n\n<p>1. Now we need to tell Oozie which server to use for that Export OOZIE_URL environment variable<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><code>export OOZIE_URL=\"http:\/\/localhost:22000\/oozie\"<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>2. Run workflow job using the command given below<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><code>oozie job -config &lt;location of properties file&gt; -run<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>3. Get the status of workflow job<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><code>oozie job -info &lt;job id&gt;<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>4. Results of prosperous workflow implementation can be seen using the command below<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><code>hadoop fs -cat &lt;location of result&gt;<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>What is Apache Oozie? To manage Hadoop jobs in a distributed environment, Apache Oozie is used. Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. Oozie authorizes merging multiple complicated jobs that are run in a sequential order to gain a more critical task. Within a sequence of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5234,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[138],"tags":[],"class_list":["post-5171","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-hadoop-tutorials"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5171","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=5171"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5171\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/5234"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=5171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=5171"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=5171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}