What is Apache Oozie?
To manage Hadoop jobs in a distributed environment, Apache Oozie is used. Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. Oozie authorizes merging multiple complicated jobs that are run in a sequential order to gain a more critical task. Within a sequence of the task, two or more jobs can also be programmed to run parallel to each other.
Oozie supports three types of jobs.
- Workflow engine
The commitment of a workflow engine is to keep and drive workflows comprised of Hadoop jobs, e.g., MapReduce, Pig, Hive.
- Coordinator engine:
It drives workflow jobs established on predefined schedules and availability of data.
The bundle is the high-level abstraction that will batch a set of coordinator jobs. The bundle is a collection of coordinator jobs.
Types of Nodes in Oozie Workflow.
- Start and End node
Start and end nodes define the beginning and end of the workflow. These nodes include optional fail nodes.
- Action Nodes
Actual processing tasks are defined in action nodes. The system remotely notifies Oozie when a specific action node finishes and the next node in the workflow is executed. HDFS commands are also included in the action nodes.
- Fork and Join nodes
Parallel execution of tasks in the workflow is executed with the help of a fork and join nodes. Two or more nodes can run at the same time using Fork nodes. In some cases, we need to wait for some specific tasks to complete their working for this join node is used.
- Control flow nodes
Control flow nodes take decisions about previous tasks. Control nodes are based on the result of the previous nodes. Control flow nodes are if-else statements that evaluate to be true or false.
Here is an example of a simple workflow of Oozie.
Packaging and deploying an Oozie workflow application
A workflow application has the workflow description and all the related resources such as MapReduce Jar files, and Pig scripts. Applications need to obey a straightforward directory structure and are deployed to HDFS so that Oozie can access them.
<name of workflow>/</name>
It is essential to maintain workflow.xml (a workflow definition file) in the top-level folder.
Lib directory has Jar files, including MapReduce classes. Workflow application coordinating to this layout can be built with any build tool.
Now copy this to HDFS using the command given below
hadoop fs -put hadoop-examples/target/<name of workflow dir> name of workflow
Steps for Running an Oozie workflow job
1. Now we need to tell Oozie which server to use for that Export OOZIE_URL environment variable
2. Run workflow job using the command given below
oozie job -config <location of properties file> -run
3. Get the status of workflow job
oozie job -info <job id>
4. Results of prosperous workflow implementation can be seen using the command below
hadoop fs -cat <location of result>