Oozie

If you are not yet familiar with Oozie, a workflow engine for Hadoop, please, read about it here, to get a quick review of ideas behind it.

Create a project

To create a project just initialize an empty git repository and add origin to it. The URL of the origin you can take from datamot UI in the main menu.

datamot@localhost:~> cd ~
datamot@localhost:~> mkdir happy-project
datamot@localhost:~> cd happy-project
datamot@localhost:~/happy-project> git init
Initialized empty Git repository in /home/datamot/happy-project/.git/

Add remote repository

Go to datamot UI: http://localhost:2718. Enter any username you want. Currently login screen is just a dummy page. Copy remote url from the main menu panel by just clicking on it. Add remote:

datamot@localhost:~/happy-project> git add remote origin datamot@localhost:/home/datamot/.datamot/datamot/babymot

Project structure

Project structure is not dictated. No predefined “special” places for coordinators, datasets, workflows, scripts etc. You can place any coordinator to any folder within your project. This means that you can structure your project anyway you want.

The only thing to remember are “dot” files and “dot” folders. They are not processed by generating engine.

Variables substitution and plugins

The real power of datamot is in variables substitution and plugins. In it’s core, datamot is a rendering engine. This means that you can place all the variables in some “dot” file, namely “.conf” file, and then use them all over the project. The “scope” of the “.conf” file is a folder in which it placed and all the folders inside it. You can redefine variables in any folder.

Consider simple project:

Happy project
jobs/
import-happiness/
calculate-joy/
<coordinator-app  name="import-happiness"
                  frequency="${coord:days(1)}"
                  start=""
                  end=""
                  timezone="UTC"
                  xmlns="uri:oozie:coordinator:0.2">

  <controls>
    <timeout></timeout>
    <execution>FIFO</execution>
  </controls>

  <datasets>
    <include>/datasets/joyful.xml</include>
  </datasets>

  <output-events>
    <data-out name="happiness" dataset="happiness">
        <instance>${coord:current(0)}</instance>
    </data-out>
  </output-events>

  <action>
    <workflow>
      <app-path>/jobs/import-happiness/workflow.xml</app-path>
      <configuration>
        <property>
          <name> happiness </name>
          <value> ${coord:dataOut('happiness')} </value>
        </property>
      </configuration>
    </workflow>
  </action>

</coordinator-app>
<coordinator-app  name="calculate-joy"
                  frequency="${coord:days(1)}"
                  start=""
                  end=""
                  timezone="UTC"
                  xmlns="uri:oozie:coordinator:0.2">

  <controls>
    <timeout></timeout>
    <execution>FIFO</execution>
  </controls>

  <datasets>
    <include>/datasets/joyful.xml</include>
  </datasets>

  <input-events>
    <data-in name="happiness" dataset="happiness">
      <instance>${coord:current(0)}</instance>
    </data-in>
  </input-events>

  <output-events>
    <data-out name="joy" dataset="joy" >
      <instance>${coord:current(0)}</instance>
    </data-out>
  </output-events>

  <action>
    <workflow>
      <app-path>/jobs/calculate-joy/workflow.xml</app-path>
      <configuration>
        <property>
          <name> happiness </name>
          <value> ${coord:dataIn('happiness')} </value>
        </property>
        <property>
          <name> joy </name>
          <value> ${coord:dataOut('joy')} </value>
        </property>
      </configuration>
    </workflow>
  </action>

</coordinator-app>

This is a familiar Oozie syntax for a coordinator. But it’s too verbose due to being xml syntax. It is possible to use much shorter and laconic code to describe coordinators. Refer to docs to see how to use plugins.

Help preserve this project

Help us develop