Hands-On Exercise 1: a basic workflow¶

In this exercise we will prepare a simple workflow, and we will execute a first run using the CIOP tools.

Prerequisites¶

You have cloned the Hands-On git repository (see Clone the Hands-On repository),
(Only for python) You have installed the required software (see Prerequisites when using python).

Install the Hands-On¶

The Hands-On installation is quite straightforward, and it is performed with the Maven tool:

cd
cd dcs-hands-on
mvn clean install -D hands.on=1 -P bash

With the last command you installed the first Hands-On exercise (option -D) using a bash profile (option -P). The profile represents the programming language used to implement the Hands-On run executables.

Understand the workflow¶

A workflow must be defined as a DAG [1]. There is a special file, named application.xml, that defines a workflow. The first step is to create an application.xml:

Go to the application’s default location (/application), by typing:

cd $_CIOP_APPLICATION_PATH

Check for a file named application.xml
Open it with a text editor (e.g. vi) and inspect its content. It will be similar to:

<?xml version="1.0" encoding="us-ascii"?>
<application xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" id="application">
  <jobTemplates>
    <jobTemplate id="my_template">
      <streamingExecutable>/application/my_node/run</streamingExecutable>
    </jobTemplate>
  </jobTemplates>
    <workflow id="hands-on-1" title="Basic Workflow" abstract="Exercise 1, a basic workflow">
    <workflowVersion>1.0</workflowVersion>
    <node id="my_node">
      <job id="my_template"/>
      <sources>
        <source refid="file:urls">/application/inputs/list</source>
      </sources>
    </node>
  </workflow>
</application>

Check the inputs¶

Check for a file named list under the folder inputs:

cat inputs/list

It will be similar to:

input1
input2

Warning

Such a file should not contain blank lines at the beginning or at the end, and comments are not allowed.

Check the run executable¶

A run executable is responsible for the execution of your application (or a step of it) by the Hadoop compute engine. In the application.xml we defined a workflow with a single node and the related run executable:

      <streamingExecutable>/application/my_node/run</streamingExecutable>

Inspect the run executable:

cat my_node/run

Note

Depending from the profile chosen (maven’s option -P), a run executable can be written in different programming or scripting languages including python, R, or bash (the Hands-On exercises are initially available in python and bash).

Run the node¶

List the available node(s) with:

ciop-run -n

This returns:

my_node

Execute it by typing:

ciop-run my_node

The output will be similar to:

2016-01-19 12:27:48 [WARN ] -  -- WPS needs at least one input value from your application.xml (source or parameter with scope=runtime);
2016-01-19 12:27:51 [INFO ] - Workflow submitted
2016-01-19 12:27:51 [INFO ] - Closing this program will not stop the job.
2016-01-19 12:27:51 [INFO ] - To kill this job type:
2016-01-19 12:27:51 [INFO ] - ciop-stop 0000000-160119102214227-oozie-oozi-W
2016-01-19 12:27:51 [INFO ] - Tracking URL:
2016-01-19 12:27:51 [INFO ] - http://sb-10-16-10-50.dev.terradue.int:11000/oozie/?job=0000000-160119102214227-oozie-oozi-W

Node Name     :  my_node
Status        :  OK

Publishing results...

2016-01-19 12:28:31 [INFO ] - Workflow completed.

Note

Since the Hadoop Sandbox mode that is used here runs on a Virtual Machine offering two Cores, and the node ‘my_node’ has to process only two inputs, the input1 and input2 lines have been processed in parallel, by two simultaneous tasks (each task processing a single entry of the input file). From there, Hadoop deployments in Cluster mode will handle the scaling up of your application to a larger amount of data input and processing nodes.

Recap¶

We installed a simple workflow with a single node;
We passed to the workflow a list of two data inputs;
We executed a simple run that logs the name of data inputs, running two tasks in parallel.

Footnotes

[1]	Directed acyclic graph