In this exercise we will prepare a simple workflow, and we will execute a first run using the CIOP tools.
The Hands-On installation is quite straightforward, and it is performed with the Maven tool:
cd
cd dcs-hands-on
mvn clean install -D hands.on=1 -P bash
With the last command you installed the first Hands-On exercise (option -D) using a bash profile (option -P). The profile represents the programming language used to implement the Hands-On run executables.
A workflow must be defined as a DAG [1]. There is a special file, named application.xml, that defines a workflow. The first step is to create an application.xml:
cd $_CIOP_APPLICATION_PATH
<?xml version="1.0" encoding="us-ascii"?>
<application xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" id="application">
<jobTemplates>
<jobTemplate id="my_template">
<streamingExecutable>/application/my_node/run</streamingExecutable>
</jobTemplate>
</jobTemplates>
<workflow id="hands-on-1" title="Basic Workflow" abstract="Exercise 1, a basic workflow">
<workflowVersion>1.0</workflowVersion>
<node id="my_node">
<job id="my_template"/>
<sources>
<source refid="file:urls">/application/inputs/list</source>
</sources>
</node>
</workflow>
</application>
cat inputs/list
input1
input2
Warning
Such a file should not contain blank lines at the beginning or at the end, and comments are not allowed.
A run executable is responsible for the execution of your application (or a step of it) by the Hadoop compute engine. In the application.xml we defined a workflow with a single node and the related run executable:
<streamingExecutable>/application/my_node/run</streamingExecutable>
cat my_node/run
Note
Depending from the profile chosen (maven’s option -P), a run executable can be written in different programming or scripting languages including python, R, or bash (the Hands-On exercises are initially available in python and bash).
ciop-run -n
This returns:
my_node
ciop-run my_node
The output will be similar to:
2016-01-19 12:27:48 [WARN ] - -- WPS needs at least one input value from your application.xml (source or parameter with scope=runtime);
2016-01-19 12:27:51 [INFO ] - Workflow submitted
2016-01-19 12:27:51 [INFO ] - Closing this program will not stop the job.
2016-01-19 12:27:51 [INFO ] - To kill this job type:
2016-01-19 12:27:51 [INFO ] - ciop-stop 0000000-160119102214227-oozie-oozi-W
2016-01-19 12:27:51 [INFO ] - Tracking URL:
2016-01-19 12:27:51 [INFO ] - http://sb-10-16-10-50.dev.terradue.int:11000/oozie/?job=0000000-160119102214227-oozie-oozi-W
Node Name : my_node
Status : OK
Publishing results...
2016-01-19 12:28:31 [INFO ] - Workflow completed.
Note
Since the Hadoop Sandbox mode that is used here runs on a Virtual Machine offering two Cores, and the node ‘my_node’ has to process only two inputs, the input1 and input2 lines have been processed in parallel, by two simultaneous tasks (each task processing a single entry of the input file). From there, Hadoop deployments in Cluster mode will handle the scaling up of your application to a larger amount of data input and processing nodes.
Footnotes
[1] | Directed acyclic graph |