Hands-On Exercise 1: a basic workflow¶
In this exercise we will prepare a simple workflow, and we will execute a first run using the CIOP tools.
Install the Hands-On¶
The Hands-On installation is quite straightforward, and it is performed with the Maven tool:
cd cd dcs-hands-on mvn clean install -D hands.on=1 -P bash
With the last command you installed the first Hands-On exercise (option -D) using a bash profile (option -P). The profile represents the programming language used to implement the Hands-On run executables.
Understand the workflow¶
A workflow must be defined as a DAG . There is a special file, named application.xml, that defines a workflow. The first step is to create an application.xml:
- Go to the application’s default location (/application), by typing:
- Check for a file named application.xml
- Open it with a text editor (e.g. vi) and inspect its content. It will be similar to:
<?xml version="1.0" encoding="us-ascii"?> <application xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" id="application"> <jobTemplates> <jobTemplate id="my_template"> <streamingExecutable>/application/my_node/run</streamingExecutable> </jobTemplate> </jobTemplates> <workflow id="hands-on-1" title="Basic Workflow" abstract="Exercise 1, a basic workflow"> <workflowVersion>1.0</workflowVersion> <node id="my_node"> <job id="my_template"/> <sources> <source refid="file:urls">/application/inputs/list</source> </sources> </node> </workflow> </application>
Check the inputs¶
- Check for a file named list under the folder inputs:
- It will be similar to:
Such a file should not contain blank lines at the beginning or at the end, and comments are not allowed.
Check the run executable¶
A run executable is responsible for the execution of your application (or a step of it) by the Hadoop compute engine. In the application.xml we defined a workflow with a single node and the related run executable:
- Inspect the run executable:
Depending from the profile chosen (maven’s option -P), a run executable can be written in different programming or scripting languages including python, R, or bash (the Hands-On exercises are initially available in python and bash).
Run the node¶
- List the available node(s) with:
- Execute it by typing:
The output will be similar to:
2016-01-19 12:27:48 [WARN ] - -- WPS needs at least one input value from your application.xml (source or parameter with scope=runtime); 2016-01-19 12:27:51 [INFO ] - Workflow submitted 2016-01-19 12:27:51 [INFO ] - Closing this program will not stop the job. 2016-01-19 12:27:51 [INFO ] - To kill this job type: 2016-01-19 12:27:51 [INFO ] - ciop-stop 0000000-160119102214227-oozie-oozi-W 2016-01-19 12:27:51 [INFO ] - Tracking URL: 2016-01-19 12:27:51 [INFO ] - http://sb-10-16-10-50.dev.terradue.int:11000/oozie/?job=0000000-160119102214227-oozie-oozi-W Node Name : my_node Status : OK Publishing results... 2016-01-19 12:28:31 [INFO ] - Workflow completed.
Since the Hadoop Sandbox mode that is used here runs on a Virtual Machine offering two Cores, and the node ‘my_node’ has to process only two inputs, the input1 and input2 lines have been processed in parallel, by two simultaneous tasks (each task processing a single entry of the input file). From there, Hadoop deployments in Cluster mode will handle the scaling up of your application to a larger amount of data input and processing nodes.
The job did not actually produce anything useful, but two lines of logging. We explain how you can access the logs via a web browser in the next exercise.