Hands-On Exercise 6: a multi-node workflow¶

In this exercise we will run an application workflow defined by two nodes, passing outputs from the first as inputs to the second, and finally publishing the workflow results on HDFS.

Prerequisites¶

You have cloned the Hands-On git repository (see Clone the Hands-On repository),
(Only for python) You have installed the required software (see Prerequisites when using python).

Install the Hands-On¶

Install the Hands-On Exercise 6, just type:

cd
cd dcs-hands-on
mvn clean install -D hands.on=6 -P bash

Note: this installation is using the ImageMagick [1] tools to perform image manipulations.

Inspect the application.xml¶

Inspect the application.xml:

<?xml version="1.0" encoding="UTF-8"?>
<application id="beam_arithm">
  <jobTemplates>
    <!-- BEAM BandMaths operator job template  -->
    <jobTemplate id="expression">
      <streamingExecutable>/application/expression/run</streamingExecutable>
      <defaultParameters>           
        <parameter id="expression">l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570)</parameter>
      </defaultParameters>
    </jobTemplate>
    <!-- BEAM Level 3 processor job template  -->
    <jobTemplate id="binning">
      <streamingExecutable>/application/binning/run</streamingExecutable>
      <defaultParameters>           
        <parameter id="cellsize">9.28</parameter>
        <parameter id="bandname">out</parameter>
        <parameter id="bitmask">l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570)</parameter>
        <parameter id="bbox">-180,-90,180,90</parameter>
        <parameter id="algorithm">MIN_MAX</parameter>
        <parameter id="outputname">binned</parameter>
        <parameter id="resampling">binning</parameter>
        <parameter id="palette">#MCI_Palette
color0=0,0,0
color1=0,0,154
color2=54,99,250
color3=110,201,136
color4=166,245,8
color5=222,224,0
color6=234,136,0
color7=245,47,0
color8=255,255,255
numPoints=9
sample0=98.19878118960284
sample1=98.64947122314665
sample2=99.10016125669047
sample3=99.5508512902343
sample4=100.0015413237781
sample5=100.4522313573219
sample6=100.90292139086574
sample7=101.35361142440956
sample8=101.80430145795337</parameter>
        <parameter id="band">1</parameter>
        <parameter id="tailor">true</parameter>
      </defaultParameters>
      <defaultJobconf>
              <property id="ciop.job.max.tasks">1</property>
            </defaultJobconf>
    </jobTemplate>
  </jobTemplates>
    <workflow id="hands-on-6" title="A multi-node workflow" abstract="Exercise 6, a multi-node workflow">           
    <workflowVersion>1.0</workflowVersion>
    <node id="node_expression">       
      <job id="expression"></job>     
      <sources>
        <source refid="file:urls">/application/inputs/list</source>
      </sources>
      <parameters>          
      </parameters>
    </node>
    <node id="node_binning">        
      <job id="binning"></job>      
      <sources>
        <source refid="wf:node">node_expression</source>        
      </sources>
      <parameters>
        <parameter id="bitmask"/>   
      </parameters>
    </node>
  </workflow>
</application>

We added a second processing node, named node_binning, and we declared that its source is the node_expression.

    <node id="node_binning">        
      <job id="binning"></job>      
      <sources>
        <source refid="wf:node">node_expression</source>        
      </sources>
      <parameters>
        <parameter id="bitmask"/>   
      </parameters>
    </node>

Inspect the run executable¶

Now we are going to see how the run executable for node binning looks like. You can use the more command:

cd $_CIOP_APPLICATION_PATH/binning
more run
grep ciop-publish run

Note that the ciop-publish command is called with the option -m. This means that it will publish the files as results of the entire workflow. Files are not going to be passed to any subsequent job. They are placed in a persistent shared location common to the whole workflow.

Run and debug the workflow¶

Check the available nodes with:

ciop-run -n

You will see:

node_expression
node_binning

Run the node node_expression:

ciop-run node_expression

Run the node node_binning:

ciop-run node_binning

The output will be similar to:

2016-01-19 17:01:03 [WARN ] -  -- WPS needs at least one input value from your application.xml (source or parameter with scope=runtime);
2016-01-19 17:01:04 [INFO ] - Workflow submitted
2016-01-19 17:01:04 [INFO ] - Closing this program will not stop the job.
2016-01-19 17:01:04 [INFO ] - To kill this job type:
2016-01-19 17:01:04 [INFO ] - ciop-stop 0000025-160119102214227-oozie-oozi-W
2016-01-19 17:01:04 [INFO ] - Tracking URL:
2016-01-19 17:01:04 [INFO ] - http://sb-10-16-10-50.dev.terradue.int:11000/oozie/?job=0000025-160119102214227-oozie-oozi-W

Node Name     :  node_binning
Status        :  OK

Publishing results...

2016-01-19 17:02:56 [INFO ] - Workflow completed.
2016-01-19 17:02:56 [INFO ] - Output Metalink: http://sb-10-16-10-50.dev.terradue.int:50070/webhdfs/v1/ciop/run/hands-on-6/0000025-160119102214227-oozie-oozi-W/results.metalink?op=OPEN

Check in these logs how the job definition is ran as a Hadoop Streaming MapReduce task. A MapReduce job usually splits the input source so that independent data chunks are processed by the map tasks in a completely parallel manner. The Hadoop framework takes care of tasks scheduling & monitoring, and re-executes the failed tasks.

Run the entire workflow:

ciop-run

Follow the execution until it ends (approximately five minutes),
To close the ciop-run output page, press CTRL+C.

Recap¶

We added a second node (node_binning) to our workflow;
We published results as final results of the workflow;
We ran the added node using the ciop-run command;
We saw how a job is handled by the framework as parallel tasks, during the workflow execution;
We ran the entire workflow using the ciop-run command.

Footnotes

[1]	ImageMagick