In this exercise we will run an application workflow defined by two nodes, passing outputs from the first as inputs to the second, and finally publishing the workflow results on HDFS.
cd
cd dcs-hands-on
mvn clean install -D hands.on=6 -P bash
Note: this installation is using the ImageMagick [1] tools to perform image manipulations.
<?xml version="1.0" encoding="UTF-8"?>
<application id="beam_arithm">
<jobTemplates>
<!-- BEAM BandMaths operator job template -->
<jobTemplate id="expression">
<streamingExecutable>/application/expression/run</streamingExecutable>
<defaultParameters>
<parameter id="expression">l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570)</parameter>
</defaultParameters>
</jobTemplate>
<!-- BEAM Level 3 processor job template -->
<jobTemplate id="binning">
<streamingExecutable>/application/binning/run</streamingExecutable>
<defaultParameters>
<parameter id="cellsize">9.28</parameter>
<parameter id="bandname">out</parameter>
<parameter id="bitmask">l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570)</parameter>
<parameter id="bbox">-180,-90,180,90</parameter>
<parameter id="algorithm">MIN_MAX</parameter>
<parameter id="outputname">binned</parameter>
<parameter id="resampling">binning</parameter>
<parameter id="palette">#MCI_Palette
color0=0,0,0
color1=0,0,154
color2=54,99,250
color3=110,201,136
color4=166,245,8
color5=222,224,0
color6=234,136,0
color7=245,47,0
color8=255,255,255
numPoints=9
sample0=98.19878118960284
sample1=98.64947122314665
sample2=99.10016125669047
sample3=99.5508512902343
sample4=100.0015413237781
sample5=100.4522313573219
sample6=100.90292139086574
sample7=101.35361142440956
sample8=101.80430145795337</parameter>
<parameter id="band">1</parameter>
<parameter id="tailor">true</parameter>
</defaultParameters>
<defaultJobconf>
<property id="ciop.job.max.tasks">1</property>
</defaultJobconf>
</jobTemplate>
</jobTemplates>
<workflow id="hands-on-6" title="A multi-node workflow" abstract="Exercise 6, a multi-node workflow">
<workflowVersion>1.0</workflowVersion>
<node id="node_expression">
<job id="expression"></job>
<sources>
<source refid="file:urls">/application/inputs/list</source>
</sources>
<parameters>
</parameters>
</node>
<node id="node_binning">
<job id="binning"></job>
<sources>
<source refid="wf:node">node_expression</source>
</sources>
<parameters>
<parameter id="bitmask"/>
</parameters>
</node>
</workflow>
</application>
We added a second processing node, named node_binning, and we declared that its source is the node_expression.
<node id="node_binning">
<job id="binning"></job>
<sources>
<source refid="wf:node">node_expression</source>
</sources>
<parameters>
<parameter id="bitmask"/>
</parameters>
</node>
cd $_CIOP_APPLICATION_PATH/binning
more run
grep ciop-publish run
Note that the ciop-publish command is called with the option -m. This means that it will publish the files as results of the entire workflow. Files are not going to be passed to any subsequent job. They are placed in a persistent shared location common to the whole workflow.
ciop-run -n
You will see:
node_expression
node_binning
ciop-run node_expression
ciop-run node_binning
The output will be similar to:
2016-01-19 17:01:03 [WARN ] - -- WPS needs at least one input value from your application.xml (source or parameter with scope=runtime);
2016-01-19 17:01:04 [INFO ] - Workflow submitted
2016-01-19 17:01:04 [INFO ] - Closing this program will not stop the job.
2016-01-19 17:01:04 [INFO ] - To kill this job type:
2016-01-19 17:01:04 [INFO ] - ciop-stop 0000025-160119102214227-oozie-oozi-W
2016-01-19 17:01:04 [INFO ] - Tracking URL:
2016-01-19 17:01:04 [INFO ] - http://sb-10-16-10-50.dev.terradue.int:11000/oozie/?job=0000025-160119102214227-oozie-oozi-W
Node Name : node_binning
Status : OK
Publishing results...
2016-01-19 17:02:56 [INFO ] - Workflow completed.
2016-01-19 17:02:56 [INFO ] - Output Metalink: http://sb-10-16-10-50.dev.terradue.int:50070/webhdfs/v1/ciop/run/hands-on-6/0000025-160119102214227-oozie-oozi-W/results.metalink?op=OPEN
Check in these logs how the job definition is ran as a Hadoop Streaming MapReduce task. A MapReduce job usually splits the input source so that independent data chunks are processed by the map tasks in a completely parallel manner. The Hadoop framework takes care of tasks scheduling & monitoring, and re-executes the failed tasks.
ciop-run
Footnotes
[1] | ImageMagick |