.. _principles:

Understand the Ellip Workflows development key principles
=========================================================

The Ellip Workflows filesystems
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the context of your application development life-cycle, the Ellip Workflows provides you three filesystems (or directories):

* /home/<user> that we refer to as *HOME* 
* /application that we refer to as *APPLICATION*
* /share that we refer to as *SHARE*

The HOME filesystem
--------------------

A user's home directory is intended to contain the user's files; possibly including text documents, pictures or videos, etc. It may also include the configuration files of preferred settings for any software you have used there. The user can install executable software in this directory, but it will only be available to users with permission to this directory. The home directory can be organized further with the use of sub-directories.

As such, the *HOME* is used to store the user's files. It can be used to store source files (the compiled programs would then go *APPLICATION*). 

.. NOTE:: 

  At job or workflow execution time, the Ellip Workflows uses a system user to execute the application. This system user cannot read files in *HOME*.  
  When the application is ran on a Production Environment (cluster mode), the *HOME* directory is no longer available in any of the computing nodes. 

The APPLICATION filesystem
--------------------------

The *APPLICATION* filesystem contains all the files required to run the application.

The *APPLICATION* filesystem is available on the Ellip Workflows as /application.

.. note:: 
  
  Whenever an application wrapper script needs to refer the *APPLICATION* value (/application), use the variable $_CIOP_APPLICATION_PATH, example:

.. code-block:: bash

  export BEAM_HOME=$_CIOP_APPLICATION_PATH/common/beam-4.11

The *APPLICATION* contains

* the Application Descriptor File, named *application.xml* 
* a folder for each job template

.. seealso:: 

  The Application Descriptor file is described in :ref:`application_descriptor_reference`

A *job template* folder contains:

* the streaming executable script, a script in your preferred language (e.g. bash, R or Python) that deals with the *stdin* managed by the Ellip Workflows virtual machine (e.g. EO data URLs to be passed to ciop-copy). 

There isn't a defined naming convention although it is often called *run* with an extension:

* run.sh for bash scripting streaming executable
* run.py for python streaming executable
* run.R for R streaming executable

.. note:: 

  The streaming executable script will read its inputs via stdin managed by the Hadoop Map Reduce streaming underlying layer 

* a set of folders such as:

  * /application/<job template name>/bin standing for "binaries" and contains certain fundamental job utilities which are in part needed by the job wrapper script.
  * /application/<job template name>/etc containing job-wide configuration files
  * /application/<job template name>/lib containing the job libraries
  * ...

.. note:: There aren't any particular rules for the folders in the job template folder

The *APPLICATION* of a workflow with two jobs can then be represented as

.. code-block:: bash

  /application/
    application.xml
    /job_template_1
      run.sh
      /bin
      /etc
    /job_template_2
      run.sh
      /bin
      /lib

The SHARE filesystem
--------------------

The *SHARE* filesystem is the Linux mount point for the Hadoop Distributed File System (HDFS). This HDFS filesystem is used to store the application's Job outputs, generated by the execution of ciop-run. The application's workflow and its node names are defined in the Application Descriptor File of your Ellip Workflows development environment.

The *SHARE* filesystem is available on the Ellip Workflows machine as /share, and the HDFS distributed filesystem access point is /tmp. Therefore, on the Ellip Workflows machine, /share/tmp is the root of the distributed filesystem.

.. WARNING:: In Cluster mode (production environment), the *SHARE* mount is no longer available. Do not use /share to reference files available on HDFS, but rather use the hdfs:// path, as returned by the ciop-publish utility.

When the ciop-run is invoked to run the complete application workflow, the outputs are found in a dedicated folder under *SHARE*:

.. code-block:: bash

  /share/ciop/run/<run identifier>/<node name>/data
 
and with the hdfs:// URL:

.. code-block:: bash

  hdfs://<ellip_workflow_host>/ciop/run/<run identifier>/<node_name>/data (without /share)
  
For example, you can access a data folder with job outputs either through:

.. code-block:: bash

  $ ls /share/ciop/run/beam_arithm/node_expression/data 

or

.. code-block:: bash

  $ hadoop dfs -ls /ciop/run/beam_arithm/node_expression/data (without /share)
  
.. admonition:: Features 

  The command ciop-run is keeping track of all its workflow execution runs. This feature allows you to compare the results from different sets of parameters for example.

.. TO DO:: check the Application Descriptor file to define the 'default parameter' values, and learn how to override these in the workflow.

You have now an understanding of the way your PaaS environment is dealing with datasets and programs, and how it leverages the Hadoop Distributed File System.

.. LEARN MORE:: you can get a deeper insight with these 5 Common Questions About Apache Hadoop http://blog.cloudera.com/blog/2009/05/5-common-questions-about-hadoop/

The Application Workflows
^^^^^^^^^^^^^^^^^^^^^^^^

Role of the Directed Acyclic Graph (DAG)
----------------------------------------

The DAG helps you to sequence your Application workflow with simple rules. For the Hadoop Map/Reduce programming framework, a workflow is subject to constraints implying that certain tasks must be performed earlier than others. 

The application nodes of the DAG can be Mappers, Reducers or (starting from ciop v1.2) Map/Reduce Hadoop jobs.

* Mappers: if the type of the application node is "Mapper", the number of Hadoop tasks that will run that Job in parallel is defined by the number of available slots on the cluster.
* Reducers: if the type of the application node is "Reducer", the number of task is fixed to 1, independently from the cluster dimension.
* Map/Reduce: if the type of the application node is "Map/Reduce", each parallel task is re-arranging its task outputs according to the program implementing the Reducer.

Hadoop Streaming
----------------

The Ellip Workflows environment builds on a “shared-nothing” architecture that partitions and distributes each large dataset to the disks attached directly to the worker nodes of the cluster.
Hadoop will split (distribute) the standard input of a Job to each task created on the cluster. A task is created from a Job template. The input split depends on the number of available task slots. The number of task slots depends on the cluster dimension (the number of worker nodes). 

In the Ellip Workflows environment (pseudo-cluster mode), the cluster dimension is 1 and the number of the available task slots is 2 (running on a 2-Cores CPU). 

In the IaaS Production environment (cluster mode), the cluster dimension is n (the servers provisioned on the cluster) and the number of available tasks slots is n x m (m-Cores CPU of the provisioned server type).

The Application Descriptor file
-------------------------------

The application descriptor file contains the definition of the application, and is composed of two sections:

* A "jobTemplates" section, describing for the application workflow each required Job Template, with its streaming executable file location, default parameters, and default Job configuration.
* A "workflow" section, describing the sequence of the workflow nodes, with for each node its Job template, its source for the inputs (e.g. a file with datasets URLs, a catalogue series, a previous node, or an input string), and its parameter values that might override the default parameters (defined in the job template above).

The application descriptor is an XML file managed on the Ellip Workflows APPLICATION filesystem, and is located as $_CIOP_APPLICATION_PATH/application.xml (the value of $_CIOP_APPLICATION_PATH is "/application")

.. seealso:: 

  The Application Descriptor file structure is documented in :ref:`application_descriptor_reference`

.. tip:: 

  Check that your application descriptor file is well formed with the :ref:`ciop_app_check` utility