Understand the Ellip Workflows development key principles¶
The Ellip Workflows filesystems¶
In the context of your application development life-cycle, the Ellip Workflows provides you three filesystems (or directories):
- /home/<user> that we refer to as HOME
- /application that we refer to as APPLICATION
- /share that we refer to as SHARE
The HOME filesystem¶
A user’s home directory is intended to contain the user’s files; possibly including text documents, pictures or videos, etc. It may also include the configuration files of preferred settings for any software you have used there. The user can install executable software in this directory, but it will only be available to users with permission to this directory. The home directory can be organized further with the use of sub-directories.
As such, the HOME is used to store the user’s files. It can be used to store source files (the compiled programs would then go APPLICATION).
At job or workflow execution time, the Ellip Workflows uses a system user to execute the application. This system user cannot read files in HOME. When the application is ran on a Production Environment (cluster mode), the HOME directory is no longer available in any of the computing nodes.
The APPLICATION filesystem¶
The APPLICATION filesystem contains all the files required to run the application.
The APPLICATION filesystem is available on the Ellip Workflows as /application.
Whenever an application wrapper script needs to refer the APPLICATION value (/application), use the variable $_CIOP_APPLICATION_PATH, example:
The APPLICATION contains
- the Application Descriptor File, named application.xml
- a folder for each job template
The Application Descriptor file is described in Application descriptor reference
A job template folder contains:
- the streaming executable script, a script in your preferred language (e.g. bash, R or Python) that deals with the stdin managed by the Ellip Workflows virtual machine (e.g. EO data URLs to be passed to ciop-copy).
There isn’t a defined naming convention although it is often called run with an extension:
- run.sh for bash scripting streaming executable
- run.py for python streaming executable
- run.R for R streaming executable
The streaming executable script will read its inputs via stdin managed by the Hadoop Map Reduce streaming underlying layer
- a set of folders such as:
- /application/<job template name>/bin standing for “binaries” and contains certain fundamental job utilities which are in part needed by the job wrapper script.
- /application/<job template name>/etc containing job-wide configuration files
- /application/<job template name>/lib containing the job libraries
There aren’t any particular rules for the folders in the job template folder
The APPLICATION of a workflow with two jobs can then be represented as
/application/ application.xml /job_template_1 run.sh /bin /etc /job_template_2 run.sh /bin /lib
The Application Workflows¶
Role of the Directed Acyclic Graph (DAG)¶
The DAG helps you to sequence your Application workflow with simple rules. For the Hadoop Map/Reduce programming framework, a workflow is subject to constraints implying that certain tasks must be performed earlier than others.
The application nodes of the DAG can be Mappers, Reducers or (starting from ciop v1.2) Map/Reduce Hadoop jobs.
- Mappers: if the type of the application node is “Mapper”, the number of Hadoop tasks that will run that Job in parallel is defined by the number of available slots on the cluster.
- Reducers: if the type of the application node is “Reducer”, the number of task is fixed to 1, independently from the cluster dimension.
- Map/Reduce: if the type of the application node is “Map/Reduce”, each parallel task is re-arranging its task outputs according to the program implementing the Reducer.
The Ellip Workflows environment builds on a “shared-nothing” architecture that partitions and distributes each large dataset to the disks attached directly to the worker nodes of the cluster. Hadoop will split (distribute) the standard input of a Job to each task created on the cluster. A task is created from a Job template. The input split depends on the number of available task slots. The number of task slots depends on the cluster dimension (the number of worker nodes).
In the Ellip Workflows environment (pseudo-cluster mode), the cluster dimension is 1 and the number of the available task slots is 2 (running on a 2-Cores CPU).
In the IaaS Production environment (cluster mode), the cluster dimension is n (the servers provisioned on the cluster) and the number of available tasks slots is n x m (m-Cores CPU of the provisioned server type).
The Application Descriptor file¶
The application descriptor file contains the definition of the application, and is composed of two sections:
- A “jobTemplates” section, describing for the application workflow each required Job Template, with its streaming executable file location, default parameters, and default Job configuration.
- A “workflow” section, describing the sequence of the workflow nodes, with for each node its Job template, its source for the inputs (e.g. a file with datasets URLs, a catalogue series, a previous node, or an input string), and its parameter values that might override the default parameters (defined in the job template above).
The application descriptor is an XML file managed on the Ellip Workflows APPLICATION filesystem, and is located as $_CIOP_APPLICATION_PATH/application.xml (the value of $_CIOP_APPLICATION_PATH is “/application”)
The Application Descriptor file structure is documented in Application descriptor reference
Check that your application descriptor file is well formed with the ciop-appcheck (7) utility