Understand the Sandbox key principles¶

The Sandbox filesystems¶

In the context of your application development life-cycle, the Sandbox provides you with three filesystems (or directories):

/home/<user> that we refer to as HOME
/application that we refer to as APPLICATION
/share that we refer to as SHARE

The HOME filesystem¶

A user’s home directory is intended to contain the user’s files; possibly including text documents, pictures or videos, etc. It may also include the configuration files of preferred settings for any software you have used there, and that you might have tailored to your liking: web browser bookmarks, favorite desktop wallpaper and themes, passwords to any external services accessed via a given software, etc. The user can install executable software in this directory, but it will only be available to users with permission to this directory. The home directory can be organized further with the use of sub-directories.

As such, the HOME is used to store the user’s files. It can be used to store source files (the compiled programs would then go APPLICATION).

Note

At job or workflow execution time, the Sandbox uses a system user to execute the application. This system user cannot read files in HOME. When the application is ran on a Production Environment (cluster mode), the HOME directory is no longer available in any of the computing nodes.

The APPLICATION filesystem¶

The APPLICATION filesystem contains all the files required to run the application.

The APPLICATION filesystem is available on the Sandbox as /application.

Note

Whenever an application wrapper script needs to refer the APPLICATION value (/application), use the variable $_CIOP_APPLICATION_PATH, example:

export BEAM_HOME=$_CIOP_APPLICATION_PATH/common/beam-4.11

The APPLICATION contains

the Application Descriptor File, named _application.xml_
a folder for each job template

See also

The Application Descriptor file is described in Application descriptor reference

A job template folder contains:

the streaming executable script, a script in your preferred language (e.g. bash, R or Python) that deals with the stdin managed by the Sandbox (e.g. EO data URLs to be passed to ciop-copy).

There isn’t a defined naming convention although it is often called run with an extension:

run.sh for bash scripting streaming executable
run.py for python streaming executable
run.R for R streaming executable

Note

The streaming executable script will read its inputs via stdin managed by the Hadoop Map Reduce streaming underlying layer

a set of folders such as:
- /application/<job template name>/bin standing for “binaries” and contains certain fundamental job utilities which are in part needed by the job wrapper script.
- /application/<job template name>/etc containing job-wide configuration files
- /application/<job template name>/lib containing the job libraries
- ...

Note

There aren’t any particular rules for the folders in the job template folder

The APPLICATION of a workflow with two jobs can then be represented as

/application/
  application.xml
  /job_template_1
    run.sh
    /bin
    /etc
  /job_template_2
    run.sh
    /bin
    /lib

The Application Workflow¶

Role of the Directed Acyclic Graph (DAG)¶

The DAG helps you to sequence your Application workflow with simple rules. For the Hadoop Map/Reduce programming framework, a workflow is subject to constraints implying that certain tasks must be performed earlier than others.

The application nodes of the DAG can be Mappers, Reducers or (starting from ciop v1.2) Map/Reduce Hadoop jobs.

Mappers: if the type of the application node is “Mapper”, the number of Hadoop tasks that will run that Job in parallel is defined by the number of available slots on the cluster.
Reducers: if the type of the application node is “Reducer”, the number of task is fixed to 1, independently from the cluster dimension.
Map/Reduce: if the type of the application node is “Map/Reduce”, each parallel task is re-arranging its task outputs according to the program implementing the Reducer.

Hadoop Streaming¶

The Developer Cloud Sandbox environment builds on a “shared-nothing” architecture that partitions and distributes each large dataset to the disks attached directly to the worker nodes of the cluster. Hadoop will split (distribute) the standard input of a Job to each task created on the cluster. A task is created from a Job template. The input split depends on the number of available task slots. The number of task slots depends on the cluster dimension (the number of worker nodes).

In the Developer Cloud Sandbox environment (pseudo-cluster mode), the cluster dimension is 1 and the number of the available task slots is 2 (running on a 2-Cores CPU).

In the IaaS Production environment (cluster mode), the cluster dimension is n (the servers provisioned on the cluster) and the number of available tasks slots is n x m (m-Cores CPU of the provisioned server type).

The Application Descriptor file¶

The application descriptor file contains the definition of the application, and is composed of two sections:

A “jobTemplates” section, describing for the application workflow each required Job Template, with its streaming executable file location, default parameters, and default Job configuration.
A “workflow” section, describing the sequence of the workflow nodes, with for each node its Job template, its source for the inputs (e.g. a file with datasets URLs, a catalogue series, a previous node, or an input string), and its parameter values that might override the default parameters (defined in the job template above).

The application descriptor is an XML file managed on the Sandbox APPLICATION filesystem, and is located as $_CIOP_APPLICATION_PATH/application.xml (the value of $_CIOP_APPLICATION_PATH is “/application”)

Understand the Sandbox key principles¶

The Sandbox filesystems¶

The HOME filesystem¶

The APPLICATION filesystem¶

The SHARE filesystem¶

The Application Workflow¶

Role of the Directed Acyclic Graph (DAG)¶

Hadoop Streaming¶

The Application Descriptor file¶