Understand the Sandbox key principles

The Sandbox filesystems

In the context of your application development life-cycle, the Sandbox provides you with three filesystems (or directories):

  • /home/<user> that we refer to as HOME
  • /application that we refer to as APPLICATION
  • /share that we refer to as SHARE

The HOME filesystem

A user’s home directory is intended to contain the user’s files; possibly including text documents, pictures or videos, etc. It may also include the configuration files of preferred settings for any software you have used there, and that you might have tailored to your liking: web browser bookmarks, favorite desktop wallpaper and themes, passwords to any external services accessed via a given software, etc. The user can install executable software in this directory, but it will only be available to users with permission to this directory. The home directory can be organized further with the use of sub-directories.

As such, the HOME is used to store the user’s files. It can be used to store source files (the compiled programs would then go APPLICATION).

Note

At job or workflow execution time, the Sandbox uses a system user to execute the application. This system user cannot read files in HOME. When the application is ran on a Production Environment (cluster mode), the HOME directory is no longer available in any of the computing nodes.

The APPLICATION filesystem

The APPLICATION filesystem contains all the files required to run the application.

The APPLICATION filesystem is available on the Sandbox as /application.

Note

Whenever an application wrapper script needs to refer the APPLICATION value (/application), use the variable $_CIOP_APPLICATION_PATH, example:

export BEAM_HOME=$_CIOP_APPLICATION_PATH/common/beam-4.11

The APPLICATION contains

  • the Application Descriptor File, named _application.xml_
  • a folder for each job template

See also

The Application Descriptor file is described in Application descriptor reference

A job template folder contains:

  • the streaming executable script, a script in your preferred language (e.g. bash, R or Python) that deals with the stdin managed by the Sandbox (e.g. EO data URLs to be passed to ciop-copy).

There isn’t a defined naming convention although it is often called run with an extension:

  • run.sh for bash scripting streaming executable
  • run.py for python streaming executable
  • run.R for R streaming executable

Note

The streaming executable script will read its inputs via stdin managed by the Hadoop Map Reduce streaming underlying layer

  • a set of folders such as:
    • /application/<job template name>/bin standing for “binaries” and contains certain fundamental job utilities which are in part needed by the job wrapper script.
    • /application/<job template name>/etc containing job-wide configuration files
    • /application/<job template name>/lib containing the job libraries
    • ...

Note

There aren’t any particular rules for the folders in the job template folder

The APPLICATION of a workflow with two jobs can then be represented as

/application/
  application.xml
  /job_template_1
    run.sh
    /bin
    /etc
  /job_template_2
    run.sh
    /bin
    /lib

The SHARE filesystem

The SHARE filesystem is the Linux mount point for the Hadoop Distributed File System (HDFS). This HDFS filesystem is used to store the application’s Job outputs, generated by the execution of ciop-run. The application’s workflow and its node names are defined in the Application Descriptor File of your Sandbox development environment.

The SHARE filesystem is available on the Sandbox as /share, and the HDFS distributed filesystem access point is /tmp. Therefore, on the Sandbox, /share/tmp is the root of the distributed filesystem.

Warning

In Cluster mode (production environment), the SHARE mount is no longer available. Do not use /share to reference files available on HDFS, but rather use the hdfs:// path, as returned by the ciop-publish utility.

When the ciop-run is invoked to run the complete application workflow, the outputs are found in a dedicated folder under SHARE:

/share/ciop/run/<run identifier>/<node name>/data

and with the hdfs:// URL:

hdfs://<sandbox_host>/ciop/run/<run identifier>/<node name>/data (without /share)

For example, you can access a data folder with job outputs either through:

$ ls /share/ciop/run/beam_arithm/node_expression/data

or

$ hadoop dfs -ls /ciop/run/beam_arithm/node_expression/data (without /share)

Features

The command ciop-run is keeping track of all its workflow execution runs. This feature allows you to compare the results from different sets of parameters for example.

You have now an understanding of the way your PaaS environment is dealing with datasets and programs, and how it leverages the Hadoop Distributed File System.

The Application Workflow

Role of the Directed Acyclic Graph (DAG)

The DAG helps you to sequence your Application workflow with simple rules. For the Hadoop Map/Reduce programming framework, a workflow is subject to constraints implying that certain tasks must be performed earlier than others.

The application nodes of the DAG can be Mappers, Reducers or (starting from ciop v1.2) Map/Reduce Hadoop jobs.

  • Mappers: if the type of the application node is “Mapper”, the number of Hadoop tasks that will run that Job in parallel is defined by the number of available slots on the cluster.
  • Reducers: if the type of the application node is “Reducer”, the number of task is fixed to 1, independently from the cluster dimension.
  • Map/Reduce: if the type of the application node is “Map/Reduce”, each parallel task is re-arranging its task outputs according to the program implementing the Reducer.

Hadoop Streaming

The Developer Cloud Sandbox environment builds on a “shared-nothing” architecture that partitions and distributes each large dataset to the disks attached directly to the worker nodes of the cluster. Hadoop will split (distribute) the standard input of a Job to each task created on the cluster. A task is created from a Job template. The input split depends on the number of available task slots. The number of task slots depends on the cluster dimension (the number of worker nodes).

In the Developer Cloud Sandbox environment (pseudo-cluster mode), the cluster dimension is 1 and the number of the available task slots is 2 (running on a 2-Cores CPU).

In the IaaS Production environment (cluster mode), the cluster dimension is n (the servers provisioned on the cluster) and the number of available tasks slots is n x m (m-Cores CPU of the provisioned server type).

The Application Descriptor file

The application descriptor file contains the definition of the application, and is composed of two sections:

  • A “jobTemplates” section, describing for the application workflow each required Job Template, with its streaming executable file location, default parameters, and default Job configuration.
  • A “workflow” section, describing the sequence of the workflow nodes, with for each node its Job template, its source for the inputs (e.g. a file with datasets URLs, a catalogue series, a previous node, or an input string), and its parameter values that might override the default parameters (defined in the job template above).

The application descriptor is an XML file managed on the Sandbox APPLICATION filesystem, and is located as $_CIOP_APPLICATION_PATH/application.xml (the value of $_CIOP_APPLICATION_PATH is “/application”)

See also

The Application Descriptor file structure is documented in Application descriptor reference

Tip

Check that your application descriptor file is well formed with the ciop-appcheck utility