Data pipeline design

Parallelization Strategy

where the service developer defines the Complex directed acyclic graph (DAG) of the service showing the different jobs, inputs and outputs of the workflow. This graph will show the parallelization strategy to be applied on the Cloud Environment;

Description

The goal of this pipeline is to help WFP (World Food Programme) with data for seasonal monitoring and early warning activities. Therefore our pipeline generates Rainfall Estimates aggregations compared to a reference peridod.

This pipeline packages an algorithm that process datasets from CHIRPS RFE 5km (Rainfall Estimates) and results on resolution daily data aggregations for area of interest which are: - Sum of daily data over the past N days, derived every 10 days (N = 10, 30, 60, 90, 120, 150, 180, 270, 365 days), - Counts of daily data above 1mm over the past N days, derived every 10 days (N = 30, 60, 90 days), - Longest sequence of daily values < 2mm (“dry spell”) within the last N days, derived every 10 days (N = 30, 60, 90 days).

Ellip Workflows archetype instantiated for the wfp-01-03-02 data pipeline

Ellip Workflows archetype instantiated for the wfp-01-03-02 data pipeline

Data Sources

The data requirements need are analyzed and their retrieval mechanism accessed to make sure that the data is available in the system to be consumed as expected. In this pipeline because we are doing aggregations the outputs will be datasets of N data.

Catalogue endpoint: https://catalog.terradue.com/chirps/description Repository: https://gitlab.com/ec-better/wfp/applications/ewf-wfp-01-03-02

Tools and Libraries

The tools and libraries necessary to execute the applications are analyzed and their compatibility is evaluated taking in consideration the computational resources available. The following libs were used: osgeo, geopandas, gzip, cioppy, shutil, sys, numpy, pandas, math, re and os.

Trigger

Queue

Queue

Queue