Data Pipeline

Introduction

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premises data silos.

Create data pipeline

A data pipeline can be created using any of the following ways:

  • Using DuploCloud UI

  • Using an exported template from AWS console

  • Cloning an existing template

Using DuploCloud UI

Proceed to Cloud Services → Analytics -> Data Pipeline. Click on +Add button.

Enter relevant information on the form. Click Generate button. The form includes information like - name, description, s3 log folder, cron schedule details, EMR resources, EMR steps, etc.

Review generated JSON, and make any further changes to generated JSON

Using exported template in AWS console

Proceed to Cloud Services → Analytics -> Data Pipeline. Click on +Add button. Click 'Import Pipeline Template'

In AWS console Proceed to Data Pipeline -> Choose Existing Data Pipeline -> Click Edit -> Click Export. Please review generated JSON, and make any further changes to generated JSON. Click Submit.

Clone existing data pipeline

Copy previously exported template from the form. Please do any additional changes (such as schedule frequency, EMR steps). Click Submit to save the Data Pipeline.

Existing Data Pipelines can be cloned in List View or Details View.

List view

To get JIT (Just In Time) access to appropriate AWS console, click on Data Pipeline, EMR Console, EMR Jupyter Console. Click **** row level menu actions to manage the Data Pipeline. e.g. Clone, Edit, Export, Delete etc.

Details view

Use Details view to update Data Pipeline. Use JIT (Just In Time) access to AWS console. Check Errors and warnings.

Example data pipeline template

There are two types of Data Pipeline templates:

  1. Exported template in AWS console

  2. Exported template in DuploCloud UI

AWS console exported template

DuploCloud exported template

Last updated

Was this helpful?