LogoLogo
HomePlatformAsk DuploCloudPricing
  • Overview
  • Product Updates
  • Workshops
    • DuploCloud 101 for AWS
      • Create Your Infrastructure and Application
        • 1. Log in to the DuploCloud Portal
        • 2. Create a DuploCloud Infrastructure
        • 3. Create a DuploCloud Tenant
        • 4. Create an EKS Worker Node
        • 5. Deploy an Application
        • 6. Create a Load Balancer
        • 7. Deploy an S3 Bucket
        • 8. Deploy a Database
        • 9. Create an Alarm
      • Daily Operations using DuploCloud
        • 1. Host, Container, and Kubectl Shell
        • 2. Logging
        • 3. Metrics
        • 4. Billing and Cost Management
        • 5. Audit Logs
        • 6 - Tenant and Admin Just-In-Time (JIT) AWS Access
        • 7. CI/CD
        • 8. Security Hub and Dashboard
        • 9. Terraform Mode of Operations
      • Post-workshop Reference Guide
        • Post-Workshop Testing and Documentation Links
        • Connect With Us
        • DuploCloud Whitepapers
        • DuploCloud Terraform Provider
        • DuploCloud AWS Demo Video
  • Getting Started with DuploCloud
    • What DuploCloud Does
    • DuploCloud Onboarding
    • Application Focused Interface: DuploCloud Architecture
      • DuploCloud Tenancy Models
      • DuploCloud Common Components
        • Infrastructure
        • Plan
        • Tenant
        • Hosts
        • Services
        • Diagnostics
      • Management Portal Scope
    • GRC Tools and DuploCloud
    • Public Cloud Tutorials
    • Getting Help with DuploCloud
  • Container Orchestrators
    • Terminologies in Container Orchestration
  • DuploCloud Prerequisites
    • DNS Configuration
  • AWS User Guide
    • Prerequisites
      • Route 53 Hosted Zone
      • ACM Certificate
      • Shell Access for Containers
      • VPN Setup
      • Connect to the VPN
    • AWS Quick Start
      • Step 1: Create Infrastructure and Plan
      • Step 2: Create a Tenant
      • Step 3: Create an RDS Database (Optional)
      • Creating an EKS Service
        • Step 4: Create a Host
        • Step 5: Create a Service
        • Step 6: Create a Load Balancer
        • Step 7: Enable Additional Load Balancer Options (Optional)
        • Step 8: Create a Custom DNS Name (Optional)
        • Step 9: Test the Application
      • Creating an ECS Service
        • Step 4: Create a Task Definition for an Application
        • Step 5: Create the ECS Service and Load Balancer
        • Step 6: Test the Application
      • Creating a Native Docker Service
        • Step 4: Create an EC2 Host
        • Step 5: Create a Service
        • Step 6: Create a Load Balancer
        • Step 7: Test the Application
    • AWS Use Cases
      • Creating an Infrastructure and Plan for AWS
        • EKS Setup
          • Enable EKS endpoints
          • Enable EKS logs
          • Enable Cluster Autoscaler
        • ECS Setup
          • Enable ECS logging
        • Add VPC endpoints
        • Security Group rules
        • Upgrading the EKS version
      • Creating a Tenant (Environment)
        • Setting Tenant session duration
        • Setting Tenant expiration
        • Tenant Config settings
      • Hosts (VMs)
        • Adding Hosts
        • Connect EC2 instance
        • Adding Shared Hosts
        • Adding Dedicated Hosts
        • Autoscaling Hosts
          • Autoscaling Groups (ASG)
            • Launch Templates
            • Instance Refresh for ASG
            • Scale to or from Zero
            • Spot Instances for AWS
          • ECS Autoscaling
          • Autoscaling in Kubernetes
        • Configure Auto-reboot
        • Create Amazon Machine Image (AMI)
        • Hibernate an EC2 Host
        • Snapshots
        • Taints for EKS Nodes
        • Disable Source Destination Check
      • Auditing
      • Logs
        • Enable Default-Tenant logging
        • Enable Non-Default Tenant logging
        • Configure Logging per Tenant
        • Display logs
        • Create custom logs
      • Diagnostics and Metrics
        • Metrics Setup
        • Metrics Dashboard
        • Kubernetes Administrator dashboard
      • Faults and Alerts
        • Alert notifications
        • Automatic alert creation
        • Automatic fault healing
        • SNS Topic Alerts
        • System Settings Flags
      • AWS Console link
      • Just-in-Time (JIT) Access
      • Billing and Cost management
        • Enable billing data
        • View billing data
        • Apply cost allocation tags
        • DuploCloud License Usage
        • Configure Billing Alerts
      • Resource Quotas
      • Big Data and ETL
      • Custom Resource tags
    • AWS Services
      • Containers and Services
        • EKS Containers and Services
          • Allocation Tagging
        • ECS Containers, Task Definitions and Services
        • Passing Configs and Secrets
        • Container Rollback
        • Docker Registry credentials
      • Load Balancers
        • Target Groups
        • EKS Load Balancers
        • ECS Services and Load Balancers
        • Native Docker Load Balancers
      • Storage
        • Storage Class and PVCs
        • GP3 Storage Class
      • API Gateway
      • Batch
      • CloudFront
      • Databases
        • AWS ElastiCache
        • AWS DynamoDB database
        • AWS Timestream database
        • RDS database
          • IAM authentication
          • Backup and restore
          • Sharing encrypted database
          • Manage RDS Snapshots
          • Add and manage RDS read replicas
            • Add Aurora RDS replicas
          • Add monitoring interval
          • Enable or disable RDS logging
          • Restrict RDS instance size
          • Add parameters in Parameter Groups
          • Manage Performance Insights
      • Data Pipeline
      • Elastic Container Registry (ECR)
        • Sharing ECR Repos
      • Elastic File System (EFS)
        • Mount an EFS in an EC2 instance
      • EMR Serverless
      • EventBridge
      • IoT (Internet of Things)
      • Kafka Cluster
      • Kinesis Stream
      • Lambda Functions
        • Configure Lambda with Container Images
        • Lambda Layers
      • Managed Airflow
      • NAT Gateway for HA
      • OpenSearch
      • Probes and Health Check
      • S3 Bucket
      • SNS Topic
      • SQS Queue
      • Virtual Private Cloud (VPC) Peering
      • Web App Firewall (WAF)
    • AWS FAQ
    • AWS Systems Settings
      • AWS Infrastructure Settings
      • AWS Tenant Settings
    • AWS Security Settings
      • Tenant Security settings
      • Infrastructure Security settings
      • System Security settings
      • AWS Account Security settings
      • Vanta Compliance Controls
  • GCP User Guide
    • Container deployments
      • Container orchestration features
      • Key DuploCloud concepts
    • Prerequisites
      • Docker Registry
      • Service Account Setup
      • Cloud DNS Zone
      • Certificates for Load Balancer and Ingress
      • Initial Infrastructure Setup
      • Tools Tenant
        • Enable Kubectl Shell
      • Docker
        • Docker Registry Credentials (Optional)
        • Shell Access for Docker (Optional)
      • VPN
        • VPN Setup
        • Connect to the VPN
      • Managed SSL Certificates with Certificate Manager (Optional)
    • GCP Quick Start
      • Step 1: Create Infrastructure and Plan
      • Step 2: Create a Tenant
      • Create a Service with GKE Autopilot
        • Step 3: Create a Service
        • Step 4: Create a Load Balancer
        • Step 5: Test the Application
      • Create a Service with GKE Standard
        • Step 3: Create a Node Pool
        • Step 4: Create a Service
        • Step 5: Create a Load Balancer
        • Step 6: Test the Application
    • GCP Use Cases
      • Creating an Infrastructure and Plan for GCP
        • Creating a GKE Autopilot Cluster
        • Creating GKE Standard Cluster
        • Kubectl token and config
        • Upgrading the GKE version
      • Creating a Tenant (Environment)
        • Tenant expiry
        • Tenant Config settings
      • Hosts (VMs)
      • Cost management for billing
        • Export Billing to BigQuery
        • Manage cross project billing in GCP
    • GCP Services
      • Containers and Services
      • GKE Containers and Services
        • Allocation Tagging
        • Docker Registry credentials
        • Container Rollback
        • Passing Config and Secrets
      • GCP Databases
        • Cloud SQL
        • Firestore Database
        • Managed Redis
      • Load Balancers
      • Cloud Armour
      • Cloud Credentials
      • Cloud Functions
      • Cloud Scheduler
      • Cloud Storage
      • Node Pools
      • Pub/Sub
    • GCP FAQs
    • GCP Systems Settings
      • GCP Infrastructure Settings
      • GCP Tenant Settings
    • GCP Security Settings
      • Infrastructure Security settings
      • GCP Account Security settings
  • Azure User Guide
    • Container deployments
      • Container orchestration features
      • Key DuploCloud concepts
    • Prerequisites
      • Program DNS entries
      • Set the AKS cluster version
      • Import SSL certificates
      • Provision the VPN
      • Connect to the VPN
      • Managed Identity Setup
    • Azure Quick Start
      • Step 1: Create Infrastructure and Plan
      • Step 2: Create a Tenant
      • Step 3: Create Agent Pools
      • Step 4: Create a Service
      • Step 5: Create a Load Balancer
      • Step 6: Test the Application
    • Azure Use Cases
      • Creating an Infrastructure and Plan for Azure
        • AKS initial setup
        • Kubectl token and config
        • Encrypted storage account
        • Upgrading the AKS version
      • Creating a Tenant (Environment)
        • Tenant expiry
        • Tenant Config settings
      • Hosts (VMs)
        • Autoscaling for Hosts
          • Autoscaling Azure Agent Pools
        • Shared Hosts
        • Availability Sets
        • Snapshots
      • Logs
      • Metrics
      • Faults and alerts
        • Alert notifications
      • Azure Portal link
      • Billing and Cost management
        • Enable billing data
        • Viewing billing data
    • Azure Services
      • Containers and Services
        • AKS Containers and Services
          • Allocation Tagging
        • Docker Registry Credentials
        • Container Rollback
        • Passing Configs and Secrets
      • Agent Pools
        • Spot Instances for AKS Agent Pools
      • Azure Container Registry (ACR)
      • Databases
        • MSSQL Server database
        • PostgreSQL database
        • PostgreSQL Flexible Server
        • MySQL Server database
          • Azure Managed SQL Instances
        • MySQL Flexible Server
        • Redis database
      • Docker Web Application
      • Databricks
      • Data Factory
      • Infra Secrets
      • Key Vault
      • Load Balancers
      • Public IP Address Prefix
      • Serverless
        • App Service Plans and Web Apps
        • Function Apps
      • Service Bus
      • Storage Account
      • Subscription
      • VM Scale Sets
    • Azure FAQ
    • Azure Systems Settings
      • Azure Infrastructure Settings
      • Azure Tenant Settings
    • Azure Security Settings
      • Tenant Security Settings
  • Kubernetes User Guide
    • Kubernetes Quick Start
    • Kubectl
      • Local Kubectl Setup
        • Kubectl Shell
      • Kubectl Shell
        • Enable Kubectl Shell for GKE
        • Enable Kubectl Shell for AKS
      • Kubectl Tokens and Access Management
      • Read-only Access in Kubernetes
      • Mirantis Lens
    • Configs and Secrets
      • Setting Kubernetes Secrets
      • Creating a Kubernetes ConfigMap
      • Setting Environment Variables (EVs) from a ConfigMap or Secret
      • Mounting ConfigMaps and Secrets as files
      • Using Kubernetes Secrets with Azure Storage connection data
      • Creating the SecretProviderClass Custom Resource to mount secrets
      • Managing Secrets and ConfigMaps access for readonly users (AWS and GCP)
    • Jobs
    • CronJobs
    • DaemonSet
    • Helm Charts
    • Ingress Loadbalancer
      • EKS Ingress
      • GKE Ingress
      • AKS Shared Application Gateway
        • Using an Azure Application Gateway SSL policy with Ingress
    • InitContainers and Sidecar Containers
    • HPA
    • Pod Toleration
    • Kubernetes Lifecycle Hooks
    • Kubernetes StorageClass and PVC
      • Native Azure Storage Classes
    • Import an External Kubernetes Cluster
    • Managed Service Accounts (RBAC)
    • Create a Diagnostics Application Service
  • Security and Compliance
    • Control Groups
    • Isolation and Firewall
      • Cloud Account
      • Network Segmentation
      • IAM
      • Security Groups
      • VPN
      • WAF
    • Access Management
      • Authentication Methods
      • Cloud Console, API and CLI
      • VM SSH
      • Container Shell
      • Kubernetes Access
      • Permission Sets
    • Encryption
      • At Rest Encryption
      • In Transit encryption
    • Tags and Label
    • Security Monitoring
      • Agent Management
      • SIEM
      • Vulnerabilities
      • Hardening Standards (CIS)
      • File Integrity Monitoring
      • Access Monitoring
      • HIDS
      • NIDS
      • Inventory Monitoring
        • Inventory Reports
      • Antivirus
      • VAPT (Pen Test)
      • AWS Security HUB
      • Alerting and Event Management
    • Compliance Frameworks
    • Security and Compliance Workflow
  • Terraform User Guide
    • DuploCloud Terraform Provider
    • DuploCloud Terraform Exporter
      • Install Terraform Exporter
      • Generate Terraform
      • Using Generated Code
      • Troubleshooting Guide
    • Terraform FAQ
  • Automation and Tools
    • DuploCtl CLI
    • Supported 3rd Party Tools
    • Automation Stacks
      • Clone from a Tenant
      • Create a deploy template
      • Deploy from a template
      • Customize deploy templates
  • CI/CD Overview
    • Service Accounts
    • GitHub Actions
      • Configure GitHub
      • Build a Docker image
      • Update a Kubernetes Service
      • Update an ECS Service
      • Update a Lambda function
      • Update CloudFront
      • Upload to S3 bucket
      • Execute Terraform
    • CircleCI
      • Configure CircleCI
      • Build and Push Docker Image
      • Update Service
    • GitLab CI/CD
      • Configure Gitlab
      • Build a Docker image
      • Update a service
    • Bitbucket Pipelines
      • Configure Bitbucket
      • Build a Docker image
      • Update the Service with Deploy Pipe
    • Azure Pipelines
      • Configure Azure DevOps
      • Build a Docker image from Azure DevOps
      • Update a Service
      • Troubleshooting
    • Katkit
      • Environments
      • Link repository
      • Phases
      • Katkit config
      • Advanced functions
  • User Administration
    • User Logins
    • User access to DuploCloud
    • API tokens
    • Session Timeout
    • Tenant Access for Users
      • Add Tenant access over a VPN
      • Read-only access to a Tenant
      • Cross-tenant Access
      • Deleting a Tenant
    • VPN access for users
    • Database access for users
    • SSO Configuration
      • Azure SSO Configuration
      • Okta Identity Management
    • Login Banner/Button Customization
  • Observability
    • Standard Observability Suite
      • Setup
        • Logging Setup
          • Custom Kibana Logging URL
        • Metrics Setup
        • Auditing
          • Custom Kibana Audit URL
      • Logs
      • Metrics
    • Advanced Observability Suite
      • Architecture
      • Dashboards
        • Administrator Dashboard
        • Tenant Dashboard
        • Customizing Dashboards
      • Logging with Loki
      • Metrics with Mimir
      • Tracing with Tempo
      • Profiles with Pyroscope
      • Alerts with Alert Manager
      • Service Level Objectives (SLOs)
      • OTEL Stack Resource Requirements
      • Application Instrumentation
      • Custom Metrics
      • Terraform
    • Faults and Alerts
      • Alert notifications
      • Automatic alert creation
    • Auditing
    • Web App Firewall (WAF)
  • Runbooks
    • Configuring Egress and Ingress for AKS Ingress Controllers in Private Networks
    • Configuring Retool to SSH into a DuploCloud Host with a Static IP Address for Secure Remote Database
  • FAQs
  • Extras
    • FluxCD
    • Deploying Helm Charts
    • Setting up SCPs (Service Control Policies) for DuploCloud
    • BYOH
    • Delegate Subdomains
    • Video Transcripts
      • DuploCloud AWS Product Demo
      • DuploCloud Azure Product Demo
      • DuploCloud GCP Product Demo
      • DevOps Deep Dive - Abstracting Cloud Complexity
      • DuploCloud Uses Infrastructure-as-Code to Stitch Together DevOps Lifecycle
Powered by GitBook
LogoLogo

Platform

  • Overview
  • Demo Videos
  • Pricing Guide
  • Documentaiton

Solutions

  • DevOps Automation
  • Compliance
  • Platform Engineering
  • Edge Deployments

Resources

  • Blog & News
  • Customer Stories
  • Webinars
  • Privacy Policy

Company

  • Careers
  • Press
  • Events
  • Contact

© DuploCloud, Inc. All rights reserved. DuploCloud trademarks used herein are registered trademarks of DuploCloud and affiliates

On this page
  • Introduction
  • Create data pipeline
  • Using DuploCloud UI
  • Using exported template in AWS console
  • Clone existing data pipeline
  • List view
  • Details view
  • Example data pipeline template
  • AWS console exported template
  • DuploCloud exported template

Was this helpful?

Edit on GitHub
Export as PDF
  1. AWS User Guide
  2. AWS Services

Data Pipeline

Introduction

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premises data silos.

Create data pipeline

A data pipeline can be created using any of the following ways:

  • Using DuploCloud UI

  • Using an exported template from AWS console

  • Cloning an existing template

Using DuploCloud UI

Proceed to Cloud Services → Analytics -> Data Pipeline. Click on +Add button.

Enter relevant information on the form. Click Generate button. The form includes information like - name, description, s3 log folder, cron schedule details, EMR resources, EMR steps, etc.

Review generated JSON, and make any further changes to generated JSON

Using exported template in AWS console

Proceed to Cloud Services → Analytics -> Data Pipeline. Click on +Add button. Click 'Import Pipeline Template'

In AWS console Proceed to Data Pipeline -> Choose Existing Data Pipeline -> Click Edit -> Click Export. Please review generated JSON, and make any further changes to generated JSON. Click Submit.

Clone existing data pipeline

Copy previously exported template from the form. Please do any additional changes (such as schedule frequency, EMR steps). Click Submit to save the Data Pipeline.

Existing Data Pipelines can be cloned in List View or Details View.

List view

To get JIT (Just In Time) access to appropriate AWS console, click on Data Pipeline, EMR Console, EMR Jupyter Console. Click **** row level menu actions to manage the Data Pipeline. e.g. Clone, Edit, Export, Delete etc.

Details view

Use Details view to update Data Pipeline. Use JIT (Just In Time) access to AWS console. Check Errors and warnings.

Example data pipeline template

There are two types of Data Pipeline templates:

  1. Exported template in AWS console

  2. Exported template in DuploCloud UI

AWS console exported template

{
    "ParameterValues": [
        {
            "Id": "myEMRReleaseLabel",
            "StringValue": "emr-6.1.0"
        },
        {
            "Id": "myMasterInstanceType",
            "StringValue": "m3.xlarge"
        },
        {
            "Id": "myBootstrapAction",
            "StringValue": "s3://duploservices-pravin-test-del1-128329325849/bootstrap_actions/customactionsstream_back_py_libraries.sh"
        },
        {
            "Id": "myEmrStep",
            "StringValue": "command-runner.jar,spark-submit,--packages,io.delta:delta-core_2.12:0.8.0,--conf,spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,--conf,spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog,--num-executors,2,--executor-cores,2,--executor-memory,2G,--conf,spark.driver.memoryOverhead=4096,--conf,spark.executor.memoryOverhead=4096,--conf,spark.dynamicAllocation.enabled=false,--name,PixelcustomactionsstreamData,--py-files,s3://duploservices-pravin-test-del1-128329325849/libraries/ua_parser.zip,--py-files,s3://duploservices-pravin-test-del1-128329325849/libraries/user_agents.zip,--py-files,s3://duploservices-pravin-test-del1-128329325849/libraries/trmLocation.zip,s3://duploservices-pravin-test-del1-128329325849/duplo_test/transform-pixel-data-to-parquet.py,s3://duploservices-pravin-test-del1-128329325849/duplo_test/sample_raw_data,s3://duploservices-pravin-test-del1-128329325849/duplo_test/output/,true,append,s3://duploservices-pravin-test-del1-128329325849/duplo_test/trmLocation.bin,s3://duploservices-pravin-test-del1-128329325849/duplo_test/customactionsstream_retailer_config.json,hourly,{\"retailer\":\"test\"}"
        },
        {
            "Id": "myEmrStep",
            "StringValue": "command-runner.jar,aws,athena,start-query-execution,--query-string,MSCK REPAIR TABLE backcountry.customactionsstream_hourly_delta,--result-configuration,OutputLocation=s3://duploservices-pravin-test-del1-128329325849/logs/athena/backcountry/customactionsstream_hourly_delta"
        },
        {
            "Id": "myCoreInstanceType",
            "StringValue": "m3.xlarge"
        },
        {
            "Id": "myCoreInstanceCount",
            "StringValue": "1"
        }
    ],
    "ParameterObjects": [
        {
            "Attributes": [
                {
                    "Key": "helpText",
                    "StringValue": "An existing EC2 key pair to SSH into the master node of the EMR cluster as the user \"hadoop\"."
                },
                {
                    "Key": "description",
                    "StringValue": "EC2 key pair"
                },
                {
                    "Key": "optional",
                    "StringValue": "true"
                },
                {
                    "Key": "type",
                    "StringValue": "String"
                }
            ],
            "Id": "myEC2KeyPair"
        },
        {
            "Attributes": [
                {
                    "Key": "helpLink",
                    "StringValue": "https://docs.aws.amazon.com/console/datapipeline/emrsteps"
                },
                {
                    "Key": "watermark",
                    "StringValue": "s3://myBucket/myPath/myStep.jar,firstArg,secondArg"
                },
                {
                    "Key": "helpText",
                    "StringValue": "A step is a unit of work you submit to the cluster. You can specify one or more steps"
                },
                {
                    "Key": "description",
                    "StringValue": "EMR step(s)"
                },
                {
                    "Key": "isArray",
                    "StringValue": "true"
                },
                {
                    "Key": "type",
                    "StringValue": "String"
                }
            ],
            "Id": "myEmrStep"
        },
        {
            "Attributes": [
                {
                    "Key": "helpText",
                    "StringValue": "Task instances run Hadoop tasks."
                },
                {
                    "Key": "description",
                    "StringValue": "Task node instance type"
                },
                {
                    "Key": "optional",
                    "StringValue": "true"
                },
                {
                    "Key": "type",
                    "StringValue": "String"
                }
            ],
            "Id": "myTaskInstanceType"
        },
        {
            "Attributes": [
                {
                    "Key": "default",
                    "StringValue": "m1.medium"
                },
                {
                    "Key": "helpText",
                    "StringValue": "Core instances run Hadoop tasks and store data using the Hadoop Distributed File System (HDFS)."
                },
                {
                    "Key": "description",
                    "StringValue": "Core node instance type"
                },
                {
                    "Key": "type",
                    "StringValue": "String"
                }
            ],
            "Id": "myCoreInstanceType"
        },
        {
            "Attributes": [
                {
                    "Key": "default",
                    "StringValue": "emr-5.13.0"
                },
                {
                    "Key": "helpText",
                    "StringValue": "Determines the base configuration of the instances in your cluster, including the Hadoop version."
                },
                {
                    "Key": "description",
                    "StringValue": "EMR Release Label"
                },
                {
                    "Key": "type",
                    "StringValue": "String"
                }
            ],
            "Id": "myEMRReleaseLabel"
        },
        {
            "Attributes": [
                {
                    "Key": "default",
                    "StringValue": "2"
                },
                {
                    "Key": "description",
                    "StringValue": "Core node instance count"
                },
                {
                    "Key": "type",
                    "StringValue": "Integer"
                }
            ],
            "Id": "myCoreInstanceCount"
        },
        {
            "Attributes": [
                {
                    "Key": "description",
                    "StringValue": "Task node instance count"
                },
                {
                    "Key": "optional",
                    "StringValue": "true"
                },
                {
                    "Key": "type",
                    "StringValue": "Integer"
                }
            ],
            "Id": "myTaskInstanceCount"
        },
        {
            "Attributes": [
                {
                    "Key": "helpLink",
                    "StringValue": "https://docs.aws.amazon.com/console/datapipeline/emr_bootstrap_actions"
                },
                {
                    "Key": "helpText",
                    "StringValue": "Bootstrap actions are scripts that are executed during setup before Hadoop starts on every cluster node."
                },
                {
                    "Key": "description",
                    "StringValue": "Bootstrap action(s)"
                },
                {
                    "Key": "isArray",
                    "StringValue": "true"
                },
                {
                    "Key": "optional",
                    "StringValue": "true"
                },
                {
                    "Key": "type",
                    "StringValue": "String"
                }
            ],
            "Id": "myBootstrapAction"
        },
        {
            "Attributes": [
                {
                    "Key": "default",
                    "StringValue": "m1.medium"
                },
                {
                    "Key": "helpText",
                    "StringValue": "The Master instance assigns Hadoop tasks to core and task nodes, and monitors their status."
                },
                {
                    "Key": "description",
                    "StringValue": "Master node instance type"
                },
                {
                    "Key": "type",
                    "StringValue": "String"
                }
            ],
            "Id": "myMasterInstanceType"
        }
    ],
    "PipelineObjects": [
        {
            "Fields": [
                {
                    "Key": "property",
                    "RefValue": "PropertyId_NA18c"
                },
                {
                    "Key": "classification",
                    "StringValue": "export"
                },
                {
                    "Key": "type",
                    "StringValue": "EmrConfiguration"
                }
            ],
            "Id": "EmrConfigurationId_LFzOl",
            "Name": "DefaultEmrConfiguration2"
        },
        {
            "Fields": [
                {
                    "Key": "type",
                    "StringValue": "Property"
                },
                {
                    "Key": "value",
                    "StringValue": "/usr/bin/python3"
                },
                {
                    "Key": "key",
                    "StringValue": "PYSPARK_PYTHON"
                }
            ],
            "Id": "PropertyId_NA18c",
            "Name": "DefaultProperty1"
        },
        {
            "Fields": [
                {
                    "Key": "taskInstanceType",
                    "StringValue": "#{myTaskInstanceType}"
                },
                {
                    "Key": "subnetId",
                    "StringValue": "subnet-09ce080c2630b9ad7"
                },
                {
                    "Key": "onFail",
                    "RefValue": "ActionId_SUEgm"
                },
                {
                    "Key": "maximumRetries",
                    "StringValue": "1"
                },
                {
                    "Key": "configuration",
                    "RefValue": "EmrConfigurationId_Q9rpL"
                },
                {
                    "Key": "coreInstanceCount",
                    "StringValue": "#{myCoreInstanceCount}"
                },
                {
                    "Key": "masterInstanceType",
                    "StringValue": "#{myMasterInstanceType}"
                },
                {
                    "Key": "releaseLabel",
                    "StringValue": "#{myEMRReleaseLabel}"
                },
                {
                    "Key": "type",
                    "StringValue": "EmrCluster"
                },
                {
                    "Key": "terminateAfter",
                    "StringValue": "3 Hours"
                },
                {
                    "Key": "bootstrapAction",
                    "StringValue": "#{myBootstrapAction}"
                },
                {
                    "Key": "resourceRole",
                    "StringValue": "duploservices-pravin-test"
                },
                {
                    "Key": "taskInstanceCount",
                    "StringValue": "#{myTaskInstanceCount}"
                },
                {
                    "Key": "coreInstanceType",
                    "StringValue": "#{myCoreInstanceType}"
                },
                {
                    "Key": "keyPair",
                    "StringValue": "duploservices-pravin-test"
                },
                {
                    "Key": "region",
                    "StringValue": "us-west-2"
                },
                {
                    "Key": "applications",
                    "StringValue": "spark"
                }
            ],
            "Id": "EmrClusterObj",
            "Name": "EmrClusterObj"
        },
        {
            "Fields": [
                {
                    "Key": "failureAndRerunMode",
                    "StringValue": "CASCADE"
                },
                {
                    "Key": "resourceRole",
                    "StringValue": "duploservices-pravin-test"
                },
                {
                    "Key": "pipelineLogUri",
                    "StringValue": "s3://duploservices-pravin-test-del1-128329325849/logs/data-pipelines/"
                },
                {
                    "Key": "role",
                    "StringValue": "DuploAWSDataPipelineRole"
                },
                {
                    "Key": "scheduleType",
                    "StringValue": "cron"
                }
            ],
            "Id": "Default",
            "Name": "Default"
        },
        {
            "Fields": [
                {
                    "Key": "period",
                    "StringValue": "10 Hour"
                },
                {
                    "Key": "startDateTime",
                    "StringValue": "2022-02-07T21:29:00"
                },
                {
                    "Key": "type",
                    "StringValue": "Schedule"
                }
            ],
            "Id": "ScheduleId_NfOUF",
            "Name": "Every 10 hr"
        },
        {
            "Fields": [
                {
                    "Key": "subject",
                    "StringValue": "Backcountry-customactionsstream-delta-hourly: #{node.@pipelineId} Error: #{node.errorMessage}"
                },
                {
                    "Key": "message",
                    "StringValue": "Backcountry-customactionsstream-delta-hourly failed to run"
                },
                {
                    "Key": "type",
                    "StringValue": "SnsAlarm"
                },
                {
                    "Key": "topicArn",
                    "StringValue": "arn:aws:sns:us-west-2:269378226633:duploservices-pravin-test-del77-128329325849"
                }
            ],
            "Id": "ActionId_SUEgm",
            "Name": "TriggerNotificationOnFail"
        },
        {
            "Fields": [
                {
                    "Key": "schedule",
                    "RefValue": "ScheduleId_NfOUF"
                },
                {
                    "Key": "step",
                    "StringValue": "#{myEmrStep}"
                },
                {
                    "Key": "runsOn",
                    "RefValue": "EmrClusterObj"
                },
                {
                    "Key": "type",
                    "StringValue": "EmrActivity"
                }
            ],
            "Id": "EmrActivityObj",
            "Name": "EmrActivityObj"
        },
        {
            "Fields": [
                {
                    "Key": "configuration",
                    "RefValue": "EmrConfigurationId_LFzOl"
                },
                {
                    "Key": "type",
                    "StringValue": "EmrConfiguration"
                },
                {
                    "Key": "classification",
                    "StringValue": "spark-env"
                }
            ],
            "Id": "EmrConfigurationId_Q9rpL",
            "Name": "DefaultEmrConfiguration1"
        }
    ]
}

DuploCloud exported template

{
  "objects": [
    {
      "subject": "Backcountry-clickstream-delta-hourly: #{node.@pipelineId} Error: #{node.errorMessage}",
      "name": "TriggerNotificationOnFail",
      "id": "ActionId_SUEgm",
      "message": "Backcountry-clickstream-delta-hourly failed to run",
      "type": "SnsAlarm",
      "topicArn": "arn:aws:sns:us-west-2:269378226633:duploservices-pravin-test-del77-128329325849"
    },
    {
      "configuration": {
        "ref": "EmrConfigurationId_LFzOl"
      },
      "name": "DefaultEmrConfiguration1",
      "id": "EmrConfigurationId_Q9rpL",
      "type": "EmrConfiguration",
      "classification": "spark-env"
    },
    {
      "subnetId": "subnet-09ce080c2630b9ad7",
      "taskInstanceType": "#{myTaskInstanceType}",
      "onFail": {
        "ref": "ActionId_SUEgm"
      },
      "maximumRetries": "1",
      "configuration": {
        "ref": "EmrConfigurationId_Q9rpL"
      },
      "coreInstanceCount": "#{myCoreInstanceCount}",
      "masterInstanceType": "#{myMasterInstanceType}",
      "releaseLabel": "#{myEMRReleaseLabel}",
      "type": "EmrCluster",
      "terminateAfter": "3 Hours",
      "bootstrapAction": "#{myBootstrapAction}",
      "resourceRole": "duploservices-pravin-test",
      "taskInstanceCount": "#{myTaskInstanceCount}",
      "name": "EmrClusterObj",
      "coreInstanceType": "#{myCoreInstanceType}",
      "keyPair": "duploservices-pravin-test",
      "id": "EmrClusterObj",
      "region": "us-west-2",
      "applications": "spark"
    },
    {
      "schedule": {
        "ref": "ScheduleId_NfOUF"
      },
      "name": "EmrActivityObj",
      "step": "#{myEmrStep}",
      "runsOn": {
        "ref": "EmrClusterObj"
      },
      "id": "EmrActivityObj",
      "type": "EmrActivity"
    },
    {
      "name": "DefaultEmrConfiguration2",
      "property": {
        "ref": "PropertyId_NA18c"
      },
      "id": "EmrConfigurationId_LFzOl",
      "classification": "export",
      "type": "EmrConfiguration"
    },
    {
      "period": "10 Hour",
      "startDateTime": "2022-02-07T21:18:20.737+00:00",
      "name": "Every 10 hr",
      "id": "ScheduleId_NfOUF",
      "type": "Schedule"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "duploservices-pravin-test",
      "role": "DuploAWSDataPipelineRole",
      "pipelineLogUri": "s3://duploservices-pravin-test-del1-128329325849/logs/data-pipelines/",
      "scheduleType": "cron",
      "name": "Default",
      "id": "Default"
    },
    {
      "name": "DefaultProperty1",
      "id": "PropertyId_NA18c",
      "type": "Property",
      "value": "/usr/bin/python3",
      "key": "PYSPARK_PYTHON"
    }
  ],
  "parameters": [
    {
      "helpText": "An existing EC2 key pair to SSH into the master node of the EMR cluster as the user \"hadoop\".",
      "description": "EC2 key pair",
      "optional": "true",
      "id": "myEC2KeyPair",
      "type": "String"
    },
    {
      "helpLink": "https://docs.aws.amazon.com/console/datapipeline/emrsteps",
      "watermark": "s3://myBucket/myPath/myStep.jar,firstArg,secondArg",
      "helpText": "A step is a unit of work you submit to the cluster. You can specify one or more steps",
      "description": "EMR step(s)",
      "isArray": "true",
      "id": "myEmrStep",
      "type": "String"
    },
    {
      "helpText": "Task instances run Hadoop tasks.",
      "description": "Task node instance type",
      "optional": "true",
      "id": "myTaskInstanceType",
      "type": "String"
    },
    {
      "default": "m1.medium",
      "helpText": "Core instances run Hadoop tasks and store data using the Hadoop Distributed File System (HDFS).",
      "description": "Core node instance type",
      "id": "myCoreInstanceType",
      "type": "String"
    },
    {
      "default": "emr-5.13.0",
      "helpText": "Determines the base configuration of the instances in your cluster, including the Hadoop version.",
      "description": "EMR Release Label",
      "id": "myEMRReleaseLabel",
      "type": "String"
    },
    {
      "default": "2",
      "description": "Core node instance count",
      "id": "myCoreInstanceCount",
      "type": "Integer"
    },
    {
      "description": "Task node instance count",
      "optional": "true",
      "id": "myTaskInstanceCount",
      "type": "Integer"
    },
    {
      "helpLink": "https://docs.aws.amazon.com/console/datapipeline/emr_bootstrap_actions",
      "helpText": "Bootstrap actions are scripts that are executed during setup before Hadoop starts on every cluster node.",
      "description": "Bootstrap action(s)",
      "isArray": "true",
      "optional": "true",
      "id": "myBootstrapAction",
      "type": "String"
    },
    {
      "default": "m1.medium",
      "helpText": "The Master instance assigns Hadoop tasks to core and task nodes, and monitors their status.",
      "description": "Master node instance type",
      "id": "myMasterInstanceType",
      "type": "String"
    }
  ],
  "values": {
    "myMasterInstanceType": "m3.xlarge",
    "myEMRReleaseLabel": "emr-6.1.0",
    "myBootstrapAction": "s3://duploservices-pravin-test-del1-128329325849/bootstrap_actions/clickstream_backcountry_py_libraries.sh",
    "myEmrStep": [
      "command-runner.jar,spark-submit,--packages,io.delta:delta-core_2.12:0.8.0,--conf,spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,--conf,spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog,--num-executors,2,--executor-cores,2,--executor-memory,2G,--conf,spark.driver.memoryOverhead=4096,--conf,spark.executor.memoryOverhead=4096,--conf,spark.dynamicAllocation.enabled=false,--name,PixelClickstreamData,--py-files,s3://duploservices-pravin-test-del1-128329325849/libraries/ua_parser.zip,--py-files,s3://duploservices-pravin-test-del1-128329325849/libraries/user_agents.zip,--py-files,s3://duploservices-pravin-test-del1-128329325849/libraries/IP2Location.zip,s3://duploservices-pravin-test-del1-128329325849/duplo_test/transform-pixel-data-to-parquet.py,s3://duploservices-pravin-test-del1-128329325849/duplo_test/sample_raw_data,s3://duploservices-pravin-test-del1-128329325849/duplo_test/output/,true,append,s3://duploservices-pravin-test-del1-128329325849/duplo_test/IP2LOCATION-LITE-DB5.IPV6.BIN,s3://duploservices-pravin-test-del1-128329325849/duplo_test/clickstream_retailer_config.json,hourly,{\"retailer\":\"test\"}",
      "command-runner.jar,aws,athena,start-query-execution,--query-string,MSCK REPAIR TABLE backcountry.clickstream_hourly_delta,--result-configuration,OutputLocation=s3://duploservices-pravin-test-del1-128329325849/logs/athena/backcountry/clickstream_hourly_delta"
    ],
    "myCoreInstanceCount": "1",
    "myCoreInstanceType": "m3.xlarge"
  }
}
PreviousManage Performance InsightsNextElastic Container Registry (ECR)

Last updated 3 months ago

Was this helpful?