LogoLogo
HomePlatformAsk DuploCloudPricing
  • Overview
  • Product Updates
  • Workshops
    • DuploCloud 101 for AWS
      • Create Your Infrastructure and Application
        • 1. Log in to the DuploCloud Portal
        • 2. Create a DuploCloud Infrastructure
        • 3. Create a DuploCloud Tenant
        • 4. Create an EKS Worker Node
        • 5. Deploy an Application
        • 6. Create a Load Balancer
        • 7. Deploy an S3 Bucket
        • 8. Deploy a Database
        • 9. Create an Alarm
      • Daily Operations using DuploCloud
        • 1. Host, Container, and Kubectl Shell
        • 2. Logging
        • 3. Metrics
        • 4. Billing and Cost Management
        • 5. Audit Logs
        • 6 - Tenant and Admin Just-In-Time (JIT) AWS Access
        • 7. CI/CD
        • 8. Security Hub and Dashboard
        • 9. Terraform Mode of Operations
      • Post-workshop Reference Guide
        • Post-Workshop Testing and Documentation Links
        • Connect With Us
        • DuploCloud Whitepapers
        • DuploCloud Terraform Provider
        • DuploCloud AWS Demo Video
  • Getting Started with DuploCloud
    • What DuploCloud Does
    • DuploCloud Onboarding
    • Application Focused Interface: DuploCloud Architecture
      • DuploCloud Tenancy Models
      • DuploCloud Common Components
        • Infrastructure
        • Plan
        • Tenant
        • Hosts
        • Services
        • Diagnostics
      • Management Portal Scope
    • GRC Tools and DuploCloud
    • Public Cloud Tutorials
    • Getting Help with DuploCloud
  • Container Orchestrators
    • Terminologies in Container Orchestration
  • DuploCloud Prerequisites
    • DNS Configuration
  • AWS User Guide
    • Prerequisites
      • Route 53 Hosted Zone
      • ACM Certificate
      • Shell Access for Containers
      • VPN Setup
      • Connect to the VPN
    • AWS Quick Start
      • Step 1: Create Infrastructure and Plan
      • Step 2: Create a Tenant
      • Step 3: Create an RDS Database (Optional)
      • Creating an EKS Service
        • Step 4: Create a Host
        • Step 5: Create a Service
        • Step 6: Create a Load Balancer
        • Step 7: Enable Additional Load Balancer Options (Optional)
        • Step 8: Create a Custom DNS Name (Optional)
        • Step 9: Test the Application
      • Creating an ECS Service
        • Step 4: Create a Task Definition for an Application
        • Step 5: Create the ECS Service and Load Balancer
        • Step 6: Test the Application
      • Creating a Native Docker Service
        • Step 4: Create an EC2 Host
        • Step 5: Create a Service
        • Step 6: Create a Load Balancer
        • Step 7: Test the Application
    • AWS Use Cases
      • Creating an Infrastructure and Plan for AWS
        • EKS Setup
          • Enable EKS endpoints
          • Enable EKS logs
          • Enable Cluster Autoscaler
        • ECS Setup
          • Enable ECS logging
        • Add VPC endpoints
        • Security Group rules
        • Upgrading the EKS version
      • Creating a Tenant (Environment)
        • Setting Tenant session duration
        • Setting Tenant expiration
        • Tenant Config settings
      • Hosts (VMs)
        • Adding Hosts
        • Connect EC2 instance
        • Adding Shared Hosts
        • Adding Dedicated Hosts
        • Autoscaling Hosts
          • Autoscaling Groups (ASG)
            • Launch Templates
            • Instance Refresh for ASG
            • Scale to or from Zero
            • Spot Instances for AWS
          • ECS Autoscaling
          • Autoscaling in Kubernetes
        • Configure Auto-reboot
        • Create Amazon Machine Image (AMI)
        • Hibernate an EC2 Host
        • Snapshots
        • Taints for EKS Nodes
        • Disable Source Destination Check
      • Auditing
      • Logs
        • Enable Default-Tenant logging
        • Enable Non-Default Tenant logging
        • Configure Logging per Tenant
        • Display logs
        • Create custom logs
      • Diagnostics and Metrics
        • Metrics Setup
        • Metrics Dashboard
        • Kubernetes Administrator dashboard
      • Faults and Alerts
        • Alert notifications
        • Automatic alert creation
        • Automatic fault healing
        • SNS Topic Alerts
        • System Settings Flags
      • AWS Console link
      • Just-in-Time (JIT) Access
      • Billing and Cost management
        • Enable billing data
        • View billing data
        • Apply cost allocation tags
        • DuploCloud License Usage
        • Configure Billing Alerts
      • Resource Quotas
      • Big Data and ETL
      • Custom Resource tags
    • AWS Services
      • Containers and Services
        • EKS Containers and Services
          • Allocation Tagging
        • ECS Containers, Task Definitions and Services
        • Passing Configs and Secrets
        • Container Rollback
        • Docker Registry credentials
      • Load Balancers
        • Target Groups
        • EKS Load Balancers
        • ECS Services and Load Balancers
        • Native Docker Load Balancers
      • Storage
        • Storage Class and PVCs
        • GP3 Storage Class
      • API Gateway
      • Batch
      • CloudFront
      • Databases
        • AWS ElastiCache
        • AWS DynamoDB database
        • AWS Timestream database
        • RDS database
          • IAM authentication
          • Backup and restore
          • Sharing encrypted database
          • Manage RDS Snapshots
          • Add and manage RDS read replicas
            • Add Aurora RDS replicas
          • Add monitoring interval
          • Enable or disable RDS logging
          • Restrict RDS instance size
          • Add parameters in Parameter Groups
          • Manage Performance Insights
      • Data Pipeline
      • Elastic Container Registry (ECR)
        • Sharing ECR Repos
      • Elastic File System (EFS)
        • Mount an EFS in an EC2 instance
      • EMR Serverless
      • EventBridge
      • IoT (Internet of Things)
      • Kafka Cluster
      • Kinesis Stream
      • Lambda Functions
        • Configure Lambda with Container Images
        • Lambda Layers
      • Managed Airflow
      • NAT Gateway for HA
      • OpenSearch
      • Probes and Health Check
      • S3 Bucket
      • SNS Topic
      • SQS Queue
      • Virtual Private Cloud (VPC) Peering
      • Web App Firewall (WAF)
    • AWS FAQ
    • AWS Systems Settings
      • AWS Infrastructure Settings
      • AWS Tenant Settings
    • AWS Security Configuration Settings
      • Tenant Security settings
      • Infrastructure Security settings
      • System Security settings
      • AWS Account Security settings
      • Vanta Compliance Controls
  • GCP User Guide
    • Container deployments
      • Container orchestration features
      • Key DuploCloud concepts
    • Prerequisites
      • Docker Registry
      • Service Account Setup
      • Cloud DNS Zone
      • Certificates for Load Balancer and Ingress
      • Initial Infrastructure Setup
      • Tools Tenant
        • Enable Kubectl Shell
      • Docker
        • Docker Registry Credentials (Optional)
        • Shell Access for Docker (Optional)
      • VPN
        • VPN Setup
        • Connect to the VPN
      • Managed SSL Certificates with Certificate Manager (Optional)
    • GCP Quick Start
      • Step 1: Create Infrastructure and Plan
      • Step 2: Create a Tenant
      • Create a Service with GKE Autopilot
        • Step 3: Create a Service
        • Step 4: Create a Load Balancer
        • Step 5: Test the Application
      • Create a Service with GKE Standard
        • Step 3: Create a Node Pool
        • Step 4: Create a Service
        • Step 5: Create a Load Balancer
        • Step 6: Test the Application
    • GCP Use Cases
      • Creating an Infrastructure and Plan for GCP
        • Creating a GKE Autopilot Cluster
        • Creating GKE Standard Cluster
        • Kubectl token and config
        • Upgrading the GKE version
      • Creating a Tenant (Environment)
        • Tenant expiry
        • Tenant Config settings
      • Hosts (VMs)
      • Cost management for billing
        • Export Billing to BigQuery
        • Manage cross project billing in GCP
    • GCP Services
      • Containers and Services
      • GKE Containers and Services
        • Allocation Tagging
        • Docker Registry credentials
        • Container Rollback
        • Passing Config and Secrets
      • GCP Databases
        • Cloud SQL
        • Firestore Database
        • Managed Redis
      • Load Balancers
      • Cloud Armour
      • Cloud Credentials
      • Cloud Functions
      • Cloud Run Service
      • Cloud Scheduler
      • Cloud Storage
      • Node Pools
      • Pub/Sub
      • Virtual Private Cloud (VPC) Peering
      • GCP Security Command Center
    • GCP FAQs
    • GCP Systems Settings
      • GCP Infrastructure Settings
      • GCP Tenant Settings
    • GCP Security Settings
      • Infrastructure Security settings
      • GCP Account Security settings
  • Azure User Guide
    • Container deployments
      • Container orchestration features
      • Key DuploCloud concepts
    • Prerequisites
      • Program DNS Entries
      • Import SSL certificates
      • Provision the VPN
      • Connect to the VPN
      • Managed Identity Setup
    • Azure Quick Start
      • Step 1: Create Infrastructure and Plan
      • Step 2: Create a Tenant
      • Step 3: Create Agent Pools
      • Step 4: Create a Service
      • Step 5: Create a Load Balancer
      • Step 6: Test the Application
    • Azure Use Cases
      • Creating an Infrastructure and Plan for Azure
        • AKS initial setup
        • Kubectl token and config
        • Encrypted storage account
        • Upgrading the AKS version
      • Creating a Tenant (Environment)
        • Tenant expiry
        • Tenant Config settings
      • Hosts (VMs)
        • Autoscaling for Hosts
          • Autoscaling Azure Agent Pools
        • Shared Hosts
        • Availability Sets
        • Snapshots
      • Logs
      • Metrics
      • Faults and alerts
        • Alert notifications
      • Azure Portal link
      • Billing and Cost management
        • Enable billing data
        • Viewing billing data
    • Azure Services
      • Containers and Services
        • AKS Containers and Services
          • Allocation Tagging
        • Docker Registry Credentials
        • Container Rollback
        • Passing Configs and Secrets
      • Agent Pools
        • Spot Instances for AKS Agent Pools
      • Azure Container Registry (ACR)
      • Databases
        • MSSQL Server database
        • PostgreSQL database
        • PostgreSQL Flexible Server
        • MySQL Server database
          • Azure Managed SQL Instances
        • MySQL Flexible Server
        • Redis database
      • Docker Web Application
      • Databricks
      • Data Factory
      • Infra Secrets
      • Key Vault
      • Load Balancers
      • Public IP Address Prefix
      • Serverless
        • App Service Plans and Web Apps
        • Function Apps
      • Service Bus
      • Storage Account
      • Subscription
      • VM Scale Sets
    • Azure FAQ
    • Azure Systems Settings
      • Azure Infrastructure Settings
      • Azure Tenant Settings
    • Azure Security Settings
      • Tenant Security Settings
  • Kubernetes User Guide
    • Kubernetes Quick Start
    • Kubectl
      • Local Kubectl Setup
        • Kubectl Shell
      • Kubectl Shell
        • Enable Kubectl Shell for GKE
        • Enable Kubectl Shell for AKS
      • Kubectl Tokens and Access Management
      • Read-only Access in Kubernetes
      • Mirantis Lens
    • Configs and Secrets
      • Setting Kubernetes Secrets
      • Creating a Kubernetes ConfigMap
      • Setting Environment Variables (EVs) from a ConfigMap or Secret
      • Mounting ConfigMaps and Secrets as files
      • Using Kubernetes Secrets with Azure Storage connection data
      • Creating the SecretProviderClass Custom Resource to mount secrets
      • Managing Secrets and ConfigMaps access for readonly users (AWS and GCP)
    • Jobs
    • CronJobs
    • DaemonSet
    • Helm Charts
    • Ingress Loadbalancer
      • EKS Ingress
      • GKE Ingress
      • AKS Shared Application Gateway
        • Using an Azure Application Gateway SSL policy with Ingress
    • InitContainers and Sidecar Containers
    • HPA
    • Pod Toleration
    • Kubernetes Lifecycle Hooks
    • Kubernetes StorageClass and PVC
      • Native Azure Storage Classes
    • Import an External Kubernetes Cluster
    • Managed Service Accounts (RBAC)
    • Create a Diagnostics Application Service
  • Security and Compliance
    • Control Groups
    • Isolation and Firewall
      • Cloud Account
      • Network Segmentation
      • IAM
      • Security Groups
      • VPN
      • WAF
    • Access Management
      • Authentication Methods
      • Cloud Console, API and CLI
      • VM SSH
      • Container Shell
      • Kubernetes Access
      • Permission Sets
    • Encryption
      • At Rest Encryption
      • In Transit encryption
    • Tags and Label
    • Security Monitoring
      • Agent Management
      • SIEM
      • Vulnerabilities
      • Hardening Standards (CIS)
      • File Integrity Monitoring
      • Access Monitoring
      • HIDS
      • NIDS
      • Inventory Monitoring
        • Inventory Reports
      • Antivirus
      • VAPT (Pen Test)
      • AWS Security HUB
      • Alerting and Event Management
    • Compliance Frameworks
    • Security and Compliance Workflow
  • Terraform User Guide
    • DuploCloud Terraform Provider
    • DuploCloud Terraform Exporter
      • Install Terraform Exporter
      • Generate Terraform
      • Using Generated Code
      • Troubleshooting Guide
    • Terraform FAQ
  • Automation and Tools
    • DuploCtl CLI
    • Supported 3rd Party Tools
    • Automation Stacks
      • Clone from a Tenant
      • Create a deploy template
      • Deploy from a template
      • Customize deploy templates
  • CI/CD Overview
    • Service Accounts
    • GitHub Actions
      • Configure GitHub
      • Build a Docker image
      • Update a Kubernetes Service
      • Update an ECS Service
      • Update a Lambda function
      • Update CloudFront
      • Upload to S3 bucket
      • Execute Terraform
    • CircleCI
      • Configure CircleCI
      • Build and Push Docker Image
      • Update Service
    • GitLab CI/CD
      • Configure Gitlab
      • Build a Docker image
      • Update a service
    • Bitbucket Pipelines
      • Configure Bitbucket
      • Build a Docker image
      • Update the Service with Deploy Pipe
    • Azure Pipelines
      • Configure Azure DevOps
      • Build a Docker image from Azure DevOps
      • Update a Service
      • Troubleshooting
    • Katkit
      • Environments
      • Link repository
      • Phases
      • Katkit config
      • Advanced functions
    • ArgoCD
  • User Administration
    • User Logins
    • User access to DuploCloud
    • User Email Notifications
    • API tokens
    • Session Timeout
    • Tenant Access for Users
      • Add Tenant access over a VPN
      • Read-only access to a Tenant
      • Cross-tenant Access
      • Deleting a Tenant
    • VPN access for users
    • Database access for users
    • SSO Configuration
      • Azure SSO Configuration
      • Okta Identity Management
    • Login Banner/Button Customization
  • AI Suite
    • AI HelpDesk
      • Ticket
      • Out of the Box Agents
    • AI Studio
      • Agent
      • Tools
      • VectorDB
      • Developers
    • FAQ
  • Observability
    • Standard Observability Suite
      • Setup
        • Logging Setup
          • Custom Kibana Logging URL
        • Metrics Setup
        • Auditing
          • Custom Kibana Audit URL
      • Logs
      • Metrics
    • Advanced Observability Suite
      • Architecture
      • Dashboards
        • Administrator Dashboard
        • Tenant Dashboard
        • Customizing Dashboards
      • Logging with Loki
      • Metrics with Mimir
      • Tracing with Tempo
      • Profiles with Pyroscope
      • Alerts with Alert Manager
      • Service Level Objectives (SLOs)
      • OTEL Stack Resource Requirements
      • Application Instrumentation
      • Custom Metrics
      • Terraform
    • Faults and Alerts
      • Alert notifications
      • Automatic alert creation
    • Auditing
    • Web App Firewall (WAF)
  • Runbooks
    • Configuring Egress and Ingress for AKS Ingress Controllers in Private Networks
    • Configuring Retool to SSH into a DuploCloud Host with a Static IP Address for Secure Remote Database
  • FAQs
  • Extras
    • FluxCD
    • Deploying Helm Charts
    • Setting up SCPs (Service Control Policies) for DuploCloud
    • BYOH
    • Delegate Subdomains
    • Video Transcripts
      • DuploCloud AWS Product Demo
      • DuploCloud Azure Product Demo
      • DuploCloud GCP Product Demo
      • DevOps Deep Dive - Abstracting Cloud Complexity
      • DuploCloud Uses Infrastructure-as-Code to Stitch Together DevOps Lifecycle
Powered by GitBook
LogoLogo

Platform

  • Overview
  • Demo Videos
  • Pricing Guide
  • Documentaiton

Solutions

  • DevOps Automation
  • Compliance
  • Platform Engineering
  • Edge Deployments

Resources

  • Blog & News
  • Customer Stories
  • Webinars
  • Privacy Policy

Company

  • Careers
  • Press
  • Events
  • Contact

© DuploCloud, Inc. All rights reserved. DuploCloud trademarks used herein are registered trademarks of DuploCloud and affiliates

On this page
  • Introduction
  • Creating a Spark cluster with a Jupyter notebook
  • Create a host for Spark master
  • Create a host for Spark worker
  • Create a host for Jupyter
  • Create a Spark master Docker service
  • Create a Jupyter service
  • Create Spark workers
  • Create a Docker services shell
  • Open Jupyter Docker shell
  • Get Jupyter URL
  • Open Jupyter
  • Creating a Jupyter notebook with an ETL use-case
  • Scrape data from internet
  • Download data
  • Save to S3
  • Connect to Spark cluster
  • Load data
  • Schema
  • Process
  • SQL select
  • SQL Joins
  • Export reports

Was this helpful?

Edit on GitHub
Export as PDF
  1. AWS User Guide
  2. AWS Use Cases

Big Data and ETL

Introduction

  • Use case:

    • Collection of data from using various methods/sources

      • Web scraping: Selenium using headless chrome/firefox.

      • Web crawling: status website sing crawling

      • API to Data collection: It could be REST or GraphQL API

      • Private internal customer data collected over various transactions

      • Private external customer data collected over secured SFTP

      • The data purchased from 3rd party

      • The data from various social networks

    • Correlate data from various sources

      • Clean up and Process data and apply various statistical methods, create

      • Correlate terabytes of data from various sources and make sense from the data.

      • Detect anomalies, summarize, bucketize, and various aggregations

      • Attach meta-data to enrich data.

      • Create data for NLP and ML models for predictions of future events.

    • AI/ML pipelines and life-cycle management

      • Make data available to data science team

      • Train models and do continuous improvement trials, reinforcement learning.

      • Create anomalies, bucketize data, summarize and do various aggregations.

      • Train NLP and ML models for predictions of future events based on history

      • Create history for models/hyper parameters and data at various stages.

  • Deploying Apache Sparkâ„¢ cluster

Creating a Spark cluster with a Jupyter notebook

In this tutorial we will create a Spark cluster with a Jupyter notebook. A typical use case is ETL jobs, for example reading parquet files from S3, processing and pushing reports to databases. The aim is to process GBs of data in faster and cost-effective way.

The high-level steps are:

  1. Create 3 VMs one for each Spark master, Spark worker and Jupyter notebook.

  2. Deploy Docker images for each of these on these VMs.

Create a host for Spark master

From the DuploCloud portal, navigate to Cloud Services -> Hosts -> EC2. Click +Add and check the Advanced Options box. Change the value of instance type to ‘m4.xlarge‘ and add an allocation tag ‘sparkmaster‘.

Create a host for Spark worker

Create another host for the worker. Change the value of instance type to ‘m4.4xlarge‘ and add an allocation tag ‘sparkworker‘. Click Submit. The number of workers depends on how much load you want to process. You should add one host for each worker. They should all have the same allocation tag ‘sparkworker‘. You can add and remove workers and scale up or down the Spark worker service as many times as you want. We will see in the following steps.

Create a host for Jupyter

Create one more host for Jupyter notebook. Choose the value of instance type to ‘m4.4xlarge‘ and add the allocation tag as ‘jupyter‘.

Create a Spark master Docker service

Navigate to Docker -> Services and click Add. In the Service Name field, enter ‘sparkmaster‘ and in the Docker Image field, enter ‘duplocloud/anyservice:spark_v6', add the allocation tag ‘sparkmaster‘. From the Docker Networks list box, select Host Network. By setting this in Docker Host config you are making the container networking the same as the VM i.e., container IP is same as VM.

Create a Jupyter service

First we need the IP address of Spark master. Click on Spark master service and on the right expand the container details and copy the host IP. Create another service, under name choose ‘jupyter‘, image ‘duplocloud/anyservice:spark_notebook_pyspark_scala_v4‘, add the allocation jupyter and select Host network for Docker Host Config, Add volume mapping “/home/ubuntu/jupyter:/home/jovyan/work“, Also provide the environment variables

{"SPARK_OPTS":" --master spark://<>:7077 --driver-memory 20g --executor-memory 15g --executor-cores 4 "}

Replace the brackets <> with the IP you just got. See figure 5.

Create Spark workers

Create another service named ‘sparkworker1`, image ‘duplocloud/anyservice:spark_v7‘, add the allocation tag ‘sparkworker‘ and select Host Network for Docker Network. Also provide the environment variables

{"node": "worker", "masterip": "<>"}

Replace the brackets <> with the IP you just got. See Figure 5.

Depending on how many worker hosts you have created, use the same number under replicas and that is the way you can scale up and down. At any time, you can add new hosts, set the allocation tag ‘sparkworker‘ and then under services, edit the sparkworker service and update the replicas.

Create a Docker services shell

Add or update shell access by clicking on >_ icon. This gives you easy access into the container shell. You will need to wait for 5 minutes for the shell to be ready. Make sure you are connected to VPN if you choose to launch the shell as internal only

Open Jupyter Docker shell

Select Jupyter service and expand the container. Copy the Host IP and then click on >_ icon.

Get Jupyter URL

Once you are inside the shell. Run command ‘> jupyter notebook list‘ to get the URL along with auth token. Replace the IP with Jupyter IP you copied previously. See Figure 5.

Open Jupyter

In your browser, navigate to the Jupyter URL and you should be able to see the UI.

Now you can use Jupyter to connect to data sources and destinations and do ETL jobs. Sources and destinations can include various SQL and NoSQL databases, S3 and various reporting tools including big data and GPU-based Deep learning.

Creating a Jupyter notebook with an ETL use-case

In this following we will create a Jupyter notebook and show some basic web scraping, using Spark for preprocessing, exporting into schema, do ETLs, join multiple dataframes (parquets), and export reports into MySQL.

Scrape data from internet

Connect to a website and parse html (using jsoup)

Download data

Extract the downloaded zip. This particular file is 8 GB in size and has 9 million records in csv

Save to S3

Upload the data to AWS S3

Connect to Spark cluster

Also Configure session with required settings to read and write from AWS S3

Load data

Load data in Spark cluster

Schema

Define the Spark schema

Process

Do data processing

SQL select

Setup Spark SQL

SQL Joins

Spark SQL joins 20 GB of data from multiple sources

Export reports

Export reports to RDS for UI consumption Generate various charts and graphs

PreviousResource QuotasNextCustom Resource tags

Last updated 2 months ago

Was this helpful?

Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
The EC2 Add Host page
The EC2 Add Host page
The EC2 Add Host page