Big-data and ETL

Introduction

  • Use case:
    • Collection of data from using various methods/sources
      • Web scraping: Selenium using headless chrome/firefox.
      • Web crawling: status website sing crawling
      • API to Data collection: It could be REST or GraphQL API
      • Private internal customer data collected over various transactions
      • Private external customer data collected over secured SFTP
      • The data purchased from 3rd party
      • The data from various social networks
    • Correlate data from various sources
      • Clean up and Process data and apply various statistical methods, create
      • Correlate terabytes of data from various sources and make sense from the data.
      • Detect anomalies, summarize, bucketize, and various aggregations
      • Attach meta-data to enrich data.
      • Create data for NLP and ML models for predictions of future events.
    • AI/ML pipelines and life-cycle management
      • Make data available to data science team
      • Train models and do continuous improvement trials, reinforcement learning.
      • Create anomalies, bucketize data, summarize and do various aggregations.
      • Train NLP and ML models for predictions of future events based on history
      • Create history for models/hyper parameters and data at various stages.
  • Deploying Apache Spark™ cluster

Creating a Spark cluster with a Jupyter notebook

In this tutorial we will create a Spark cluster with a Jupyter notebook. A typical use case is ETL jobs, for example reading parquet files from S3, processing and pushing reports to databases. The aim is to process GBs of data in faster and cost-effective way.
The high-level steps are:
  1. 1.
    Create 3 VMs one for each Spark master, Spark worker and Jupyter notebook.
  2. 2.
    Deploy Docker images for each of these on these VMs.

Create a host for Spark master

Navigate to DevOps > Hosts > EC2 > +Add button to create hosts and click show advanced. Change the value of instance type to ‘m4.xlarge‘ and add an allocation tag ‘sparkmaster‘.
Figure 1

Create a host for Spark worker

Create another host for the worker. Change the value of instance type to ‘m4.4xlarge‘ and add an allocation tag ‘sparkworker‘. Click on submit. The number of workers depends on how much load you want to process. You should add one host for each worker. They should all have the same allocation tag ‘sparkworker‘. You can add and remove workers and scale up or down the Spark worker service as many times as you want. We will see in the following steps.
Figure 2

Create a host for Jupyter

Create one more host for Jupyter notebook. Choose the value of instance type to ‘m4.4xlarge‘ and add the allocation tag as ‘jupyter‘.
Figure 3

Create a Spark master Docker service

Navigate to Containers > EKS/ Native > Services and create new service. Under name choose ‘sparkmaster‘, image ‘duplocloud/anyservice:spark_v6', add the allocation tag ‘sparkmaster‘. In the Docker host config select Host network. By setting this in Docker Host config you are making the container networking the same as the VM i.e., container IP is same as VM.
Figure 4Z

Create a Jupyter service

First we need the IP address of Spark master. Click on Spark master service and on the right expand the container details and copy the host IP. Create another service, under name choose ‘jupyter‘, image ‘duplocloud/anyservice:spark_notebook_pyspark_scala_v4‘, add the allocation jupyter and select Host network for Docker Host Config, Add volume mapping “/home/ubuntu/jupyter:/home/jovyan/work“, Also provide the environment variables
{"SPARK_OPTS":" --master spark://<>:7077 --driver-memory 20g --executor-memory 15g --executor-cores 4 "}
Replace the brackets <> with the IP you just got. See figure 5.
Figure 5
Figure 6

Create Spark workers

Create another service named ‘sparkworker1`, image ‘duplocloud/anyservice:spark_v7‘, add the allocation tag ‘sparkworker‘ and select Host network for Docker Host Config. Also provide the environment variables
{"node": "worker", "masterip": "<>"}
Replace the brackets <> with the IP you just got. See Figure 5.
Depending on how many worker hosts you have created, use the same number under replicas and that is the way you can scale up and down. At any time, you can add new hosts, set the allocation tag ‘sparkworker‘ and then under services, edit the sparkworker service and update the replicas.
Figure 7

Create a Docker services shell

Add or update shell access by clicking on >_ icon. This gives you easy access into the container shell. You will need to wait for 5 minutes for the shell to be ready. Make sure you are connected to VPN if you choose to launch the shell as internal only

Open Jupyter Docker shell

Select Jupyter service and expand the container. Copy the hostip and then click on >_ icon.
Figure 8

Get Jupyter URL

Once you are inside the shell. Run command ‘> jupyter notebook list‘ to get the URL along with auth token. Replace the IP with Jupyter IP you copied previously. See Figure 5.
Figure 9

Open Jupyter

In your browser, navigate to the Jupyter URL and you should be able to see the UI.
Figure 10
Now you can use Jupyter to connect to data sources and destinations and do ETL jobs. Sources and destinations can include various SQL and NoSQL databases, S3 and various reporting tools including big data and GPU-based Deep learning.

Creating a Jupyter notebook with an ETL use-case

In this following we will create a Jupyter notebook and show some basic web scraping, using Spark for preprocessing, exporting into schema, do ETLs, join multiple dataframes (parquets), and export reports into MySQL.

Scrape data from internet

Connect to a website and parse html (using jsoup)
Figure 11

Download data

Extract the downloaded zip. This particular file is 8 GB in size and has 9 million records in csv
Figure 12

Save to S3

Upload the data to AWS S3
Figure 13

Connect to Spark cluster

Also Configure session with required settings to read and write from AWS S3
Figure 14

Load data

Load data in Spark cluster
Figure 14

Schema

Define the Spark schema
Figure 15

Process

Do data processing
Figure 16

SQL select

Setup Spark SQL
Figure 17

SQL Joins

Spark SQL joins 20 GB of data from multiple sources
Figure 18

Export reports

Export reports to RDS for UI consumption Generate various charts and graphs
Figure 19
Export as PDF
Copy link
Edit on GitHub
On this page
Introduction
Creating a Spark cluster with a Jupyter notebook
Create a host for Spark master
Create a host for Spark worker
Create a host for Jupyter
Create a Spark master Docker service
Create a Jupyter service
Create Spark workers
Create a Docker services shell
Open Jupyter Docker shell
Get Jupyter URL
Open Jupyter
Creating a Jupyter notebook with an ETL use-case
Scrape data from internet
Download data
Save to S3
Connect to Spark cluster
Load data
Schema
Process
SQL select
SQL Joins
Export reports