Links

EMR Serverless

Run big data applications with open-source frameworks without managing clusters and servers
Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers. You get all the features and benefits of Amazon EMR without needing experts to plan and manage clusters.
In this procedure, we create an EMR studio, create and clone a Spark application, then create and clone a Spark job to run the application with EMR Serverless.
DuploCloud EMR Serverless supports Hive, Spark, and custom ECR images.

Creating an EMR Studio

To create EMR Serverless applications you first need to create an EMR studio.
  1. 1.
    In the DuploCloud Portal, navigate to Cloud Services -> Analytics.
  2. 2.
    Click the EMR Serverless tab.
  3. 3.
    Click EMR Studio.
    Actions menu with EMR Studio option highlighted on EMR Serverless tab
  4. 4.
    Click Add. The Add EMR Studio pane displays.
  5. 5.
    Enter a Description of the Studio for reference.
  6. 6.
    Select an S3 Bucket that you previously defined from the Logs Default S3 Bucket list box.
  7. 7.
    Optionally, in the Logs Default S3 Folder field, specify the path to which logs are written.
  8. 8.
    Click Create. The EMR Studio is created and displayed.
  9. 9.
    Select the EMR Studio name in the Name column. The EMR Studio page displays. View the Details of the EMR Serverless Studio.
    EMR Studio page with Basic and Details tabs.
  10. 10.
    Navigate to the EMR Serverless tab and click the menu (
    ) icon in the Actions column. Use the Actions Menu to delete the studio if needed, as well as to view the studio in the AWS Console.
    EMR Serverless Studio Actions Menu
Now that the EMR Studio exists, you create an application to run analytics with it. The DuploCloud Portal supports Hive and Spark applications. In this example, we create a Spark Application.

Creating an EMR Serverless application

  1. 1.
    In the EMR Serverless tab, click Add. A configuration wizard launches with five steps for you to complete.
  2. 2.
    Enter the EMR Serverless Application Name (app1, in this example) and the EMR Release Label in the Basics step. DuploCloud prepends the string DUPLOSERVICES-TENANT_NAME to your chosen application name, where TENANT_NAME is your Tenant's name. Click Next.
    EMR Serverless configuration wizard Basics step
  3. 3.
    Accept the defaults for the Capacity, Limits, and Configure pages by clicking Next on each page until you reach the Confirm page.
  4. 4.
    On the Confirm page, click Submit. Your created application instance (DUPLOSERVICES-DEFAULT-APP1, in this example) is displayed in the EMR Serverless tab with the State of CREATED.
    EMR Serverless tab with CREATED application instance
Before you begin to create a job to run the application, clone an instance of it to run.

Cloning an application

  1. 1.
    On the EMR Serverless page, click the menu (
    ) icon and select Clone.
    Actions menu with Clone option on EMR Serverless tab
  2. 2.
    Make any desired changes while advancing through the Basics, Capacity, Limits, and Configure steps, clicking Next to advance the wizard to the next page. DuploCloud gives your cloned app a unique generated name by default (app1-c-833, in this example).
  3. 3.
    On the Confirm page, click Submit. In the EMR Serverless tab, you should now have two application instances in the CREATED State: your original application instance (DUPLOSERVICES-DEFAULT-APP1) and the cloned application instance (DUPLOSERVICES-DEFAULT-APP1-C-833).
    Original application instance and cloned instance in EMR Serverless tab

Creating a job

You have created and cloned the Spark application. Now you must create and clone a job to run it in EMR Serverless. In this example, we create a Spark job.
If you are new to Spark, use the Info Tips (blue
icon) when entering data in the EMR Serverless configuration wizard steps below.
  1. 1.
    Select the application instance that you previously cloned. This instance (DUPLOSERVICES-DEFAULT-APP1-C-833, in this example) has a STATE of CREATED.
  2. 2.
    Click Add. The configuration wizard launches.
  3. 3.
    In the Basics step, enter the EMR Serverless RunJob Name (jobfromcloneapp, in this example).
    EMR Serverless configuration wizard Basics step with EMR Serverless RunJob Name field
  4. 4.
    Click Next.
  5. 5.
    In the Job details step, select a previously-defined Spark Script S3 Bucket.
  6. 6.
    In the Spark Script S3 Bucket File field, enter a path to define where your scripts are stored.
  7. 7.
    Optionally, in the Spark Scripts field, you can specify an array of arguments passed to your JAR or Python script. Each argument in the array must be separated by a comma (,). In the example below, a single argument of "40000" is entered.
  8. 8.
    Optionally, in the Spark Submit Parameters field, you can specify Spark --conf parameters. See the example below.
    EMR Serverless configuration wizard Job details step with Spark Script Arguments and Spark Submit Parameters fields
  9. 9.
    Click Next.
  10. 10.
    Make any desired changes in the Configure step and click Next to advance the wizard to the Confirm page.
  11. 11.
    On the Confirm page, click Submit. In the Run Jobs tab for your cloned application, your job JOBFROMCLONEAPP displays.
    Run Jobs tab for cloned application instance DUPLOSERVICES-DEFAULT-APP1-C-753

Monitoring running jobs

Observe the status of your jobs and makes changes, if needed. In this example, we monitor the Spark jobs created and cloned in this procedure.
  1. 1.
    In the DuploCloud Portal, navigate to Cloud Services -> Analytics.
  2. 2.
    Click the EMR Serverless tab.
  3. 3.
    Select the application instance that you want to monitor. The Run Jobs tab displays run jobs connected to the application instance and each job's STATE.
    Run Jobs tab with 2 jobs in various STATEs
  4. 4.
    Using the Actions menu, you can view the Console, Start, Stop, Edit, Clone or Delete jobs. You can also click the Details tab to view configuration details.