EMR Serverless
Run big data applications with open-source frameworks without managing clusters and servers
Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers. You get all the features and benefits of Amazon EMR without needing experts to plan and manage clusters.
In this procedure, we create an EMR studio, create and clone a Spark application, then create and clone a Spark job to run the application with EMR Serverless.
DuploCloud EMR Serverless supports Hive, Spark, and custom ECR images.
Creating an EMR Studio
To create EMR Serverless applications you first need to create an EMR studio.
In the DuploCloud Portal, navigate to Cloud Services -> Analytics.
Click the EMR Serverless tab.
Click EMR Studio.
Enter a Description of the Studio for reference.
Select an S3 Bucket that you previously defined from the Logs Default S3 Bucket list box.
Optionally, in the Logs Default S3 Folder field, specify the path to which logs are written.
Click Create. The EMR Studio is created and displayed.
Select the EMR Studio name in the Name column. The EMR Studio page displays. View the Details of the EMR Serverless Studio.
Now that the EMR Studio exists, you create an application to run analytics with it.
The DuploCloud Portal supports Hive
and Spark
applications. In this example, we create a Spark Application.
Creating an EMR Serverless application
In the EMR Serverless tab, click Add. A configuration wizard launches with five steps for you to complete.
Enter the EMR Serverless Application Name (
app1
, in this example) and the EMR Release Label in the Basics step. DuploCloud prepends the string DUPLOSERVICES-TENANT_NAME to your chosen application name, where TENANT_NAME is your Tenant's name. Click Next.Accept the defaults for the Capacity, Limits, and Configure pages by clicking Next on each page until you reach the Confirm page.
On the Confirm page, click Submit. Your created application instance (
DUPLOSERVICES-DEFAULT-APP1
, in this example) is displayed in the EMR Serverless tab with the State of CREATED.
Before you begin to create a job to run the application, clone an instance of it to run.
Cloning an application
Make any desired changes while advancing through the Basics, Capacity, Limits, and Configure steps, clicking Next to advance the wizard to the next page. DuploCloud gives your cloned app a unique generated name by default (app1-c-833, in this example).
On the Confirm page, click Submit. In the EMR Serverless tab, you should now have two application instances in the CREATED State: your original application instance (DUPLOSERVICES-DEFAULT-APP1) and the cloned application instance (DUPLOSERVICES-DEFAULT-APP1-C-833).
Creating a job
You have created and cloned the Spark application. Now you must create and clone a job to run it in EMR Serverless. In this example, we create a Spark job.
Select the application instance that you previously cloned. This instance (DUPLOSERVICES-DEFAULT-APP1-C-833, in this example) has a STATE of CREATED.
Click Add. The configuration wizard launches.
In the Basics step, enter the EMR Serverless RunJob Name (jobfromcloneapp, in this example).
Click Next.
In the Job details step, select a previously-defined Spark Script S3 Bucket.
In the Spark Script S3 Bucket File field, enter a path to define where your scripts are stored.
Optionally, in the Spark Scripts field, you can specify an array of arguments passed to your JAR or Python script. Each argument in the array must be separated by a comma (,). In the example below, a single argument of "40000" is entered.
Optionally, in the Spark Submit Parameters field, you can specify Spark
--conf
parameters. See the example below.Click Next.
Make any desired changes in the Configure step and click Next to advance the wizard to the Confirm page.
On the Confirm page, click Submit. In the Run Jobs tab for your cloned application, your job JOBFROMCLONEAPP displays.
Monitoring running jobs
Observe the status of your jobs and makes changes, if needed. In this example, we monitor the Spark jobs created and cloned in this procedure.
In the DuploCloud Portal, navigate to Cloud Services -> Analytics.
Click the EMR Serverless tab.
Select the application instance that you want to monitor. The Run Jobs tab displays run jobs connected to the application instance and each job's STATE.
Using the Actions menu, you can view the Console, Start, Stop, Edit, Clone or Delete jobs. You can also click the Details tab to view configuration details.
Last updated