AWS Glue Jobs orchestration

Today's IT is moving towards cloud and server-less technologies and everyone wants to achieve best solution or design while implementing there solutions. Today in this blog we are going to talk about how to orchestrate your AWS Glue Jobs with an event/time driven server-less way

In any business scenario we will be running our Jobs based on Time or Event based triggers. Event here can be a file arrival in S3 or API call or SQS or any other event you wanted to trigger your Jobs and other is time based on particular window or frequency we want our jobs to run.

Below flow shows detail of how we can actually create our AWS Glue Jobs workflow based on a event trigger.

Below detail information based on each numbering provided in above flow:

1. This is the where we trigger our "AWS Lambda" based on any event occurrence and this event can be of many types as below:

File arrival in S3
SQS trigger event
Time base event in AWS Cloudwatch
API call can trigger a lambda
Dynamodb Stream can also trigger a lambda or many more based on business scenario

2. Once "AWS Lambda" gets initiate based on any event type above we can start our first AWS Glue Job with the help of boto3 "start_job_run" API

client = boto3.client('glue')
response = client.start_job_run(JobName='Glue Job',
Arguments={'--key1': 'value1',--key2': 'value2'})

Once Job started we can read response from above "start_job_run" API and parse it to get Job Run Id of AWS Glue Job and making an entry into dynamodb table by using Job Run Id as partition key and process name can be Hash Key. Below is the sample entry for dynamodb

{
"JobRunId": "jhjakhsdhj2134324vsdkldscklmsd",
"ProcessName": "exampleflow",
"JobName": "MyGlueJob",
"Status": "InProgress",
"CreatedDate": "2020-03-21",
"CreatedBy": "First Lambda",
"NextGlueJob": "SecondGlueJob",
"RunId": "100"
}

3. In this step it just shows it started the AWS Glue Jobs asynchronously and we did capture the Job-Run-Id which will help to get status in future

4. We can make an entry in SQS if we want to trigger same glue Job concurrently and due to concurrency limit of 3 all other events can wait in this waiting queue till any free Glue Job is available to take it (Applicable for concurrent Job's only)

5. In this step we will be having a event created in AWS Cloudwatch which should be triggered when our first Glue Job status change from running to Success or Fail and below is sample event Json

{
"detail-type": ["Glue Job State Change"],
"source": ["aws.glue"],
"detail": {"jobName": ["GlueJob1","GlueJob2"],
"state": ["FAILED","TIMEOUT","STOPPED","SUCCEEDED"]}
}

6. This Lambda will trigger when step 5 event occurs in AWS cloud watch and based on event payload it will read Job Run Id

7. In this step it can read message from waiting queue and start the first glue Job in the place of one completed (Applicable for concurrent Job's only)

8. In this Step we will be querying the dynamoDB table and get next Glue Job name and continue the process or if we need to end this process based on entry in dynamodb and also update status of our Glue Job completed.

Note: This is my one of the initial blogs so please comment your feedback or any corrections if needed appreciate your response.

AWS Glue Jobs orchestration

Saturday, March 21, 2020

No comments:

Post a Comment