Robust API Retry Mechanism with AWS Step Functions and Lambda

You, • Tech Blog
Back

MainImage

I have been working with external API calls for a while and have noticed they can sometimes fail for various reasons, such as network issues, server downtime, or rate limits on the server. So, I have built this solution to have a robust system to tackle this problem.

In this solution, we will leverage the AWS Step Function and Lambda Functions to construct a reliable retry mechanism. The State Machine will consist of a collection of Lambda functions invoked and stitched together to produce results. This article will walk you through the step-by-step guide.

The main objective we are trying to solve:

While Step Functions inherently support retries within tasks, our specific challenge involves handling API rate limits from the server we are communicating with. The server imposes a rate limit and responds with a 429 status code if too many requests are made from the same IP address within a short period.


Prerequisites

1. Architecture:

Workflow Explanation

  1. User Invokes Step Function State Machine: The process begins when a user initiates the step function state machine. This could be triggered through an API call, a scheduled event, or another AWS service.

  2. Step Function Invokes Lambda (1st Attempt): The step function invokes the first Lambda function (Lambda 1). This Lambda function is responsible for making the API call.

  3. Response: Status: Lambda 1 Executes the API call and returns a status response. This response indicates whether the API call was successful (e.g., status code 200) or failed (e.g., any status code other than 200).

  4. If Failure Status ≠ 200 (2nd Attempt): If the response from Lambda 1 If it indicates a failure (status code not equal to 200), the step function will proceed to invoke a retry mechanism. This could involve retrying the same Lambda function or invoking a different Lambda function (Lambda 2) to handle the retry attempt.

  5. Response: Status: Lambda 2 It attempts to execute the API call and returns a status response. Similar to the first Attempt, this response will indicate whether the retry was successful.

  6. If Success Status = 200: If either Lambda 1 or Lambda 2 Successfully executes the API call and returns a status code of 200, the step function completes successfully, and the user is notified of the success.

  7. If Failure Even After Retries: Then we will fail the step function and forward the API error to the user with the appropriate status code.

To explain the architecture easily, I have created the above diagram with one retry only, but we will build the solution with two retries. Below is the state machine diagram.


2. Step-by-Step Guide

import boto3
import json
import time

def start_state_machine(body):
    # Create a session with AWS credentials
    session = boto3.Session(
        aws_access_key_id='',
        aws_secret_access_key='',
        region_name=''
    )
    
    # Create a client to interact with AWS Step Functions
    step_functions_client = session.client('stepfunctions')
    
    # Define the ARN of the Step Function that you want to start
    state_machine_arn = 'arn:aws:states::stateMachine:apiProxyStateMachine'
    
    # Define the input to pass to the Step Function
    input_data = body
    
    # Start the Step Function with the specified input
    response = step_functions_client.start_execution(
        stateMachineArn=state_machine_arn,
        input=json.dumps(input_data)
    )
    
    # Wait for the execution to complete
    while True:
        execution_status = step_functions_client.describe_execution(
            executionArn=response['executionArn']
        )['status']
        if execution_status in ['SUCCEEDED', 'FAILED', 'ABORTED']:
            break
    
    execution_output = step_functions_client.describe_execution(
        executionArn=response['executionArn']
    )
    
    if(execution_output['status'] == 'SUCCEEDED'):
        return execution_output['output']
    else:
        return execution_output['status']

def lambda_handler(event, context):
    event = event["body"]
    data = start_state_machine(event)
    response = json.loads(data)
    return {
        "statusCode": response["statusCode"],
        "body": response["body"]
    }


3. Testing the State Machine

Trigger the state machine execution using the first lambda function URL and monitor it through the AWS State Machine Console. You should see the retries and the final result, whether it succeeds or fails.


Conclusion —

Implementing a robust API retry mechanism using AWS Step Functions and Lambda is a powerful way to enhance the reliability of your API integrations. I have worked too much with the vendor APIs, and their reliability is something you can not trust. They have rate limits, server IP-based wait times, and so on. This retry using different lambda functions will give us different server URLs, preventing IP-based wait time blocking plus the retry mechanism.

This solution provides a visual workflow to monitor and debug your API calls. With AWS Step Functions and Lambda, you can build a fault-tolerant API integration with minimal effort.


Thanks for reading the tutorial. I hope you learn something new today. If you want to read more stories like this, I invite you to follow me.

Till then, Sayonara! I wish you the best in your learning journey.

I am a software engineer who enjoys travelling and writing.

© Somil Gupta.Made with ♥ in India.