Improving Developer Productivity at Disney with Serverless and Open Source

November 9, 2022 By Mark Otto 0

Disney connected devices

Disney+ is the dedicated streaming home for movies and shows from Disney, Pixar, Marvel, Star Wars, and National Geographic, as well as the new general entertainment content brand in select international markets, Star. Disney+ is available as both a standalone service, or as part of The Disney Bundle, which gives subscribers access to Disney+, Hulu, and ESPN+.

Software teams at Disney Streaming provide the best-in-class customer experience possible to their users. Regardless of the platform they’re using, customer expectations are high. Users expect streaming to be quick and smooth, profiles to be synchronized between devices, and notifications to be correct and timely. There is a real need to quickly adapt to new customer and business demands.

The Messaging team at Disney Streaming is responsible for engaging our customers through messaging such as email or push notifications. We send millions of messages to our customers for a variety of use cases including password recovery, account changes and purchase confirmations. The messages go out across various Disney brands and over various channels including email and push notifications, leveraging a variety of technologies, a vast number of them serverless.

Lambda Performance with Java

We started our serverless journey with Node.js to write our Lambda functions since it is well supported by Lambda and cold starts were negligible. There was, however, greater expertise in Java and JVM languages, and the team was tasked with porting over existing functionality to Java. We chose Java knowing that there would be some challenges, especially cold starts.

We started out simply using AWS Lambda functions to handle asynchronous use cases, specifically to listen to events published through Amazon Kinesis. Kinesis is a serverless streaming platform allowing us to collect and process data in real-time. Various teams would publish events about changes on their systems on Kinesis. These could be billing events, purchase events, or account changes. By publishing events, downstream consumers, including the Messaging team, can analyze and react to events in real-time while loosely coupling systems. Our team is responsible for consuming events from Kinesis, analyzing them and triggering appropriate messaging to our customers, notifying them in real time.

While there are various ways to consume from a Kinesis stream, Lambda provides a featureful native integration with little operational overhead. The native integration with Lambda automatically scales based on the number of Kinesis shards, polls Kinesis shards for new records, and checkpoints successfully processed records.

In addition, we also configured Kinesis to put records we can’t process into a dead letter Amazon Simple Queue Service (Amazon SQS) queue for additional processing. The native integration with Amazon SQS also has its own benefits, such as scaling up to 1,000 concurrent Lambda executions if there is a large volume of messages. By leveraging these serverless systems, we abstracted a lot of operational cost, allowing our team to focus on deliverables.

Disney serverless architecture diagram

We continued to expand our Lambda usage to similar use cases, making use of serverless where it made sense. For example, we leveraged Lambda functions to process Amazon Simple Email Service (SES) events via Amazon Simple Notification Service (SNS) to get email engagement updates, and DynamoDB streams to replicate data to Elasticsearch allowing us to scale our writes with DynamoDB while providing a way to do full text searches with Elasticsearch.

As many of our initial systems were asynchronous, cold starts were less of a concern. However, we still wanted to reduce Lambda cold starts. Cold starts happen whenever a new execution environment is created. During this process, Lambda downloads the function code, creates the environment, and then runs any initialization code before running the handler method. After execution completes, this execution environment is reused by subsequent requests. Though infrequent, cold starts can be an area for optimization.

Many things affect the duration of the cold start, including the size of the Lambda function, use of reflection, and the Java virtual machine (JVM).

One way we mitigated cold starts was avoiding dependency injection frameworks that leverage reflection. Reflection allows Java code to examine its own classes, methods, fields, and their properties at run time. This is important when programming for Lambda as using a dependency injection framework with reflection increases the duration of initialization, the period before the function executes the handler method. We used Dagger2 for dependency injection, an open source, fully static, compile-time dependency injection framework. Dependency injection functionality is often done at runtime, but with Dagger, the dependency injection is moved to compile time, reducing the cold start.

The size of our Lambda function also affects the cold start duration, the more classes and libraries included the longer the cold start duration. By reducing the size of our Lambda functions to only include the libraries that were needed, we were able to reduce the cold start duration. For example, instead of including the entire AWS SDK core, we only included libraries for the AWS services we were leveraging in that Lambda.

We also experimented with the memory allocated to our Lambda functions and saw improvements when increasing it beyond 512 MB. The increased memory gave the function more network bandwidth and CPU capacity, leading to faster cold starts and execution. Increasing our memory allocation from 512 MB to 3008 MB reduced our cold start duration from 6.3 seconds to 3.6 seconds running on the JVM.

Cold start initition chart according to memory allocation

As we reaped the benefits of a serverless model, we wanted to see if we could use Lambda and Java for HTTP APIs. However, since HTTP APIs are synchronous, clients calling the APIs would experience the occasional cold start and need to wait up to 3.6 seconds for a response. We needed to further reduce cold starts to meet sub second service level response times.

We found that GraalVM’s native-image tool had the potential to mitigate our cold starts by reducing our Lambda footprint and bypassing the JVM. The native image builder processes all elements of the application, and statically analyzes them to remove classes, methods and parameters that are unreachable during the application execution. This reduces the memory usage and jar size. Then it does an ahead of time compilation of the reachable code and data into a native executable. The native executable uses a fraction of the resources the JVM would with a faster startup.

The same Lambda function with 3008 MB of memory that took 3.6 seconds to start with the JVM, started in under 100 milliseconds once compiled to a native executable using GraalVMs native-image tool. We were able to achieve sub second response times as shown in the graph below, and as an added bonus were also able to reduce the memory size and cost.

Cold start duration with native images chart

One caveat to building native images is that reflection is only partially supported. The native-image tool attempts to resolve the elements of the application. Elements that fail to automatically resolve will need to be manually configured in a reflect-config.json file. This configuration specifies the program elements that will be accessed reflectively.

The Micronaut framework provides a @ReflectiveAccess annotation to help ease this process. The Micronaut Framework is an open source, modern, full-stack framework for building microservices and serverless applications. It has been designed to be cloud native from the start. Micronaut does not rely on reflection or dynamic class loading, so it works automatically with GraalVM native-image.

The @ReflectiveAccess annotation can be declared on a specific type, constructor, method or field to enable reflective access just for the annotated element. The annotation processor will then programmatically register a reflection configuration for those annotated elements.

Micronaut also includes helpful AWS integrations with their Micronaut AWS library, which allows easy integration between our Lambda function with Amazon API Gateway or an Application Load Balancer (ALB). This integration provides MicronautLambdaHandler, a Lambda handler that converts your ALB or API Gateway request to a familiar REST Controller class with annotations similar to popular frameworks like Spring.

@Controller
public class ItemController {
 …
 @Get("/items/{itemId}")
 public Item getItem(String itemId) {
     return ddb.getItem(itemId);
 }
 @Post("/items")
 public void postItem(@Valid @Body Item item) {
     ddb.saveItem(item);
 }
}

We can then set the handler to MicronautLambdaHandler in our CloudFormation, which will invoke our Controller methods.

Api:
   Type: AWS::Serverless::Function
   Properties:
     FunctionName: “Api”
     Handler: io.micronaut.function.aws.proxy.MicronautLambdaHandler
     Runtime: provided.al2

Read more about this integration here in the Micronaut guide.

Safe Deployments with AWS SAM

Using native AWS integrations and serverless already allowed us to provide business value quickly, but to further improve our velocity, we leveraged AWS Serverless Application Model (AWS SAM).

AWS SAM is an open source framework for deploying serverless applications in a single versioned entity.

AWS SAM, being an extension of AWS CloudFormation, allows us to treat our infrastructure as code. We can deploy to multiple accounts in multiple regions in various environments in a predictable and repeatable manner. AWS SAM provides a shorthand syntax, allowing us to declare serverless functions, APIs, databases, and event source mappings in fewer lines of CloudFormation. There’s less code to write and less code to maintain. In a few lines of AWS SAM, we can define a Lambda function using Java11 and its associated Kinesis event trigger, cutting down the need for longer CloudFormation templates.

TestConsumer:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: TestConsumer
      Handler: TestConsumer.Handler
      Runtime: java11
      CodeUri: ./test-consumer/target/test-consumer.jar
      MemorySize: 2048
      Events:
        Kinesis:
          Type: Kinesis
          Properties:
            Stream: !Ref TestStream
            StartingPosition: “LATEST”
            BatchSize: 100
            MaximumRecordAgeInSeconds: 3600 # 1 hour

Being able to produce performant APIs is great but we also need to be able to quickly and safely iterate. One of the greatest benefits of AWS SAM is it allows us to deploy quickly while reducing operational risk.

Properties:
      AutoPublishAlias: live
      DeploymentPreference:
        Type: Canary10Percent5Minutes
        Hooks:
          PreTraffic: TestConsumerBeforeTraffic
          PostTraffic: TestConsumerAfterTraffic
        Alarms:
          - !Ref TestConsumerBeforeTrafficAlarm
          - !Ref TestConsumerAfterTrafficAlarm

With a few lines of CloudFormation, AWS SAM allows us to deploy new versions of our Lambda function, and automatically creates aliases that point to the new version.

By specifying a deployment preference in AWS SAM, it directly integrates with AWS CodeDeploy. With this CodeDeploy integration, we can gradually shift traffic away from the old Lambda function to the new one. This gradual deployment comes in various flavors, including Canary, allowing us to shift a small percentage of traffic to the new version of the Lambda and the remaining traffic to the new Lambda function after some time. For example, by specifying Canary10Percent5Minutes, our Lambda function will only initially send 10 percent of our traffic to the new Lambda function deployed and, after 5 minutes, the remaining traffic is shifted to the new Lambda function.

A list of alarms can be configured to notify CodeDeploy that there is an issue. When CodeDeploy is alerted, it will rollback to the previous version of the Lambda function.

AWS SAM also allows us to define hooks. Hooks are Lambda functions that execute before and after traffic gets sent to the new version of the Lambda. The hooks can act as tests to exercise the new version of the Lambda, and its configuration is functional before and after traffic has been sent to the new version. If any of the tests fail, it can issue a code deploy failure or trigger an alarm which causes a rollback to the previous version of the Lambda function. These end-to-end tests allow us to validate that not only the Lambda function code is functional, but other integrations and configurations are correct.

Leveraging AWS SAM has enabled the team to configure and add needed infrastructure in one place. This has simplified both updates and maintenance of our infrastructure. This allows us to release end user features more quickly with fewer bugs.

Our use of serverless and open source technologies has improved our ability to deliver business value safely and reliably. Learn more about serverless at serverlessland.com and visit https://medium.com/disney-streaming to learn more about exciting technologies at Disney Streaming.

This article was written by Disney Streaming Services in collaboration with Amazon Web Services, and it does not constitute an endorsement of either party’s services.