Unlocking AWS Console: Diagnosing Errors with Amazon Q Developer

January 11, 2025 By Mark Otto Off

Introduction

Developers, IT Operators, and in some cases, Site Reliability Engineers (SREs) are responsible for deploying and operating infrastructure and applications, as well as responding to and resolving incidents effectively and in a timely manner. Effective incident management requires quick diagnosis, root cause analysis, and implementation of corrective actions. Diagnosing the root cause can be challenging in the context of modern systems that involve multiple resources deployed across distributed environments. Amazon Q Developer, a generative AI-powered assistant, can help simplify this process by diagnosing errors you receive in the AWS Management Console.

Amazon Q Developer can save you critical time when dealing with production issues by helping to diagnose errors related to your AWS environment. These errors could be the result of potential misconfiguration across multiple resources, and usually requires you to navigate between several service consoles to identify the root cause. Amazon Q Developer applies machine learning models to automate diagnosis of errors that arise in the AWS Console interface. This reduces the mean time to repair (MTTR) and minimizes the impact of incidents on business operations.

This blog post explores the Amazon Q Developer feature to diagnose errors in AWS Console while working with AWS services. We describe how this feature works in order to provide you guidance on troubleshooting. We take a look behind-the-scenes to show the processes that power this feature.

Diagnose with Amazon Q

The Diagnose with Amazon Q feature is activated when an error occurs in the console for an AWS service that is currently supported by this functionality, and a user with appropriate permissions clicks the Diagnose with Amazon Q button next to the error message. Amazon Q provides a natural language explanation that analyzes the root cause of the error. With a second click on Help me resolve, Amazon Q displays an ordered list of instructions which can be used to resolve the error condition. Once completed, you can provide feedback on whether the resolution provided by Amazon Q was helpful.

To make things concrete, we consider two running examples.

Example 1: Assume that you try to delete an S3 bucket which is not empty. This results in an error message:

This bucket is not empty. Buckets must be empty before they can be deleted. To
delete all objects in the bucket, use the empty bucket configuration.

Example 2: Suppose that you try to list objects in a particular S3 bucket, but lack IAM permissions to do so. This results in an error message:

Insufficient permissions to list objects. After you or your AWS administrator has updated your permissions to allow the s3:ListBucketaction, refresh the page. Learn more about Identity and access management in
Amazon S3.

User clicks on “Launch Instances” button In the EC2 service console in the AWS Management console. User enters all the required information, and clicks on “Launch Instance” button. This results in “Instance launch failed” error appearing in the console along with a “Diagnose with Amazon Q” button. User clicks on the button. this brings up a new window titled “Diagnose console errors with Amazon Q”. Soon an “Analysis” section appears with the message describing the issue with IAM permissions to launch new EC2 instances using natural language. User clicks on “Help me resolve” button. After few seconds, “Resolution” section along with the steps to resolve the error appears.

Diagnose with Amazon Q IAM permissions related to EC2 instance launch error

Behind the Scenes: How Amazon Q generates a diagnosis

When you click on Diagnose with Amazon Q button next to the error message in the AWS Management Console, Amazon Q generates an Analysis that expresses the root cause of the error in natural language. This step is assisted by Large Language Models (LLMs) and is based on context information only. The context provided to the LLM includes the error message shown in the console, the URL of the triggering action, and the IAM role of the user signed in the AWS Console. The service always operates within the permissions granted by your role as you operate in the AWS Console, ensuring that privileges are never escalated beyond what are assigned to you.

When you click on Help me resolve button after you have reviewed the analysis, Amazon Q retrieves additional information about the state of the resources in the AWS Account where the error occurred. This is accomplished by interrogating the customer account in various ways. In this phase, the system actively decides which information is still missing and issues interrogation requests against internal services to fulfil the information need. Interrogation is not needed for simple errors, such as Example 1 above, but becomes essential in order to resolve more complex errors, where information from the context proves insufficient.

Given the context, error analysis, user permissions, and results of account interrogation, Amazon Q generates step-by-step Resolution instructions. This step is assisted by LLMs.

After implementing and validating the steps provided by Amazon Q to resolve the error in the console, you have the option to provide feedback of your experience.

A flow diagram illustrating an error resolution process using Amazon Q. The process begins with an error. The user then diagnoses the issue with Amazon Q, which gets context information from the AWS Console and provide an Analysis. The user requests help to resolve the error. The system enriches the prompt interrogation the signed-in user's account. The model generates step-by-step resolution instructions. These instructions go through a validation process before being presented to the user for implementation.

Diagram showing Interactions between User, AWS Console and Amazon Q Developer

Context Information

Contextual information helps the LLMs to generate more relevant and informed outputs. Context is provided to Amazon Q as input from the console automatically. As the basis for all further analysis and decisions, it should be as rich as possible. At a minimum, Amazon Q obtains the error message, the URL for the triggering action, and the IAM role that the signed-in user assumes. The system automatically extracts relevant identifiers from the context. In our running Example 1, the URL may be https://s3.console.aws.amazon.com/s3/bucket/my-bucket-123456/delete?region=us-west-2, from which Amazon Q extracts aws_region = "us-west-2" and s3_bucket_name = "my-bucket-123456".

Beyond this minimum context, Amazon Q can obtain additional information from the console, pertaining to what the user sees on the screen when the error happens, such as content of text fields or widgets in the current UI. Amazon Q can also make use of specific context provided by the underlying service. In the case of Example 2 above, the bucket name is extracted from the URL, the action s3:ListBucket from the error message, and Amazon Q may obtain additional information from IAM about related policies and accept or deny statements.

Interrogating the signed-in user’s Account

Diagnose with Amazon Q functionality is not just a passive receiver of context information, it has built-in capabilities of actively asking for additional information. This includes developing an understanding of resources in the AWS account, and their relationship with the resource experiencing the error. Such interrogation queries are planned by a subsystem based on context information. It provides a low-latency and deterministic approach to find resources and their relationships. This relationship context provided to the LLM, such as EBS volumes attached to an EC2 instance or policies included in the attached IAM role, improves the accuracy of root cause analysis for diagnosing the error.

In the simple running Example 1 where error is due to non-empty S3 bucket, the error message and the console URL contain all the necessary information to proceed, and active interrogation is not required. On the other hand, for the IAM permission error in Example 2, it’s helpful to understand the permissions on the IAM role associated with the resource experiencing the error. Amazon Q can fetch identity-level policies for the role and resource-level policies for the affected resource, based on which it can diagnose the cause of the error, using internal IAM services. To be concrete, the URL for Example 2 may be https://s3.console.aws.amazon.com/s3/buckets/my-bucket-123456?region=us-west-2&bucketType=general&tab=objects, from which Amazon Q extracts region and S3 bucket name. It can also extract the action s3:ListBucket from the error message itself. Based on this information, Amazon Q can fetch bucket policies for my-bucket-123456, identity-level policies for the role, then scan those for presence or absence of the s3:ListBucket action, or call internal IAM services to provide additional information about the cause of access being denied.

This subsystem uses AWS Cloud Control API (CCAPI) which is called on your behalf by Amazon Q with the permissions granted by your IAM Role. As part of onboarding to Amazon Q, the AmazonQFullAccess managed policy is attached to the Role that can access Amazon Q. This managed policy contains the ListResources and GetResource CCAPI IAM permissions. This ensures all Roles given that managed policy will have access to the CCAPI read and list endpoints. If you do not attach the AmazonQFullAccess managed policy to the required roles, you will need to attach the ListResources and GetResource permission directly to the role.

Generating Step-by-step Resolution Instructions

At this point, all acquired information is synthesized by Amazon Q in order to generate useful and actionable resolution instructions. As an illustration, possible sample instructions for the running examples under consideration are listed below. As the models are updated and improved over time, the responses can change.

For Example 1, sample instructions could look like:

Navigate to the S3 console, click “Buckets”, and select the my-bucket-123456 bucket
Click on the “Empty” tab.
If your bucket contains a large number of objects, creating a lifecycle rule to delete all objects in the bucket might be a more efficient way of emptying your bucket
Type “permanently delete” in text input field and confirm that all objects are to be removed.
Retry deleting the my-bucket-123456 S3 bucket.

For Example 2, you may obtain:

Go to the IAM console. Edit the IAM policy attached to the role ReadOnly
Allow for the s3:ListBucket action for resource being the S3 bucket ARN arn:aws:s3:::my-bucket-123456.
Save the updated IAM policy
Refresh the S3 console page to list the objects in the bucket my-bucket-123456

Note that the instructions contain information inferred from the context, such as bucket name my-bucket-123456, instead of placeholders. Instructions returned by Diagnose with Amazon Q are complete and fine-grained enough in order to be followed without any extra effort. In fact, while the service makes use of an LLM to synthesize resolution instructions, Amazon Q uses post-processing to correct frequently occurring mistakes. For example, in Example 2 above, the LLM may have returned the ARN as arn:aws:s3:<region>::<bucket_name>, which would be corrected to what is shown above.

The instructions returned for Example 2 above assume that the reason for the user not being able to list objects is a missing Allow statement in the policies attached to the ReadOnly role. Other root causes could be a Deny statement in a policy attached to the S3 bucket, or to the ReadOnly role. Diagnose with Amazon Q can use account interrogation in order to identify the correct root cause and propose the right resolution. In the example above, it can fetch the policies attached to the ReadOnly role and check whether s3:ListBucket is missing indeed, or fetch policies attached to the bucket bucket-123456.

Validation

One goal for Diagnose with Amazon Q is to attain wide coverage of AWS rapidly, while keeping the quality bar high, so that you obtain useful, actionable advice where ever you obtain an error. An important prerequisite to attain this goal is a robust and flexible evaluation system. Evaluating systems based on Generative AI is challenging due to the large output space (natural language) and non-deterministic behavior.

In a nutshell, our validation system is based on building a large dataset of errors, where each record has a certain number of annotations. Each record contains the context (templatized error message and console URL; meaning that bucket-123456 is replaced by {{s3_bucket_name}}, us-west-2 by {{aws_region}}). Annotations include Infrastructure as Code (CloudFormation) descriptions of the erroneous account state and the triggering action, as well as ground truth responses obtained from expert annotators. These records allow us to simulate the behaviour of variants of our system without human interactions and many times faster than real time (by way of parallelization). We are also developing automated validation metrics for comparing ground truth annotations and system responses, based on which offline evaluations can be run fully automatically.

This validation system allows us to rapidly validate new ideas by comparing them against the current state, while also guarding against regressions. While human experts are still needed to provide annotations of error records, we actively innovate to speed up and simplify these tasks, by building annotation tools which avoid natural language input, have validations built in, and are rather asking to correct system output than providing ground truth annotations from scratch.

Conclusion

The Diagnose with Amazon Q feature of Amazon Q Developer allows you to determine the cause of an error in the AWS Console without needing to navigate to multiple service consoles. By providing tailored, step-by-step instructions specific to your AWS account and error context, Amazon Q Developer empowers you to troubleshoot and resolve issues efficiently. This helps your organization achieve greater operational efficiency, reduce downtime, improve service quality, and free up valuable human resources enabling them to focus on higher-value activities. We also provide you details on how AI and machine learning capabilities work behind the scenes to enable this functionality.