Unlocking AWS Console: Diagnosing Errors with Amazon Q Developer
January 11, 2025Introduction
Developers, IT Operators, and in some cases, Site Reliability Engineers (SREs) are responsible for deploying and operating infrastructure and applications, as well as responding to and resolving incidents effectively and in a timely manner. Effective incident management requires quick diagnosis, root cause analysis, and implementation of corrective actions. Diagnosing the root cause can be challenging in the context of modern systems that involve multiple resources deployed across distributed environments. Amazon Q Developer, a generative AI-powered assistant, can help simplify this process by diagnosing errors you receive in the AWS Management Console.
Amazon Q Developer can save you critical time when dealing with production issues by helping to diagnose errors related to your AWS environment. These errors could be the result of potential misconfiguration across multiple resources, and usually requires you to navigate between several service consoles to identify the root cause. Amazon Q Developer applies machine learning models to automate diagnosis of errors that arise in the AWS Console interface. This reduces the mean time to repair (MTTR) and minimizes the impact of incidents on business operations.
This blog post explores the Amazon Q Developer feature to diagnose errors in AWS Console while working with AWS services. We describe how this feature works in order to provide you guidance on troubleshooting. We take a look behind-the-scenes to show the processes that power this feature.
Diagnose with Amazon Q
The Diagnose with Amazon Q feature is activated when an error occurs in the console for an AWS service that is currently supported by this functionality, and a user with appropriate permissions clicks the Diagnose with Amazon Q button next to the error message. Amazon Q provides a natural language explanation that analyzes the root cause of the error. With a second click on Help me resolve, Amazon Q displays an ordered list of instructions which can be used to resolve the error condition. Once completed, you can provide feedback on whether the resolution provided by Amazon Q was helpful.
To make things concrete, we consider two running examples.
Example 1: Assume that you try to delete an S3 bucket which is not empty. This results in an error message:
This bucket is not empty. Buckets must be empty before they can be deleted. To
delete all objects in the bucket, use the empty bucket configuration.
Example 2: Suppose that you try to list objects in a particular S3 bucket, but lack IAM permissions to do so. This results in an error message:
Insufficient permissions to list objects. After you or your AWS administrator has updated your permissions to allow the s3:ListBucket
action, refresh the page. Learn more about Identity and access management in
Amazon S3.
Behind the Scenes: How Amazon Q generates a diagnosis
When you click on Diagnose with Amazon Q button next to the error message in the AWS Management Console, Amazon Q generates an Analysis that expresses the root cause of the error in natural language. This step is assisted by Large Language Models (LLMs) and is based on context information only. The context provided to the LLM includes the error message shown in the console, the URL of the triggering action, and the IAM role of the user signed in the AWS Console. The service always operates within the permissions granted by your role as you operate in the AWS Console, ensuring that privileges are never escalated beyond what are assigned to you.
When you click on Help me resolve button after you have reviewed the analysis, Amazon Q retrieves additional information about the state of the resources in the AWS Account where the error occurred. This is accomplished by interrogating the customer account in various ways. In this phase, the system actively decides which information is still missing and issues interrogation requests against internal services to fulfil the information need. Interrogation is not needed for simple errors, such as Example 1 above, but becomes essential in order to resolve more complex errors, where information from the context proves insufficient.
Given the context, error analysis, user permissions, and results of account interrogation, Amazon Q generates step-by-step Resolution instructions. This step is assisted by LLMs.
After implementing and validating the steps provided by Amazon Q to resolve the error in the console, you have the option to provide feedback of your experience.
Context Information
Contextual information helps the LLMs to generate more relevant and informed outputs. Context is provided to Amazon Q as input from the console automatically. As the basis for all further analysis and decisions, it should be as rich as possible. At a minimum, Amazon Q obtains the error message, the URL for the triggering action, and the IAM role that the signed-in user assumes. The system automatically extracts relevant identifiers from the context. In our running Example 1, the URL may be https://s3.console.aws.amazon.com/s3/bucket/my-bucket-123456/delete?region=us-west-2
, from which Amazon Q extracts aws_region = "us-west-2"
and s3_bucket_name = "my-bucket-123456"
.
Beyond this minimum context, Amazon Q can obtain additional information from the console, pertaining to what the user sees on the screen when the error happens, such as content of text fields or widgets in the current UI. Amazon Q can also make use of specific context provided by the underlying service. In the case of Example 2 above, the bucket name is extracted from the URL, the action s3:ListBucket
from the error message, and Amazon Q may obtain additional information from IAM about related policies and accept or deny statements.
Interrogating the signed-in user’s Account
Diagnose with Amazon Q functionality is not just a passive receiver of context information, it has built-in capabilities of actively asking for additional information. This includes developing an understanding of resources in the AWS account, and their relationship with the resource experiencing the error. Such interrogation queries are planned by a subsystem based on context information. It provides a low-latency and deterministic approach to find resources and their relationships. This relationship context provided to the LLM, such as EBS volumes attached to an EC2 instance or policies included in the attached IAM role, improves the accuracy of root cause analysis for diagnosing the error.
In the simple running Example 1 where error is due to non-empty S3 bucket, the error message and the console URL contain all the necessary information to proceed, and active interrogation is not required. On the other hand, for the IAM permission error in Example 2, it’s helpful to understand the permissions on the IAM role associated with the resource experiencing the error. Amazon Q can fetch identity-level policies for the role and resource-level policies for the affected resource, based on which it can diagnose the cause of the error, using internal IAM services. To be concrete, the URL for Example 2 may be https://s3.console.aws.amazon.com/s3/buckets/my-bucket-123456?region=us-west-2&bucketType=general&tab=objects
, from which Amazon Q extracts region and S3 bucket name. It can also extract the action s3:ListBucket
from the error message itself. Based on this information, Amazon Q can fetch bucket policies for my-bucket-123456
, identity-level policies for the role, then scan those for presence or absence of the s3:ListBucket
action, or call internal IAM services to provide additional information about the cause of access being denied.
This subsystem uses AWS Cloud Control API (CCAPI) which is called on your behalf by Amazon Q with the permissions granted by your IAM Role. As part of onboarding to Amazon Q, the AmazonQFullAccess managed policy is attached to the Role that can access Amazon Q. This managed policy contains the ListResources
and GetResource
CCAPI IAM permissions. This ensures all Roles given that managed policy will have access to the CCAPI read and list endpoints. If you do not attach the AmazonQFullAccess
managed policy to the required roles, you will need to attach the ListResources
and GetResource
permission directly to the role.
Generating Step-by-step Resolution Instructions
At this point, all acquired information is synthesized by Amazon Q in order to generate useful and actionable resolution instructions. As an illustration, possible sample instructions for the running examples under consideration are listed below. As the models are updated and improved over time, the responses can change.
For Example 1, sample instructions could look like:
- Navigate to the S3 console, click “Buckets”, and select the
my-bucket-123456
bucket - Click on the “Empty” tab.
- If your bucket contains a large number of objects, creating a lifecycle rule to delete all objects in the bucket might be a more efficient way of emptying your bucket
- Type “permanently delete” in text input field and confirm that all objects are to be removed.
- Retry deleting the
my-bucket-123456
S3 bucket.
For Example 2, you may obtain:
- Go to the IAM console. Edit the IAM policy attached to the role
ReadOnly
- Allow for the
s3:ListBucket
action for resource being the S3 bucket ARNarn:aws:s3:::my-bucket-123456
. - Save the updated IAM policy
- Refresh the S3 console page to list the objects in the bucket
my-bucket-123456
Note that the instructions contain information inferred from the context, such as bucket name my-bucket-123456
, instead of placeholders. Instructions returned by Diagnose with Amazon Q are complete and fine-grained enough in order to be followed without any extra effort. In fact, while the service makes use of an LLM to synthesize resolution instructions, Amazon Q uses post-processing to correct frequently occurring mistakes. For example, in Example 2 above, the LLM may have returned the ARN as arn:aws:s3:<region>::<bucket_name>
, which would be corrected to what is shown above.
The instructions returned for Example 2 above assume that the reason for the user not being able to list objects is a missing Allow
statement in the policies attached to the ReadOnly
role. Other root causes could be a Deny
statement in a policy attached to the S3 bucket, or to the ReadOnly
role. Diagnose with Amazon Q can use account interrogation in order to identify the correct root cause and propose the right resolution. In the example above, it can fetch the policies attached to the ReadOnly
role and check whether s3:ListBucket
is missing indeed, or fetch policies attached to the bucket bucket-123456
.
Validation
One goal for Diagnose with Amazon Q is to attain wide coverage of AWS rapidly, while keeping the quality bar high, so that you obtain useful, actionable advice where ever you obtain an error. An important prerequisite to attain this goal is a robust and flexible evaluation system. Evaluating systems based on Generative AI is challenging due to the large output space (natural language) and non-deterministic behavior.
In a nutshell, our validation system is based on building a large dataset of errors, where each record has a certain number of annotations. Each record contains the context (templatized error message and console URL; meaning that bucket-123456
is replaced by {{s3_bucket_name}}
, us-west-2
by {{aws_region}}
). Annotations include Infrastructure as Code (CloudFormation) descriptions of the erroneous account state and the triggering action, as well as ground truth responses obtained from expert annotators. These records allow us to simulate the behaviour of variants of our system without human interactions and many times faster than real time (by way of parallelization). We are also developing automated validation metrics for comparing ground truth annotations and system responses, based on which offline evaluations can be run fully automatically.
This validation system allows us to rapidly validate new ideas by comparing them against the current state, while also guarding against regressions. While human experts are still needed to provide annotations of error records, we actively innovate to speed up and simplify these tasks, by building annotation tools which avoid natural language input, have validations built in, and are rather asking to correct system output than providing ground truth annotations from scratch.
Conclusion
The Diagnose with Amazon Q feature of Amazon Q Developer allows you to determine the cause of an error in the AWS Console without needing to navigate to multiple service consoles. By providing tailored, step-by-step instructions specific to your AWS account and error context, Amazon Q Developer empowers you to troubleshoot and resolve issues efficiently. This helps your organization achieve greater operational efficiency, reduce downtime, improve service quality, and free up valuable human resources enabling them to focus on higher-value activities. We also provide you details on how AI and machine learning capabilities work behind the scenes to enable this functionality.