AWS CloudWatch Investigation
AWS CloudWatch Investigation is an code AI skill with a core value of >. It
helps developers solve real-world problems in the code domain, boosting
efficiency, automating repetitive tasks, and optimizing workflows.
>
mkdir -p ./skills/aws-cloudwatch-investigation && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/aws-cloudwatch-investigation/SKILL.md -o ./skills/aws-cloudwatch-investigation/SKILL.md Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).
Skill Content
# AWS CloudWatch Investigation Skill
Reusable patterns for investigating production incidents using CloudWatch Logs, Metrics, and Alarms. These patterns are designed to be composed together during incident triage.
---
Pattern 1: Logs Insights Query Templates
Error Spike Detection
Find the top errors in a time window, grouped by error type:
fields @timestamp, @message, @logStream
| filter @message like /(?i)(error|exception|fatal|critical)/
| stats count(*) as errorCount by bin(5m), @logStream
| sort errorCount desc
| limit 20P99 Latency Breakdown by Operation
Identify which operations are driving latency spikes:
fields @timestamp, @duration, operation
| filter ispresent(@duration)
| stats avg(@duration) as avgMs,
pct(@duration, 50) as p50Ms,
pct(@duration, 95) as p95Ms,
pct(@duration, 99) as p99Ms,
count(*) as invocations
by operation
| sort p99Ms desc
| limit 15Lambda Cold Start Detection
Quantify cold start impact during an incident:
fields @timestamp, @duration, @initDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count(*) as coldStarts,
avg(@initDuration) as avgInitMs,
max(@initDuration) as maxInitMs,
avg(@duration) as avgDurationMs
by bin(5m)
| sort @timestamp descOut-of-Memory (OOM) Detection
Find Lambda functions or containers killed by memory pressure:
fields @timestamp, @message, @logStream, @memorySize, @maxMemoryUsed
| filter @message like /Runtime exited|out of memory|OOMKilled|Cannot allocate memory|MemoryError/
| stats count(*) as oomEvents by @logStream, bin(10m)
| sort oomEvents desc
| limit 10For memory utilization trending before OOM:
fields @timestamp, @maxMemoryUsed, @memorySize
| filter ispresent(@maxMemoryUsed)
| stats max(@maxMemoryUsed / @memorySize * 100) as peakMemPct,
avg(@maxMemoryUsed / @memorySize * 100) as avgMemPct
by bin(5m)
| sort @timestamp descTimeout Detection
Find invocations that hit the configured timeout:
fields @timestamp, @duration, @logStream, @requestId
| filter @message like /Task timed out/ or @duration > 28000
| stats count(*) as timeouts by @logStream, bin(5m)
| sort timeouts desc---
Pattern 2: Alarm History to Deploy-Event Correlation
Process
1. **Get alarm transition time** — note the exact timestamp when the alarm entered ALARM state.
2. **Query CloudTrail** for deployment-related events in a window of [alarm_time - 30min, alarm_time]:
# CloudTrail Lake query for deployment events
SELECT eventTime, eventName, userIdentity.arn, requestParameters
FROM <event-data-store-id>
WHERE eventTime > '<alarm_time_minus_30m>'
AND eventTime < '<alarm_time>'
AND eventName IN (
'UpdateFunctionCode', 'UpdateFunctionConfiguration',
'UpdateService', 'CreateDeployment', 'RegisterTaskDefinition',
'CreateChangeSet', 'ExecuteChangeSet',
'StartPipelineExecution', 'PutImage'
)
ORDER BY eventTime DESC3. **Correlation criteria** — a deploy is "correlated" if:
- It targets the same service/resource as the alarm
- It completed within 15 minutes before the alarm transition
- The deployer identity matches a CI/CD role (not a human applying a hotfix)
4. **Strengthening the correlation:**
- Check if the same alarm was healthy in the previous deployment cycle
- Verify no other environmental changes (scaling events, config changes) in the same window
- Look for canary/synthetic monitor failures that started at the same time
Output Format
Deploy Correlation:
Event: UpdateFunctionCode
Time: 2024-03-15T14:23:07Z (12 min before alarm)
Actor: arn:aws:sts::123456789012:assumed-role/github-actions-deploy/session
Resource: arn:aws:lambda:us-east-1:123456789012:function:payment-processor
Correlation: STRONG — same resource, CI/CD actor, alarm was OK prior cycle---
Pattern 3: Narrow the Blast Radius Decision Tree
Use this tree to sy
🎯 Best For
- GitHub Copilot users
- Claude users
- Software engineers
- Development teams
- Tech leads
💡 Use Cases
- Code quality improvement
- Best practice enforcement
📖 How to Use This Skill
- 1
Install the Skill
Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
- 2
Load into Your AI Assistant
Open GitHub Copilot or Claude and reference the skill. Paste the SKILL.md content or use the system prompt tab.
- 3
Apply AWS CloudWatch Investigation to Your Work
Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
- 4
Review and Refine
Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.
❓ Frequently Asked Questions
Is AWS CloudWatch Investigation compatible with Cursor and VS Code?
Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.
Do I need specific dependencies for AWS CloudWatch Investigation?
Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.
How do I install AWS CloudWatch Investigation?
Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/aws-cloudwatch-investigation/SKILL.md, ready to use.
Can I customize this skill for my team?
Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.
⚠️ Common Mistakes to Avoid
Skipping validation
Always test AI-generated code changes, even for simple refactors.
Missing dependency updates
Check if the skill requires updated dependencies or new packages.