MR
Mayur Rathi
@github
⭐ 34.1k GitHub stars

AWS CloudWatch Investigation

AWS CloudWatch Investigation is an code AI skill with a core value of >. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

>

Last verified on: 2026-06-28
mkdir -p ./skills/aws-cloudwatch-investigation && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/aws-cloudwatch-investigation/SKILL.md -o ./skills/aws-cloudwatch-investigation/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# AWS CloudWatch Investigation Skill


Reusable patterns for investigating production incidents using CloudWatch Logs, Metrics, and Alarms. These patterns are designed to be composed together during incident triage.


---


Pattern 1: Logs Insights Query Templates


Error Spike Detection


Find the top errors in a time window, grouped by error type:


text
fields @timestamp, @message, @logStream
| filter @message like /(?i)(error|exception|fatal|critical)/
| stats count(*) as errorCount by bin(5m), @logStream
| sort errorCount desc
| limit 20

P99 Latency Breakdown by Operation


Identify which operations are driving latency spikes:


text
fields @timestamp, @duration, operation
| filter ispresent(@duration)
| stats avg(@duration) as avgMs,
        pct(@duration, 50) as p50Ms,
        pct(@duration, 95) as p95Ms,
        pct(@duration, 99) as p99Ms,
        count(*) as invocations
  by operation
| sort p99Ms desc
| limit 15

Lambda Cold Start Detection


Quantify cold start impact during an incident:


text
fields @timestamp, @duration, @initDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count(*) as coldStarts,
        avg(@initDuration) as avgInitMs,
        max(@initDuration) as maxInitMs,
        avg(@duration) as avgDurationMs
  by bin(5m)
| sort @timestamp desc

Out-of-Memory (OOM) Detection


Find Lambda functions or containers killed by memory pressure:


text
fields @timestamp, @message, @logStream, @memorySize, @maxMemoryUsed
| filter @message like /Runtime exited|out of memory|OOMKilled|Cannot allocate memory|MemoryError/
| stats count(*) as oomEvents by @logStream, bin(10m)
| sort oomEvents desc
| limit 10

For memory utilization trending before OOM:


text
fields @timestamp, @maxMemoryUsed, @memorySize
| filter ispresent(@maxMemoryUsed)
| stats max(@maxMemoryUsed / @memorySize * 100) as peakMemPct,
        avg(@maxMemoryUsed / @memorySize * 100) as avgMemPct
  by bin(5m)
| sort @timestamp desc

Timeout Detection


Find invocations that hit the configured timeout:


text
fields @timestamp, @duration, @logStream, @requestId
| filter @message like /Task timed out/ or @duration > 28000
| stats count(*) as timeouts by @logStream, bin(5m)
| sort timeouts desc

---


Pattern 2: Alarm History to Deploy-Event Correlation


Process


1. **Get alarm transition time** — note the exact timestamp when the alarm entered ALARM state.

2. **Query CloudTrail** for deployment-related events in a window of [alarm_time - 30min, alarm_time]:


text
# CloudTrail Lake query for deployment events
SELECT eventTime, eventName, userIdentity.arn, requestParameters
FROM <event-data-store-id>
WHERE eventTime > '<alarm_time_minus_30m>'
  AND eventTime < '<alarm_time>'
  AND eventName IN (
    'UpdateFunctionCode', 'UpdateFunctionConfiguration',
    'UpdateService', 'CreateDeployment', 'RegisterTaskDefinition',
    'CreateChangeSet', 'ExecuteChangeSet',
    'StartPipelineExecution', 'PutImage'
  )
ORDER BY eventTime DESC

3. **Correlation criteria** — a deploy is "correlated" if:

- It targets the same service/resource as the alarm

- It completed within 15 minutes before the alarm transition

- The deployer identity matches a CI/CD role (not a human applying a hotfix)


4. **Strengthening the correlation:**

- Check if the same alarm was healthy in the previous deployment cycle

- Verify no other environmental changes (scaling events, config changes) in the same window

- Look for canary/synthetic monitor failures that started at the same time


Output Format


text
Deploy Correlation:
  Event: UpdateFunctionCode
  Time: 2024-03-15T14:23:07Z (12 min before alarm)
  Actor: arn:aws:sts::123456789012:assumed-role/github-actions-deploy/session
  Resource: arn:aws:lambda:us-east-1:123456789012:function:payment-processor
  Correlation: STRONG — same resource, CI/CD actor, alarm was OK prior cycle

---


Pattern 3: Narrow the Blast Radius Decision Tree


Use this tree to sy

🎯 Best For

  • GitHub Copilot users
  • Claude users
  • Software engineers
  • Development teams
  • Tech leads

💡 Use Cases

  • Code quality improvement
  • Best practice enforcement

📖 How to Use This Skill

  1. 1

    Install the Skill

    Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.

  2. 2

    Load into Your AI Assistant

    Open GitHub Copilot or Claude and reference the skill. Paste the SKILL.md content or use the system prompt tab.

  3. 3

    Apply AWS CloudWatch Investigation to Your Work

    Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.

  4. 4

    Review and Refine

    Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is AWS CloudWatch Investigation compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for AWS CloudWatch Investigation?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install AWS CloudWatch Investigation?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/aws-cloudwatch-investigation/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills