Friday, 27 February 2026

AWS production log analysis with Claude in Cursor — a step-by-step guide

 Goal: let Claude (Anthropic) help you explore, summarize, triage, and root-cause production logs from AWS while working inside the Cursor IDE (or a Cursor + Claude workflow). This guide assumes you have an AWS production environment that emits logs to CloudWatch / S3 and that you can configure Cursor to use an Anthropic API key (or use a Cursor extension that exposes Claude).


Quick architecture overview (what you build)

  1. Log sources: EC2 / ECS / EKS application logs, Lambda logs, ALB/ELB access logs, RDS logs, CloudTrail, VPC Flow Logs.

  2. Collection / centralization: CloudWatch Logs (native), Kinesis Data Streams / Firehose into S3, or direct delivery (ALB → S3).

  3. Indexing & query layer (optional but recommended): CloudWatch Logs Insights for immediate queries; send long-term logs to S3 + Athena / OpenSearch for powerful searches.

  4. Preprocessing / enrichment: Lambda / Glue jobs to parse JSON, enrich with metadata (service, pod, trace-id), and redact secrets.

  5. Cursor + Claude: connect Cursor to an Anthropic API key or install a Cursor-Claude extension so you can paste query results, open log snippets, or stream structured samples to Claude for summarization and RCA.


Step 1 — Gather logs (fast, low friction)

  1. For application logs already in CloudWatch Logs, open CloudWatch → Log groups.

  2. For access logs that write to S3 (ALB/NLB), ensure the target S3 bucket has lifecycle rules for retention.

  3. If you want a streaming pipeline: configure Kinesis Data Firehose to deliver to S3 (Parquet/JSON) and optionally to OpenSearch / Splunk.

Why: CloudWatch Logs gives instant ad-hoc querying; S3 + Athena/Glue is cheaper for long-term analytics.


Step 2 — Prepare a secure sample set to send to Claude

Important security note: Do not send PII, secrets, auth tokens, or production credentials to any external LLM without enterprise agreements and data handling policies. Redact or anonymize values (user IDs, IPs, emails, tokens) before sending. If you must send PII for authorized internal use, ensure your Anthropic contract and Cursor deployment are approved. (I’m assuming you’ll redact locally first.)

Redaction pattern examples (simple):

  • Replace emails: s/[\w.+-]+@[\w-]+\.[\w.-]+/[REDACTED_EMAIL]/g

  • Replace IPs: s/\b\d{1,3}(\.\d{1,3}){3}\b/[REDACTED_IP]/g

  • Replace UUIDs/IDs: s/[0-9a-fA-F-]{8,36}/[REDACTED_ID]/g


Step 3 — Extract useful slices (what to send)

When you ask an LLM to analyze logs, smaller high-value slices work best. Create extracts like:

  • A timeline: the last N minutes of logs from the affected service (sorted).

  • One example error trace (full stack) with surrounding 50 lines context.

  • Aggregated counts: top 10 error messages with counts, top 10 responding endpoints latencies > X ms.

  • Correlation keys: logs that share the same trace-id or request-id.

Example CloudWatch Logs Insights queries:

# errors in last 15 minutes
fields @timestamp, @message, service, traceId
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp desc
| limit 200

Or aggregated:

fields bin(5m) as period, count(*) as hits, count_distinct(traceId) as traces
| filter @message like /ERROR/
| stats sum(hits) by period

Run the query, export top results to a file (CSV / JSON), redact, and copy into Cursor.


Step 4 — Configure Cursor to use Claude (quick)

Option A — Cursor built-in model selection: add your Anthropic API key in Cursor settings → Models → Anthropic / Claude model entry. Choose the model you prefer (Claude Opus/Claude Code variants).

Option B — Cursor extension: install a community Cursor-Claude extension if your workspace allows (some companies use internal installs). Example repos and packages exist that show how to install an Anthropic extension into Cursor. Always prefer official options where available.


Step 5 — Prompts & interactions: how to ask Claude to analyze logs

Below are practical, reusable prompt templates. Paste one into Cursor’s Claude chat or the extension, then paste the redacted log snippet.

Template A — Quick summary

I’m pasting a redacted log sample from production for service "payments". Please:
1) Give a short summary of what’s happening (2–3 sentences).
2) List the most likely root causes (ranked).
3) Suggest 3 next troubleshooting steps I should run (commands or queries).
Now here's the redacted log snippet:
-----
<paste logs>
-----

Template B — Correlate traces & explain

I have multiple log lines that share traceId = 12345-abc. Summarize the timeline of events for this trace in plain English, highlight errors, and map which service/component likely introduced the error. Provide a one-paragraph RCA hypothesis and 4 tactical next steps.
<redacted trace logs>

Template C — Generate CloudWatch Insights queries

Given these sample logs and the problem (e.g., "intermittent 502s from /api/checkout"), produce a CloudWatch Logs Insights query to:
- show top endpoints returning 5xx in last 30 minutes,
- group by availability zone,
- show counts and 95th percentile latency where present.
Also provide a short explanation for each part of the query.
<sample log schema: timestamp, @message, statusCode, path, latencyMs, az>

Step 6 — Example workflow (hands-on)

  1. Run CloudWatch Insights to get the top 200 ERROR lines for payments in the last 15 minutes. Export JSON.

  2. Run a local redaction script (simple Python or sed) to hide IPs, emails, tokens.

  3. Open Cursor → start a new Claude chat → paste this prompt (Template A) + the redacted sample.

  4. Ask Claude follow-ups: “Which log lines show latency increase before the error?” or “write a BASH snippet that fetches full logs for traceId X from CloudWatch via awscli.”

  5. Use Claude’s answer to craft next CloudWatch queries or to produce a short incident summary for Slack / PagerDuty.


Step 7 — Automating parts of the flow

You can automate repeatable steps while keeping human-in-the-loop controls:

  • Lambda / Step Functions: when CloudWatch Alarm fires, a Step Function extracts a 5-minute log window, runs a redaction Lambda, stores the sample in S3, and notifies a human to paste into Cursor/Claude.

  • Notebook + Cursor: use a Jupyter notebook (or Cursor code cells) that runs boto3 to fetch logs, runs redaction, and then opens a prompt template prefilled in Cursor.

  • ChatOps: generate an incident summary draft automatically with Claude, then require human approval before sending to Slack.


Step 8 — Example concrete commands

Fetch logs by traceId with awscli:

# Get log streams for group, then filter for traceId
aws logs filter-log-events \
--log-group-name "/aws/ecs/payments" \
--start-time $(($(date +%s -d '15 minutes ago')*1000)) \
--filter-pattern '"traceId":"12345-abc"'

Export CloudWatch Insights query results to S3 (via console or SDK), then redact locally and paste into Cursor.


Step 9 — What Claude is good at here (and what to avoid)

Good at:

  • Summarizing large, messy log snippets into a human-readable timeline.

  • Producing suggested queries, investigative steps, and hypothesis generation.

  • Drafting incident summaries, runbooks, and remediation checklists.

Not good at / be cautious:

  • Blindly trusting any LLM RCA — always verify with observability, metrics, and traces.

  • Sending unredacted PII or sensitive logs to a third-party model without approvals.

  • Replacing structured alerting / runbook automation with ad-hoc LLM prompts.


Step 10 — Ops, costs, and governance

  • Cost: API calls to Claude have cost per token. Keep samples small and structured (aggregate + representative examples) instead of sending everything.

  • Retention & compliance: ensure logs sent to Claude comply with your company’s data handling and any regulatory rules (GDPR, PCI, etc.).

  • Access control: only allow approved engineers to use the Anthropic key in Cursor. Rotate keys and audit usage logs.


Appendix — Example prompts & followups (copy/paste ready)

Short RCA prompt

Describe the sequence of events in these logs (redacted) and provide a one-sentence root cause hypothesis plus three immediate remediation steps. Only use evidence present in the logs and mark any assumptions.
<redacted logs>

Ask for a query

Write a CloudWatch Logs Insights query that shows the top 10 error messages and the number of unique traces for each in the last 1 hour.

Follow-up to Claude

List the exact awscli commands I should run next to fetch full traces for the top 3 traceIds you identified above.

Final checklist before you send a snippet to Claude

  • Redact PII & secrets.

  • Include minimal context: service name, time window, an example traceId.

  • Attach or paste only focused extracts (timeline + example error).

  • Keep a human reviewer in the loop for any suggested remediation that touches production.


Closing notes / recommended next steps

  1. Start by manually pasting 1–2 redacted log snippets into Cursor/Claude to observe quality of answers.

  2. Build a safe redaction pipeline (Lambda or CI script).

  3. If the approach is useful, automate extraction + human approval and add audit logging for compliance.

If you want, I can:

  • provide a ready-to-use redaction script (Python) that matches your log format, or

  • draft a few CloudWatch Insights queries tailored to your service (tell me the field names you have: e.g., statusCode, path, latencyMs, traceId) — I’ll generate them right away.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.