Friday, 27 February 2026

GCP Console — Production Log Analysis (step-by-step)

 

GCP Console — Production Log Analysis (step-by-step)

Using Claude.ai Cursor for conversational / LLM-assisted analysis

This article shows a practical, end-to-end workflow for investigating production logs from Google Cloud Console (Cloud Logging / Log Explorer), exporting them, and using Claude.ai Cursor to query, summarize, and produce actionable findings. It’s written as a sequence of clear steps you can follow now.


1) Goal & quick summary

Goal: quickly find, explore, and analyze production issues using GCP Log Explorer, export the logs you need (e.g., to BigQuery or CSV), then use Claude.ai Cursor to ask natural-language questions, detect anomalies, generate summaries, and produce runbook-style recommendations.

High-level flow:

  1. Identify logs in GCP Console → filter with Logging Query Language (LQL).

  2. Export/save relevant log slices (BigQuery sink or CSV).

  3. Use Claude.ai Cursor to load the data (or connect to BigQuery) and interactively analyze it with prompts and code cells.

  4. Produce findings, visualizations, and suggested remediation steps.


2) Prerequisites & access

  • GCP project access with Logging Viewer (or higher) role for the target project. For exports, Logs Configuration Writer or BigQuery Data Editor permissions may be required.

  • Cloud Logging (formerly Stackdriver Logging) is enabled and your services are writing logs.

  • A Claude.ai account with Cursor enabled (ability to connect/upload files or to connect to BigQuery / cloud storage).

  • Optional: BigQuery dataset to receive exported logs, or permission to download CSVs from Log Explorer.


3) Step A — Narrow down logs in GCP Console (Log Explorer)

  1. Open Cloud Console → Navigation menu → LoggingLog Explorer.

  2. Set the project (top-left) to the production project.

  3. Choose a time range (top-right). Start wide (last 24 hrs) then narrow to the window of the incident.

  4. Use the resource and log filters:

    • Resource: e.g., Kubernetes Container, GCE VM Instance, Cloud Run Revision, Cloud Function.

    • Log name: application logs, stdout, stderr, requests, or syslog.

  5. Build an LQL query (examples below). Use PROP: "value" filters and severity:

    • Example — errors for a service:

      resource.type="k8s_container"
      resource.labels.namespace_name="prod"
      logName="projects/PROJECT_ID/logs/stdout"
      severity>=ERROR
    • Example — 500s in an HTTP server (if structured):

      jsonPayload.status>=500
      resource.type="cloud_run_revision"
  6. Run the query, inspect sample log entries on the right. Use the Expand pane to view full JSON payloads.


4) Step B — Refine & extract fields

  • Use field extraction on the Log Explorer: click the JSON payload and copy or add a derived field (e.g., user_id, trace, request_id, latency_ms).

  • Use PARSE functions or REGEXP_EXTRACT in the Logging Query Language to pull structured fields from unstructured text when needed.

  • Example of extracting a numeric latency from jsonPayload:

    jsonPayload.latencyMs = CAST(REGEXP_EXTRACT(textPayload, r"latency=(\d+)") AS INT64)

(Exact functions depend on whether you're exporting to BigQuery or using LQL features.)


5) Step C — Export logs for deeper analysis

You have two main options:

Option 1 — Export to BigQuery (recommended for large-scale analysis)

  1. In Log Explorer, click Create export (or go to Logging → Logs Router).

  2. Create a sink:

    • Sink service: BigQuery dataset.

    • Choose filter: the LQL you refined above (only export relevant logs).

    • Destination dataset: your_project.your_dataset.logs_prod.

  3. Confirm and create the sink. Logs matching the filter will be streamed into the BigQuery table (append).

Advantages: scalable, fast SQL queries, works well with Cursor if Cursor can connect to BigQuery (recommended).

Option 2 — Download a CSV / JSON from Log Explorer (ad-hoc)

  1. From Log Explorer results, click Download → CSV or JSON for the current query/time range.

  2. This is suitable for small slices or immediate one-off investigations.


6) Step D — Prepare data for Claude.ai Cursor

  • If you exported to BigQuery, note the table name and ensure Cursor can connect (or you can export a table snapshot to CSV).

  • If using CSV/JSON, upload it into Claude.ai Cursor (Cursor supports file upload and interactive code cells).

  • Clean data as required: convert timestamps, parse fields, remove PII (mask user identifiers), and sample if dataset is huge.


7) Step E — Use Claude.ai Cursor: practical examples & prompt templates

Below are concrete prompts and examples you can paste into Claude.ai Cursor. Treat Cursor like an analyst: show it the table/CSV or give it a BigQuery connection plus the table name.

A) Quick human-readable summary

Prompt

I uploaded prod_logs_2026-02-26.csv. Give me a short summary of the main error types, top affected services, and any spikes in errors over time. Show counts by error type and by service and produce a 3-line executive summary.

B) Find top offending requests

Prompt

In the dataset, find the top 10 request_ids that produced the most ERROR or CRITICAL entries. For each request_id, list the sequence of log messages ordered by timestamp.

C) Anomaly detection for latency

Prompt

Use the latency_ms field. Detect outliers and periods with sustained latency > 2× median. Provide a time series plot and list time windows with the highest average latency, with candidate root causes from available fields (service, instance, region).

D) Create an alerting metric recommendation

Prompt

Based on the error rate and latency patterns, recommend two actionable logs-based metrics and sample alerting thresholds for production. Explain why and include suggested alert descriptions.

E) Build a runbook-style remediation

Prompt

For the most frequent error NullPointerException in PaymentProcessor.process, propose a step-by-step troubleshooting runbook: initial checks, logs to inspect (including exact LQL queries), quick mitigations, and safe rollback steps.

F) BigQuery SQL ask (if Cursor can run SQL or you prefer to run it yourself)

Sample SQL to get error counts per service per hour:

SELECT
service,
TIMESTAMP_TRUNC(timestamp, HOUR) AS hour,
COUNTIF(severity >= "ERROR") AS errors,
COUNT(*) AS total
FROM `project.dataset.logs_prod`
GROUP BY service, hour
ORDER BY hour DESC
LIMIT 1000;

You can paste this into BigQuery or ask Cursor to run it if it has access.


8) Example LQL snippets (to use directly in GCP Log Explorer)

  • Errors for a microservice in prod:

    resource.type="k8s_container"
    resource.labels.namespace_name="prod"
    resource.labels.container_name="payments-service"
    severity>=ERROR
  • HTTP 5xx in Cloud Run (structured JSON):

    resource.type="cloud_run_revision"
    jsonPayload.httpStatus >= 500

9) Putting findings into action

  • Short-term: create logs-based alerting policies or temporary scaling rules; pin a hotfix and monitor behavior post-deploy.

  • Mid-term: export logs to BigQuery and build dashboard queries (error trends, latency percentiles). Use logs-based metrics for SLO-based alerts.

  • Long-term: ensure structured logging across services, consistent correlation IDs / traces, and centralized log retention & sampling policies.


10) Security, cost & best practices

  • Permissions: restrict Log Router and BigQuery sink creation to ops/security engineers.

  • PII: mask or remove PII before exporting to external tools / LLMs. If using Claude.ai, avoid sending raw PII unless you explicitly sanitize.

  • Retention & cost: exporting high-volume logs to BigQuery can be costly. Use filter-based sinks to export only what you need. Consider sampling for debug logs.

  • Structured logging: prefer JSON structured logs (jsonPayload) with request_id, trace, service, region, latency_ms so queries are easier.

  • Trace linkage: capture trace and span_id to tie logs to traces (Cloud Trace) for distributed tracing.


11) Example end-to-end mini playbook (concise)

  1. In Cloud Console → Log Explorer, filter: resource=prod, severity>=ERROR, last 1 hour.

  2. If the volume is manageable, download JSON; otherwise set a BigQuery sink with that filter.

  3. In Claude.ai Cursor: upload the JSON or connect to BigQuery table.

  4. Ask Cursor: “Show me top 5 error messages, top services, and a 10-minute error-rate time series.”

  5. Use Cursor outputs to identify suspect service/instance/time window. Extract the trace or request_id.

  6. Run a targeted LQL to fetch full request lifecycles.

  7. Make a temporary alert (Logs → Metrics → Create Metric → Create Alerting Policy).

  8. Draft a short incident report and runbook using Cursor (ask it to create an incident summary and stepwise mitigation).


12) Sample prompts you can copy-paste into Cursor

  • “Summarize this table logs_prod with top 10 error messages, counts, and the earliest/latest timestamp for each message.”

  • “For the error ‘DBConnectionTimeout’, list the instance IDs and the average CPU utilization and network I/O in the 5 minutes before the errors.” (If you include those fields or connect Cursor to metrics.)

  • “Draft a one-page incident postmortem with timeline, root cause hypothesis, corrective actions, and owners based on these logs.”


13) Checklist before sharing results externally

  • Remove PII and sensitive tokens.

  • Confirm the timezones used in timestamps (store and present in UTC or local consistently).

  • Attach LQL/SQL queries used to generate findings so others can reproduce.


14) Closing tips

  • Start with small, well-scoped queries. Iteratively expand.

  • Use BigQuery if you plan repeated or complex analyses. BigQuery + Cursor (or Cursor file uploads) is a powerful combo.

  • Use Claude.ai Cursor for natural language exploration, summarization, and to generate runbooks/alerts — but always validate any suggested remediation with engineers before acting.

AWS production log analysis with Claude in Cursor — a step-by-step guide

 Goal: let Claude (Anthropic) help you explore, summarize, triage, and root-cause production logs from AWS while working inside the Cursor IDE (or a Cursor + Claude workflow). This guide assumes you have an AWS production environment that emits logs to CloudWatch / S3 and that you can configure Cursor to use an Anthropic API key (or use a Cursor extension that exposes Claude).


Quick architecture overview (what you build)

  1. Log sources: EC2 / ECS / EKS application logs, Lambda logs, ALB/ELB access logs, RDS logs, CloudTrail, VPC Flow Logs.

  2. Collection / centralization: CloudWatch Logs (native), Kinesis Data Streams / Firehose into S3, or direct delivery (ALB → S3).

  3. Indexing & query layer (optional but recommended): CloudWatch Logs Insights for immediate queries; send long-term logs to S3 + Athena / OpenSearch for powerful searches.

  4. Preprocessing / enrichment: Lambda / Glue jobs to parse JSON, enrich with metadata (service, pod, trace-id), and redact secrets.

  5. Cursor + Claude: connect Cursor to an Anthropic API key or install a Cursor-Claude extension so you can paste query results, open log snippets, or stream structured samples to Claude for summarization and RCA.


Step 1 — Gather logs (fast, low friction)

  1. For application logs already in CloudWatch Logs, open CloudWatch → Log groups.

  2. For access logs that write to S3 (ALB/NLB), ensure the target S3 bucket has lifecycle rules for retention.

  3. If you want a streaming pipeline: configure Kinesis Data Firehose to deliver to S3 (Parquet/JSON) and optionally to OpenSearch / Splunk.

Why: CloudWatch Logs gives instant ad-hoc querying; S3 + Athena/Glue is cheaper for long-term analytics.


Step 2 — Prepare a secure sample set to send to Claude

Important security note: Do not send PII, secrets, auth tokens, or production credentials to any external LLM without enterprise agreements and data handling policies. Redact or anonymize values (user IDs, IPs, emails, tokens) before sending. If you must send PII for authorized internal use, ensure your Anthropic contract and Cursor deployment are approved. (I’m assuming you’ll redact locally first.)

Redaction pattern examples (simple):

  • Replace emails: s/[\w.+-]+@[\w-]+\.[\w.-]+/[REDACTED_EMAIL]/g

  • Replace IPs: s/\b\d{1,3}(\.\d{1,3}){3}\b/[REDACTED_IP]/g

  • Replace UUIDs/IDs: s/[0-9a-fA-F-]{8,36}/[REDACTED_ID]/g


Step 3 — Extract useful slices (what to send)

When you ask an LLM to analyze logs, smaller high-value slices work best. Create extracts like:

  • A timeline: the last N minutes of logs from the affected service (sorted).

  • One example error trace (full stack) with surrounding 50 lines context.

  • Aggregated counts: top 10 error messages with counts, top 10 responding endpoints latencies > X ms.

  • Correlation keys: logs that share the same trace-id or request-id.

Example CloudWatch Logs Insights queries:

# errors in last 15 minutes
fields @timestamp, @message, service, traceId
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp desc
| limit 200

Or aggregated:

fields bin(5m) as period, count(*) as hits, count_distinct(traceId) as traces
| filter @message like /ERROR/
| stats sum(hits) by period

Run the query, export top results to a file (CSV / JSON), redact, and copy into Cursor.


Step 4 — Configure Cursor to use Claude (quick)

Option A — Cursor built-in model selection: add your Anthropic API key in Cursor settings → Models → Anthropic / Claude model entry. Choose the model you prefer (Claude Opus/Claude Code variants).

Option B — Cursor extension: install a community Cursor-Claude extension if your workspace allows (some companies use internal installs). Example repos and packages exist that show how to install an Anthropic extension into Cursor. Always prefer official options where available.


Step 5 — Prompts & interactions: how to ask Claude to analyze logs

Below are practical, reusable prompt templates. Paste one into Cursor’s Claude chat or the extension, then paste the redacted log snippet.

Template A — Quick summary

I’m pasting a redacted log sample from production for service "payments". Please:
1) Give a short summary of what’s happening (2–3 sentences).
2) List the most likely root causes (ranked).
3) Suggest 3 next troubleshooting steps I should run (commands or queries).
Now here's the redacted log snippet:
-----
<paste logs>
-----

Template B — Correlate traces & explain

I have multiple log lines that share traceId = 12345-abc. Summarize the timeline of events for this trace in plain English, highlight errors, and map which service/component likely introduced the error. Provide a one-paragraph RCA hypothesis and 4 tactical next steps.
<redacted trace logs>

Template C — Generate CloudWatch Insights queries

Given these sample logs and the problem (e.g., "intermittent 502s from /api/checkout"), produce a CloudWatch Logs Insights query to:
- show top endpoints returning 5xx in last 30 minutes,
- group by availability zone,
- show counts and 95th percentile latency where present.
Also provide a short explanation for each part of the query.
<sample log schema: timestamp, @message, statusCode, path, latencyMs, az>

Step 6 — Example workflow (hands-on)

  1. Run CloudWatch Insights to get the top 200 ERROR lines for payments in the last 15 minutes. Export JSON.

  2. Run a local redaction script (simple Python or sed) to hide IPs, emails, tokens.

  3. Open Cursor → start a new Claude chat → paste this prompt (Template A) + the redacted sample.

  4. Ask Claude follow-ups: “Which log lines show latency increase before the error?” or “write a BASH snippet that fetches full logs for traceId X from CloudWatch via awscli.”

  5. Use Claude’s answer to craft next CloudWatch queries or to produce a short incident summary for Slack / PagerDuty.


Step 7 — Automating parts of the flow

You can automate repeatable steps while keeping human-in-the-loop controls:

  • Lambda / Step Functions: when CloudWatch Alarm fires, a Step Function extracts a 5-minute log window, runs a redaction Lambda, stores the sample in S3, and notifies a human to paste into Cursor/Claude.

  • Notebook + Cursor: use a Jupyter notebook (or Cursor code cells) that runs boto3 to fetch logs, runs redaction, and then opens a prompt template prefilled in Cursor.

  • ChatOps: generate an incident summary draft automatically with Claude, then require human approval before sending to Slack.


Step 8 — Example concrete commands

Fetch logs by traceId with awscli:

# Get log streams for group, then filter for traceId
aws logs filter-log-events \
--log-group-name "/aws/ecs/payments" \
--start-time $(($(date +%s -d '15 minutes ago')*1000)) \
--filter-pattern '"traceId":"12345-abc"'

Export CloudWatch Insights query results to S3 (via console or SDK), then redact locally and paste into Cursor.


Step 9 — What Claude is good at here (and what to avoid)

Good at:

  • Summarizing large, messy log snippets into a human-readable timeline.

  • Producing suggested queries, investigative steps, and hypothesis generation.

  • Drafting incident summaries, runbooks, and remediation checklists.

Not good at / be cautious:

  • Blindly trusting any LLM RCA — always verify with observability, metrics, and traces.

  • Sending unredacted PII or sensitive logs to a third-party model without approvals.

  • Replacing structured alerting / runbook automation with ad-hoc LLM prompts.


Step 10 — Ops, costs, and governance

  • Cost: API calls to Claude have cost per token. Keep samples small and structured (aggregate + representative examples) instead of sending everything.

  • Retention & compliance: ensure logs sent to Claude comply with your company’s data handling and any regulatory rules (GDPR, PCI, etc.).

  • Access control: only allow approved engineers to use the Anthropic key in Cursor. Rotate keys and audit usage logs.


Appendix — Example prompts & followups (copy/paste ready)

Short RCA prompt

Describe the sequence of events in these logs (redacted) and provide a one-sentence root cause hypothesis plus three immediate remediation steps. Only use evidence present in the logs and mark any assumptions.
<redacted logs>

Ask for a query

Write a CloudWatch Logs Insights query that shows the top 10 error messages and the number of unique traces for each in the last 1 hour.

Follow-up to Claude

List the exact awscli commands I should run next to fetch full traces for the top 3 traceIds you identified above.

Final checklist before you send a snippet to Claude

  • Redact PII & secrets.

  • Include minimal context: service name, time window, an example traceId.

  • Attach or paste only focused extracts (timeline + example error).

  • Keep a human reviewer in the loop for any suggested remediation that touches production.


Closing notes / recommended next steps

  1. Start by manually pasting 1–2 redacted log snippets into Cursor/Claude to observe quality of answers.

  2. Build a safe redaction pipeline (Lambda or CI script).

  3. If the approach is useful, automate extraction + human approval and add audit logging for compliance.

If you want, I can:

  • provide a ready-to-use redaction script (Python) that matches your log format, or

  • draft a few CloudWatch Insights queries tailored to your service (tell me the field names you have: e.g., statusCode, path, latencyMs, traceId) — I’ll generate them right away.