Friday, 27 February 2026

GCP Console — Production Log Analysis (step-by-step)

 

GCP Console — Production Log Analysis (step-by-step)

Using Claude.ai Cursor for conversational / LLM-assisted analysis

This article shows a practical, end-to-end workflow for investigating production logs from Google Cloud Console (Cloud Logging / Log Explorer), exporting them, and using Claude.ai Cursor to query, summarize, and produce actionable findings. It’s written as a sequence of clear steps you can follow now.


1) Goal & quick summary

Goal: quickly find, explore, and analyze production issues using GCP Log Explorer, export the logs you need (e.g., to BigQuery or CSV), then use Claude.ai Cursor to ask natural-language questions, detect anomalies, generate summaries, and produce runbook-style recommendations.

High-level flow:

  1. Identify logs in GCP Console → filter with Logging Query Language (LQL).

  2. Export/save relevant log slices (BigQuery sink or CSV).

  3. Use Claude.ai Cursor to load the data (or connect to BigQuery) and interactively analyze it with prompts and code cells.

  4. Produce findings, visualizations, and suggested remediation steps.


2) Prerequisites & access

  • GCP project access with Logging Viewer (or higher) role for the target project. For exports, Logs Configuration Writer or BigQuery Data Editor permissions may be required.

  • Cloud Logging (formerly Stackdriver Logging) is enabled and your services are writing logs.

  • A Claude.ai account with Cursor enabled (ability to connect/upload files or to connect to BigQuery / cloud storage).

  • Optional: BigQuery dataset to receive exported logs, or permission to download CSVs from Log Explorer.


3) Step A — Narrow down logs in GCP Console (Log Explorer)

  1. Open Cloud Console → Navigation menu → LoggingLog Explorer.

  2. Set the project (top-left) to the production project.

  3. Choose a time range (top-right). Start wide (last 24 hrs) then narrow to the window of the incident.

  4. Use the resource and log filters:

    • Resource: e.g., Kubernetes Container, GCE VM Instance, Cloud Run Revision, Cloud Function.

    • Log name: application logs, stdout, stderr, requests, or syslog.

  5. Build an LQL query (examples below). Use PROP: "value" filters and severity:

    • Example — errors for a service:

      resource.type="k8s_container"
      resource.labels.namespace_name="prod"
      logName="projects/PROJECT_ID/logs/stdout"
      severity>=ERROR
    • Example — 500s in an HTTP server (if structured):

      jsonPayload.status>=500
      resource.type="cloud_run_revision"
  6. Run the query, inspect sample log entries on the right. Use the Expand pane to view full JSON payloads.


4) Step B — Refine & extract fields

  • Use field extraction on the Log Explorer: click the JSON payload and copy or add a derived field (e.g., user_id, trace, request_id, latency_ms).

  • Use PARSE functions or REGEXP_EXTRACT in the Logging Query Language to pull structured fields from unstructured text when needed.

  • Example of extracting a numeric latency from jsonPayload:

    jsonPayload.latencyMs = CAST(REGEXP_EXTRACT(textPayload, r"latency=(\d+)") AS INT64)

(Exact functions depend on whether you're exporting to BigQuery or using LQL features.)


5) Step C — Export logs for deeper analysis

You have two main options:

Option 1 — Export to BigQuery (recommended for large-scale analysis)

  1. In Log Explorer, click Create export (or go to Logging → Logs Router).

  2. Create a sink:

    • Sink service: BigQuery dataset.

    • Choose filter: the LQL you refined above (only export relevant logs).

    • Destination dataset: your_project.your_dataset.logs_prod.

  3. Confirm and create the sink. Logs matching the filter will be streamed into the BigQuery table (append).

Advantages: scalable, fast SQL queries, works well with Cursor if Cursor can connect to BigQuery (recommended).

Option 2 — Download a CSV / JSON from Log Explorer (ad-hoc)

  1. From Log Explorer results, click Download → CSV or JSON for the current query/time range.

  2. This is suitable for small slices or immediate one-off investigations.


6) Step D — Prepare data for Claude.ai Cursor

  • If you exported to BigQuery, note the table name and ensure Cursor can connect (or you can export a table snapshot to CSV).

  • If using CSV/JSON, upload it into Claude.ai Cursor (Cursor supports file upload and interactive code cells).

  • Clean data as required: convert timestamps, parse fields, remove PII (mask user identifiers), and sample if dataset is huge.


7) Step E — Use Claude.ai Cursor: practical examples & prompt templates

Below are concrete prompts and examples you can paste into Claude.ai Cursor. Treat Cursor like an analyst: show it the table/CSV or give it a BigQuery connection plus the table name.

A) Quick human-readable summary

Prompt

I uploaded prod_logs_2026-02-26.csv. Give me a short summary of the main error types, top affected services, and any spikes in errors over time. Show counts by error type and by service and produce a 3-line executive summary.

B) Find top offending requests

Prompt

In the dataset, find the top 10 request_ids that produced the most ERROR or CRITICAL entries. For each request_id, list the sequence of log messages ordered by timestamp.

C) Anomaly detection for latency

Prompt

Use the latency_ms field. Detect outliers and periods with sustained latency > 2× median. Provide a time series plot and list time windows with the highest average latency, with candidate root causes from available fields (service, instance, region).

D) Create an alerting metric recommendation

Prompt

Based on the error rate and latency patterns, recommend two actionable logs-based metrics and sample alerting thresholds for production. Explain why and include suggested alert descriptions.

E) Build a runbook-style remediation

Prompt

For the most frequent error NullPointerException in PaymentProcessor.process, propose a step-by-step troubleshooting runbook: initial checks, logs to inspect (including exact LQL queries), quick mitigations, and safe rollback steps.

F) BigQuery SQL ask (if Cursor can run SQL or you prefer to run it yourself)

Sample SQL to get error counts per service per hour:

SELECT
service,
TIMESTAMP_TRUNC(timestamp, HOUR) AS hour,
COUNTIF(severity >= "ERROR") AS errors,
COUNT(*) AS total
FROM `project.dataset.logs_prod`
GROUP BY service, hour
ORDER BY hour DESC
LIMIT 1000;

You can paste this into BigQuery or ask Cursor to run it if it has access.


8) Example LQL snippets (to use directly in GCP Log Explorer)

  • Errors for a microservice in prod:

    resource.type="k8s_container"
    resource.labels.namespace_name="prod"
    resource.labels.container_name="payments-service"
    severity>=ERROR
  • HTTP 5xx in Cloud Run (structured JSON):

    resource.type="cloud_run_revision"
    jsonPayload.httpStatus >= 500

9) Putting findings into action

  • Short-term: create logs-based alerting policies or temporary scaling rules; pin a hotfix and monitor behavior post-deploy.

  • Mid-term: export logs to BigQuery and build dashboard queries (error trends, latency percentiles). Use logs-based metrics for SLO-based alerts.

  • Long-term: ensure structured logging across services, consistent correlation IDs / traces, and centralized log retention & sampling policies.


10) Security, cost & best practices

  • Permissions: restrict Log Router and BigQuery sink creation to ops/security engineers.

  • PII: mask or remove PII before exporting to external tools / LLMs. If using Claude.ai, avoid sending raw PII unless you explicitly sanitize.

  • Retention & cost: exporting high-volume logs to BigQuery can be costly. Use filter-based sinks to export only what you need. Consider sampling for debug logs.

  • Structured logging: prefer JSON structured logs (jsonPayload) with request_id, trace, service, region, latency_ms so queries are easier.

  • Trace linkage: capture trace and span_id to tie logs to traces (Cloud Trace) for distributed tracing.


11) Example end-to-end mini playbook (concise)

  1. In Cloud Console → Log Explorer, filter: resource=prod, severity>=ERROR, last 1 hour.

  2. If the volume is manageable, download JSON; otherwise set a BigQuery sink with that filter.

  3. In Claude.ai Cursor: upload the JSON or connect to BigQuery table.

  4. Ask Cursor: “Show me top 5 error messages, top services, and a 10-minute error-rate time series.”

  5. Use Cursor outputs to identify suspect service/instance/time window. Extract the trace or request_id.

  6. Run a targeted LQL to fetch full request lifecycles.

  7. Make a temporary alert (Logs → Metrics → Create Metric → Create Alerting Policy).

  8. Draft a short incident report and runbook using Cursor (ask it to create an incident summary and stepwise mitigation).


12) Sample prompts you can copy-paste into Cursor

  • “Summarize this table logs_prod with top 10 error messages, counts, and the earliest/latest timestamp for each message.”

  • “For the error ‘DBConnectionTimeout’, list the instance IDs and the average CPU utilization and network I/O in the 5 minutes before the errors.” (If you include those fields or connect Cursor to metrics.)

  • “Draft a one-page incident postmortem with timeline, root cause hypothesis, corrective actions, and owners based on these logs.”


13) Checklist before sharing results externally

  • Remove PII and sensitive tokens.

  • Confirm the timezones used in timestamps (store and present in UTC or local consistently).

  • Attach LQL/SQL queries used to generate findings so others can reproduce.


14) Closing tips

  • Start with small, well-scoped queries. Iteratively expand.

  • Use BigQuery if you plan repeated or complex analyses. BigQuery + Cursor (or Cursor file uploads) is a powerful combo.

  • Use Claude.ai Cursor for natural language exploration, summarization, and to generate runbooks/alerts — but always validate any suggested remediation with engineers before acting.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.