golang-observability

작성자: samber

Golang 일상적인 관측 가능성 — 프로덕션에서 항상 켜져 있는 신호. slog를 사용한 구조화된 로깅, Prometheus 메트릭, OpenTelemetry 분산 추적, pprof/Pyroscope를 사용한 연속 프로파일링, 서버 측 RUM 이벤트 추적, 알림 및 Grafana 대시보드를 다룹니다. 프로덕션 모니터링을 위해 Go 서비스를 계측하거나, 메트릭 또는 알림을 설정하거나, OpenTelemetry 추적을 추가하거나, 로그와 추적을 연관시키거나, 레거시 로거(zap/logrus/zerolog)를 slog로 마이그레이션할 때 적용하세요.

npx skills add https://github.com/samber/cc-skills-golang --skill golang-observability

Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.

Modes:

  • Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
  • Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
  • Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.

Community default. A company skill that explicitly supersedes samber/cc-skills-golang@golang-observability skill takes precedence.

Go Observability Best Practices

Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.

When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.

Best Practices Summary

  1. Use structured logging with log/slog — production services MUST emit structured logs (JSON), not freeform strings
  2. Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention
  3. Log with context — use slog.InfoContext(ctx, ...) to correlate logs with traces
  4. Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics.
  5. Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values
  6. Track percentiles (P50, P90, P99, P99.9) using Histograms + histogram_quantile() in PromQL
  7. Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere
  8. Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations
  9. Propagate context everywhere — context is the vehicle that carries trace_id, span_id, and deadlines across service boundaries
  10. Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying
  11. Correlate signals — inject trace_id into logs, use exemplars to link metrics to traces
  12. A feature is not done until it is observable — declare metrics, add proper logging, create spans
  13. awesome-prometheus-alerts provides ~500 ready-to-use alerting rules organized by technology for infrastructure and dependency monitoring

Cross-References

See samber/cc-skills-golang@golang-error-handling skill for the single handling rule. See samber/cc-skills-golang@golang-troubleshooting skill for using observability signals to diagnose production issues. See samber/cc-skills-golang@golang-security skill for protecting pprof endpoints and avoiding PII in logs. See samber/cc-skills-golang@golang-context skill for propagating trace context across service boundaries. See samber/cc-skills@promql-cli skill for querying and exploring PromQL expressions against Prometheus from the CLI.

Go 1.26+: slog multi-handler

For simple fan-out to multiple slog handlers, prefer stdlib slog.NewMultiHandler before adding third-party handler-composition dependencies.

logger := slog.New(slog.NewMultiHandler(
    slog.NewJSONHandler(os.Stdout, nil),
    auditHandler,
))

Use third-party slog handler libraries only when the stdlib handler composition is insufficient.

The Five Signals

SignalQuestion it answersToolWhen to use
LogsWhat happened?log/slogDiscrete events, errors, audit trails
MetricsHow much / how fast?Prometheus clientAggregated measurements, alerting, SLOs
TracesWhere did time go?OpenTelemetryRequest flow across services, latency breakdown
ProfilesWhy is it slow / using memory?pprof, PyroscopeCPU hotspots, memory leaks, lock contention
RUMHow do users experience it?PostHog, SegmentProduct analytics, funnels, session replay

Detailed Guides

Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:

  • Structured Logging — Why structured logging matters for log aggregation at scale. Covers log/slog setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with slog.InfoContext, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.

  • Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports histogram_quantile PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).

  • Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording), otelhttp middleware for HTTP instrumentation, error recording with span.RecordError(), trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.

  • Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.

  • Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.

  • Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts provides ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using irate instead of rate, missing for: duration to avoid flapping).

  • Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.

Correlating Signals

Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.

Logs + Traces: otelslog bridge

import "go.opentelemetry.io/contrib/bridges/otelslog"

// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}

Metrics + Traces: Exemplars

// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
obs := histogram.WithLabelValues("POST", "/orders")
if eo, ok := obs.(prometheus.ExemplarObserver); ok {
    eo.ObserveWithExemplar(duration, prometheus.Labels{"trace_id": traceID})
} else {
    obs.Observe(duration)
}

Migrating Legacy Loggers

If the project currently uses zap, logrus, or zerolog, migrate to log/slog. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.

Migration strategy:

  1. Add slog as the new logger with slog.SetDefault()
  2. Bridge handlers during migration route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog
  3. Gradually replace all zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) calls with slog.Info(...)
  4. Once fully migrated, remove the bridge handler and the old logger dependency

Definition of Done for Observability

A feature is not production-ready until it is observable. Before marking a feature as done, verify:

  • Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
  • Logging is proper — structured key-value pairs with slog, context variants used (slog.InfoContext), no PII in logs, errors MUST be either logged OR returned (NEVER both).
  • Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with span.RecordError().
  • Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Ready-to-use alert rules for common infrastructure dependencies are available at awesome-prometheus-alerts.
  • RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is user_id (not email), consent checked before tracking.

Common Mistakes

// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ Good — return with context, log once at the top level
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}
// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()
// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")

// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")
// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})

samber의 다른 스킬

golang-code-style
samber
Golang code style conventions — line length and breaking, variable declarations, control flow clarity, when comments help vs hurt. Use when writing or reviewing Go code, asking about style or clarity, or establishing project coding standards. Not for naming conventions (→ See `samber/cc-skills-golang@golang-naming` skill), linter configuration (→ See `samber/cc-skills-golang@golang-lint` skill), or doc comments (→ See `samber/cc-skills-golang@golang-documentation` skill).
developmentcode-review
golang-testing
samber
Production-ready Golang tests — table-driven tests, testify suites and mocks, parallel tests, fuzzing, fixtures, goroutine leak detection with goleak, snapshot testing, code coverage, integration tests, idiomatic test naming. Use when writing or reviewing Go tests, choosing a testing approach, setting up Go test CI, or debugging flaky/slow tests. For testify-specific APIs see `samber/cc-skills-golang@golang-stretchr-testify`; for measurement methodology see...
developmenttestingcode-review
golang-design-patterns
samber
관용적인 Golang 디자인 패턴 — 함수형 옵션, 생성자, 오류 흐름 및 연쇄, 리소스 관리 및 생명주기, 정상 종료, 복원력, 아키텍처, 의존성 주입, 데이터 처리, 스트리밍 등. 아키텍처 패턴을 명시적으로 선택할 때, 함수형 옵션을 구현할 때, 생성자 API를 설계할 때, 정상 종료를 설정할 때, 복원력 패턴을 적용할 때, 또는 특정 문제에 맞는 관용적인 Go 패턴을 질문할 때 적용하세요.
developmentdesigncode-review
golang-error-handling
samber
Idiomatic Golang error handling — creation, wrapping with %w, errors.Is/As, errors.Join, custom error types, sentinel errors, panic/recover, the single handling rule, structured logging with slog, HTTP request logging middleware, and samber/oops for production errors. Built to make logs usable at scale with log aggregation 3rd-party tools. Apply when creating, wrapping, inspecting, or logging errors in Go code. For samber/oops specifics → See `samber/cc-skills-golang@golang-samber-oops`...
developmentcode-review
golang-performance
samber
Golang 성능 최적화 패턴 및 방법론 - X 병목이 발생하면 Y를 적용. 할당 감소, CPU 효율성, 메모리 레이아웃, GC 튜닝, 풀링, 캐싱, 핫패스 최적화를 다룹니다. 프로파일링이나 벤치마크에서 병목이 확인되어 이를 해결할 적절한 최적화 패턴이 필요할 때 사용합니다. 또한 성능 코드 리뷰 시 개선 사항이나 빠른 성능 향상을 식별하는 데 도움이 될 벤치마크를 제안할 때 사용합니다. 측정 방법론에는 해당하지 않습니다(→...
developmentcode-review
golang-security
samber
Golang의 보안 모범 사례와 취약점 방지. 인젝션(SQL, 명령어, XSS), 암호화, 파일 시스템 안전, 네트워크 보안, 쿠키, 비밀 관리, 메모리 안전, 로깅을 다룹니다. 보안을 위해 Go 코드를 작성, 검토 또는 감사할 때, 또는 암호화, I/O, 비밀 관리, 사용자 입력 처리, 인증과 관련된 위험한 코드 작업 시 적용하세요. 보안 도구 구성도 포함됩니다.
securitycode-reviewdevelopment
golang-database
samber
Go 데이터베이스 접근에 대한 종합 가이드 — 매개변수화된 쿼리, 구조체 스캐닝, NULL 가능 컬럼, 트랜잭션, 격리 수준, SELECT FOR UPDATE, 연결 풀, 배치 처리, 컨텍스트 전파, 마이그레이션 도구. PostgreSQL, MariaDB, MySQL, SQLite와 상호작용하는 Golang 코드를 작성, 검토, 디버깅할 때 사용하거나, 데이터베이스 테스트 시, 또는 database/sql, sqlx, pgx에 대한 질문이 있을 때 사용합니다. 데이터베이스 스키마나 마이그레이션 SQL은 생성하지 않습니다.
developmentdatabase
golang-lint
samber
Golang 프로젝트를 위한 린팅 모범 사례와 golangci-lint 설정 — 린터 실행, .golangci.yml 구성, nolint 지시어로 경고 억제, 린트 출력 해석, 린터 선택. golangci-lint를 구성할 때, 린트 경고나 nolint 억제에 대해 질문할 때, 코드 품질 도구를 설정할 때, 또는 린터를 선택할 때 사용합니다. 또한 사용자가 golangci-lint, go vet, staticcheck, revive를 언급할 때 사용합니다.
developmentcode-reviewtesting