dd-monitors

작성자: datadog-labs

모니터 관리 - 생성, 업데이트, 음소거 및 알림 모범 사례.

npx skills add https://github.com/datadog-labs/pup --skill dd-monitors

Datadog Monitors

Create, manage, and maintain monitors for alerting.

Prerequisites

This requires the pup binary in your path.

pup - cargo install --git https://github.com/DataDog/pup

Quick Start

pup auth login

Common Operations

List Monitors

pup monitors list
pup monitors list --tags "team:platform"
pup monitors search --query "status:Alert"

Get Monitor

pup monitors get <id>

Create Monitor

pup monitors create --file monitor.json

Mute/Unmute

# Mute with duration
pup monitors update 12345 --file monitor-muted.json

# Or mute with specific end time
pup monitors update 12345 --file monitor-muted-until.json

# Unmute
pup monitors update 12345 --file monitor-unmuted.json

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

RuleWhy
No flapping alertsUse last_Xm not last_1m
Meaningful thresholdsBased on SLOs, not guesses
Actionable alertsIf no action needed, don't alert
Include runbook@runbook-url in message
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window

2. Use Proper Scoping

# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅

3. Set Recovery Thresholds

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""

⚠️ NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

Monitor Types

TypeUse Case
metric alertCPU, memory, custom metrics
query alertComplex metric queries
service checkAgent check status
event alertEvent stream patterns
log alertLog pattern matching
compositeCombine multiple monitors
apmAPM metrics

Audit Monitors

# Find monitors without owners
pup monitors list | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

Downtime vs Muting

UseWhen
Mute monitorQuick one-off, < 1 hour
DowntimeScheduled maintenance, recurring
# Downtime (preferred)
pup downtime create --file downtime.json

Failure Handling

ProblemFix
Alert not firingCheck query returns data, thresholds
Too many alertsIncrease window, add recovery threshold
No data alertsCheck agent connectivity, metric exists
Auth errorpup auth refresh

References

datadog-labs의 다른 스킬

agent-install
datadog-labs
agent-install — Datadog-labs/agent-skills에서 게시한 AI 에이전트용 설치 가능한 스킬입니다.
official
agent-skills
datadog-labs
AI 에이전트를 위한 Datadog 스킬. 필수 모니터링, 로깅, 트레이싱 및 관찰 가능성.
official
dd-apm
datadog-labs
APM - 설치, 온보딩, 계측, 활성화, 설정, 구성, 트레이스, 서비스, 종속성, 성능 분석. Datadog APM과 관련된 모든 요청에 사용합니다…
official
dd-audit
datadog-labs
감사 추적 조사 - 누가 무엇을 변경했는지, 키 손상, 비용 급증 근본 원인, 규정 준수 증거(SOC 2/PCI), AI 활동 감사.
official
dd-audit-ai-activity
datadog-labs
Bits AI 어시스턴트(MCP 서버)가 Datadog 조직에서 수행한 작업을 감사합니다 — 사용자별 도구 호출, 액세스한 리소스, AI 거버넌스를 위한 이상 플래그.
official
dd-audit-compliance-report
datadog-labs
Datadog Audit Trail에서 SOC 2 및 PCI DSS에 대한 감사자 준비 완료 규정 준수 증거를 생성합니다. 프레임워크 컨트롤을 특정 쿼리 패턴에 매핑하고 다음을 생성합니다…
official
dd-audit-cost-spike-investigation
datadog-labs
Datadog 제품 사용량 또는 비용 급증을 조사하기 위해 사용량 측정 데이터(언제/무엇이 급증했는지)를 감사 추적 구성 변경 사항(누가 무엇을 변경했는지)과 연관시킵니다.
official
dd-audit-key-compromise
datadog-labs
잠재적으로 유출된 Datadog API 키를 조사합니다 — 작업 타임라인, 지리/IP 분석, 호출된 엔드포인트, 이상 징후 플래그 및 복구 단계.
official