Skip to content

Docker 监控告警

容器监控是确保应用稳定运行的关键。本文档介绍 Docker 容器的监控指标、工具和告警配置。

📊 监控指标概述

核心监控指标

指标类型具体指标描述
资源使用CPU、内存、磁盘、网络容器资源消耗
容器状态运行状态、重启次数容器健康状况
应用指标响应时间、错误率、吞吐量应用性能指标
系统指标负载、文件描述符系统级别指标

Docker 原生监控命令

bash
# 查看容器资源使用情况
docker stats

# 查看特定容器统计信息
docker stats container-name

# 查看所有容器(包括停止的)
docker stats --all

# 不持续更新,只显示一次
docker stats --no-stream

# 格式化输出
docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

🔧 Prometheus 监控

Prometheus 部署

yaml
# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg

volumes:
  prometheus-data:

Prometheus 配置

yaml
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
  
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: [__meta_docker_container_name]
        target_label: container_name
      - source_labels: [__meta_docker_container_label_monitoring]
        regex: "true"
        action: keep

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

告警规则

yaml
# prometheus/rules/docker-alerts.yml
groups:
- name: docker-alerts
  rules:
  - alert: ContainerDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.instance }} is down"
      description: "Container {{ $labels.instance }} has been down for more than 1 minute."
  
  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.name }}"
      description: "Container {{ $labels.name }} CPU usage is above 80% for more than 2 minutes."
  
  - alert: HighMemoryUsage
    expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{ $labels.name }}"
      description: "Container {{ $labels.name }} memory usage is above 90% for more than 2 minutes."
  
  - alert: ContainerRestartTooOften
    expr: increase(container_start_time_seconds[1h]) > 5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Container restarting too often"
      description: "Container {{ $labels.name }} has restarted {{ $value }} times in the last hour."

📈 Grafana 可视化

Grafana 部署

yaml
# 添加到 docker-compose.monitoring.yml
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    depends_on:
      - prometheus

volumes:
  grafana-data:

Grafana 数据源配置

yaml
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Docker 监控仪表板

json
{
  "dashboard": {
    "id": null,
    "title": "Docker Container Monitoring",
    "tags": ["docker"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Container CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total{name!=\"\"}[5m]) * 100",
            "legendFormat": "{{ name }}"
          }
        ],
        "yAxes": [
          {
            "label": "CPU %",
            "max": 100,
            "min": 0
          }
        ]
      },
      {
        "id": 2,
        "title": "Container Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "container_memory_usage_bytes{name!=\"\"}",
            "legendFormat": "{{ name }}"
          }
        ],
        "yAxes": [
          {
            "label": "Bytes"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "5s"
  }
}

🚨 AlertManager 告警

AlertManager 部署

yaml
# 添加到 docker-compose.monitoring.yml
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'

volumes:
  alertmanager-data:

AlertManager 配置

yaml
# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      severity: warning
    receiver: 'warning-alerts'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://webhook-server:5000/alerts'

- name: 'critical-alerts'
  email_configs:
  - to: 'admin@example.com'
    subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#critical-alerts'
    title: 'Critical Alert'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

- name: 'warning-alerts'
  email_configs:
  - to: 'team@example.com'
    subject: 'WARNING: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

📱 应用性能监控 (APM)

Jaeger 分布式追踪

yaml
# docker-compose.apm.yml
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    container_name: jaeger
    ports:
      - "16686:16686"
      - "14268:14268"
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  app:
    image: my-app
    environment:
      - JAEGER_AGENT_HOST=jaeger
      - JAEGER_AGENT_PORT=6831
    depends_on:
      - jaeger

应用指标收集

javascript
// Node.js 应用集成 Prometheus 指标
const client = require('prom-client');
const express = require('express');

// 创建指标
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// 中间件
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
    
    httpRequestTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();
  });
  
  next();
});

// 指标端点
app.get('/metrics', (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(client.register.metrics());
});

🔍 日志监控集成

Loki + Promtail

yaml
# docker-compose.logging.yml
services:
  loki:
    image: grafana/loki:latest
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    
  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    volumes:
      - ./promtail/promtail-config.yml:/etc/promtail/config.yml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/config.yml

Promtail 配置

yaml
# promtail/promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: containers
  static_configs:
  - targets:
      - localhost
    labels:
      job: containerlogs
      __path__: /var/lib/docker/containers/*/*log
  
  pipeline_stages:
  - json:
      expressions:
        output: log
        stream: stream
        attrs:
  - json:
      source: attrs
      expressions:
        tag:
  - regex:
      source: tag
      expression: (?P<container_name>(?:[^|]*))\|(?P<image_name>(?:[^|]*))\|(?P<image_id>(?:[^|]*))\|(?P<container_id>(?:[^|]*))
  - timestamp:
      format: RFC3339Nano
      source: time
  - labels:
      stream:
      container_name:
      image_name:
  - output:
      source: output

🛠️ 监控脚本和工具

健康检查脚本

bash
#!/bin/bash
# health-check.sh

CONTAINERS=("web" "api" "db" "redis")
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

for container in "${CONTAINERS[@]}"; do
  if ! docker ps | grep -q "$container"; then
    echo "❌ Container $container is not running"
    
    # 发送告警
    curl -X POST -H 'Content-type: application/json' \
      --data "{\"text\":\"🚨 Container $container is down!\"}" \
      $WEBHOOK_URL
    
    # 尝试重启
    docker-compose restart $container
  else
    echo "✅ Container $container is healthy"
  fi
done

资源监控脚本

bash
#!/bin/bash
# resource-monitor.sh

# 设置阈值
CPU_THRESHOLD=80
MEMORY_THRESHOLD=90
DISK_THRESHOLD=85

# 检查 CPU 使用率
check_cpu() {
  local container=$1
  local cpu_usage=$(docker stats --no-stream --format "{{.CPUPerc}}" $container | sed 's/%//')
  
  if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
    echo "⚠️  High CPU usage: $container ($cpu_usage%)"
    return 1
  fi
  return 0
}

# 检查内存使用率
check_memory() {
  local container=$1
  local mem_usage=$(docker stats --no-stream --format "{{.MemPerc}}" $container | sed 's/%//')
  
  if (( $(echo "$mem_usage > $MEMORY_THRESHOLD" | bc -l) )); then
    echo "⚠️  High memory usage: $container ($mem_usage%)"
    return 1
  fi
  return 0
}

# 检查所有运行的容器
for container in $(docker ps --format "{{.Names}}"); do
  check_cpu $container
  check_memory $container
done

🚀 最佳实践

1. 监控指标选择

yaml
# 关键指标监控
services:
  app:
    image: my-app
    labels:
      - "monitoring=true"
      - "metrics.port=3000"
      - "metrics.path=/metrics"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

2. 告警分级

yaml
# 告警严重级别
- alert: ServiceDown
  expr: up == 0
  labels:
    severity: critical
    
- alert: HighLatency
  expr: http_request_duration_seconds > 1
  labels:
    severity: warning
    
- alert: DiskSpaceLow
  expr: disk_free_percent < 10
  labels:
    severity: critical

3. 监控数据保留

yaml
# Prometheus 数据保留策略
command:
  - '--storage.tsdb.retention.time=30d'
  - '--storage.tsdb.retention.size=10GB'

4. 性能优化

yaml
# 监控组件资源限制
services:
  prometheus:
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '1.0'
        reservations:
          memory: 1G
          cpus: '0.5'

通过完善的监控告警体系,您可以及时发现和解决容器化应用的问题,确保系统稳定运行。