Docker 健康检查
健康检查是确保容器正常运行的重要机制。本文档介绍 Docker 健康检查的配置、最佳实践和故障排除。
🏥 健康检查概述
什么是健康检查
健康检查是一种机制,用于:
- 监控容器状态:定期检查容器是否正常运行
- 自动恢复:发现问题时自动重启容器
- 负载均衡:从负载均衡器中移除不健康的容器
- 服务发现:确保只有健康的服务被发现
健康状态
Docker 容器有三种健康状态:
- healthy:健康检查通过
- unhealthy:健康检查失败
- starting:容器启动中,还未开始健康检查
🔧 基础健康检查
Dockerfile 中定义健康检查
dockerfile
# 基础 HTTP 健康检查
FROM nginx:alpine
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost/ || exit 1
# Node.js 应用健康检查
FROM node:16-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD node healthcheck.js || exit 1
EXPOSE 3000
CMD ["npm", "start"]
# 数据库健康检查
FROM postgres:15
HEALTHCHECK --interval=30s --timeout=5s --start-period=30s --retries=3 \
CMD pg_isready -U postgres || exit 1
# 自定义脚本健康检查
FROM alpine:latest
RUN apk add --no-cache curl
COPY health-check.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/health-check.sh
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD /usr/local/bin/health-check.sh
运行时健康检查
bash
# 运行容器时添加健康检查
docker run -d --name web \
--health-cmd="curl -f http://localhost/ || exit 1" \
--health-interval=30s \
--health-timeout=10s \
--health-retries=3 \
--health-start-period=60s \
nginx
# 禁用健康检查
docker run -d --name app --no-healthcheck my-app
# 查看容器健康状态
docker ps
docker inspect web | grep -A 10 Health
📋 Docker Compose 健康检查
基础配置
yaml
version: '3.8'
services:
web:
image: nginx:alpine
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
ports:
- "80:80"
api:
image: my-api:latest
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 40s
ports:
- "3000:3000"
db:
image: postgres:15
environment:
POSTGRES_PASSWORD: password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 30s
timeout: 5s
retries: 5
start_period: 30s
redis:
image: redis:alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 30s
timeout: 3s
retries: 3
依赖健康检查
yaml
services:
web:
image: nginx:alpine
depends_on:
api:
condition: service_healthy
db:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
api:
image: my-api:latest
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
db:
image: postgres:15
environment:
POSTGRES_PASSWORD: password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 30s
timeout: 5s
retries: 5
redis:
image: redis:alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 30s
timeout: 3s
retries: 3
🏗️ 应用级健康检查
HTTP 健康检查端点
javascript
// Node.js Express 健康检查
const express = require('express');
const app = express();
// 基础健康检查
app.get('/health', (req, res) => {
res.status(200).json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime()
});
});
// 详细健康检查
app.get('/health/detailed', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
external_api: await checkExternalAPI()
};
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'healthy' : 'unhealthy',
checks,
timestamp: new Date().toISOString()
});
});
async function checkDatabase() {
try {
await db.query('SELECT 1');
return { status: 'healthy', message: 'Database connection OK' };
} catch (error) {
return { status: 'unhealthy', message: error.message };
}
}
async function checkRedis() {
try {
await redis.ping();
return { status: 'healthy', message: 'Redis connection OK' };
} catch (error) {
return { status: 'unhealthy', message: error.message };
}
}
Python Flask 健康检查
python
from flask import Flask, jsonify
import psycopg2
import redis
import requests
from datetime import datetime
app = Flask(__name__)
@app.route('/health')
def health_check():
return jsonify({
'status': 'healthy',
'timestamp': datetime.utcnow().isoformat(),
'service': 'my-api'
})
@app.route('/health/detailed')
def detailed_health_check():
checks = {
'database': check_database(),
'redis': check_redis(),
'external_service': check_external_service()
}
is_healthy = all(check['status'] == 'healthy' for check in checks.values())
return jsonify({
'status': 'healthy' if is_healthy else 'unhealthy',
'checks': checks,
'timestamp': datetime.utcnow().isoformat()
}), 200 if is_healthy else 503
def check_database():
try:
conn = psycopg2.connect(
host="db",
database="myapp",
user="postgres",
password="password"
)
conn.close()
return {'status': 'healthy', 'message': 'Database connection OK'}
except Exception as e:
return {'status': 'unhealthy', 'message': str(e)}
def check_redis():
try:
r = redis.Redis(host='redis', port=6379, db=0)
r.ping()
return {'status': 'healthy', 'message': 'Redis connection OK'}
except Exception as e:
return {'status': 'unhealthy', 'message': str(e)}
Go 健康检查
go
package main
import (
"database/sql"
"encoding/json"
"net/http"
"time"
_ "github.com/lib/pq"
"github.com/go-redis/redis/v8"
)
type HealthCheck struct {
Status string `json:"status"`
Timestamp time.Time `json:"timestamp"`
Checks map[string]Check `json:"checks,omitempty"`
}
type Check struct {
Status string `json:"status"`
Message string `json:"message"`
}
func healthHandler(w http.ResponseWriter, r *http.Request) {
health := HealthCheck{
Status: "healthy",
Timestamp: time.Now(),
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(health)
}
func detailedHealthHandler(db *sql.DB, rdb *redis.Client) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
checks := map[string]Check{
"database": checkDatabase(db),
"redis": checkRedis(rdb),
}
isHealthy := true
for _, check := range checks {
if check.Status != "healthy" {
isHealthy = false
break
}
}
status := "healthy"
statusCode := http.StatusOK
if !isHealthy {
status = "unhealthy"
statusCode = http.StatusServiceUnavailable
}
health := HealthCheck{
Status: status,
Timestamp: time.Now(),
Checks: checks,
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(statusCode)
json.NewEncoder(w).Encode(health)
}
}
func checkDatabase(db *sql.DB) Check {
if err := db.Ping(); err != nil {
return Check{Status: "unhealthy", Message: err.Error()}
}
return Check{Status: "healthy", Message: "Database connection OK"}
}
func checkRedis(rdb *redis.Client) Check {
if err := rdb.Ping(context.Background()).Err(); err != nil {
return Check{Status: "unhealthy", Message: err.Error()}
}
return Check{Status: "healthy", Message: "Redis connection OK"}
}
🔍 健康检查脚本
Shell 脚本健康检查
bash
#!/bin/bash
# health-check.sh
set -e
# 配置
SERVICE_URL="http://localhost:3000"
DB_HOST="db"
DB_USER="postgres"
REDIS_HOST="redis"
# 检查 HTTP 服务
check_http() {
if curl -f -s --max-time 10 "$SERVICE_URL/health" > /dev/null; then
echo "✅ HTTP service is healthy"
return 0
else
echo "❌ HTTP service is unhealthy"
return 1
fi
}
# 检查数据库
check_database() {
if pg_isready -h "$DB_HOST" -U "$DB_USER" > /dev/null 2>&1; then
echo "✅ Database is healthy"
return 0
else
echo "❌ Database is unhealthy"
return 1
fi
}
# 检查 Redis
check_redis() {
if redis-cli -h "$REDIS_HOST" ping > /dev/null 2>&1; then
echo "✅ Redis is healthy"
return 0
else
echo "❌ Redis is unhealthy"
return 1
fi
}
# 检查磁盘空间
check_disk_space() {
local usage=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$usage" -lt 90 ]; then
echo "✅ Disk space is healthy ($usage%)"
return 0
else
echo "❌ Disk space is critical ($usage%)"
return 1
fi
}
# 执行所有检查
main() {
local failed=0
check_http || failed=1
check_database || failed=1
check_redis || failed=1
check_disk_space || failed=1
if [ $failed -eq 0 ]; then
echo "🎉 All health checks passed"
exit 0
else
echo "💥 Some health checks failed"
exit 1
fi
}
main "$@"
📊 监控集成
Prometheus 健康检查指标
javascript
// Node.js Prometheus 集成
const client = require('prom-client');
// 健康检查指标
const healthCheckGauge = new client.Gauge({
name: 'health_check_status',
help: 'Health check status (1 = healthy, 0 = unhealthy)',
labelNames: ['service', 'check_type']
});
const healthCheckDuration = new client.Histogram({
name: 'health_check_duration_seconds',
help: 'Health check duration in seconds',
labelNames: ['service', 'check_type']
});
// 健康检查函数
async function performHealthCheck(checkName, checkFunction) {
const start = Date.now();
try {
await checkFunction();
healthCheckGauge.labels('my-service', checkName).set(1);
return { status: 'healthy' };
} catch (error) {
healthCheckGauge.labels('my-service', checkName).set(0);
return { status: 'unhealthy', error: error.message };
} finally {
const duration = (Date.now() - start) / 1000;
healthCheckDuration.labels('my-service', checkName).observe(duration);
}
}
// 健康检查端点
app.get('/health', async (req, res) => {
const checks = {
database: await performHealthCheck('database', checkDatabase),
redis: await performHealthCheck('redis', checkRedis)
};
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'healthy' : 'unhealthy',
checks
});
});
🚨 故障处理
自动重启策略
yaml
services:
api:
image: my-api:latest
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
健康检查失败处理
bash
#!/bin/bash
# health-monitor.sh
CONTAINER_NAME="my-app"
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
while true; do
# 检查容器健康状态
HEALTH_STATUS=$(docker inspect --format='{{.State.Health.Status}}' $CONTAINER_NAME 2>/dev/null)
if [ "$HEALTH_STATUS" = "unhealthy" ]; then
echo "Container $CONTAINER_NAME is unhealthy, attempting restart..."
# 发送告警
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"🚨 Container $CONTAINER_NAME is unhealthy and being restarted\"}" \
$WEBHOOK_URL
# 重启容器
docker restart $CONTAINER_NAME
# 等待重启完成
sleep 60
fi
sleep 30
done
🚀 最佳实践
1. 健康检查设计原则
dockerfile
# 轻量级检查
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
CMD curl -f http://localhost/ping || exit 1
# 避免复杂检查
# ❌ 不好的例子
HEALTHCHECK CMD complex-database-migration-check.sh
# ✅ 好的例子
HEALTHCHECK CMD curl -f http://localhost/health || exit 1
2. 分层健康检查
yaml
services:
app:
healthcheck:
# 基础检查:服务是否响应
test: ["CMD", "curl", "-f", "http://localhost:3000/ping"]
interval: 10s
timeout: 3s
retries: 3
# 详细检查通过监控系统进行
monitoring:
image: my-monitoring
command: ["monitor", "--detailed-checks"]
3. 渐进式健康检查
yaml
services:
app:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s # 给应用足够的启动时间
4. 环境特定配置
yaml
# 开发环境 - 宽松的健康检查
services:
app:
healthcheck:
interval: 60s
timeout: 30s
retries: 5
# 生产环境 - 严格的健康检查
services:
app:
healthcheck:
interval: 15s
timeout: 5s
retries: 2
通过合理的健康检查配置,您可以确保容器化应用的高可用性和自动恢复能力。