🌟 嗨,我是摘星!
🌈 在彩虹般絢爛的技術棧中,我是那個永不停歇的色彩收集者。
🦋 每一個優化都是我培育的花朵,每一個特性都是我放飛的蝴蝶。
🔬 每一次代碼審查都是我的顯微鏡觀察,每一次重構都是我的化學實驗。
🎵 在編程的交響樂中,我既是指揮家也是演奏者。讓我們一起,在技術的音樂廳裡,奏響屬於程式設計師的華美樂章。
作為一名在雲原生領域摸爬滾打多年的工程師,我深知 CrashLoopBackOff 是 Kubernetes 運維中最令人頭疼的問題之一。這個看似簡單的狀態背後,往往隱藏著複雜的技術細節和多層次的故障原因。在我的實際工作中,遇到過無數次 Pod 反復重啟的場景,從最初的手忙腳亂到現在的有條不紊,這個過程讓我深刻理解了容器化應用的生命週期管理。
CrashLoopBackOff 狀態表示 Pod 中的容器反復崩潰並重啟,Kubernetes 會採用指數退避策略來延長重啟間隔,避免資源浪費。但這個狀態的觸發原因卻千差萬別:可能是映像建構時的基礎環境問題,可能是應用代碼的邏輯錯誤,也可能是資源限制配置不當,更可能是健康檢查探針的配置失誤。
在我處理過的案例中,最複雜的一次涉及到微服務架構下的依賴鏈問題。一個看似獨立的服務因為資料庫連線池配置錯誤,導致啟動時間過長,而我們設定的 livenessProbe 超时时間過短,形成了惡性循環。這個問題的排查過程讓我意識到,解決 CrashLoopBackOff 不僅需要技術功底,更需要系統性的思維和全鏈路的分析能力。
本文將基於我的實戰經驗,從映像建構的底層細節開始,逐步深入到容器運行時環境、資源配置、健康檢查機制等各個層面,為大家提供一套完整的 CrashLoopBackOff 問題排查方法論。我會通過具體的代碼示例、配置檔和排查命令,幫助大家建立起系統性的問題解決思路,讓這個令人頭疼的問題變得可控可解。
CrashLoopBackOff 是 Kubernetes 中一個特殊的 Pod 狀態,它表示容器在反復崩潰和重啟。理解其狀態轉換機制是排查問題的基礎。
# Pod 狀態示例
apiVersion: v1
kind: Pod
metadata:
name: crash-demo
spec:
containers:
- name: app
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "exit 1"] # 故意讓容器退出
restartPolicy: Always
# 查看 Pod 狀態變化
kubectl get pods -w
# 輸出示例:
# NAME READY STATUS RESTARTS AGE
# crash-demo 0/1 CrashLoopBackOff 5 5m
Kubernetes 采用指數退避算法來控制重啟間隔:
// 重啟延遲計算邏輯(簡化版)
func calculateBackoffDelay(restartCount int) time.Duration {
// 基礎延遲 10 秒
baseDelay := 10 * time.Second
// 最大延遲 5 分鐘
maxDelay := 300 * time.Second
// 指數退避:10s, 20s, 40s, 80s, 160s, 300s...
delay := baseDelay * time.Duration(1 << uint(restartCount))
if delay > maxDelay {
delay = maxDelay
}
return delay
}
不同的重啟策略會影響 CrashLoopBackOff 的行為:
# 重啟策略對比配置
apiVersion: v1
kind: Pod
metadata:
name: restart-policy-demo
spec:
# Always: 總是重啟(預設)
# OnFailure: 僅在失敗時重啟
# Never: 從不重啟
restartPolicy: Always
containers:
- name: app
image: busybox
command: ["sh", "-c", "echo 'Starting...' && sleep 30 && exit 1"]
圖1:Pod 狀態轉換流程圖
映像建構是容器運行的基礎,錯誤的基礎映像選擇或建構配置常常是 CrashLoopBackOff 的根源。
# 問題示例:使用了不兼容的基礎映像
FROM alpine:3.18
COPY app /usr/local/bin/app
CMD ["/usr/local/bin/app"]
# 問題:Alpine 使用 musl libc,而應用可能是基於 glibc 編譯的
# 優化後的 Dockerfile
FROM ubuntu:20.04 as builder
# 安裝建構依賴
RUN apt-get update && apt-get install -y \
gcc \
libc6-dev \
&& rm -rf /var/lib/apt/lists/*
# 編譯應用
COPY src/ /src/
WORKDIR /src
RUN gcc -o app main.c
# 多階段建構,減小映像體積
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y \
ca-certificates \
&& rm -rf /var/lib/apt/lists/* \
&& groupadd -r appuser \
&& useradd -r -g appuser appuser
COPY --from=builder /src/app /usr/local/bin/app
RUN chmod +x /usr/local/bin/app
# 使用非 root 使用者運行
USER appuser
EXPOSE 8080
# 添加健康檢查
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
CMD ["/usr/local/bin/app"]
#!/bin/bash
# build-and-test.sh - 映像建構驗證腳本
set -e
IMAGE_NAME="myapp:latest"
CONTAINER_NAME="test-container"
echo "🔨 建構映像..."
docker build -t $IMAGE_NAME .
echo "🧪 測試映像基本功能..."
# 測試容器能否正常啟動
docker run --name $CONTAINER_NAME -d $IMAGE_NAME
# 等待容器啟動
sleep 5
# 檢查容器狀態
if docker ps | grep -q $CONTAINER_NAME; then
echo "✅ 容器啟動成功"
else
echo "❌ 容器啟動失敗"
docker logs $CONTAINER_NAME
exit 1
fi
# 測試健康檢查
echo "🏥 測試健康檢查..."
for i in {1..10}; do
if docker exec $CONTAINER_NAME curl -f http://localhost:8080/health; then
echo "✅ 健康檢查通過"
break
else
echo "⏳ 等待應用啟動... ($i/10)"
sleep 3
fi
done
# 清理測試容器
docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME
echo "🎉 映像建構驗證完成"
# .github/workflows/image-scan.yml
name: Image Security Scan
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build image
run: docker build -t myapp:${{ github.sha }} .
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: 'myapp:${{ github.sha }}'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
圖2:映像建構流程架構圖
資源限制配置不當是導致 CrashLoopBackOff 的常見原因,特別是內存限制過低或 CPU 限制過嚴格。
# 資源配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: resource-demo
spec:
replicas: 3
selector:
matchLabels:
app: resource-demo
template:
metadata:
labels:
app: resource-demo
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
# 請求資源:調度時的最小保證
memory: "128Mi"
cpu: "100m"
limits:
# 限制資源:運行時的最大允許
memory: "512Mi"
cpu: "500m"
env:
- name: JAVA_OPTS
value: "-Xmx400m -Xms128m" # JVM 堆內存不能超過容器內存限制
# ConfigMap 配置管理
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
application.properties: |
server.port=8080
spring.datasource.url=jdbc:mysql://mysql:3306/mydb
spring.datasource.username=user
spring.datasource.password=password
logging.level.root=INFO
---
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
data:
db-password: cGFzc3dvcmQ=
api-key: YWJjZGVmZ2hpams=
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-with-config
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secrets
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: secret-volume
mountPath: /etc/secrets
readOnly: true
volumes:
- name: config-volume
configMap:
name: app-config
- name: secret-volume
secret:
secretName: app-secrets
# Init Container 處理依賴
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-with-deps
spec:
template:
spec:
initContainers:
# 等待資料庫就緒
- name: wait-for-db
image: busybox:1.35
command: ['sh', '-c']
args:
- |
echo "等待資料庫啟動..."
until nc -z mysql 3306; do
echo "資料庫未就緒,等待 5 秒..."
sleep 5
done
echo "資料庫已就緒"
# 資料庫遷移
- name: db-migration
image: migrate/migrate:v4.15.2
command: ["/migrate"]
args:
- "-path=/migrations"
- "-database=mysql://user:password@mysql:3306/mydb"
- "up"
volumeMounts:
- name: migrations
mountPath: /migrations
containers:
- name: app
image: myapp:latest
# 應用容器配置...
volumes:
- name: migrations
configMap:
name: db-migrations
圖3:容器資源使用分布餅圖
Kubernetes 提供三種類型的探針,每種都有其特定的用途和配置要點。
# 完整的探針配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: probe-demo
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
# 啟動探針:確保容器成功啟動
startupProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10 # 首次檢查延遲
periodSeconds: 5 # 檢查間隔
timeoutSeconds: 3 # 超时时間
failureThreshold: 30 # 失敗閾值(總啟動時間 = 30 * 5 = 150s)
successThreshold: 1 # 成功閾值
# 存活探針:檢測容器是否需要重啟
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # 連續 3 次失敗後重啟
successThreshold: 1
# 就緒探針:檢測容器是否準備好接收流量
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
// Spring Boot 健康檢查實現
@RestController
@RequestMapping("/actuator/health")
public class HealthController {
@Autowired
private DatabaseHealthIndicator databaseHealth;
@Autowired
private RedisHealthIndicator redisHealth;
// 存活探針端點 - 檢查應用基本功能
@GetMapping("/liveness")
public ResponseEntity<Map<String, Object>> liveness() {
Map<String, Object> response = new HashMap<>();
try {
// 檢查關鍵組件狀態
boolean isHealthy = checkCriticalComponents();
if (isHealthy) {
response.put("status", "UP");
response.put("timestamp", Instant.now());
return ResponseEntity.ok(response);
} else {
response.put("status", "DOWN");
response.put("reason", "Critical component failure");
return ResponseEntity.status(503).body(response);
}
} catch (Exception e) {
response.put("status", "DOWN");
response.put("error", e.getMessage());
return ResponseEntity.status(503).body(response);
}
}
// 就緒探針端點 - 檢查是否準備好處理請求
@GetMapping("/readiness")
public ResponseEntity<Map<String, Object>> readiness() {
Map<String, Object> response = new HashMap<>();
Map<String, String> checks = new HashMap<>();
// 檢查資料庫連線
boolean dbReady = databaseHealth.isHealthy();
checks.put("database", dbReady ? "UP" : "DOWN");
// 檢查 Redis 連線
boolean redisReady = redisHealth.isHealthy();
checks.put("redis", redisReady ? "UP" : "DOWN");
// 檢查外部依賴
boolean externalReady = checkExternalDependencies();
checks.put("external", externalReady ? "UP" : "DOWN");
boolean allReady = dbReady && redisReady && externalReady;
response.put("status", allReady ? "UP" : "DOWN");
response.put("checks", checks);
response.put("timestamp", Instant.now());
return allReady ?
ResponseEntity.ok(response) :
ResponseEntity.status(503).body(response);
}
private boolean checkCriticalComponents() {
// 檢查 JVM 內存使用率
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();
double memoryUsageRatio = (double) heapUsage.getUsed() / heapUsage.getMax();
if (memoryUsageRatio > 0.9) {
log.warn("Memory usage too high: {}%", memoryUsageRatio * 100);
return false;
}
return true;
}
private boolean checkExternalDependencies() {
// 檢查外部 API 可用性
try {
RestTemplate restTemplate = new RestTemplate();
ResponseEntity<String> response = restTemplate.getForEntity("http://external-api/health", String.class);
return response.getStatusCode().is2xxSuccessful();
} catch (Exception e) {
log.warn("External API health check failed", e);
return false;
}
}
}
#!/bin/bash
# probe-tuning.sh - 探針配置調優腳本
# 獲取 Pod 啟動時間統計
get_startup_stats() {
local deployment=$1
echo "📊 分析 $deployment 的啟動時間..."
kubectl get pods -l app=$deployment -o json | jq -r '
.items[] |
select(.status.phase == "Running") |
{
name: .metadata.name,
created: .metadata.creationTimestamp,
started: (.status.containerStatuses[0].state.running.startedAt // "N/A"),
ready: (.status.conditions[] | select(.type == "Ready") | .lastTransitionTime)
} |
"\(.name): 創建=\(.created), 啟動=\(.started), 就緒=\(.ready)"
'
}
# 分析探針失敗原因
analyze_probe_failures() {
local pod=$1
echo "🔍 分析 $pod 的探針失敗..."
# 獲取事件信息
kubectl describe pod $pod | grep -A 5 -B 5 "probe failed"
# 獲取容器日誌
echo "📋 容器日誌:"
kubectl logs $pod --tail=50
# 檢查資源使用情況
echo "💾 資源使用情況:"
kubectl top pod $pod
}
# 探針配置建議
suggest_probe_config() {
local app_type=$1
case $app_type in
"spring-boot")
cat << EOF
建議的 Spring Boot 應用探針配置:
startupProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 18 # 3 分鐘啟動時間
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
EOF
;;
"nodejs")
cat << EOF
建議的 Node.js 應用探針配置:
startupProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 12 # 1 分鐘啟動時間
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
EOF
;;
esac
}
# 主函數
main() {
case $1 in
"stats")
get_startup_stats $2
;;
"analyze")
analyze_probe_failures $2
;;
"suggest")
suggest_probe_config $2
;;
*)
echo "用法: $0 {stats|analyze|suggest} [參數]"
echo " stats <deployment> - 獲取啟動時間統計"
echo " analyze <pod> - 分析探針失敗原因"
echo " suggest <app-type> - 獲取探針配置建議"
;;
esac
}
main "$@"
圖4:健康檢查探針時序圖
建立系統性的排查方法論是快速定位 CrashLoopBackOff 問題的關鍵。
#!/bin/bash
# crash-diagnosis.sh - CrashLoopBackOff 系統性排查腳本
set -e
POD_NAME=""
NAMESPACE="default"
VERBOSE=false
# 解析命令行參數
while [[ $# -gt 0 ]]; do
case $1 in
-p|--pod)
POD_NAME="$2"
shift 2
;;
-n|--namespace)
NAMESPACE="$2"
shift 2
;;
-v|--verbose)
VERBOSE=true
shift
;;
*)
echo "未知參數: $1"
exit 1
;;
esac
done
if [[ -z "$POD_NAME" ]]; then
echo "用法: $0 -p <pod-name> [-n <namespace>] [-v]"
exit 1
fi
echo "🔍 開始排查 Pod: $POD_NAME (namespace: $NAMESPACE)"
echo "=================================================="
# 1. 基礎信息收集
collect_basic_info() {
echo "📋 1. 收集基礎信息"
echo "-------------------"
# Pod 狀態
echo "Pod 狀態:"
kubectl get pod $POD_NAME -n $NAMESPACE -o wide
# Pod 詳細信息
echo -e "\nPod 詳細信息:"
kubectl describe pod $POD_NAME -n $NAMESPACE
# 重啟次數統計
echo -e "\n重啟統計:"
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.containerStatuses[*].restartCount}' | \
awk '{print "重啟次數: " $1}'
}
# 2. 日誌分析
analyze_logs() {
echo -e "\n📝 2. 日誌分析"
echo "---------------"
# 當前容器日誌
echo "當前容器日誌 (最近 50 行):"
kubectl logs $POD_NAME -n $NAMESPACE --tail=50
# 前一個容器日誌(如果存在)
echo -e "\n前一個容器日誌 (最近 50 行):"
kubectl logs $POD_NAME -n $NAMESPACE --previous --tail=50 2>/dev/null || \
echo "無前一個容器日誌"
}
# 3. 資源使用分析
analyze_resources() {
echo -e "\n💾 3. 資源使用分析"
echo "-------------------"
# 資源配置
echo "資源配置:"
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[*].resources}' | jq .
# 實際資源使用
echo -e "\n實際資源使用:"
kubectl top pod $POD_NAME -n $NAMESPACE 2>/dev/null || \
echo "無法獲取資源使用數據(需要 metrics-server)"
# 節點資源狀態
NODE=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.nodeName}')
if [[ -n "$NODE" ]]; then
echo -e "\n節點 $NODE 資源狀態:"
kubectl describe node $NODE | grep -A 5 "Allocated resources"
fi
}
# 4. 網絡連接檢查
check_network() {
echo -e "\n🌐 4. 網絡連接檢查"
echo "-------------------"
# Service 配置
echo "相關 Service:"
kubectl get svc -n $NAMESPACE --field-selector metadata.name=$POD_NAME 2>/dev/null || \
echo "未找到對應的 Service"
# Endpoint 狀態
echo -e "\nEndpoint 狀態:"
kubectl get endpoints -n $NAMESPACE 2>/dev/null | grep $POD_NAME || \
echo "未找到對應的 Endpoints"
# 網絡策略
echo -e "\n網絡策略:"
kubectl get networkpolicy -n $NAMESPACE 2>/dev/null || \
echo "未配置網絡策略"
}
# 5. 配置檢查
check_configuration() {
echo -e "\n⚙️ 5. 配置檢查"
echo "----------------"
# ConfigMap
echo "ConfigMap:"
kubectl get configmap -n $NAMESPACE 2>/dev/null | head -10
# Secret
echo -e "\nSecret:"
kubectl get secret -n $NAMESPACE 2>/dev/null | head -10
# 環境變量
echo -e "\n環境變量:"
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[*].env}' | jq .
}
# 6. 事件分析
analyze_events() {
echo -e "\n📅 6. 事件分析"
echo "---------------"
# Pod 相關事件
echo "Pod 相關事件:"
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD_NAME \
--sort-by='.lastTimestamp' | tail -20
# 節點相關事件
NODE=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.nodeName}')
if [[ -n "$NODE" ]]; then
echo -e "\n節點 $NODE 相關事件:"
kubectl get events --field-selector involvedObject.name=$NODE \
--sort-by='.lastTimestamp' | tail -10
fi
}
# 7. 生成診斷報告
generate_report() {
echo -e "\n📊 7. 診斷建議"
echo "==============="
# 分析重啟原因
RESTART_COUNT=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].restartCount}')
LAST_STATE=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].lastState}')
echo "重啟次數: $RESTART_COUNT"
echo "最後狀態: $LAST_STATE"
# 常見問題檢查
echo -e "\n🔧 常見問題檢查:"
# 檢查映像拉取
IMAGE_PULL_POLICY=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].imagePullPolicy}')
echo "- 映像拉取策略: $IMAGE_PULL_POLICY"
# 檢查探針配置
LIVENESS_PROBE=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].livenessProbe}')
if [[ -n "$LIVENESS_PROBE" ]]; then
echo "- 存活探針: 已配置"
else
echo "- 存活探針: 未配置"
fi
READINESS_PROBE=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].readinessProbe}')
if [[ -n "$READINESS_PROBE" ]]; then
echo "- 就緒探針: 已配置"
else
echo "- 就緒探針: 未配置"
fi
# 資源限制檢查
MEMORY_LIMIT=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].resources.limits.memory}')
CPU_LIMIT=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].resources.limits.cpu}')
if [[ -n "$MEMORY_LIMIT" ]]; then
echo "- 內存限制: $MEMORY_LIMIT"
else
echo "- 內存限制: 未設置(可能導致 OOM)"
fi
if [[ -n "$CPU_LIMIT" ]]; then
echo "- CPU 限制: $CPU_LIMIT"
else
echo "- CPU 限制: 未設置"
fi
}
# 主執行流程
main() {
collect_basic_info
analyze_logs
analyze_resources
check_network
check_configuration
analyze_events
generate_report
echo -e "\n✅ 排查完成!"
echo "建議根據以上信息分析問題原因,並參考相應的解決方案。"
}
# 執行主函數
main
問題類型 | 症狀表現 | 排查重點 | 解決方案 |
---|---|---|---|
映像問題 | ImagePullBackOff → CrashLoopBackOff | 映像標籤、倉庫權限、網絡連接 | 修復映像建構、更新拉取憑證 |
資源不足 | OOMKilled、CPU 節流 | 內存/CPU 使用率、節點資源 | 調整資源限制、擴容節點 |
配置錯誤 | 環境變量、掛載失敗 | ConfigMap、Secret、Volume | 修正配置檔案、檢查掛載路徑 |
依賴服務 | 連線超時、服務不可達 | 網絡連接、DNS 解析 | 修復依賴服務、調整超時配置 |
探針配置 | 健康檢查失敗 | 探針端點、超時設置 | 優化探針參數、修復健康檢查 |
#!/bin/bash
# auto-fix.sh - 自動修復常見 CrashLoopBackOff 問題
fix_common_issues() {
local deployment=$1
local namespace=${2:-default}
echo "🔧 嘗試自動修復 $deployment 的常見問題..."
# 1. 增加資源限制
echo "1. 調整資源限制..."
kubectl patch deployment $deployment -n $namespace -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "'$deployment'",
"resources": {
"requests": {
"memory": "256Mi",
"cpu": "200m"
},
"limits": {
"memory": "512Mi",
"cpu": "500m"
}
}
}]
}
}
}
}'
# 2. 調整探針配置
echo "2. 優化探針配置..."
kubectl patch deployment $deployment -n $namespace -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "'$deployment'",
"livenessProbe": {
"initialDelaySeconds": 60,
"periodSeconds": 30,
"timeoutSeconds": 10,
"failureThreshold": 5
},
"readinessProbe": {
"initialDelaySeconds": 10,
"periodSeconds": 5,
"timeoutSeconds": 3,
"failureThreshold": 3
}
}]
}
}
}
}'
# 3. 添加啟動探針
echo "3. 添加啟動探針..."
kubectl patch deployment $deployment -n $namespace -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "'$deployment'",
"startupProbe": {
"httpGet": {
"path": "/actuator/health",
"port": 8080
},
"initialDelaySeconds": 10,
"periodSeconds": 5,
"timeoutSeconds": 3,
"failureThreshold": 5
}
}]
}
}
}
}'
echo "✅ 自動修復完成!"
}
# 主函數
main() {
if [[ $# -lt 1 ]]; then
echo "用法: $0 <deployment-name> [namespace]"
exit 1
fi
fix_common_issues $1 $2
}
# 執行主函數
main "$@"