🎯 GPUaaS 장애 유형

유형	실제 현상
GPU Down	노드에 GPU가 안 잡힘
GPU 과부하	학습, 추론 중 성능 저하
GPU 메모리 부족	모델 로딩 실패
Pod Pending	GPU 부족 or 스케줄링 실패
Pod Hang	컨테이너는 살아있는데 GPU 작업 안함
GPU Idle	자원 낭비

1️⃣ GPUaaS 통합 AlertRule (GPU + Pod 포함)

cat <<EOF > gpu-aas-alerts.yaml
groups:
- name: gpu-aas
  rules:

  # 1. GPU 자체 장애
  - alert: GPUDown
    expr: nvidia_gpu_utilization == 0
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "GPU not responding ({{ \$labels.instance }})"

  - alert: GPUHighUtilization
    expr: avg by(instance)(nvidia_gpu_utilization) > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU overload ({{ \$labels.instance }})"

  - alert: GPUMemoryAlmostFull
    expr: avg by(instance)(nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100 > 90
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "GPU memory almost full ({{ \$labels.instance }})"

  # 2. Pod Pending (GPU 할당 실패)
  - alert: GPUPodPending
    expr: kube_pod_status_phase{phase="Pending"} == 1
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "GPU Pod Pending ({{ \$labels.namespace }}/{{ \$labels.pod }})"

  # 3. Pod Hang (컨테이너는 있는데 GPU 사용 안함)
  - alert: GPUPodHang
    expr: |
      kube_pod_container_status_running == 1
      and
      on(pod, namespace)
      sum(nvidia_gpu_utilization) by (pod, namespace) < 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "GPU Pod Hang detected ({{ \$labels.namespace }}/{{ \$labels.pod }})"

  # 4. GPU 장시간 Idle (비용 낭비)
  - alert: GPUIdleTooLong
    expr: avg by(instance)(nvidia_gpu_utilization) < 5
    for: 30m
    labels:
      severity: info
    annotations:
      summary: "GPU idle too long ({{ \$labels.instance }})"

EOF

2️⃣ Kubernetes에 AlertRule 적용

① ConfigMap 생성

kubectl create configmap gpu-aas-alerts \
--from-file=gpu-aas-alerts.yaml \
-n monitoring

② Prometheus가 이 룰을 읽도록 라벨 추가

kubectl label configmap gpu-aas-alerts role=gpu-alerts -n monitoring

“gpu-aas-alerts ConfigMap에 ‘role=gpu-alerts’라는 꼬리표를 붙인다”

3️⃣ Prometheus에 ruleSelector 연결

Prometheus CR 확인

kubectl get prometheus -n monitoring

편집

kubectl edit prometheus k8s -n monitoring

아래가 있는지 확인:

spec:
  ruleSelector:
    matchLabels:
      role: gpu-alerts

없으면 추가 후 저장

4️⃣ Prometheus 재적용 (룰 리로드)

ConfigMap이 바뀌면 Prometheus가 자동 reload 하지만
안되면 강제 롤링

kubectl rollout restart statefulset prometheus-k8s -n monitoring

5️⃣ 적용 여부 확인

kubectl exec -n monitoring -it prometheus-k8s-0 -- \
wget -qO- http://localhost:9090/api/v1/rules | grep GPU

또는 Grafana → Alerting → Alert Rules에서 확인

🎯 이 상태가 의미하는 것

이제 네 GPUaaS는:

장애	감지
GPU 드라이버 죽음	✅
GPU 메모리 부족	✅
학습 Pod Pending	✅
추론 Pod 멈춤	✅
GPU 놀고있음	✅
비용 낭비	✅

→ 클라우드 GPU 서비스 수준

저작자표시 비영리 변경금지 (새창열림)

'[GPUaaS] > Prometheus' 카테고리의 다른 글

[NCP 실전] kubectl rollout restart statefulset prometheus-k8s -n monitoring 의미 (0)	2026.01.14
Thanos for Kubernetes in S3 with Grafana and Prometheus (1)	2026.01.13
[NCP 실전] NCP Kubernetes + Prometheus + Alertmanager 환경에Thanos + NCP Object Storage 연동 (0)	2026.01.13
[NCP 실전] Kubernetes 내부 DNS 주소 규칙 (0)	2026.01.13
[NCP 실전] Kubernetes에 Prometheus + Grafana 모니터링 구성 (0)	2026.01.12
Helm을 사용하여 Kubernetes에 Prometheus 설정 \| Prometheus를 사용한 Kubernetes 모니터링 (0)	2026.01.12
[중요] 우분투 - Grafana Prometheus 를 사용한 서버 시각화!! (2)	2026.01.12
[Prometheus] Node Exporter의 역할!! (@2025년 최신) (1)	2025.10.03

[NCP 실전] Prometheus Alertmanager - AlertRule 적용

🎯 GPUaaS 장애 유형

1️⃣ GPUaaS 통합 AlertRule (GPU + Pod 포함)

2️⃣ Kubernetes에 AlertRule 적용

① ConfigMap 생성

② Prometheus가 이 룰을 읽도록 라벨 추가

3️⃣ Prometheus에 ruleSelector 연결

4️⃣ Prometheus 재적용 (룰 리로드)

5️⃣ 적용 여부 확인

🎯 이 상태가 의미하는 것

'[GPUaaS] > Prometheus' 카테고리의 다른 글

댓글

티스토리툴바

[NCP 실전] Prometheus Alertmanager - AlertRule 적용

🎯 GPUaaS 장애 유형

1️⃣ GPUaaS 통합 AlertRule (GPU + Pod 포함)

2️⃣ Kubernetes에 AlertRule 적용

① ConfigMap 생성

② Prometheus가 이 룰을 읽도록 라벨 추가

3️⃣ Prometheus에 ruleSelector 연결

4️⃣ Prometheus 재적용 (룰 리로드)

5️⃣ 적용 여부 확인

🎯 이 상태가 의미하는 것

'[GPUaaS] > Prometheus' 카테고리의 다른 글

관련글

댓글

티스토리툴바