**NCP 환경에 딱 맞는 “정석 루트”**를 선택하셨어요.
아래는 실무에서 그대로 복붙해서 써도 되는 수준으로,

NCP Kubernetes
→ NVIDIA Device Plugin
→ DCGM Exporter
→ Prometheus
→ Grafana
를 모두 Helm 기반 + 명령어 중심 + 순번별로 정리한 가이드입니다.

🧭 전체 구성 흐름 (먼저 그림으로 이해)

[NCP GPU Node]
├─ (기본 제공) NVIDIA Driver
├─ NVIDIA Device Plugin (Helm)
├─ DCGM Exporter (Helm)
├─ Prometheus (Helm)
└─ Grafana (Helm)

0️⃣ 사전 준비 (로컬 PC)

kubectl version
helm version

✔️ 둘 다 정상 출력되어야 합니다.

1️⃣ GPU 노드 확인 (반드시 먼저)

kubectl get nodes

GPU 노드 리소스 확인:

kubectl describe node <GPU_NODE_NAME> | grep -i nvidia

정상 예:

nvidia.com/gpu: 4

2️⃣ NVIDIA Device Plugin 설치 (Helm)

2-1. Helm Repo 추가

helm repo add nvidia https://nvidia.github.io/k8s-device-plugin
helm repo update

2-2. Device Plugin 설치

helm install nvidia-device-plugin nvidia/k8s-device-plugin \
--namespace kube-system

2-3. 설치 확인

kubectl get pods -n kube-system | grep nvidia

정상:

nvidia-device-plugin-daemonset-xxxxx 1/1 Running

3️⃣ monitoring Namespace 생성

kubectl create namespace monitoring

(이미 있다면 생략)

4️⃣ DCGM Exporter 설치 (Helm)

4-1. Helm Repo 추가

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

4-2. DCGM Exporter 설치 (NCP 필수 옵션 포함)

helm install dcgm-exporter prometheus-community/dcgm-exporter \
  --namespace monitoring \
  --set securityContext.privileged=true \
  --set tolerations[0].key="nvidia.com/gpu" \
  --set tolerations[0].operator="Exists" \
  --set tolerations[0].effect="NoSchedule"

4-3. 설치 확인

kubectl get pods -n monitoring -o wide | grep dcgm

GPU 노드 수 = Pod 수 → 정상

5️⃣ Prometheus 설치 (Helm, Operator 방식)

👉 ServiceMonitor 자동 연동을 위해 Prometheus Operator 사용

5-1. kube-prometheus-stack Repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

5-2. Prometheus 설치

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring

⏳ 약 1~2분 소요

5-3. Prometheus Pod 확인

kubectl get pods -n monitoring

6️⃣ DCGM Exporter ↔ Prometheus 연동

6-1. DCGM Exporter 업그레이드 (ServiceMonitor 활성화)

helm upgrade dcgm-exporter prometheus-community/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.namespace=monitoring \
  --set serviceMonitor.labels.release=prometheus \
  --set securityContext.privileged=true \
  --set tolerations[0].key="nvidia.com/gpu" \
  --set tolerations[0].operator="Exists" \
  --set tolerations[0].effect="NoSchedule"

6-2. Prometheus 타겟 확인

포트포워딩:

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090

브라우저:

Status → Targets → dcgm-exporter (UP)

7️⃣ Grafana 설치 & 접속

kube-prometheus-stack 안에 Grafana 포함되어 있음

7-1. Grafana 서비스 확인

kubectl get svc -n monitoring | grep grafana

7-2. Grafana 접속

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

브라우저:

http://localhost:3000

7-3. Grafana 초기 계정

항목값

ID	admin
PW	prom-operator

8️⃣ DCGM Exporter Grafana 대시보드 추가

8-1. 대시보드 Import

Grafana → + → Import
Dashboard ID: 12239
DataSource: Prometheus

9️⃣ GPU 메트릭 확인 (PromQL)

# GPU 사용률
DCGM_FI_DEV_GPU_UTIL

# GPU 메모리 사용률(%)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# GPU 전력
DCGM_FI_DEV_POWER_USAGE

🔟 전체 설치 상태 점검 체크리스트

kubectl get pods -n kube-system | grep nvidia
kubectl get pods -n monitoring | grep dcgm
kubectl get pods -n monitoring | grep prometheus
kubectl get pods -n monitoring | grep grafana

⚠️ NCP 실무에서 자주 터지는 포인트

증상	원인
GPU 메트릭 없음	Device Plugin 미설치
dcgm Pod Pending	toleration 누락
Prometheus 타겟 Down	ServiceMonitor label 불일치
Grafana 데이터 없음	Prometheus datasource 미선택

🎯 여기까지 하면 가능한 것들

✔ GPU 실시간 사용률
✔ GPU 메모리 / 전력 모니터링
✔ 노드별 GPU 상태
✔ GPUaaS 운영 기반 완성

저작자표시 비영리 변경금지 (새창열림)

'[GPUaaS] > GPUmgt' 카테고리의 다른 글

[GPU 클러스터] InfiniBand Cluster !! (0)	2026.01.23
[NVIDIA] 서버 / 데이터센터용 GPU 6종 !! (0)	2026.01.23
How to Set Up GPU Pods in Kubernetes for AI and Machine Learning Workloads (0)	2026.01.23
[중요2][NCP 쿠버네티스 설치] 컨테이너 오케스트레이션 툴, 쿠버네티스 이해하기 (기초) (0)	2026.01.22
[NCP 실전] DCGM Exporter DaemonSet YAML 상세 해설 (1)	2026.01.15
[NCP 실전] GPU 노드 오토스케일링시 NVIDIA Device Plugin / DCGM Exporter 자동 설치 (DaemonSet) (1)	2026.01.14
[NCP 적용불가][NVIDIA GPU Operator] GPU 노드 전체를 자동으로 세팅해주는 올인원 운영자(Operator) (0)	2026.01.14
[GPU가 놀고 있나?] “리부팅하라”는 명령이 절대 아니다. (0)	2026.01.13

[중요2][NCP 실전] Kubernetes→ NVIDIA Device Plugin→ DCGM Exporter→ Prometheus→ Grafana 설치 가이드

🧭 전체 구성 흐름 (먼저 그림으로 이해)

0️⃣ 사전 준비 (로컬 PC)

1️⃣ GPU 노드 확인 (반드시 먼저)

2️⃣ NVIDIA Device Plugin 설치 (Helm)

2-1. Helm Repo 추가

2-2. Device Plugin 설치

2-3. 설치 확인

3️⃣ monitoring Namespace 생성

4️⃣ DCGM Exporter 설치 (Helm)

4-1. Helm Repo 추가

4-2. DCGM Exporter 설치 (NCP 필수 옵션 포함)

4-3. 설치 확인

5️⃣ Prometheus 설치 (Helm, Operator 방식)

5-1. kube-prometheus-stack Repo

5-2. Prometheus 설치

5-3. Prometheus Pod 확인

6️⃣ DCGM Exporter ↔ Prometheus 연동

6-1. DCGM Exporter 업그레이드 (ServiceMonitor 활성화)

6-2. Prometheus 타겟 확인

7️⃣ Grafana 설치 & 접속

7-1. Grafana 서비스 확인

7-2. Grafana 접속

7-3. Grafana 초기 계정

8️⃣ DCGM Exporter Grafana 대시보드 추가

8-1. 대시보드 Import

9️⃣ GPU 메트릭 확인 (PromQL)

🔟 전체 설치 상태 점검 체크리스트

⚠️ NCP 실무에서 자주 터지는 포인트

🎯 여기까지 하면 가능한 것들

'[GPUaaS] > GPUmgt' 카테고리의 다른 글

댓글

티스토리툴바

[중요2][NCP 실전] Kubernetes→ NVIDIA Device Plugin→ DCGM Exporter→ Prometheus→ Grafana 설치 가이드

🧭 전체 구성 흐름 (먼저 그림으로 이해)

0️⃣ 사전 준비 (로컬 PC)

1️⃣ GPU 노드 확인 (반드시 먼저)

2️⃣ NVIDIA Device Plugin 설치 (Helm)

2-1. Helm Repo 추가

2-2. Device Plugin 설치

2-3. 설치 확인

3️⃣ monitoring Namespace 생성

4️⃣ DCGM Exporter 설치 (Helm)

4-1. Helm Repo 추가

4-2. DCGM Exporter 설치 (NCP 필수 옵션 포함)

4-3. 설치 확인

5️⃣ Prometheus 설치 (Helm, Operator 방식)

5-1. kube-prometheus-stack Repo

5-2. Prometheus 설치

5-3. Prometheus Pod 확인

6️⃣ DCGM Exporter ↔ Prometheus 연동

6-1. DCGM Exporter 업그레이드 (ServiceMonitor 활성화)

6-2. Prometheus 타겟 확인

7️⃣ Grafana 설치 & 접속

7-1. Grafana 서비스 확인

7-2. Grafana 접속

7-3. Grafana 초기 계정

8️⃣ DCGM Exporter Grafana 대시보드 추가

8-1. 대시보드 Import

9️⃣ GPU 메트릭 확인 (PromQL)

🔟 전체 설치 상태 점검 체크리스트

⚠️ NCP 실무에서 자주 터지는 포인트

🎯 여기까지 하면 가능한 것들

'[GPUaaS] > GPUmgt' 카테고리의 다른 글

관련글

댓글

티스토리툴바