반응형
“운영 표준 라벨 세트 + TFJob” 기준으로,
GPU 전용 노드 Affinity, 최종 운영용 PVC + Sidecar 업로드 포함 (TFJob YAML)
아래 YAML은
Chief + Worker 구조, GPU 라벨 기준 스케줄링, 출력 PVC + S3 업로드, GPU MIG 프로필 적용
반영한 운영 표준 버전입니다.
📄 tfjob-train-prod.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tf-mnist-train
namespace: ml
spec:
runPolicy:
cleanPodPolicy: None
tfReplicaSpecs:
# =========================
# Chief (결과 저장 + 업로드)
# =========================
kind: TFJob
metadata:
name: tf-mnist-train
namespace: ml
spec:
runPolicy:
cleanPodPolicy: None
tfReplicaSpecs:
# =========================
# Chief (결과 저장 + 업로드)
# =========================
Chief:
replicas: 1
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: chief
gpu.pool: train
replicas: 1
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: chief
gpu.pool: train
containers:
# 1️⃣ TensorFlow Trainer (GPU 사용)
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: output
mountPath: /output
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: output
mountPath: /output
# 2️⃣ Sidecar Uploader (GPU 미사용)
- name: uploader
image: amazon/aws-cli
command: ["/bin/sh", "-c"]
args:
- |
while true; do
if [ -f /output/DONE ]; then
aws s3 sync /output \
s3://tf-result-bucket/job-$(HOSTNAME) \
--endpoint-url $AWS_ENDPOINT_URL
exit 0
fi
sleep 30
done
envFrom:
- secretRef:
name: objstore-cred
resources:
requests:
nvidia.com/gpu: 0
limits:
nvidia.com/gpu: 0
volumeMounts:
- name: output
mountPath: /output
volumes:
- name: output
persistentVolumeClaim:
claimName: tf-output-pvc
# =========================
# Worker (계산 전용)
# =========================
Worker:
replicas: 4
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: worker
gpu.pool: train
containers:
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
- name: uploader
image: amazon/aws-cli
command: ["/bin/sh", "-c"]
args:
- |
while true; do
if [ -f /output/DONE ]; then
aws s3 sync /output \
s3://tf-result-bucket/job-$(HOSTNAME) \
--endpoint-url $AWS_ENDPOINT_URL
exit 0
fi
sleep 30
done
envFrom:
- secretRef:
name: objstore-cred
resources:
requests:
nvidia.com/gpu: 0
limits:
nvidia.com/gpu: 0
volumeMounts:
- name: output
mountPath: /output
volumes:
- name: output
persistentVolumeClaim:
claimName: tf-output-pvc
# =========================
# Worker (계산 전용)
# =========================
Worker:
replicas: 4
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: worker
gpu.pool: train
containers:
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
=======제거 필요=============
volumeMounts:
- name: output
mountPath: /output # Worker는 결과 저장 안 하지만 PVC 마운트만 해두면 체크포인트 공유 가능
volumes:
- name: output
persistentVolumeClaim:
claimName: tf-output-pvc
====================
volumeMounts:
- name: output
mountPath: /output # Worker는 결과 저장 안 하지만 PVC 마운트만 해두면 체크포인트 공유 가능
volumes:
- name: output
persistentVolumeClaim:
claimName: tf-output-pvc
====================
🔑 반영 사항
- Node Affinity / Label 기준
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: chief/worker
gpu.pool: train
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: chief/worker
gpu.pool: train
→ 운영 표준 라벨 세트 그대로 적용
2. Chief
- GPU 사용 + 결과 저장 PVC + Sidecar S3 업로드
3. Worker
- GPU 사용, 계산 전용
PVC 마운트만 해두어 Chief와 체크포인트 공유 가능필요시 Checkpoint를 /output에 저장하도록 train.py 작성
4. GPU 리소스
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
5. PVC
- tf-output-pvc 미리 생성 필요
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tf-output-pvc
namespace: ml
spec:
accessModes:
- ReadWriteMany (Once)
resources:
requests:
storage: 500Gi
kind: PersistentVolumeClaim
metadata:
name: tf-output-pvc
namespace: ml
spec:
accessModes:
- ReadWriteMany (Once)
resources:
requests:
storage: 500Gi
💡 Tip
Worker Pod가 PVC에 쓰려면 ReadWriteMany(RWX) 필요- MIG 프로파일과 GPU 라벨이 정확히 매치돼야 스케줄러가 올바르게 잡음
- Sidecar 업로드는 Chief Pod만 담당
======================================================
1️⃣ MIG 적용 시 GPU 리소스
핵심
- MIG Enabled GPU에서는 각 Pod가 실제로 몇 MIG Slice를 쓸지 지정해야 함
- nvidia.com/gpu: 1 = 전체 GPU 1개를 쓰겠다는 뜻 (MIG 모드에서는 잘못된 표현)
- 올바른 운영 표준은 MIG Slice 단위로 지정:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
requests:
nvidia.com/mig-1g.10gb: 1
limits:
nvidia.com/mig-1g.10gb: 1
requests:
nvidia.com/mig-1g.10gb: 1
즉, MIG 프로파일 이름 그대로 리소스 이름으로 써야 스케줄러가 정확히 MIG Slice를 잡습니다.
⚠️ nvidia.com/gpu:1 은 MIG 환경에서는 전체 GPU를 예약하는 꼴이 되어 효율 안 나옴
2️⃣ Worker PVC 문제
- 블록 스토리지(BlockStorage): ReadWriteOnce(RWO)만 지원 → 동시 접속 불가
- Worker 여러 Pod가 동시에 쓰면 충돌 → RWX 불가
- 따라서 Worker Pod는:
- RWO PVC 사용
- Chief에서만 쓰도록 체크포인트 공유
- Worker는 쓰지 않고 읽기 전용만 가능하거나 tmpfs / local SSD 사용
기존에 내가 RWX 붙여서 올린 건 실수였음. 운영 기준에서는 RWO + Chief만 쓰는 구조가 맞습니다.
## 최종본 TFjob.yaml
3️⃣ 수정된 운영 표준 TFJob (MIG + RWO)
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tf-mnist-train
namespace: ml
spec:
runPolicy:
cleanPodPolicy: None
tfReplicaSpecs:
Chief:
replicas: 1
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: chief
gpu.pool: train
containers:
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/mig-1g.10gb: 1
limits:
nvidia.com/mig-1g.10gb: 1
volumeMounts:
- name: output
mountPath: /output
- name: uploader
image: amazon/aws-cli
command: ["/bin/sh", "-c"]
args:
- |
while true; do
if [ -f /output/DONE ]; then
aws s3 sync /output \
s3://tf-result-bucket/job-$(HOSTNAME) \
--endpoint-url $AWS_ENDPOINT_URL
exit 0
fi
sleep 30
done
envFrom:
- secretRef:
name: objstore-cred
resources:
requests:
nvidia.com/gpu: 0
limits:
nvidia.com/gpu: 0
volumeMounts:
- name: output
mountPath: /output
volumes:
- name: output
persistentVolumeClaim:
claimName: tf-output-pvc # RWO
Worker:
replicas: 4
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: worker
gpu.pool: train
containers:
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/mig-1g.10gb: 1
limits:
nvidia.com/mig-1g.10gb: 1
# Worker는 PVC 안 쓰고 GPU 계산 전용
# volumes 삭제 → RWO 블록스토리지 동시접속 문제 회피
kind: TFJob
metadata:
name: tf-mnist-train
namespace: ml
spec:
runPolicy:
cleanPodPolicy: None
tfReplicaSpecs:
Chief:
replicas: 1
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: chief
gpu.pool: train
containers:
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/mig-1g.10gb: 1
limits:
nvidia.com/mig-1g.10gb: 1
volumeMounts:
- name: output
mountPath: /output
- name: uploader
image: amazon/aws-cli
command: ["/bin/sh", "-c"]
args:
- |
while true; do
if [ -f /output/DONE ]; then
aws s3 sync /output \
s3://tf-result-bucket/job-$(HOSTNAME) \
--endpoint-url $AWS_ENDPOINT_URL
exit 0
fi
sleep 30
done
envFrom:
- secretRef:
name: objstore-cred
resources:
requests:
nvidia.com/gpu: 0
limits:
nvidia.com/gpu: 0
volumeMounts:
- name: output
mountPath: /output
volumes:
- name: output
persistentVolumeClaim:
claimName: tf-output-pvc # RWO
Worker:
replicas: 4
restartPolicy: OnFailure
template:
spec:
nodeSelector:
gpu.vendor: nvidia
gpu.model: A100
gpu.mem: 80gb
gpu.mig: enabled
gpu.mig.profile: 1g.10gb
gpu.role: worker
gpu.pool: train
containers:
- name: trainer
image: tensorflow/tensorflow:2.14.0-gpu
command: ["python", "train.py"]
resources:
requests:
nvidia.com/mig-1g.10gb: 1
limits:
nvidia.com/mig-1g.10gb: 1
# Worker는 PVC 안 쓰고 GPU 계산 전용
# volumes 삭제 → RWO 블록스토리지 동시접속 문제 회피
🔑 정리
- MIG Slice 단위 리소스 지정
nvidia.com/mig-1g.10gb: 1
- GPU: 1 이 아님
- GPU 전체를 예약하지 않고 Slice만 예약
2. Worker PVC 삭제
- RWO 블록스토리지라면 Worker는 PVC 마운트 X
- Chief만 결과 저장 및 Sidecar 업로드 담당
3. Chief PVC
- RWO 가능 → 단일 Pod만 쓰므로 OK
4. NodeSelector
- 운영 표준 라벨 완전 적용
반응형
'[GPUaaS] > TensorFlow' 카테고리의 다른 글
| [중요3] 운영 표준 - [최종] KServer & NAS & S3 & TFJob.yaml (라벨/MIG/RWO/RWM 적용) (0) | 2026.01.30 |
|---|---|
| [중요2] 운영 표준 - [최종] Train.py & TFJob.yaml (라벨/MIG/RWO 적용) (0) | 2026.01.30 |
| [중요2] 운영 표준 - ☸️ Kubernetes + TensorFlow 구동 원리 (0) | 2026.01.28 |
| [중요2] 운영 표준 - GPU 노드 라벨 세트 (0) | 2026.01.28 |
| [GPU 타입] 운영 무중단 - 라벨 NodePool 등록 (1) | 2026.01.27 |
| [GPU 타입] 신규 라벨 NodePool 등록 (라벨 + Taint + Affinity 세트) (0) | 2026.01.26 |
| [GPU] Node Affinity + GPU 타입 분리 (A100 / H100) (0) | 2026.01.26 |
| [GPU] requests = limits가 좋은 이유 (0) | 2026.01.26 |
댓글