본문 바로가기
[GPUaaS]/TensorFlow

[중요2] 운영 표준 - TFJob.yaml (라벨/MIG/RWO/S3 적용)

by METAVERSE STORY 2026. 1. 28.
반응형

 

 

“운영 표준 라벨 세트 + TFJob” 기준으로,
GPU 전용 노드 Affinity, 최종 운영용 PVC + Sidecar 업로드 포함 (TFJob YAML)

아래 YAML은
Chief + Worker 구조, GPU 라벨 기준 스케줄링, 출력 PVC + S3 업로드,
GPU MIG 프로필 적용
반영한 운영 표준 버전입니다.

 

 


📄 tfjob-train-prod.yaml

 
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-mnist-train
  namespace: ml
spec:
  runPolicy:
    cleanPodPolicy: None
  tfReplicaSpecs:

    # =========================
    # Chief (결과 저장 + 업로드)
    # =========================
 
    Chief:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          nodeSelector:
            gpu.vendor: nvidia
            gpu.model: A100
            gpu.mem: 80gb
            gpu.mig: enabled
            gpu.mig.profile: 1g.10gb
            gpu.role: chief
            gpu.pool: train

          containers:
 
            # 1️⃣ TensorFlow Trainer (GPU 사용)
            - name: trainer
              image: tensorflow/tensorflow:2.14.0-gpu
              command: ["python", "train.py"]
              resources:
                requests:
                  nvidia.com/gpu: 1
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
                - name: output
                  mountPath: /output
 
            # 2️⃣ Sidecar Uploader (GPU 미사용)
            - name: uploader
              image: amazon/aws-cli
              command: ["/bin/sh", "-c"]
              args:
                - |
                  while true; do
                    if [ -f /output/DONE ]; then
                      aws s3 sync /output \
                        s3://tf-result-bucket/job-$(HOSTNAME) \
                        --endpoint-url $AWS_ENDPOINT_URL
                      exit 0
                    fi
                    sleep 30
                  done
              envFrom:
                - secretRef:
                    name: objstore-cred
              resources:
                requests:
                  nvidia.com/gpu: 0
                limits:
                  nvidia.com/gpu: 0
              volumeMounts:
                - name: output
                  mountPath: /output
          volumes:
            - name: output
              persistentVolumeClaim:
                claimName: tf-output-pvc

    # =========================
    # Worker (계산 전용)
    # =========================
    Worker:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          nodeSelector:
            gpu.vendor: nvidia
            gpu.model: A100
            gpu.mem: 80gb
            gpu.mig: enabled
            gpu.mig.profile: 1g.10gb
            gpu.role: worker
            gpu.pool: train
          containers:
            - name: trainer
              image: tensorflow/tensorflow:2.14.0-gpu
              command: ["python", "train.py"]
              resources:
                requests:
                  nvidia.com/gpu: 1
                limits:
                  nvidia.com/gpu: 1
 
 
=======제거 필요=============
              volumeMounts:
                - name: output
                  mountPath: /output  # Worker는 결과 저장 안 하지만 PVC 마운트만 해두면 체크포인트 공유 가능
          volumes:
            - name: output
              persistentVolumeClaim:
                claimName: tf-output-pvc
====================

 

 

 


🔑 반영 사항

  1. Node Affinity / Label 기준
 
nodeSelector:
  gpu.vendor: nvidia
  gpu.model: A100
  gpu.mem: 80gb
  gpu.mig: enabled
  gpu.mig.profile: 1g.10gb
  gpu.role: chief/worker
  gpu.pool: train
 
 
 

→ 운영 표준 라벨 세트 그대로 적용

 

2. Chief

  • GPU 사용 + 결과 저장 PVC + Sidecar S3 업로드




    3. Worker
  • GPU 사용, 계산 전용
  • PVC 마운트만 해두어 Chief와 체크포인트 공유 가능
  • 필요시 Checkpoint를 /output에 저장하도록 train.py 작성

 

4. GPU 리소스

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1
 
 
 
5. PVC
  • tf-output-pvc 미리 생성 필요
 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tf-output-pvc
  namespace: ml
spec:
  accessModes:
    - ReadWriteMany (Once)
  resources:
    requests:
      storage: 500Gi

 

 

 


💡 Tip

  • Worker Pod가 PVC에 쓰려면 ReadWriteMany(RWX) 필요
  • MIG 프로파일과 GPU 라벨이 정확히 매치돼야 스케줄러가 올바르게 잡음
  • Sidecar 업로드는 Chief Pod만 담당

 

 

 

 

======================================================

 

 

 

1️⃣ MIG 적용 시 GPU 리소스

핵심

  • MIG Enabled GPU에서는 각 Pod가 실제로 몇 MIG Slice를 쓸지 지정해야 함
  • nvidia.com/gpu: 1 = 전체 GPU 1개를 쓰겠다는 뜻 (MIG 모드에서는 잘못된 표현)
  • 올바른 운영 표준은 MIG Slice 단위로 지정:
 
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1
  requests:
    nvidia.com/mig-1g.10gb: 1
 
 
 

즉, MIG 프로파일 이름 그대로 리소스 이름으로 써야 스케줄러가 정확히 MIG Slice를 잡습니다.

⚠️ nvidia.com/gpu:1 은 MIG 환경에서는 전체 GPU를 예약하는 꼴이 되어 효율 안 나옴

 

 

 


2️⃣ Worker PVC 문제

  • 블록 스토리지(BlockStorage): ReadWriteOnce(RWO)만 지원 → 동시 접속 불가
  • Worker 여러 Pod가 동시에 쓰면 충돌 → RWX 불가
  • 따라서 Worker Pod는:
    • RWO PVC 사용
    • Chief에서만 쓰도록 체크포인트 공유
    • Worker는 쓰지 않고 읽기 전용만 가능하거나 tmpfs / local SSD 사용

기존에 내가 RWX 붙여서 올린 건 실수였음. 운영 기준에서는 RWO + Chief만 쓰는 구조가 맞습니다.

 

 

 


## 최종본 TFjob.yaml
3️⃣ 수정된 운영 표준 TFJob (MIG + RWO)

 
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-mnist-train
  namespace: ml
spec:
  runPolicy:
    cleanPodPolicy: None
  tfReplicaSpecs:

    Chief:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          nodeSelector:
            gpu.vendor: nvidia
            gpu.model: A100
            gpu.mem: 80gb
            gpu.mig: enabled
            gpu.mig.profile: 1g.10gb
            gpu.role: chief
            gpu.pool: train
          containers:
            - name: trainer
              image: tensorflow/tensorflow:2.14.0-gpu
              command: ["python", "train.py"]
              resources:
                requests:
                  nvidia.com/mig-1g.10gb: 1
                limits:
                  nvidia.com/mig-1g.10gb: 1
              volumeMounts:
                - name: output
                  mountPath: /output
            - name: uploader
              image: amazon/aws-cli
              command: ["/bin/sh", "-c"]
              args:
                - |
                  while true; do
                    if [ -f /output/DONE ]; then
                      aws s3 sync /output \
                        s3://tf-result-bucket/job-$(HOSTNAME) \
                        --endpoint-url $AWS_ENDPOINT_URL
                      exit 0
                    fi
                    sleep 30
                  done
              envFrom:
                - secretRef:
                    name: objstore-cred
              resources:
                requests:
                  nvidia.com/gpu: 0
                limits:
                  nvidia.com/gpu: 0
              volumeMounts:
                - name: output
                  mountPath: /output
          volumes:
            - name: output
              persistentVolumeClaim:
                claimName: tf-output-pvc   # RWO
                

    Worker:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          nodeSelector:
            gpu.vendor: nvidia
            gpu.model: A100
            gpu.mem: 80gb
            gpu.mig: enabled
            gpu.mig.profile: 1g.10gb
            gpu.role: worker
            gpu.pool: train
          containers:
            - name: trainer
              image: tensorflow/tensorflow:2.14.0-gpu
              command: ["python", "train.py"]
              resources:
                requests:
                  nvidia.com/mig-1g.10gb: 1
                limits:
                  nvidia.com/mig-1g.10gb: 1
              # Worker는 PVC 안 쓰고 GPU 계산 전용
          # volumes 삭제 → RWO 블록스토리지 동시접속 문제 회피

🔑 정리

  1. MIG Slice 단위 리소스 지정
 
nvidia.com/mig-1g.10gb: 1
 
  • GPU: 1 이 아님
  • GPU 전체를 예약하지 않고 Slice만 예약



    2. Worker PVC 삭제
  • RWO 블록스토리지라면 Worker는 PVC 마운트 X
  • Chief만 결과 저장 및 Sidecar 업로드 담당


    3. Chief PVC
  • RWO 가능 → 단일 Pod만 쓰므로 OK


    4. NodeSelector
  • 운영 표준 라벨 완전 적용

 

 

 

 

반응형

댓글