1️⃣ GPU에서 requests = limits가 좋은 이유

🔹 CPU / Memory랑 다름

CPU / Memory → overcommit 가능
GPU → 절대 불가 (정수 리소스, 독점)

즉,

requests: 1
limits: 1

은 사실상 GPU 세계의 기본 규칙

2️⃣ 경우별 동작 차이

❌ limits만 있는 경우

limits:
nvidia.com/gpu: 1

스케줄러는 암묵적으로 requests=1로 처리
대부분 동작은 함
❗ 하지만:
- 리소스 계산이 명시적이지 않음
- Quota / Capacity 계산에서 혼란
- 운영 표준 문서화에 불리

✅ requests + limits 모두 있는 경우 (권장)

requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1

스케줄러 판단 명확
GPU bin-packing 정확
Quota / Fair-share 안정
장애·재스케줄 시 예측 가능성 ↑

👉 대규모 GPU 클러스터일수록 필수

3️⃣ “안정성” 기준 최종 권장안

✔ 스케줄링 안정성
✔ GPU 낭비 방지
✔ 장애 복구 시 예측 가능
✔ 운영 표준화

👉 requests = limits = GPU 개수

✅ 최종 TFJob YAML (안정성 최우선)

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-mnist-train
  namespace: ml
spec:
  runPolicy:
    cleanPodPolicy: None

  tfReplicaSpecs:

    # =========================
    # Chief (결과 저장 + 업로드)
    # =========================
    Chief:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:

          # 1️⃣ TensorFlow Trainer (GPU 사용)
          - name: trainer
            image: tensorflow/tensorflow:2.14.0-gpu
            command: ["python", "train.py"]
            resources:
              requests:
                nvidia.com/gpu: 1
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - name: output
              mountPath: /output

          # 2️⃣ Sidecar Uploader (GPU 미사용)
          - name: uploader
            image: amazon/aws-cli
            command: ["/bin/sh", "-c"]
            args:
              - |
                while true; do
                  if [ -f /output/DONE ]; then
                    aws s3 sync /output \
                      s3://tf-result-bucket/job-$(HOSTNAME) \
                      --endpoint-url $AWS_ENDPOINT_URL
                    exit 0
                  fi
                  sleep 30
                done
            envFrom:
            - secretRef:
                name: objstore-cred
            resources:
              requests:
                nvidia.com/gpu: 0
              limits:
                nvidia.com/gpu: 0
            volumeMounts:
            - name: output
              mountPath: /output

          volumes:
          - name: output
            persistentVolumeClaim:
              claimName: tf-output-pvc

    # =========================
    # Worker (계산 전용)
    # =========================
    Worker:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: trainer
            image: tensorflow/tensorflow:2.14.0-gpu
            command: ["python", "train.py"]
            resources:
              requests:
                nvidia.com/gpu: 1
              limits:
                nvidia.com/gpu: 1
            # ❌ PVC 없음 (중요)

🔍 운영자 한 줄 요약

GPU 워크로드에서는 requests = limits가 곧 “안정성”이고,
이 TFJob YAML은 대규모 GPU 클러스터에서도 그대로 써도 되는 최종본이다.

저작자표시 비영리 변경금지 (새창열림)

'[GPUaaS] > TensorFlow' 카테고리의 다른 글

[중요2] 운영 표준 - GPU 노드 라벨 세트 (0)	2026.01.28
[GPU 타입] 운영 무중단 - 라벨 NodePool 등록 (1)	2026.01.27
[GPU 타입] 신규 라벨 NodePool 등록 (라벨 + Taint + Affinity 세트) (0)	2026.01.26
[GPU] Node Affinity + GPU 타입 분리 (A100 / H100) (0)	2026.01.26
[TF 분산학습] 스토리지 관점 + TensorFlow 내부 동작 (0)	2026.01.26
[쿠버네티스 워크로드 개념] TFJob / CronJob / Job / Deployment / Pod (0)	2026.01.26
[TensorFlow] 구글이 만든 머신러닝·딥러닝 프레임워크 !! (0)	2026.01.26
[TFJob] POD Sidecar 자동 업로드 (0)	2026.01.25

[GPU] requests = limits가 좋은 이유

1️⃣ GPU에서 requests = limits가 좋은 이유

🔹 CPU / Memory랑 다름

2️⃣ 경우별 동작 차이

❌ limits만 있는 경우

✅ requests + limits 모두 있는 경우 (권장)

3️⃣ “안정성” 기준 최종 권장안

✅ 최종 TFJob YAML (안정성 최우선)

🔍 운영자 한 줄 요약

'[GPUaaS] > TensorFlow' 카테고리의 다른 글

댓글

티스토리툴바

[GPU] requests = limits가 좋은 이유

1️⃣ GPU에서 requests = limits가 좋은 이유

🔹 CPU / Memory랑 다름

2️⃣ 경우별 동작 차이

❌ limits만 있는 경우

✅ requests + limits 모두 있는 경우 (권장)

3️⃣ “안정성” 기준 최종 권장안

✅ 최종 TFJob YAML (안정성 최우선)

🔍 운영자 한 줄 요약

'[GPUaaS] > TensorFlow' 카테고리의 다른 글

관련글

댓글

티스토리툴바