본문 바로가기
[GPUaaS]/TensorFlow

[중요2] 운영 표준 - [최종] Train.py & TFJob.yaml (라벨/MIG/RWO 적용)

by METAVERSE STORY 2026. 1. 30.
반응형

 

 

그럼 지금부터 **“처음부터 끝까지 운영용 TFJob 완전 배포 세트”**를 순번으로, 실제 배포 기준으로 자세히 정리할게요.


목표: Chief만 체크포인트 + Sidecar S3 업로드, Worker 계산 전용, MIG + RWO, GPU 라벨 준수


🚀 운영용 TFJob 완전 배포 세트 순서


1️⃣ Namespace 준비

운영 환경에서는 프로젝트별/팀별 Namespace 권장

 
kubectl create namespace ml

확인:

 
kubectl get ns

 

 


2️⃣ PVC 준비 (RWO, Chief 전용)

  • 블록스토리지 사용 시 RWO
  • Worker는 PVC 쓰지 않고 계산만 함
 
# tf-output-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tf-output-pvc
  namespace: ml
spec:
  accessModes:
    - ReadWriteOnce   # Chief 전용
  resources:
    requests:
      storage: 500Gi
 
 
 

배포:

 
kubectl apply -f tf-output-pvc.yaml

확인:

 
kubectl get pvc -n ml

 

 

 


3️⃣ train.py 준비

  • Chief만 체크포인트 저장 + Sidecar 업로드 트리거
  • Worker는 계산 전용
 
# train.py
import os
import json
import tensorflow as tf

# 1️⃣ TF_CONFIG 읽기
tf_config = os.environ.get("TF_CONFIG", "{}")
tf_config_json = json.loads(tf_config)
task_type = tf_config_json.get("task", {}).get("type", "worker")
is_chief = task_type in ["chief", "master"]

print(f"[INFO] Task type: {task_type}, is_chief: {is_chief}")

# 2️⃣ 분산 전략
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dense(10, activation="softmax")
    ])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )

# 3️⃣ 데이터셋 (운영 시 NAS/OS 교체)
dataset = tf.data.Dataset.from_tensor_slices((
    tf.random.normal([1000, 32]),
    tf.random.uniform([1000], maxval=10, dtype=tf.int32)
)).batch(32)

# 4️⃣ 체크포인트 콜백 (Chief만)
checkpoint_dir = "/output/checkpoints"
callbacks = []
if is_chief:
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
        filepath=os.path.join(checkpoint_dir, "ckpt-{epoch}"),
        save_weights_only=True,
        save_freq='epoch'
    )
    callbacks.append(checkpoint_cb)

# 5️⃣ 학습
model.fit(dataset, epochs=10, callbacks=callbacks)

# 6️⃣ Chief 완료 표시 (Sidecar 업로드)
if is_chief:
    done_file = "/output/DONE"
    with open(done_file, "w") as f:
        f.write("done")
    print(f"[INFO] Chief training done. Created {done_file}")
 
 
 
 
 
 

4️⃣ Docker 이미지 준비

  • TensorFlow GPU 이미지 기준
  • train.py 포함
 
# Dockerfile
FROM tensorflow/tensorflow:2.14.0-gpu

WORKDIR /app
COPY train.py /app/train.py

# 필요시 aws-cli 설치 (Sidecar 이미 사용하므로 선택)
# RUN pip install boto3

CMD ["python", "train.py"]
 
 
 
 

빌드 & 푸시:

 
docker build -t myrepo/tf-mnist-train:latest .
docker push myrepo/tf-mnist-train:latest
 
 
 
 
 

5️⃣ 운영용 TFJob YAML 준비

  • Chief + Worker 구조
  • MIG Slice 리소스
  • Chief PVC + Sidecar 업로드
  • Worker는 PVC 미사용, 계산 전용
  • NodeSelector 운영 라벨 준수
 
# tfjob-train-prod.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-mnist-train
  namespace: ml
spec:
  runPolicy:
    cleanPodPolicy: None
  tfReplicaSpecs:

    Chief:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          nodeSelector:
            gpu.vendor: nvidia
            gpu.model: A100
            gpu.mem: 80gb
            gpu.mig: enabled
            gpu.mig.profile: 1g.10gb
            gpu.role: chief
            gpu.pool: train

          containers:
            - name: trainer
              image: myrepo/tf-mnist-train:latest
              command: ["python", "train.py"]
              resources:
                requests:
                  nvidia.com/mig-1g.10gb: 1
                limits:
                  nvidia.com/mig-1g.10gb: 1

              volumeMounts:
                - name: output
                  mountPath: /output

            - name: uploader
              image: amazon/aws-cli
              command: ["/bin/sh", "-c"]
              args:
                - |
                  while true; do
                    if [ -f /output/DONE ]; then
                      aws s3 sync /output \
                        s3://tf-result-bucket/job-$(HOSTNAME) \
                        --endpoint-url $AWS_ENDPOINT_URL
                      exit 0
                    fi
                    sleep 30
                  done

              envFrom:
                - secretRef:
                    name: objstore-cred
              resources:
                requests:
                  nvidia.com/gpu: 0
                limits:
                  nvidia.com/gpu: 0
 
              volumeMounts:
                - name: output
                  mountPath: /output

          volumes:
            - name: output
              persistentVolumeClaim:
                claimName: tf-output-pvc

    Worker:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          nodeSelector:
            gpu.vendor: nvidia
            gpu.model: A100
            gpu.mem: 80gb
            gpu.mig: enabled
            gpu.mig.profile: 1g.10gb
            gpu.role: worker
            gpu.pool: train

          containers:
            - name: trainer
              image: myrepo/tf-mnist-train:latest
              command: ["python", "train.py"]
              resources:
                requests:
                  nvidia.com/mig-1g.10gb: 1
                limits:
                  nvidia.com/mig-1g.10gb: 1

          # Worker는 PVC 제거 → RWO 블록스토리지 동시접속 문제 회피
 
 
 
 
 
 
 

6️⃣ 배포 순서

  1. Namespace 생성 → PVC 생성
  2. Docker 이미지 빌드 & 푸시
  3. TFJob YAML 적용
 
kubectl apply -f tfjob-train-prod.yaml
 
 
 
 
 
4. TFJob 상태 확인
 
kubectl get tfjob -n ml
kubectl get pods -n ml
 
 
 
 
5. 로그 확인
 
kubectl logs -n ml <pod-name> -c trainer
kubectl logs -n ml <pod-name> -c uploader  # Chief Sidecar
 
 
 
 
 
 
 
 

7️⃣ 운영 체크포인트


 

체크 확인 방법
Chief PVC 사용 RWO PVC 마운트 확인
Worker PVC 없음 PVC 마운트 제거
MIG Slice 예약 resources.requests/limits 확인
Sidecar 업로드 /output/DONE → S3 동기화
TF_CONFIG `kubectl exec -it <pod> -- env

 


💡 정리

  • Chief = 체크포인트 + Sidecar S3 업로드
  • Worker = 계산 전용, PVC 없음
  • MIG Slice = nvidia.com/mig-1g.10gb
  • PVC = RWO
  • NodeSelector = 운영 표준 라벨 준수

 

 

 

반응형

댓글