機器學習訓練流水線

PR Ready

資料準備 → 訓練 → 評估 → A/B 測試 → 部署至推論端點。

6 個節點 · 6 條連接pr ready
argomlopstrainingkubernetesml-pipeline
視覺化
準備訓練資料data

從特徵庫提取資料,分割訓練/驗證/測試集,並寫入 GCS。

sequential訓練模型
訓練模型cli

在 GPU 節點上對 transformer 模型進行微調。

sequential評估模型
評估模型cicd

與基準線比較準確率、延遲與公平性指標。

conditional影子流量 A/B 測試
conditional回滾
影子流量 A/B 測試infra

將 10% 的影子流量導向候選模型,持續 24 小時。

conditional部署至推論服務
conditional回滾
部署至推論服務infra

將候選模型提升為正式環境推論端點。

回滾infra

若 A/B 測試指標退化,回滾至前一個模型版本。

ex-argo-ml-training.osop.yaml
# Argo Workflows ML Training Pipeline — OSOP Portable Workflow
#
# End-to-end ML pipeline: prepare training data, train a model on GPU,
# evaluate against baseline metrics, run an A/B test on shadow traffic,
# and deploy to a serving endpoint if the new model wins.
#
# Run with Argo or validate: osop validate argo-ml-training.osop.yaml

osop_version: "1.0"
id: "argo-ml-training"
name:"機器學習訓練流水線"
description:"資料準備 → 訓練 → 評估 → A/B 測試 → 部署至推論端點。"
version: "1.0.0"
tags: [argo, mlops, training, kubernetes, ml-pipeline]

nodes:
  - id: "prepare_data"
    type: "data"
    name: "準備訓練資料"
    description: "從特徵庫提取資料,分割訓練/驗證/測試集,並寫入 GCS。"
    config:
      source: "feature-store://user-embeddings/v3"
      splits: { train: 0.8, val: 0.1, test: 0.1 }

  - id: "train_model"
    type: "cli"
    subtype: "script"
    name: "訓練模型"
    description: "在 GPU 節點上對 transformer 模型進行微調。"
    config:
      command: "python train.py --config config/prod.yaml"
      resources: { gpu: 4, memory: "64Gi" }

  - id: "evaluate"
    type: "cicd"
    subtype: "test"
    name: "評估模型"
    description: "與基準線比較準確率、延遲與公平性指標。"
    config:
      metrics: [accuracy, f1, p99_latency, demographic_parity]
      baseline: "models/production/latest"

  - id: "ab_test"
    type: "infra"
    name: "影子流量 A/B 測試"
    description: "將 10% 的影子流量導向候選模型,持續 24 小時。"
    config:
      traffic_split: 0.1
      duration_hours: 24

  - id: "deploy_serving"
    type: "infra"
    name: "部署至推論服務"
    description: "將候選模型提升為正式環境推論端點。"
    config:
      endpoint: "models/recommendation/v3"
      canary_percent: 5

  - id: "rollback"
    type: "infra"
    name: "回滾"
    description: "若 A/B 測試指標退化,回滾至前一個模型版本。"

edges:
  - from: "prepare_data"
    to: "train_model"
    mode: "sequential"
  - from: "train_model"
    to: "evaluate"
    mode: "sequential"
  - from: "evaluate"
    to: "ab_test"
    mode: "conditional"
    when: "metrics.accuracy > baseline.accuracy"
    label: "Beats baseline"
  - from: "ab_test"
    to: "deploy_serving"
    mode: "conditional"
    when: "ab_result == 'winner'"
    label: "A/B test passed"
  - from: "ab_test"
    to: "rollback"
    mode: "conditional"
    when: "ab_result == 'loser'"
    label: "A/B test failed"
  - from: "evaluate"
    to: "rollback"
    mode: "conditional"
    when: "metrics.accuracy <= baseline.accuracy"
    label: "Below baseline — abort"