機器學習訓練流水線
PR Ready資料準備 → 訓練 → 評估 → A/B 測試 → 部署至推論端點。
6 個節點 · 6 條連接pr ready
argomlopstrainingkubernetesml-pipeline
視覺化
準備訓練資料data
從特徵庫提取資料,分割訓練/驗證/測試集,並寫入 GCS。
↓sequential→ 訓練模型
訓練模型cli
在 GPU 節點上對 transformer 模型進行微調。
↓sequential→ 評估模型
評估模型cicd
與基準線比較準確率、延遲與公平性指標。
↓conditional→ 影子流量 A/B 測試
↓conditional→ 回滾
影子流量 A/B 測試infra
將 10% 的影子流量導向候選模型,持續 24 小時。
↓conditional→ 部署至推論服務
↓conditional→ 回滾
部署至推論服務infra
將候選模型提升為正式環境推論端點。
回滾infra
若 A/B 測試指標退化,回滾至前一個模型版本。
ex-argo-ml-training.osop.yaml
# Argo Workflows ML Training Pipeline — OSOP Portable Workflow
#
# End-to-end ML pipeline: prepare training data, train a model on GPU,
# evaluate against baseline metrics, run an A/B test on shadow traffic,
# and deploy to a serving endpoint if the new model wins.
#
# Run with Argo or validate: osop validate argo-ml-training.osop.yaml
osop_version: "1.0"
id: "argo-ml-training"
name:"機器學習訓練流水線"
description:"資料準備 → 訓練 → 評估 → A/B 測試 → 部署至推論端點。"
version: "1.0.0"
tags: [argo, mlops, training, kubernetes, ml-pipeline]
nodes:
- id: "prepare_data"
type: "data"
name: "準備訓練資料"
description: "從特徵庫提取資料,分割訓練/驗證/測試集,並寫入 GCS。"
config:
source: "feature-store://user-embeddings/v3"
splits: { train: 0.8, val: 0.1, test: 0.1 }
- id: "train_model"
type: "cli"
subtype: "script"
name: "訓練模型"
description: "在 GPU 節點上對 transformer 模型進行微調。"
config:
command: "python train.py --config config/prod.yaml"
resources: { gpu: 4, memory: "64Gi" }
- id: "evaluate"
type: "cicd"
subtype: "test"
name: "評估模型"
description: "與基準線比較準確率、延遲與公平性指標。"
config:
metrics: [accuracy, f1, p99_latency, demographic_parity]
baseline: "models/production/latest"
- id: "ab_test"
type: "infra"
name: "影子流量 A/B 測試"
description: "將 10% 的影子流量導向候選模型,持續 24 小時。"
config:
traffic_split: 0.1
duration_hours: 24
- id: "deploy_serving"
type: "infra"
name: "部署至推論服務"
description: "將候選模型提升為正式環境推論端點。"
config:
endpoint: "models/recommendation/v3"
canary_percent: 5
- id: "rollback"
type: "infra"
name: "回滾"
description: "若 A/B 測試指標退化,回滾至前一個模型版本。"
edges:
- from: "prepare_data"
to: "train_model"
mode: "sequential"
- from: "train_model"
to: "evaluate"
mode: "sequential"
- from: "evaluate"
to: "ab_test"
mode: "conditional"
when: "metrics.accuracy > baseline.accuracy"
label: "Beats baseline"
- from: "ab_test"
to: "deploy_serving"
mode: "conditional"
when: "ab_result == 'winner'"
label: "A/B test passed"
- from: "ab_test"
to: "rollback"
mode: "conditional"
when: "ab_result == 'loser'"
label: "A/B test failed"
- from: "evaluate"
to: "rollback"
mode: "conditional"
when: "metrics.accuracy <= baseline.accuracy"
label: "Below baseline — abort"