生產事故應變
AI → Human結構化的事故處理流程,結合 AI 分類與人工監督。
5 個節點 · 5 條連接devops
eventagenthumancli
視覺化
告警觸發event
PagerDuty 或 Grafana 告警觸發。
↓sequential→ AI 分類診斷
AI 分類診斷agent
AI 分析日誌、指標與近期部署紀錄。
↓sequential→ 工程師調查
↓timeout→ 工程師調查
工程師調查human
值班工程師驗證 AI 的分類結果。
↓conditional→ 套用緩解措施
套用緩解措施cli
執行回滾、擴容或套用 hotfix。
↓sequential→ 產生事後檢討報告
產生事後檢討報告agent
AI 根據 .osoplog 資料撰寫事後檢討報告。
uc-incident-response.osop.yaml
osop_version: "1.0"
id: "incident-response"
name:"生產事故應變"
description:"結構化的事故處理流程,結合 AI 分類與人工監督。"
nodes:
- id: "alert"
type: "event"
name: "告警觸發"
description: "PagerDuty 或 Grafana 告警觸發。"
- id: "triage"
type: "agent"
subtype: "llm"
name: "AI 分類診斷"
description: "AI 分析日誌、指標與近期部署紀錄。"
security:
risk_level: "medium"
- id: "investigate"
type: "human"
subtype: "input"
name: "工程師調查"
description: "值班工程師驗證 AI 的分類結果。"
- id: "mitigate"
type: "cli"
subtype: "script"
name: "套用緩解措施"
description: "執行回滾、擴容或套用 hotfix。"
security:
risk_level: "high"
approval_gate: true
- id: "postmortem"
type: "agent"
subtype: "llm"
name: "產生事後檢討報告"
description: "AI 根據 .osoplog 資料撰寫事後檢討報告。"
edges:
- from: "alert"
to: "triage"
mode: "sequential"
- from: "triage"
to: "investigate"
mode: "sequential"
- from: "investigate"
to: "mitigate"
mode: "conditional"
when: "investigation.confirmed == true"
- from: "mitigate"
to: "postmortem"
mode: "sequential"
- from: "triage"
to: "investigate"
mode: "timeout"
timeout_sec: 300
label: "Escalate if >5min"