38 KiB

Raw Permalink Blame History Unescape Escape

Multi-Agent Management Architecture

Version: 2.0 Date: 2026-03-06 Maintainer: Eason

Current state: Main (陈医生) is the only active agent. The life agent (张大师) has been removed. All agents are defined in agents.yaml.

For Main Agent (陈医生): 你是 Hub Agent。本文档既是架构参考，也是你的操作手册。当用户要求创建、维护、排查或移除 Agent 时，跳转到对应的 Playbook 章节（11-14），按步骤执行。

1. Hub-and-Spoke Model

Main agent acts as the memory hub -- responsible for publishing shared knowledge, maintaining the project registry, and onboarding new agents. All other agents (local or remote) are spokes that consume shared memory and contribute their own private/project memories.

Main Agent (Hub) - defined in agents.yaml
  |-- publish_knowledge() --> Qdrant mem0_v4_shared (visibility=public)
  |-- publish_knowledge(project_id=X) --> (visibility=project)
  |-- maintain project_registry.yaml
  |-- maintain docs & best practices
  |
  +-- Local Spokes (same server, same Qdrant)
  |     |-- local-cli: main (openclaw gateway)
  |     |-- local-systemd: <agent_id> (port 187XX)
  |
  +-- Remote Spokes (Tailscale VPN -> Qdrant)
        +-- remote-http: <agent_id> (health via HTTP)

2. Memory Visibility Model

All agents share one Qdrant collection: mem0_v4_shared. Isolation is achieved through metadata fields.

Visibility	Who can read	Metadata filter
public	All agents	`visibility=public`
project	Same project members	`visibility=project, project_id=X`
private	Only the writing agent	`visibility=private, agent_id=X`

Project membership is defined in skills/mem0-integration/project_registry.yaml. Main agent is registered as member of all projects for audit access.

3. Agent Registry (agents.yaml)

Path: /root/.openclaw/workspace/agents.yaml

This file is the single source of truth for all agent definitions. All tooling reads from it dynamically:

Consumer	Purpose
`deploy.sh`	Service management (start/stop/debug/fix)
`agent-monitor.js`	Health monitoring
`local_search.py`	Agent lookup for search
`memory_cleanup.py`	Agent-aware cleanup
`onboard.sh` / `offboard.sh`	Add/remove agents

Helper script: scripts/parse_agents.py parses agents.yaml for bash/JS:

python3 scripts/parse_agents.py list          # list agent IDs
python3 scripts/parse_agents.py info <id>     # get agent info as KEY=VALUE (shell-safe quoted)
python3 scripts/parse_agents.py services      # list all agents with service details (tab-separated)
python3 scripts/parse_agents.py ids           # space-separated agent IDs (for bash loops)

Note: The info subcommand outputs single-quoted values (KEY='value') that are safe for eval in bash, even when values contain spaces, CJK characters, or special shell metacharacters. The services subcommand uses tab (\t) as the delimiter to avoid collisions with | or spaces in command strings.

Agent types supported:

Type	Description
`local-cli`	Managed via `openclaw gateway` CLI (main agent)
`local-systemd`	Managed via user-level systemd unit
`remote-http`	Remote agent checked via HTTP health endpoint

4. Agent Lifecycle

4.1 Onboard (create)

cd /root/.openclaw/workspace/templates
./onboard.sh <agent_id> <agent_name> <project_id> [qdrant_host]

Fully automated. This script:

Creates workspace at agents/<agent_id>-workspace/ (IDENTITY.md, SOUL.md, mem0 config)
Registers the agent in agents.yaml
Registers in project_registry.yaml
For local agents: generates systemd service + env file, installs, enables
Reloads openclaw-agent-monitor so it picks up the new agent

Examples:

./onboard.sh crypto "CryptoBot" crypto              # local agent
./onboard.sh remote1 "RemoteBot" advert 100.115.94.1 # remote agent

Remaining manual steps (local-systemd): Edit IDENTITY.md, create ~/.openclaw-<agent_id>/openclaw.json, then start the service.

4.2 Offboard (retire)

cd /root/.openclaw/workspace/templates
./offboard.sh <agent_id> [--keep-data]

Options:

(default) Full removal: stops service, removes from agents.yaml and project_registry, deletes workspace, profile, and Qdrant memories
--keep-data Unregister only: keeps workspace and profile files

Examples:

./offboard.sh crypto              # full removal
./offboard.sh crypto --keep-data  # keep files, just unregister

The main (hub) agent cannot be offboarded.

5. Knowledge Publishing

Main agent can publish best practices and shared knowledge to Qdrant:

Via Python:

from mem0_client import mem0_client
await mem0_client.start()
await mem0_client.publish_knowledge(
    content="Always use EnvironmentFile= in systemd services for upgrade safety",
    category="knowledge",
    visibility="public",
)

Via CLI:

python3 mem0_integration.py publish '{"content":"...", "visibility":"public"}'

Via Node.js plugin (index.js):

The publish action is available through the same spawn interface used by search and add.

Visibility Guidelines

Content type	Visibility	Example
System best practices	public	"Use deploy.sh fix-service after upgrades"
Project-specific knowledge	project	"{agent_id} uses Google Calendar API"
User preferences	private	"User prefers dark mode"
API keys, secrets	NEVER store	Use environment variables

6. Cold Start Preload

When a new session starts, session_init.py calls cold_start_search() which retrieves memories in three phases:

Phase 0 (public): Best practices, shared config -- available to all agents
Phase 1 (project): Project-specific guidelines -- based on agent's project membership
Phase 2 (private): Agent's own recent context

Results are deduplicated, ordered by phase priority, and injected into the System Prompt.

7. Local Agent Configuration

Local agents run on the same server and connect to Qdrant at localhost:6333.

Key configuration points:

openclaw.json: collection_name: "mem0_v4_shared" (NOT agent-specific collections)
systemd/<agent_id>-gateway.env: contains MEM0_DASHSCOPE_API_KEY
EnvironmentFile= in the service unit references the env file

8. Remote Agent Configuration

Remote agents run on different servers and connect to Qdrant via Tailscale.

Prerequisites

Tailscale installed and joined to the same tailnet on both servers
Qdrant accessible at the hub server's Tailscale IP (e.g., 100.115.94.1:6333)
Tailscale ACL allows the remote server to access port 6333

Environment File

MEM0_QDRANT_HOST=100.115.94.1
MEM0_DASHSCOPE_API_KEY=sk-...
OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

Onboarding

./onboard.sh remote1 "RemoteBot" advert 100.115.94.1

The 4th argument sets MEM0_QDRANT_HOST in the generated env file. The agent is automatically added to agents.yaml and the monitor picks it up on reload.

Monitoring

The monitor reads from agents.yaml dynamically. Remote agents (type remote-http) are checked via their health_url. Remote agents cannot be auto-started from the hub; the monitor will only alert on failure.

9. Agent Monitor Service Hardening

The openclaw-agent-monitor.service runs as a system-level systemd service with the following security constraints:

Directive	Value	Purpose
`ProtectSystem`	`strict`	Mounts entire filesystem read-only
`ProtectHome`	`read-only`	Home directory is read-only
`ReadWritePaths`	`/root/.openclaw/workspace/logs /run/user/0`	Whitelist for writes: log output + D-Bus for `systemctl --user`
`NoNewPrivileges`	`true`	Cannot gain new privileges
`MemoryMax`	`512M`	OOM guard
`CPUQuota`	`20%`	Prevent monitor from starving other processes

Why /run/user/0? The monitor uses systemctl --user start/stop to manage gateway processes, which requires D-Bus access at the user runtime directory. Without this path whitelisted, ProtectSystem=strict would block the D-Bus socket and prevent auto-restart.

Initialization order in agent-monitor.js:

loadConfig() -- read openclaw.json
ensureLogDir() -- create log directory (must happen before any this.log() calls)
loadMonitoredServices() -- parse agents.yaml (may log errors on failure)
Signal handlers + start monitoring loop

10. File Reference

File	Purpose
`agents.yaml`	Single source of truth for agent registry
`scripts/parse_agents.py`	Parses agents.yaml for bash/JS consumers
`skills/mem0-integration/mem0_client.py`	Core client: search, write, publish, cold_start
`skills/mem0-integration/mem0_integration.py`	CLI interface: init, search, add, publish, cold_start
`skills/mem0-integration/session_init.py`	Three-phase cold start hook
`skills/mem0-integration/project_registry.yaml`	Agent-to-project membership
`templates/onboard.sh`	Automated agent onboarding (adds to agents.yaml, installs service, reloads monitor)
`templates/offboard.sh`	Clean one-command agent removal
`templates/agent-workspace/`	Workspace file templates
`templates/systemd/`	Service and env file templates
`agent-monitor.js`	Config-driven health monitor (reads agents.yaml)
`deploy.sh`	Service management (reads agents.yaml)
`docs/EXTENSIONS_ARCHITECTURE.md`	Systemd, monitor, upgrade safety
`docs/MEMORY_ARCHITECTURE.md`	Four-layer memory system detail

PART B: Operational Playbooks (面向 Main Agent 的操作手册)

以下 Section 11-14 是 Main Agent (陈医生) 在对话中执行操作的分步指南。 当用户说"帮我创建一个新 agent"、"检查 agent 状态"、"清理记忆"、"移除 agent"时，按对应章节执行。每个步骤标注了需要向用户提问 (🗣️) 还是你自己执行 (🔧)。

11. Playbook: Interactive Onboarding (创建新 Agent)

当用户说"我要创建新 agent"或类似意图时，按以下流程执行。

11.1 信息收集阶段 (🗣️ 向用户逐步提问)

按以下顺序收集信息。每轮只问 1-2 个问题，不要一次全部列出。

第 1 轮：基本身份

需要收集:
1. agent_id    — 英文小写标识符，无空格（例: crypto, hr_bot, advert_pm）
2. agent_name  — 显示名称，可以是中文（例: "加密分析师", "HR助手"）

示例提问: "新 Agent 的 ID 用什么？（英文小写，如 crypto）显示名称叫什么？"

第 2 轮：角色定义

需要收集:
3. role        — 一句话角色描述（例: "加密货币行情分析与投资策略助手"）
4. scope       — 职责范围，2-5 条（例: "行情监控、策略分析、风险提醒"）
5. personality — 性格/沟通风格（例: "专业严谨、数据驱动、适度幽默"）

示例提问: "这个 Agent 的角色是什么？负责哪些事情？你希望它是什么样的沟通风格？"

第 3 轮：项目归属

需要收集:
6. project_id  — 所属项目（已有: advert, global; 或新建）
7. new_project — 如果是新项目，需要项目名称和描述

先展示已有项目: 读取 skills/mem0-integration/project_registry.yaml

示例提问: "这个 Agent 属于哪个项目？现有项目有: advert(广告业务)、global(全局)。需要新建项目吗？"

第 4 轮：Telegram Bot

需要收集:
8. bot_token   — Telegram Bot Token

如果用户还没有 token，给出创建指引:

创建 Telegram Bot 的步骤:
1. 在 Telegram 搜索 @BotFather，发送 /newbot
2. 按提示输入 bot 显示名称（如: CryptoBot）
3. 输入 bot username（必须以 Bot 结尾，如: openclaw_crypto_bot）
4. BotFather 会返回一个 token（格式: 1234567890:ABCdef...）
5. 把这个 token 发给我

第 5 轮：部署方式

需要收集:
9. deploy_type — 本地(localhost) 还是远程(Tailscale IP)
10. qdrant_host — 远程时需要 Tailscale IP 地址

示例提问: "这个 Agent 部署在本服务器还是远程？如果远程，Tailscale IP 是多少？"

11.2 端口分配规则

端口	用途
18789	main agent (已占用)
18790	第 2 个本地 agent
18791	第 3 个本地 agent
...	依次递增

🔧 自动分配: 读取 agents.yaml 中已注册 agent 数量，port = 18789 + count。远程 agent 不需要在本服务器分配端口。

11.3 执行阶段 (🔧 按顺序执行)

收集完信息后，按以下步骤执行。每步完成后向用户报告进度。

Step 1: 运行 onboard.sh

cd /root/.openclaw/workspace/templates
# 本地 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id>
# 远程 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id> <qdrant_host>

这会自动完成: 创建 workspace、注册 agents.yaml、注册 project_registry、生成 systemd service/env、重载 monitor。

Step 2: 填充 IDENTITY.md

写入 agents/<agent_id>-workspace/IDENTITY.md:

# Agent Identity

- **Name**: <agent_name>
- **Agent ID**: <agent_id>
- **Role**: <用户提供的角色描述>
- **Project**: <project_id>
- **Created**: <今天日期>

## Scope
<用户提供的职责范围，每条一行>

## Communication Style
<用户提供的性格/沟通风格描述>

Step 3: 填充 SOUL.md

写入 agents/<agent_id>-workspace/SOUL.md:

# <agent_name> - Core Personality

## Beliefs
<从用户描述的角色推导 2-3 条核心信念>

## Behavior Rules
- Follow shared best practices from public memory
- Respect memory visibility boundaries (public/project/private)
- Log important decisions to memory for team awareness
<根据角色补充 2-3 条特定行为准则>

## Communication Style
<用户描述的沟通风格，展开为 2-3 句具体描述>

Step 4: 如果是新项目，注册到 project_registry.yaml

如果第 3 轮收集的是新项目，编辑 skills/mem0-integration/project_registry.yaml:

  <project_id>:
    name: "<项目名称>"
    description: "<项目描述>"
    members:
      - "<agent_id>"
      - "main"
    owner: "main"

Step 5: 创建 openclaw.json

这是最关键的步骤。从 main 的配置复制并修改:

cp /root/.openclaw/openclaw.json /root/.openclaw-<agent_id>/openclaw.json

必须修改的字段（字段映射表）:

JSON 路径	main 的值	新 agent 应改为
`agents.list[0].id`	`"main"`	`"<agent_id>"`
`agents.defaults.workspace`	`"/root/.openclaw/workspace"`	`"/root/.openclaw/workspace/agents/<agent_id>-workspace"`
`channels.telegram.botToken`	`"7047245486:AAF..."`	`"<用户提供的 token>"`
`gateway.port`	`18789`	`<分配的端口>`
`gateway.controlUi.allowedOrigins[2]`	`"http://100.115.94.1:18789"`	`"http://100.115.94.1:<端口>"`（必须与该 agent 的 gateway.port 一致）
`gateway.controlUi.dangerouslyDisableDeviceAuth`	`true`	保持 `true`（否则从 Tailscale IP 打开 Control UI 会提示 "device identity required"，需先配对浏览器设备）
`gateway.controlUi.allowInsecureAuth`	无或 `true`	建议 `true`（与 main 一致；HTTP 非 localhost 访问时需此选项才能绕过浏览器无法生成设备密钥的限制，否则仍会报 device identity required）
`plugins.entries.mem0-integration.config.agent_id`	`"main"`	`"<agent_id>"`

⚠️ Control UI 访问：若 allowedOrigins[2] 未改为该 agent 的端口，用户访问 http://100.115.94.1:<端口>/ 会报 "origin not allowed"，无法打开配对页。创建 openclaw.json 时务必同时改 gateway.port 与 gateway.controlUi.allowedOrigins[2]。

保持不变的字段（继承 main 的配置）:

models — 使用相同的模型配置
auth — 使用相同的认证
memory — 使用 qmd 后端
skills — 继承 tavily, find-skills-robin, mem0-integration
plugins.load.paths — 可保留或改为 agent 自己的 skills 路径

Step 6: 启动服务

# 本地 agent:
export XDG_RUNTIME_DIR=/run/user/$(id -u)
systemctl --user start openclaw-gateway-<agent_id>.service

# 检查状态:
systemctl --user status openclaw-gateway-<agent_id>.service

Step 7: 验证

./deploy.sh health

11.4 完成 Checklist (🔧 逐项确认后告知用户)

□ onboard.sh 运行成功
□ agents.yaml 已注册
□ project_registry.yaml 已注册（含 main 作为成员）
□ IDENTITY.md 已填充角色/职责
□ SOUL.md 已填充性格/行为准则
□ openclaw.json 已创建，字段已修改:
  □ agents.list[0].id = <agent_id>
  □ agents.defaults.workspace 指向 agent workspace
  □ channels.telegram.botToken 使用新 token
  □ gateway.port 不与其他 agent 冲突
  □ gateway.controlUi.allowedOrigins[2] = "http://100.115.94.1:<该 agent 端口>"（否则 Control UI 会报 origin not allowed）
  □ gateway.controlUi.dangerouslyDisableDeviceAuth = true（否则会报 device identity required）
  □ gateway.controlUi.allowInsecureAuth = true（从 Tailscale/LAN IP 用 HTTP 打开 UI 时建议开启）
  □ plugins.entries.mem0-integration.config.agent_id 正确
□ systemd 服务已启动
□ deploy.sh health 全部通过
□ Telegram Bot 配对完成（用户确认）

11.6 Telegram 配对说明 (用户必须自行完成)

新建 agent 使用 dmPolicy: pairing 时，配对必须由用户在 Telegram 与 Control UI 中完成，main agent 无法代为执行。

标准步骤（提供给用户）:

在 Telegram 中搜索该 agent 的 Bot（如 @xxx_bot），发送 /start
打开该 agent 的 Control UI：http://100.115.94.1:<端口>/（端口即该 agent 的 gateway.port）
若出现 "origin not allowed"：说明该 agent 的 openclaw.json 中 gateway.controlUi.allowedOrigins[2] 未设为 http://100.115.94.1:<端口>，需由 main agent 修正后重启该 agent 服务
若出现 "device identity required"：因通过 http://100.115.94.1:<端口> 访问时浏览器不在安全上下文（仅 HTTPS 或 localhost 才是），无法生成设备密钥。可选两种方式：
- 方式 A（推荐）：用 带 token 的 URL 打开 Control UI：http://100.115.94.1:<端口>/?token=<该 agent 的 gateway.auth.token>，即可绕过设备校验。
- 方式 B：在该 agent 的 openclaw.json 的 gateway.controlUi 下同时设置 "dangerouslyDisableDeviceAuth": true 与 "allowInsecureAuth": true，保存后重启该 agent 服务（仅限受信任内网环境）。
在 Control UI 进入 Devices/配对页面，确认 Telegram 用户的配对请求（此为 Bot 与用户的配对，与上一步的浏览器设备无关）
配对完成后，Bot 即可正常回复私信

main agent 可做：提供上述步骤、给出 Control UI 地址、在出现 origin 错误时修正配置并提醒用户重试。

11.5 对话范例

用户: 帮我创建一个新的加密货币分析 agent
Eason: 好的。先确定基本信息:
       1. Agent ID 用什么？建议用英文小写，如 crypto
       2. 显示名称叫什么？
用户: ID 就叫 crypto，名称叫"加密分析师"
Eason: 收到。这个 agent 的具体角色是什么？负责哪些事？
       沟通风格方面，你希望它专业严谨还是轻松活泼？
用户: 负责行情监控、策略分析和风险提醒。专业但简洁。
Eason: 了解。项目归属方面，现有项目有:
       - advert (广告业务)
       - global (全局共享)
       需要新建一个 crypto 项目吗？
用户: 新建 crypto 项目
Eason: 好。接下来需要一个独立的 Telegram Bot Token。
       如果还没有，步骤如下:
       1. Telegram 搜索 @BotFather → 发 /newbot
       2. 设置名称和 username
       3. 把返回的 token 发给我
用户: 1234567890:ABCdef...
Eason: 最后确认：部署在本服务器还是远程？
用户: 本服务器
Eason: 好的，开始部署...
       [执行 Step 1-7]
       部署完成！Checklist 全部通过。
       请在 Telegram 给 @你的bot 发条消息测试一下。

12. Playbook: Agent Maintenance & Troubleshooting (日常维护)

12.1 日常健康检查 (🔧)

当用户问"agent 状态怎么样"或在 Heartbeat 中执行:

./deploy.sh health      # 全局健康检查
./deploy.sh status      # 详细服务状态

解读结果并向用户报告:

✓ 表示正常
✗ 表示服务未运行 → 尝试重启
⚠ 表示资源告警 → 报告具体数值

12.2 Agent 未响应排查流程

Step 1: 检查服务是否运行
  systemctl --user status openclaw-gateway-<agent_id>.service

Step 2: 如果 inactive → 检查日志
  journalctl --user -u openclaw-gateway-<agent_id> -n 50 --no-pager

Step 3: 常见问题及解决:
  - "Address already in use" → 端口冲突，检查 openclaw.json 的 gateway.port
  - "Cannot find module" → openclaw 版本问题，运行 ./deploy.sh fix-service
  - "ECONNREFUSED" → Qdrant 未启动，检查 docker ps | grep qdrant
  - "API key invalid" → 检查 systemd/<agent_id>-gateway.env 中的 API key
  - **"origin not allowed"（Control UI 打不开）** → 该 agent 的 openclaw.json 中 gateway.controlUi.allowedOrigins[2] 必须为 "http://100.115.94.1:<该 agent 的端口>"；修改后执行 systemctl --user restart openclaw-gateway-<agent_id>.service
  - **"device identity required"（Control UI 要求设备配对）** → 通过 HTTP 访问非 localhost 时，浏览器无法生成设备密钥。解决：① 用带 token 的 URL：`http://100.115.94.1:<端口>/?token=<gateway.auth.token>`；或 ② 在该 agent 的 openclaw.json 的 gateway.controlUi 下同时设置 `"dangerouslyDisableDeviceAuth": true` 与 `"allowInsecureAuth": true`，保存后重启该 agent 服务（仅限受信任内网）。

Step 4: 重启
  systemctl --user restart openclaw-gateway-<agent_id>.service

Step 5: 仍然失败 → 收集日志给用户
  journalctl --user -u openclaw-gateway-<agent_id> -n 200 --no-pager > /tmp/agent-debug.log

12.3 OpenClaw 升级后恢复

当用户通过 UI 升级 OpenClaw 后，自定义配置可能丢失:

./deploy.sh fix-service    # 重新注入 EnvironmentFile 到 systemd 服务
./deploy.sh restart        # 重启所有服务使配置生效
./deploy.sh health         # 确认恢复正常

向用户报告修复结果。

12.4 查看 Agent 列表

python3 scripts/parse_agents.py list

输出格式: <id>\t<type>\t<name>，向用户展示时格式化为表格。

12.5 调试模式

当用户需要调试某个 agent:

./deploy.sh debug-stop     # 停止所有服务（含 monitor，防止自动重启）
# ... 用户调试 ...
./deploy.sh debug-start    # 恢复所有服务

13. Playbook: Memory Management (记忆管理)

13.1 发布共享知识 (🔧)

当用户说"把这条最佳实践共享给所有 agent":

python3 skills/mem0-integration/mem0_integration.py publish \
  '{"content":"<知识内容>", "visibility":"public", "category":"knowledge"}'

当用户说"把这个信息共享给某项目":

python3 skills/mem0-integration/mem0_integration.py publish \
  '{"content":"<内容>", "visibility":"project", "project_id":"<项目>", "category":"knowledge"}'

13.2 查看记忆统计

python3 skills/mem0-integration/memory_cleanup.py --dry-run

向用户报告各 agent、各类型、各可见性的记忆数量。

13.3 清理过期记忆

# 先 dry-run 查看:
python3 skills/mem0-integration/memory_cleanup.py --dry-run --max-age-days 90

# 确认后执行:
python3 skills/mem0-integration/memory_cleanup.py --max-age-days 90

13.4 为新 Agent 预载知识 (Cold Start)

新 agent 创建后，可以为其预载公共知识:

python3 skills/mem0-integration/mem0_integration.py cold_start \
  '{"agent_id":"<agent_id>", "user_id":"wang_yuanzhang", "top_k":10}'

13.5 检查记忆可见性

当用户质疑"某 agent 能看到这条记忆吗":

确定记忆的 visibility 和 project_id
读 project_registry.yaml 确认 agent 是否在该 project 的 members 列表中
可见性规则:
- public → 所有 agent 可见
- project → 只有 project members 可见
- private → 只有写入者可见

14. Playbook: Interactive Offboarding (移除 Agent)

14.1 信息收集 (🗣️)

需要收集:
1. agent_id  — 要移除的 agent ID
2. keep_data — 是否保留数据（workspace、profile、Qdrant 记忆）

示例提问: "要移除哪个 Agent？需要保留它的数据吗？（保留可以日后恢复）"

🔧 先展示当前 agent 列表:

python3 scripts/parse_agents.py list

14.2 安全检查 (🔧)

□ 确认不是 main agent（main 不能被移除）
□ 确认 agent 存在于 agents.yaml
□ 向用户再次确认: "确定要移除 <agent_name> (<agent_id>) 吗？这将停止服务并从注册表中删除。"

14.3 执行 (🔧)

cd /root/.openclaw/workspace/templates

# 完全移除（含数据）:
./offboard.sh <agent_id>

# 仅注销（保留数据）:
./offboard.sh <agent_id> --keep-data

脚本会交互确认 (y/N)，需要输入 y 确认。

14.4 完成后报告

向用户报告:

Agent <agent_name> (<agent_id>) 已移除:
  - 服务: 已停止并卸载
  - agents.yaml: 已移除
  - project_registry: 已移除
  - Workspace: <已删除 / 已保留>
  - Qdrant 记忆: <已删除 / 已保留>
  - Monitor: 已重载

运行 ./deploy.sh health 确认系统正常。

15. Playbook: Backup & Cleanup (备份与清理)

15.1 备份命令

命令	说明
`./deploy.sh backup`	完整备份 (workspace + Qdrant snapshot + agent profiles + docker-compose)
`./deploy.sh backup quick`	快速备份 (仅 workspace 文件，不含 Qdrant)
`bash scripts/10-create-backup.sh`	独立备份脚本（包含 mem0 配置 + agents.yaml + Qdrant snapshot）

备份保留策略: 自动保留最近 10 个备份，旧备份自动删除。

备份目录结构:

/root/.openclaw/backups/<TIMESTAMP>/
├── workspace.tar.gz                  # Layer 1+2 所有 MD 和配置文件
├── .openclaw__openclaw.json          # main agent profile
├── .openclaw-tongge__openclaw.json   # 副 agent profiles (如有)
├── docker-compose.yml                # Qdrant docker 配置
├── qdrant-mem0_v4_shared.snapshot    # Layer 4 向量数据 (full 模式)
├── qdrant-point-count.txt            # 备份时的 point 数量 (用于校验)
└── manifest.txt                      # 备份清单

15.2 恢复命令

命令	说明
`./deploy.sh restore <backup-dir>`	恢复 workspace 文件 + agent profiles
`./deploy.sh restore-qdrant <snapshot-file>`	恢复 Qdrant 向量数据

恢复前会自动创建 quick 备份，且需要交互确认 (y/N)。

15.3 记忆清理

清理脚本: skills/mem0-integration/memory_cleanup.py

命令	说明
`python3 memory_cleanup.py --dry-run`	统计各维度记忆 + 列出过期记忆数量 (不删除)
`python3 memory_cleanup.py --execute --max-age-days 90`	实际删除过期记忆

保留策略 (与 mem0_client.py 的 EXPIRATION_MAP 对齐):

session: 7 天后过期
chat_summary: 30 天后过期
preference: 永久保留
knowledge: 永久保留

--max-age-days 作为强制上限: 超过该天数的 session/chat_summary 无论 expiration_date 均会删除。preference 和 knowledge 永远不会被自动清理。

审计日志: 每次清理写入 logs/security/memory-cleanup-<date>.log。

15.4 自动化 Cron

安装脚本: scripts/setup-cron.sh

./scripts/setup-cron.sh          # 安装定时任务
./scripts/setup-cron.sh remove   # 移除定时任务
./scripts/setup-cron.sh status   # 查看当前任务

定时计划:

时间	任务
每天 02:00	`./deploy.sh backup` — 完整备份
每周日 03:00	`memory_cleanup.py --execute --max-age-days 90` — 清理过期记忆

日志输出到 logs/system/cron-backup.log 和 logs/system/cron-cleanup.log。

15.5 交互式备份恢复流程 (🗣️)

当用户要求备份或恢复时的对话流程:

备份:

陈医生: "需要创建什么类型的备份？"
  1. 完整备份 (含 Qdrant 向量数据，推荐)
  2. 快速备份 (仅 workspace 文件)

→ 执行相应命令，报告备份路径和 Qdrant point 数量
→ 建议: 重大变更前务必执行完整备份

恢复:

陈医生: "需要恢复到哪个备份？"
→ 列出 /root/.openclaw/backups/ 下可用备份
→ 展示 manifest.txt 内容让用户确认
→ 先恢复 workspace: ./deploy.sh restore <dir>
→ 如有 Qdrant 快照且用户确认: ./deploy.sh restore-qdrant <file>
→ 恢复后执行 ./deploy.sh restart + ./deploy.sh health
→ 对比 qdrant-point-count.txt 与当前 point 数量

16. Playbook: Server Migration (服务器迁移)

16.1 迁移前准备 (🗣️)

信息收集:

需要确认:
1. target_server — 目标服务器地址 (IP 或 Tailscale hostname)
2. target_user   — 目标服务器用户名 (通常 root)
3. keep_source   — 迁移后是否保留源服务器数据
4. tailscale     — 目标服务器是否已加入 Tailscale 网络

示例提问: "要迁移到哪台服务器？是否已安装 Tailscale？迁移后源服务器数据要保留吗？"

16.2 源服务器: 完整备份 (🔧)

cd /root/.openclaw/workspace
./deploy.sh backup

确认备份完整性:

ls -la /root/.openclaw/backups/<TIMESTAMP>/
cat /root/.openclaw/backups/<TIMESTAMP>/manifest.txt
cat /root/.openclaw/backups/<TIMESTAMP>/qdrant-point-count.txt

16.3 传输到目标服务器 (🔧)

BACKUP_DIR="/root/.openclaw/backups/<TIMESTAMP>"
TARGET="root@<target_server>"

rsync -avzP "$BACKUP_DIR" "$TARGET:/root/.openclaw/backups/"
rsync -avzP /root/.openclaw/workspace/ "$TARGET:/root/.openclaw/workspace/" --exclude='.git' --exclude='logs'
rsync -avzP /root/.openclaw/openclaw.json "$TARGET:/root/.openclaw/"

副 agent profiles (如有):

for d in /root/.openclaw-*/; do
    agent_name=$(basename "$d")
    rsync -avzP "$d" "$TARGET:/root/$agent_name/"
done

16.4 目标服务器: 安装基础设施 (🔧)

# 1. 安装 Node.js (v24+) 和 OpenClaw
curl -fsSL https://get.openclaw.com | bash

# 2. 安装 Docker + Qdrant
mkdir -p /opt/mem0-center && cd /opt/mem0-center
# 从备份恢复 docker-compose.yml
cp /root/.openclaw/backups/<TIMESTAMP>/docker-compose.yml .
docker compose up -d

# 3. 等待 Qdrant 启动
sleep 5
curl -sf http://localhost:6333/collections | python3 -c "import sys,json; print(json.dumps(json.load(sys.stdin),indent=2))"

# 4. 恢复 Qdrant 数据
cd /root/.openclaw/workspace
./deploy.sh restore-qdrant /root/.openclaw/backups/<TIMESTAMP>/qdrant-mem0_v4_shared.snapshot

# 5. 安装 Python 依赖
pip3 install qdrant-client mem0ai pyyaml

# 6. 安装系统服务
./deploy.sh install

16.5 验证 (🔧)

# 服务状态
./deploy.sh health

# Qdrant 数据对比
curl -sf http://localhost:6333/collections/mem0_v4_shared | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Points: {d[\"result\"][\"points_count\"]}')"
# 对比源服务器的 qdrant-point-count.txt

# 记忆检索测试
cd /root/.openclaw/workspace/skills/mem0-integration
python3 mem0_integration.py search "测试查询" --agent-id main

# Telegram 连通性
# 在 Telegram 上发送测试消息给 bot

16.6 完成后 Checklist

□ 所有 agent 服务正常运行 (deploy.sh health 全绿)
□ Qdrant point 数量与源服务器一致
□ 记忆检索正常返回结果
□ Telegram bot 回复正常
□ Cron 定时任务已安装 (scripts/setup-cron.sh install)
□ 环境变量已设置 (MEM0_DASHSCOPE_API_KEY 等)
□ Monitor 服务运行中 (systemctl status openclaw-agent-monitor)
□ Tailscale 已加入 (如需远程 agent 连接)
□ 源服务器数据处理 (保留/清理)

16.7 回滚计划

如果迁移失败:

1. 在源服务器上 ./deploy.sh debug-start 恢复服务
2. 目标服务器上 ./deploy.sh debug-stop 停止所有服务
3. 排查问题后重新尝试

17. 技能/插件管理 SOP

17.1 Skill vs Plugin 选型指南

OpenClaw 有两套扩展加载机制，选型规则如下：

类型	加载方式	配置位置	适用场景
内置 Skill	OpenClaw 自动发现	`skills.entries.<id>`	Clawhub 市场内置技能（如 `find-skills-robin`）
自定义 Plugin	手动指定路径	`plugins.load.paths` + `plugins.entries.<id>`	自研工具（tavily）、lifecycle hook（mem0）、任何需要自定义代码的扩展

判断规则：

如果只需要开关一个 Clawhub 内置功能 -> skills.entries
如果有自己的 openclaw.plugin.json + index.js -> plugins
如果需要 lifecycle hook（对话前后自动执行） -> 必须 plugins
不要同时在 skills.entries 和 plugins.entries 中重复启用同一个技能

Plugin 必需文件：

/root/.openclaw/workspace/skills/<id>/
├── openclaw.plugin.json   # 插件清单（必需）
├── index.js               # 工具/hook 实现（必需）
├── CONFIG_SUMMARY.md      # 配置文档（推荐）
└── TEST_REPORT.md         # 测试报告（推荐）

17.2 分阶段发布流程

所有新技能必须先在 main agent 上验证通过，再部署到辅 agent。

Stage 1 -- 安装代码

将技能代码放入 /root/.openclaw/workspace/skills/<id>/
确保有 openclaw.plugin.json（含 id、name、kind、main、tools/configSchema）
确保有 index.js（导出 register/activate 和工具定义）

Stage 2 -- Main 启用并测试

在 main 的 openclaw.json 中：
- plugins.load.paths 添加 "/root/.openclaw/workspace/skills/<id>"
- plugins.entries.<id> 设为 { "enabled": true } （如有 config 一并填写）
重启 main gateway：systemctl --user restart openclaw-gateway.service
检查日志确认插件加载：journalctl --user -u openclaw-gateway -n 50 | grep -i <id>
通过 Telegram 对 main 发消息测试功能

Stage 3 -- 审核

按 templates/SKILL_REVIEW_TEMPLATE.md 完成审核，包括：

审核维度	检查内容
安全	API key 管理（环境变量 vs 硬编码）、网络请求范围、文件读写、权限提升
功能	agent 能否正确调用、结果是否准确、错误处理是否合理
性能	响应时间、并发调用、对 agent 整体延迟的影响
最佳实践	推荐参数、适用场景、已知限制，记录到 `CONFIG_SUMMARY.md`

Stage 4 -- 推送辅 Agent

技能代码在共享 workspace 下，无需复制
在辅 agent 的 openclaw.json 中：
- plugins.load.paths 添加相同路径
- plugins.entries.<id> 启用（注意 agent-specific 配置，如 mem0 的 agent_id 必须改为该 agent 的 ID）
重启辅 agent gateway
验证插件加载和功能正常

17.3 当前技能清单

技能 ID	类型	加载方式	Main	Tongge	说明
`find-skills-robin`	内置	`skills.entries`	启用	启用	Clawhub 技能发现
`mem0-integration`	lifecycle	`skills.entries` + `plugins`	启用	启用	记忆系统（agent_id 需区分）
`tavily`	tool	`plugins`	启用	启用	AI 搜索（共享 API key）
`active-learning`	内置	`skills.entries`	--	启用	主动学习（仅 tongge）
`memos-cloud-openclaw-plugin`	内置	`plugins.entries`	启用	启用	Memos 云插件
`qwen-portal-auth`	内置	`plugins.entries`	启用	启用	Qwen Portal OAuth

维护要求： 每次新增或移除技能时，同步更新此表。

17.4 Agent-Specific 配置注意事项

部分 plugin 在不同 agent 间需要不同配置：

Plugin	需区分的配置项	Main	Tongge
`mem0-integration`	`config.agent_id`	`"main"`	`"tongge"`
`mem0-integration`	`config.user_id`	`"wang院长"`	`"wang院长"`

部署到新 agent 时，务必检查以上配置项。

Changelog

Version	Date	Changes
1.0	2026-03-06	Initial version: hub-and-spoke model, templates, remote support
1.1	2026-03-06	Config-driven architecture: agents.yaml as single registry; automated onboard/offboard; parse_agents.py helper; life agent (张大师) removed; main is only active agent
1.2	2026-03-06	Code review + bug fixes (7 items): `parse_agents.py` output now shell-safe quoted; `agent-monitor.js` constructor ordering fixed (ensureLogDir before loadMonitoredServices) and fallback uses full `openclaw` path; `deploy.sh` switched `grep -qP` to `grep -qE` for portability; `offboard.sh` Qdrant delete uses `FilterSelector` wrapper; `onboard.sh`/`offboard.sh` inline Python rewritten with `sys.argv` to prevent shell injection; `openclaw-agent-monitor.service` added `/run/user/0` to `ReadWritePaths` for D-Bus access; removed corrupted trailing bytes in `offboard.sh`
2.0	2026-03-06	Added operational playbooks (Part B): Interactive Onboarding (Sec 11, with conversation flow, field mapping table, port allocation, checklist, dialog example), Agent Maintenance & Troubleshooting (Sec 12), Memory Management (Sec 13), Interactive Offboarding (Sec 14). Document restructured into Part A (Architecture Reference) and Part B (Operational Playbooks).
2.1	2026-03-06	Added Backup & Cleanup Playbook (Sec 15): backup/restore commands, memory cleanup with retention policy, cron automation, interactive dialogue flow. Added Server Migration Playbook (Sec 16): step-by-step migration with pre/post checklist, Qdrant snapshot recovery, rollback plan.
2.2	2026-03-09	Added Skill/Plugin Management SOP (Sec 17): skill vs plugin selection guide, staged release workflow (main-first), current skill inventory, agent-specific config notes. Unified tavily loading to plugin mode across all agents.

38 KiB Raw Permalink Blame History Unescape Escape