You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

38 KiB

Multi-Agent Management Architecture

Version: 2.0 Date: 2026-03-06 Maintainer: Eason

Current state: Main (陈医生) is the only active agent. The life agent (张大师) has been removed. All agents are defined in agents.yaml.

For Main Agent (陈医生): 你是 Hub Agent。本文档既是架构参考,也是你的操作手册。 当用户要求创建、维护、排查或移除 Agent 时,跳转到对应的 Playbook 章节(11-14),按步骤执行。


1. Hub-and-Spoke Model

Main agent acts as the memory hub -- responsible for publishing shared knowledge, maintaining the project registry, and onboarding new agents. All other agents (local or remote) are spokes that consume shared memory and contribute their own private/project memories.

Main Agent (Hub) - defined in agents.yaml
  |-- publish_knowledge() --> Qdrant mem0_v4_shared (visibility=public)
  |-- publish_knowledge(project_id=X) --> (visibility=project)
  |-- maintain project_registry.yaml
  |-- maintain docs & best practices
  |
  +-- Local Spokes (same server, same Qdrant)
  |     |-- local-cli: main (openclaw gateway)
  |     |-- local-systemd: <agent_id> (port 187XX)
  |
  +-- Remote Spokes (Tailscale VPN -> Qdrant)
        +-- remote-http: <agent_id> (health via HTTP)

2. Memory Visibility Model

All agents share one Qdrant collection: mem0_v4_shared. Isolation is achieved through metadata fields.

Visibility Who can read Metadata filter
public All agents visibility=public
project Same project members visibility=project, project_id=X
private Only the writing agent visibility=private, agent_id=X

Project membership is defined in skills/mem0-integration/project_registry.yaml. Main agent is registered as member of all projects for audit access.


3. Agent Registry (agents.yaml)

Path: /root/.openclaw/workspace/agents.yaml

This file is the single source of truth for all agent definitions. All tooling reads from it dynamically:

Consumer Purpose
deploy.sh Service management (start/stop/debug/fix)
agent-monitor.js Health monitoring
local_search.py Agent lookup for search
memory_cleanup.py Agent-aware cleanup
onboard.sh / offboard.sh Add/remove agents

Helper script: scripts/parse_agents.py parses agents.yaml for bash/JS:

python3 scripts/parse_agents.py list          # list agent IDs
python3 scripts/parse_agents.py info <id>     # get agent info as KEY=VALUE (shell-safe quoted)
python3 scripts/parse_agents.py services      # list all agents with service details (tab-separated)
python3 scripts/parse_agents.py ids           # space-separated agent IDs (for bash loops)

Note: The info subcommand outputs single-quoted values (KEY='value') that are safe for eval in bash, even when values contain spaces, CJK characters, or special shell metacharacters. The services subcommand uses tab (\t) as the delimiter to avoid collisions with | or spaces in command strings.

Agent types supported:

Type Description
local-cli Managed via openclaw gateway CLI (main agent)
local-systemd Managed via user-level systemd unit
remote-http Remote agent checked via HTTP health endpoint

4. Agent Lifecycle

4.1 Onboard (create)

cd /root/.openclaw/workspace/templates
./onboard.sh <agent_id> <agent_name> <project_id> [qdrant_host]

Fully automated. This script:

  1. Creates workspace at agents/<agent_id>-workspace/ (IDENTITY.md, SOUL.md, mem0 config)
  2. Registers the agent in agents.yaml
  3. Registers in project_registry.yaml
  4. For local agents: generates systemd service + env file, installs, enables
  5. Reloads openclaw-agent-monitor so it picks up the new agent

Examples:

./onboard.sh crypto "CryptoBot" crypto              # local agent
./onboard.sh remote1 "RemoteBot" advert 100.115.94.1 # remote agent

Remaining manual steps (local-systemd): Edit IDENTITY.md, create ~/.openclaw-<agent_id>/openclaw.json, then start the service.

4.2 Offboard (retire)

cd /root/.openclaw/workspace/templates
./offboard.sh <agent_id> [--keep-data]

Options:

  • (default) Full removal: stops service, removes from agents.yaml and project_registry, deletes workspace, profile, and Qdrant memories
  • --keep-data Unregister only: keeps workspace and profile files

Examples:

./offboard.sh crypto              # full removal
./offboard.sh crypto --keep-data  # keep files, just unregister

The main (hub) agent cannot be offboarded.


5. Knowledge Publishing

Main agent can publish best practices and shared knowledge to Qdrant:

Via Python:

from mem0_client import mem0_client
await mem0_client.start()
await mem0_client.publish_knowledge(
    content="Always use EnvironmentFile= in systemd services for upgrade safety",
    category="knowledge",
    visibility="public",
)

Via CLI:

python3 mem0_integration.py publish '{"content":"...", "visibility":"public"}'

Via Node.js plugin (index.js):

The publish action is available through the same spawn interface used by search and add.

Visibility Guidelines

Content type Visibility Example
System best practices public "Use deploy.sh fix-service after upgrades"
Project-specific knowledge project "{agent_id} uses Google Calendar API"
User preferences private "User prefers dark mode"
API keys, secrets NEVER store Use environment variables

6. Cold Start Preload

When a new session starts, session_init.py calls cold_start_search() which retrieves memories in three phases:

  1. Phase 0 (public): Best practices, shared config -- available to all agents
  2. Phase 1 (project): Project-specific guidelines -- based on agent's project membership
  3. Phase 2 (private): Agent's own recent context

Results are deduplicated, ordered by phase priority, and injected into the System Prompt.


7. Local Agent Configuration

Local agents run on the same server and connect to Qdrant at localhost:6333.

Key configuration points:

  • openclaw.json: collection_name: "mem0_v4_shared" (NOT agent-specific collections)
  • systemd/<agent_id>-gateway.env: contains MEM0_DASHSCOPE_API_KEY
  • EnvironmentFile= in the service unit references the env file

8. Remote Agent Configuration

Remote agents run on different servers and connect to Qdrant via Tailscale.

Prerequisites

  1. Tailscale installed and joined to the same tailnet on both servers
  2. Qdrant accessible at the hub server's Tailscale IP (e.g., 100.115.94.1:6333)
  3. Tailscale ACL allows the remote server to access port 6333

Environment File

MEM0_QDRANT_HOST=100.115.94.1
MEM0_DASHSCOPE_API_KEY=sk-...
OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

Onboarding

./onboard.sh remote1 "RemoteBot" advert 100.115.94.1

The 4th argument sets MEM0_QDRANT_HOST in the generated env file. The agent is automatically added to agents.yaml and the monitor picks it up on reload.

Monitoring

The monitor reads from agents.yaml dynamically. Remote agents (type remote-http) are checked via their health_url. Remote agents cannot be auto-started from the hub; the monitor will only alert on failure.


9. Agent Monitor Service Hardening

The openclaw-agent-monitor.service runs as a system-level systemd service with the following security constraints:

Directive Value Purpose
ProtectSystem strict Mounts entire filesystem read-only
ProtectHome read-only Home directory is read-only
ReadWritePaths /root/.openclaw/workspace/logs /run/user/0 Whitelist for writes: log output + D-Bus for systemctl --user
NoNewPrivileges true Cannot gain new privileges
MemoryMax 512M OOM guard
CPUQuota 20% Prevent monitor from starving other processes

Why /run/user/0? The monitor uses systemctl --user start/stop to manage gateway processes, which requires D-Bus access at the user runtime directory. Without this path whitelisted, ProtectSystem=strict would block the D-Bus socket and prevent auto-restart.

Initialization order in agent-monitor.js:

  1. loadConfig() -- read openclaw.json
  2. ensureLogDir() -- create log directory (must happen before any this.log() calls)
  3. loadMonitoredServices() -- parse agents.yaml (may log errors on failure)
  4. Signal handlers + start monitoring loop

10. File Reference

File Purpose
agents.yaml Single source of truth for agent registry
scripts/parse_agents.py Parses agents.yaml for bash/JS consumers
skills/mem0-integration/mem0_client.py Core client: search, write, publish, cold_start
skills/mem0-integration/mem0_integration.py CLI interface: init, search, add, publish, cold_start
skills/mem0-integration/session_init.py Three-phase cold start hook
skills/mem0-integration/project_registry.yaml Agent-to-project membership
templates/onboard.sh Automated agent onboarding (adds to agents.yaml, installs service, reloads monitor)
templates/offboard.sh Clean one-command agent removal
templates/agent-workspace/ Workspace file templates
templates/systemd/ Service and env file templates
agent-monitor.js Config-driven health monitor (reads agents.yaml)
deploy.sh Service management (reads agents.yaml)
docs/EXTENSIONS_ARCHITECTURE.md Systemd, monitor, upgrade safety
docs/MEMORY_ARCHITECTURE.md Four-layer memory system detail



PART B: Operational Playbooks (面向 Main Agent 的操作手册)

以下 Section 11-14 是 Main Agent (陈医生) 在对话中执行操作的分步指南。 当用户说"帮我创建一个新 agent"、"检查 agent 状态"、"清理记忆"、"移除 agent"时, 按对应章节执行。每个步骤标注了需要向用户提问 (🗣) 还是你自己执行 (🔧)。


11. Playbook: Interactive Onboarding (创建新 Agent)

当用户说"我要创建新 agent"或类似意图时,按以下流程执行。

11.1 信息收集阶段 (🗣 向用户逐步提问)

按以下顺序收集信息。每轮只问 1-2 个问题,不要一次全部列出。

第 1 轮:基本身份

需要收集:
1. agent_id    — 英文小写标识符,无空格(例: crypto, hr_bot, advert_pm)
2. agent_name  — 显示名称,可以是中文(例: "加密分析师", "HR助手")

示例提问: "新 Agent 的 ID 用什么?(英文小写,如 crypto)显示名称叫什么?"

第 2 轮:角色定义

需要收集:
3. role        — 一句话角色描述(例: "加密货币行情分析与投资策略助手")
4. scope       — 职责范围,2-5 条(例: "行情监控、策略分析、风险提醒")
5. personality — 性格/沟通风格(例: "专业严谨、数据驱动、适度幽默")

示例提问: "这个 Agent 的角色是什么?负责哪些事情?你希望它是什么样的沟通风格?"

第 3 轮:项目归属

需要收集:
6. project_id  — 所属项目(已有: advert, global; 或新建)
7. new_project — 如果是新项目,需要项目名称和描述

先展示已有项目: 读取 skills/mem0-integration/project_registry.yaml

示例提问: "这个 Agent 属于哪个项目?现有项目有: advert(广告业务)、global(全局)。需要新建项目吗?"

第 4 轮:Telegram Bot

需要收集:
8. bot_token   — Telegram Bot Token

如果用户还没有 token,给出创建指引:

创建 Telegram Bot 的步骤:
1. 在 Telegram 搜索 @BotFather,发送 /newbot
2. 按提示输入 bot 显示名称(如: CryptoBot)
3. 输入 bot username(必须以 Bot 结尾,如: openclaw_crypto_bot)
4. BotFather 会返回一个 token(格式: 1234567890:ABCdef...)
5. 把这个 token 发给我

第 5 轮:部署方式

需要收集:
9. deploy_type — 本地(localhost) 还是远程(Tailscale IP)
10. qdrant_host — 远程时需要 Tailscale IP 地址

示例提问: "这个 Agent 部署在本服务器还是远程?如果远程,Tailscale IP 是多少?"

11.2 端口分配规则

端口 用途
18789 main agent (已占用)
18790 第 2 个本地 agent
18791 第 3 个本地 agent
... 依次递增

🔧 自动分配: 读取 agents.yaml 中已注册 agent 数量,port = 18789 + count。 远程 agent 不需要在本服务器分配端口。

11.3 执行阶段 (🔧 按顺序执行)

收集完信息后,按以下步骤执行。每步完成后向用户报告进度。

Step 1: 运行 onboard.sh

cd /root/.openclaw/workspace/templates
# 本地 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id>
# 远程 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id> <qdrant_host>

这会自动完成: 创建 workspace、注册 agents.yaml、注册 project_registry、 生成 systemd service/env、重载 monitor。

Step 2: 填充 IDENTITY.md

写入 agents/<agent_id>-workspace/IDENTITY.md:

# Agent Identity

- **Name**: <agent_name>
- **Agent ID**: <agent_id>
- **Role**: <用户提供的角色描述>
- **Project**: <project_id>
- **Created**: <今天日期>

## Scope
<用户提供的职责范围每条一行>

## Communication Style
<用户提供的性格/沟通风格描述>

Step 3: 填充 SOUL.md

写入 agents/<agent_id>-workspace/SOUL.md:

# <agent_name> - Core Personality

## Beliefs
<从用户描述的角色推导 2-3 条核心信念>

## Behavior Rules
- Follow shared best practices from public memory
- Respect memory visibility boundaries (public/project/private)
- Log important decisions to memory for team awareness
<根据角色补充 2-3 条特定行为准则>

## Communication Style
<用户描述的沟通风格展开为 2-3 句具体描述>

Step 4: 如果是新项目,注册到 project_registry.yaml

如果第 3 轮收集的是新项目,编辑 skills/mem0-integration/project_registry.yaml:

  <project_id>:
    name: "<项目名称>"
    description: "<项目描述>"
    members:
      - "<agent_id>"
      - "main"
    owner: "main"

Step 5: 创建 openclaw.json

这是最关键的步骤。从 main 的配置复制并修改:

cp /root/.openclaw/openclaw.json /root/.openclaw-<agent_id>/openclaw.json

必须修改的字段(字段映射表):

JSON 路径 main 的值 新 agent 应改为
agents.list[0].id "main" "<agent_id>"
agents.defaults.workspace "/root/.openclaw/workspace" "/root/.openclaw/workspace/agents/<agent_id>-workspace"
channels.telegram.botToken "7047245486:AAF..." "<用户提供的 token>"
gateway.port 18789 <分配的端口>
gateway.controlUi.allowedOrigins[2] "http://100.115.94.1:18789" "http://100.115.94.1:<端口>"(必须与该 agent 的 gateway.port 一致)
gateway.controlUi.dangerouslyDisableDeviceAuth true 保持 true(否则从 Tailscale IP 打开 Control UI 会提示 "device identity required",需先配对浏览器设备)
gateway.controlUi.allowInsecureAuth 无或 true 建议 true(与 main 一致;HTTP 非 localhost 访问时需此选项才能绕过浏览器无法生成设备密钥的限制,否则仍会报 device identity required)
plugins.entries.mem0-integration.config.agent_id "main" "<agent_id>"

Control UI 访问:若 allowedOrigins[2] 未改为该 agent 的端口,用户访问 http://100.115.94.1:<端口>/ 会报 "origin not allowed",无法打开配对页。创建 openclaw.json 时务必同时改 gateway.portgateway.controlUi.allowedOrigins[2]

保持不变的字段(继承 main 的配置):

  • models — 使用相同的模型配置
  • auth — 使用相同的认证
  • memory — 使用 qmd 后端
  • skills — 继承 tavily, find-skills-robin, mem0-integration
  • plugins.load.paths — 可保留或改为 agent 自己的 skills 路径

Step 6: 启动服务

# 本地 agent:
export XDG_RUNTIME_DIR=/run/user/$(id -u)
systemctl --user start openclaw-gateway-<agent_id>.service

# 检查状态:
systemctl --user status openclaw-gateway-<agent_id>.service

Step 7: 验证

./deploy.sh health

11.4 完成 Checklist (🔧 逐项确认后告知用户)

□ onboard.sh 运行成功
□ agents.yaml 已注册
□ project_registry.yaml 已注册(含 main 作为成员)
□ IDENTITY.md 已填充角色/职责
□ SOUL.md 已填充性格/行为准则
□ openclaw.json 已创建,字段已修改:
  □ agents.list[0].id = <agent_id>
  □ agents.defaults.workspace 指向 agent workspace
  □ channels.telegram.botToken 使用新 token
  □ gateway.port 不与其他 agent 冲突
  □ gateway.controlUi.allowedOrigins[2] = "http://100.115.94.1:<该 agent 端口>"(否则 Control UI 会报 origin not allowed)
  □ gateway.controlUi.dangerouslyDisableDeviceAuth = true(否则会报 device identity required)
  □ gateway.controlUi.allowInsecureAuth = true(从 Tailscale/LAN IP 用 HTTP 打开 UI 时建议开启)
  □ plugins.entries.mem0-integration.config.agent_id 正确
□ systemd 服务已启动
□ deploy.sh health 全部通过
□ Telegram Bot 配对完成(用户确认)

11.6 Telegram 配对说明 (用户必须自行完成)

新建 agent 使用 dmPolicy: pairing 时,配对必须由用户在 Telegram 与 Control UI 中完成,main agent 无法代为执行。

标准步骤(提供给用户):

  1. 在 Telegram 中搜索该 agent 的 Bot(如 @xxx_bot),发送 /start
  2. 打开该 agent 的 Control UI:http://100.115.94.1:<端口>/(端口即该 agent 的 gateway.port)
  3. 若出现 "origin not allowed":说明该 agent 的 openclaw.jsongateway.controlUi.allowedOrigins[2] 未设为 http://100.115.94.1:<端口>,需由 main agent 修正后重启该 agent 服务
  4. 若出现 "device identity required":因通过 http://100.115.94.1:<端口> 访问时浏览器不在安全上下文(仅 HTTPS 或 localhost 才是),无法生成设备密钥。可选两种方式:
    • 方式 A(推荐):用 带 token 的 URL 打开 Control UI:http://100.115.94.1:<端口>/?token=<该 agent 的 gateway.auth.token>,即可绕过设备校验。
    • 方式 B:在该 agent 的 openclaw.jsongateway.controlUi 下同时设置 "dangerouslyDisableDeviceAuth": true"allowInsecureAuth": true,保存后重启该 agent 服务(仅限受信任内网环境)。
  5. 在 Control UI 进入 Devices/配对页面,确认 Telegram 用户的配对请求(此为 Bot 与用户的配对,与上一步的浏览器设备无关)
  6. 配对完成后,Bot 即可正常回复私信

main agent 可做:提供上述步骤、给出 Control UI 地址、在出现 origin 错误时修正配置并提醒用户重试。

11.5 对话范例

用户: 帮我创建一个新的加密货币分析 agent
Eason: 好的。先确定基本信息:
       1. Agent ID 用什么?建议用英文小写,如 crypto
       2. 显示名称叫什么?
用户: ID 就叫 crypto,名称叫"加密分析师"
Eason: 收到。这个 agent 的具体角色是什么?负责哪些事?
       沟通风格方面,你希望它专业严谨还是轻松活泼?
用户: 负责行情监控、策略分析和风险提醒。专业但简洁。
Eason: 了解。项目归属方面,现有项目有:
       - advert (广告业务)
       - global (全局共享)
       需要新建一个 crypto 项目吗?
用户: 新建 crypto 项目
Eason: 好。接下来需要一个独立的 Telegram Bot Token。
       如果还没有,步骤如下:
       1. Telegram 搜索 @BotFather → 发 /newbot
       2. 设置名称和 username
       3. 把返回的 token 发给我
用户: 1234567890:ABCdef...
Eason: 最后确认:部署在本服务器还是远程?
用户: 本服务器
Eason: 好的,开始部署...
       [执行 Step 1-7]
       部署完成!Checklist 全部通过。
       请在 Telegram 给 @你的bot 发条消息测试一下。

12. Playbook: Agent Maintenance & Troubleshooting (日常维护)

12.1 日常健康检查 (🔧)

当用户问"agent 状态怎么样"或在 Heartbeat 中执行:

./deploy.sh health      # 全局健康检查
./deploy.sh status      # 详细服务状态

解读结果并向用户报告:

  • ✓ 表示正常
  • ✗ 表示服务未运行 → 尝试重启
  • ⚠ 表示资源告警 → 报告具体数值

12.2 Agent 未响应排查流程

Step 1: 检查服务是否运行
  systemctl --user status openclaw-gateway-<agent_id>.service

Step 2: 如果 inactive → 检查日志
  journalctl --user -u openclaw-gateway-<agent_id> -n 50 --no-pager

Step 3: 常见问题及解决:
  - "Address already in use" → 端口冲突,检查 openclaw.json 的 gateway.port
  - "Cannot find module" → openclaw 版本问题,运行 ./deploy.sh fix-service
  - "ECONNREFUSED" → Qdrant 未启动,检查 docker ps | grep qdrant
  - "API key invalid" → 检查 systemd/<agent_id>-gateway.env 中的 API key
  - **"origin not allowed"(Control UI 打不开)** → 该 agent 的 openclaw.json 中 gateway.controlUi.allowedOrigins[2] 必须为 "http://100.115.94.1:<该 agent 的端口>";修改后执行 systemctl --user restart openclaw-gateway-<agent_id>.service
  - **"device identity required"(Control UI 要求设备配对)** → 通过 HTTP 访问非 localhost 时,浏览器无法生成设备密钥。解决:① 用带 token 的 URL:`http://100.115.94.1:<端口>/?token=<gateway.auth.token>`;或 ② 在该 agent 的 openclaw.json 的 gateway.controlUi 下同时设置 `"dangerouslyDisableDeviceAuth": true` 与 `"allowInsecureAuth": true`,保存后重启该 agent 服务(仅限受信任内网)。

Step 4: 重启
  systemctl --user restart openclaw-gateway-<agent_id>.service

Step 5: 仍然失败 → 收集日志给用户
  journalctl --user -u openclaw-gateway-<agent_id> -n 200 --no-pager > /tmp/agent-debug.log

12.3 OpenClaw 升级后恢复

当用户通过 UI 升级 OpenClaw 后,自定义配置可能丢失:

./deploy.sh fix-service    # 重新注入 EnvironmentFile 到 systemd 服务
./deploy.sh restart        # 重启所有服务使配置生效
./deploy.sh health         # 确认恢复正常

向用户报告修复结果。

12.4 查看 Agent 列表

python3 scripts/parse_agents.py list

输出格式: <id>\t<type>\t<name>,向用户展示时格式化为表格。

12.5 调试模式

当用户需要调试某个 agent:

./deploy.sh debug-stop     # 停止所有服务(含 monitor,防止自动重启)
# ... 用户调试 ...
./deploy.sh debug-start    # 恢复所有服务

13. Playbook: Memory Management (记忆管理)

13.1 发布共享知识 (🔧)

当用户说"把这条最佳实践共享给所有 agent":

python3 skills/mem0-integration/mem0_integration.py publish \
  '{"content":"<知识内容>", "visibility":"public", "category":"knowledge"}'

当用户说"把这个信息共享给某项目":

python3 skills/mem0-integration/mem0_integration.py publish \
  '{"content":"<内容>", "visibility":"project", "project_id":"<项目>", "category":"knowledge"}'

13.2 查看记忆统计

python3 skills/mem0-integration/memory_cleanup.py --dry-run

向用户报告各 agent、各类型、各可见性的记忆数量。

13.3 清理过期记忆

# 先 dry-run 查看:
python3 skills/mem0-integration/memory_cleanup.py --dry-run --max-age-days 90

# 确认后执行:
python3 skills/mem0-integration/memory_cleanup.py --max-age-days 90

13.4 为新 Agent 预载知识 (Cold Start)

新 agent 创建后,可以为其预载公共知识:

python3 skills/mem0-integration/mem0_integration.py cold_start \
  '{"agent_id":"<agent_id>", "user_id":"wang_yuanzhang", "top_k":10}'

13.5 检查记忆可见性

当用户质疑"某 agent 能看到这条记忆吗":

  1. 确定记忆的 visibilityproject_id
  2. project_registry.yaml 确认 agent 是否在该 project 的 members 列表中
  3. 可见性规则:
    • public → 所有 agent 可见
    • project → 只有 project members 可见
    • private → 只有写入者可见

14. Playbook: Interactive Offboarding (移除 Agent)

14.1 信息收集 (🗣)

需要收集:
1. agent_id  — 要移除的 agent ID
2. keep_data — 是否保留数据(workspace、profile、Qdrant 记忆)

示例提问: "要移除哪个 Agent?需要保留它的数据吗?(保留可以日后恢复)"

🔧 先展示当前 agent 列表:

python3 scripts/parse_agents.py list

14.2 安全检查 (🔧)

□ 确认不是 main agent(main 不能被移除)
□ 确认 agent 存在于 agents.yaml
□ 向用户再次确认: "确定要移除 <agent_name> (<agent_id>) 吗?这将停止服务并从注册表中删除。"

14.3 执行 (🔧)

cd /root/.openclaw/workspace/templates

# 完全移除(含数据):
./offboard.sh <agent_id>

# 仅注销(保留数据):
./offboard.sh <agent_id> --keep-data

脚本会交互确认 (y/N),需要输入 y 确认。

14.4 完成后报告

向用户报告:

Agent <agent_name> (<agent_id>) 已移除:
  - 服务: 已停止并卸载
  - agents.yaml: 已移除
  - project_registry: 已移除
  - Workspace: <已删除 / 已保留>
  - Qdrant 记忆: <已删除 / 已保留>
  - Monitor: 已重载

运行 ./deploy.sh health 确认系统正常。


15. Playbook: Backup & Cleanup (备份与清理)

15.1 备份命令

命令 说明
./deploy.sh backup 完整备份 (workspace + Qdrant snapshot + agent profiles + docker-compose)
./deploy.sh backup quick 快速备份 (仅 workspace 文件,不含 Qdrant)
bash scripts/10-create-backup.sh 独立备份脚本(包含 mem0 配置 + agents.yaml + Qdrant snapshot)

备份保留策略: 自动保留最近 10 个备份,旧备份自动删除。

备份目录结构:

/root/.openclaw/backups/<TIMESTAMP>/
├── workspace.tar.gz                  # Layer 1+2 所有 MD 和配置文件
├── .openclaw__openclaw.json          # main agent profile
├── .openclaw-tongge__openclaw.json   # 副 agent profiles (如有)
├── docker-compose.yml                # Qdrant docker 配置
├── qdrant-mem0_v4_shared.snapshot    # Layer 4 向量数据 (full 模式)
├── qdrant-point-count.txt            # 备份时的 point 数量 (用于校验)
└── manifest.txt                      # 备份清单

15.2 恢复命令

命令 说明
./deploy.sh restore <backup-dir> 恢复 workspace 文件 + agent profiles
./deploy.sh restore-qdrant <snapshot-file> 恢复 Qdrant 向量数据

恢复前会自动创建 quick 备份,且需要交互确认 (y/N)。

15.3 记忆清理

清理脚本: skills/mem0-integration/memory_cleanup.py

命令 说明
python3 memory_cleanup.py --dry-run 统计各维度记忆 + 列出过期记忆数量 (不删除)
python3 memory_cleanup.py --execute --max-age-days 90 实际删除过期记忆

保留策略 (与 mem0_client.pyEXPIRATION_MAP 对齐):

  • session: 7 天后过期
  • chat_summary: 30 天后过期
  • preference: 永久保留
  • knowledge: 永久保留

--max-age-days 作为强制上限: 超过该天数的 session/chat_summary 无论 expiration_date 均会删除。preference 和 knowledge 永远不会被自动清理。

审计日志: 每次清理写入 logs/security/memory-cleanup-<date>.log

15.4 自动化 Cron

安装脚本: scripts/setup-cron.sh

./scripts/setup-cron.sh          # 安装定时任务
./scripts/setup-cron.sh remove   # 移除定时任务
./scripts/setup-cron.sh status   # 查看当前任务

定时计划:

时间 任务
每天 02:00 ./deploy.sh backup — 完整备份
每周日 03:00 memory_cleanup.py --execute --max-age-days 90 — 清理过期记忆

日志输出到 logs/system/cron-backup.loglogs/system/cron-cleanup.log

15.5 交互式备份恢复流程 (🗣)

当用户要求备份或恢复时的对话流程:

备份:

陈医生: "需要创建什么类型的备份?"
  1. 完整备份 (含 Qdrant 向量数据,推荐)
  2. 快速备份 (仅 workspace 文件)

→ 执行相应命令,报告备份路径和 Qdrant point 数量
→ 建议: 重大变更前务必执行完整备份

恢复:

陈医生: "需要恢复到哪个备份?"
→ 列出 /root/.openclaw/backups/ 下可用备份
→ 展示 manifest.txt 内容让用户确认
→ 先恢复 workspace: ./deploy.sh restore <dir>
→ 如有 Qdrant 快照且用户确认: ./deploy.sh restore-qdrant <file>
→ 恢复后执行 ./deploy.sh restart + ./deploy.sh health
→ 对比 qdrant-point-count.txt 与当前 point 数量

16. Playbook: Server Migration (服务器迁移)

16.1 迁移前准备 (🗣)

信息收集:

需要确认:
1. target_server — 目标服务器地址 (IP 或 Tailscale hostname)
2. target_user   — 目标服务器用户名 (通常 root)
3. keep_source   — 迁移后是否保留源服务器数据
4. tailscale     — 目标服务器是否已加入 Tailscale 网络

示例提问: "要迁移到哪台服务器?是否已安装 Tailscale?迁移后源服务器数据要保留吗?"

16.2 源服务器: 完整备份 (🔧)

cd /root/.openclaw/workspace
./deploy.sh backup

确认备份完整性:

ls -la /root/.openclaw/backups/<TIMESTAMP>/
cat /root/.openclaw/backups/<TIMESTAMP>/manifest.txt
cat /root/.openclaw/backups/<TIMESTAMP>/qdrant-point-count.txt

16.3 传输到目标服务器 (🔧)

BACKUP_DIR="/root/.openclaw/backups/<TIMESTAMP>"
TARGET="root@<target_server>"

rsync -avzP "$BACKUP_DIR" "$TARGET:/root/.openclaw/backups/"
rsync -avzP /root/.openclaw/workspace/ "$TARGET:/root/.openclaw/workspace/" --exclude='.git' --exclude='logs'
rsync -avzP /root/.openclaw/openclaw.json "$TARGET:/root/.openclaw/"

副 agent profiles (如有):

for d in /root/.openclaw-*/; do
    agent_name=$(basename "$d")
    rsync -avzP "$d" "$TARGET:/root/$agent_name/"
done

16.4 目标服务器: 安装基础设施 (🔧)

# 1. 安装 Node.js (v24+) 和 OpenClaw
curl -fsSL https://get.openclaw.com | bash

# 2. 安装 Docker + Qdrant
mkdir -p /opt/mem0-center && cd /opt/mem0-center
# 从备份恢复 docker-compose.yml
cp /root/.openclaw/backups/<TIMESTAMP>/docker-compose.yml .
docker compose up -d

# 3. 等待 Qdrant 启动
sleep 5
curl -sf http://localhost:6333/collections | python3 -c "import sys,json; print(json.dumps(json.load(sys.stdin),indent=2))"

# 4. 恢复 Qdrant 数据
cd /root/.openclaw/workspace
./deploy.sh restore-qdrant /root/.openclaw/backups/<TIMESTAMP>/qdrant-mem0_v4_shared.snapshot

# 5. 安装 Python 依赖
pip3 install qdrant-client mem0ai pyyaml

# 6. 安装系统服务
./deploy.sh install

16.5 验证 (🔧)

# 服务状态
./deploy.sh health

# Qdrant 数据对比
curl -sf http://localhost:6333/collections/mem0_v4_shared | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Points: {d[\"result\"][\"points_count\"]}')"
# 对比源服务器的 qdrant-point-count.txt

# 记忆检索测试
cd /root/.openclaw/workspace/skills/mem0-integration
python3 mem0_integration.py search "测试查询" --agent-id main

# Telegram 连通性
# 在 Telegram 上发送测试消息给 bot

16.6 完成后 Checklist

□ 所有 agent 服务正常运行 (deploy.sh health 全绿)
□ Qdrant point 数量与源服务器一致
□ 记忆检索正常返回结果
□ Telegram bot 回复正常
□ Cron 定时任务已安装 (scripts/setup-cron.sh install)
□ 环境变量已设置 (MEM0_DASHSCOPE_API_KEY 等)
□ Monitor 服务运行中 (systemctl status openclaw-agent-monitor)
□ Tailscale 已加入 (如需远程 agent 连接)
□ 源服务器数据处理 (保留/清理)

16.7 回滚计划

如果迁移失败:

1. 在源服务器上 ./deploy.sh debug-start 恢复服务
2. 目标服务器上 ./deploy.sh debug-stop 停止所有服务
3. 排查问题后重新尝试

17. 技能/插件管理 SOP

17.1 Skill vs Plugin 选型指南

OpenClaw 有两套扩展加载机制,选型规则如下:

类型 加载方式 配置位置 适用场景
内置 Skill OpenClaw 自动发现 skills.entries.<id> Clawhub 市场内置技能(如 find-skills-robin
自定义 Plugin 手动指定路径 plugins.load.paths + plugins.entries.<id> 自研工具(tavily)、lifecycle hook(mem0)、任何需要自定义代码的扩展

判断规则:

  • 如果只需要开关一个 Clawhub 内置功能 -> skills.entries
  • 如果有自己的 openclaw.plugin.json + index.js -> plugins
  • 如果需要 lifecycle hook(对话前后自动执行) -> 必须 plugins
  • 不要同时在 skills.entriesplugins.entries 中重复启用同一个技能

Plugin 必需文件:

/root/.openclaw/workspace/skills/<id>/
├── openclaw.plugin.json   # 插件清单(必需)
├── index.js               # 工具/hook 实现(必需)
├── CONFIG_SUMMARY.md      # 配置文档(推荐)
└── TEST_REPORT.md         # 测试报告(推荐)

17.2 分阶段发布流程

所有新技能必须先在 main agent 上验证通过,再部署到辅 agent。

Stage 1 -- 安装代码

  1. 将技能代码放入 /root/.openclaw/workspace/skills/<id>/
  2. 确保有 openclaw.plugin.json(含 id、name、kind、main、tools/configSchema)
  3. 确保有 index.js(导出 register/activate 和工具定义)

Stage 2 -- Main 启用并测试

  1. 在 main 的 openclaw.json 中:
    • plugins.load.paths 添加 "/root/.openclaw/workspace/skills/<id>"
    • plugins.entries.<id> 设为 { "enabled": true } (如有 config 一并填写)
  2. 重启 main gateway:systemctl --user restart openclaw-gateway.service
  3. 检查日志确认插件加载:journalctl --user -u openclaw-gateway -n 50 | grep -i <id>
  4. 通过 Telegram 对 main 发消息测试功能

Stage 3 -- 审核

templates/SKILL_REVIEW_TEMPLATE.md 完成审核,包括:

审核维度 检查内容
安全 API key 管理(环境变量 vs 硬编码)、网络请求范围、文件读写、权限提升
功能 agent 能否正确调用、结果是否准确、错误处理是否合理
性能 响应时间、并发调用、对 agent 整体延迟的影响
最佳实践 推荐参数、适用场景、已知限制,记录到 CONFIG_SUMMARY.md

Stage 4 -- 推送辅 Agent

  1. 技能代码在共享 workspace 下,无需复制
  2. 在辅 agent 的 openclaw.json 中:
    • plugins.load.paths 添加相同路径
    • plugins.entries.<id> 启用(注意 agent-specific 配置,如 mem0 的 agent_id 必须改为该 agent 的 ID)
  3. 重启辅 agent gateway
  4. 验证插件加载和功能正常

17.3 当前技能清单

技能 ID 类型 加载方式 Main Tongge 说明
find-skills-robin 内置 skills.entries 启用 启用 Clawhub 技能发现
mem0-integration lifecycle skills.entries + plugins 启用 启用 记忆系统(agent_id 需区分)
tavily tool plugins 启用 启用 AI 搜索(共享 API key)
active-learning 内置 skills.entries -- 启用 主动学习(仅 tongge)
memos-cloud-openclaw-plugin 内置 plugins.entries 启用 启用 Memos 云插件
qwen-portal-auth 内置 plugins.entries 启用 启用 Qwen Portal OAuth

维护要求: 每次新增或移除技能时,同步更新此表。

17.4 Agent-Specific 配置注意事项

部分 plugin 在不同 agent 间需要不同配置:

Plugin 需区分的配置项 Main Tongge
mem0-integration config.agent_id "main" "tongge"
mem0-integration config.user_id "wang院长" "wang院长"

部署到新 agent 时,务必检查以上配置项。


Changelog

Version Date Changes
1.0 2026-03-06 Initial version: hub-and-spoke model, templates, remote support
1.1 2026-03-06 Config-driven architecture: agents.yaml as single registry; automated onboard/offboard; parse_agents.py helper; life agent (张大师) removed; main is only active agent
1.2 2026-03-06 Code review + bug fixes (7 items): parse_agents.py output now shell-safe quoted; agent-monitor.js constructor ordering fixed (ensureLogDir before loadMonitoredServices) and fallback uses full openclaw path; deploy.sh switched grep -qP to grep -qE for portability; offboard.sh Qdrant delete uses FilterSelector wrapper; onboard.sh/offboard.sh inline Python rewritten with sys.argv to prevent shell injection; openclaw-agent-monitor.service added /run/user/0 to ReadWritePaths for D-Bus access; removed corrupted trailing bytes in offboard.sh
2.0 2026-03-06 Added operational playbooks (Part B): Interactive Onboarding (Sec 11, with conversation flow, field mapping table, port allocation, checklist, dialog example), Agent Maintenance & Troubleshooting (Sec 12), Memory Management (Sec 13), Interactive Offboarding (Sec 14). Document restructured into Part A (Architecture Reference) and Part B (Operational Playbooks).
2.1 2026-03-06 Added Backup & Cleanup Playbook (Sec 15): backup/restore commands, memory cleanup with retention policy, cron automation, interactive dialogue flow. Added Server Migration Playbook (Sec 16): step-by-step migration with pre/post checklist, Qdrant snapshot recovery, rollback plan.
2.2 2026-03-09 Added Skill/Plugin Management SOP (Sec 17): skill vs plugin selection guide, staged release workflow (main-first), current skill inventory, agent-specific config notes. Unified tavily loading to plugin mode across all agents.