# Multi-Agent Management Architecture **Version:** 2.0 **Date:** 2026-03-06 **Maintainer:** Eason **Current state:** Main (陈医生) is the only active agent. The life agent (张大师) has been removed. All agents are defined in `agents.yaml`. > **For Main Agent (陈医生):** 你是 Hub Agent。本文档既是架构参考,也是你的操作手册。 > 当用户要求创建、维护、排查或移除 Agent 时,跳转到对应的 Playbook 章节(11-14),按步骤执行。 --- ## 1. Hub-and-Spoke Model Main agent acts as the **memory hub** -- responsible for publishing shared knowledge, maintaining the project registry, and onboarding new agents. All other agents (local or remote) are **spokes** that consume shared memory and contribute their own private/project memories. ``` Main Agent (Hub) - defined in agents.yaml |-- publish_knowledge() --> Qdrant mem0_v4_shared (visibility=public) |-- publish_knowledge(project_id=X) --> (visibility=project) |-- maintain project_registry.yaml |-- maintain docs & best practices | +-- Local Spokes (same server, same Qdrant) | |-- local-cli: main (openclaw gateway) | |-- local-systemd: (port 187XX) | +-- Remote Spokes (Tailscale VPN -> Qdrant) +-- remote-http: (health via HTTP) ``` --- ## 2. Memory Visibility Model All agents share one Qdrant collection: `mem0_v4_shared`. Isolation is achieved through metadata fields. | Visibility | Who can read | Metadata filter | |-----------|-------------|-----------------| | public | All agents | `visibility=public` | | project | Same project members | `visibility=project, project_id=X` | | private | Only the writing agent | `visibility=private, agent_id=X` | Project membership is defined in `skills/mem0-integration/project_registry.yaml`. Main agent is registered as member of all projects for audit access. --- ## 3. Agent Registry (agents.yaml) **Path:** `/root/.openclaw/workspace/agents.yaml` This file is the **single source of truth** for all agent definitions. All tooling reads from it dynamically: | Consumer | Purpose | |----------|---------| | `deploy.sh` | Service management (start/stop/debug/fix) | | `agent-monitor.js` | Health monitoring | | `local_search.py` | Agent lookup for search | | `memory_cleanup.py` | Agent-aware cleanup | | `onboard.sh` / `offboard.sh` | Add/remove agents | **Helper script:** `scripts/parse_agents.py` parses agents.yaml for bash/JS: ```bash python3 scripts/parse_agents.py list # list agent IDs python3 scripts/parse_agents.py info # get agent info as KEY=VALUE (shell-safe quoted) python3 scripts/parse_agents.py services # list all agents with service details (tab-separated) python3 scripts/parse_agents.py ids # space-separated agent IDs (for bash loops) ``` > **Note:** The `info` subcommand outputs single-quoted values (`KEY='value'`) that are safe > for `eval` in bash, even when values contain spaces, CJK characters, or special shell > metacharacters. The `services` subcommand uses tab (`\t`) as the delimiter to avoid > collisions with `|` or spaces in command strings. **Agent types supported:** | Type | Description | |------|-------------| | `local-cli` | Managed via `openclaw gateway` CLI (main agent) | | `local-systemd` | Managed via user-level systemd unit | | `remote-http` | Remote agent checked via HTTP health endpoint | --- ## 4. Agent Lifecycle ### 4.1 Onboard (create) ```bash cd /root/.openclaw/workspace/templates ./onboard.sh [qdrant_host] ``` **Fully automated.** This script: 1. Creates workspace at `agents/-workspace/` (IDENTITY.md, SOUL.md, mem0 config) 2. Registers the agent in `agents.yaml` 3. Registers in `project_registry.yaml` 4. For local agents: generates systemd service + env file, installs, enables 5. Reloads `openclaw-agent-monitor` so it picks up the new agent **Examples:** ```bash ./onboard.sh crypto "CryptoBot" crypto # local agent ./onboard.sh remote1 "RemoteBot" advert 100.115.94.1 # remote agent ``` **Remaining manual steps (local-systemd):** Edit IDENTITY.md, create `~/.openclaw-/openclaw.json`, then start the service. ### 4.2 Offboard (retire) ```bash cd /root/.openclaw/workspace/templates ./offboard.sh [--keep-data] ``` **Options:** - (default) Full removal: stops service, removes from agents.yaml and project_registry, deletes workspace, profile, and Qdrant memories - `--keep-data` Unregister only: keeps workspace and profile files **Examples:** ```bash ./offboard.sh crypto # full removal ./offboard.sh crypto --keep-data # keep files, just unregister ``` The main (hub) agent cannot be offboarded. --- ## 5. Knowledge Publishing Main agent can publish best practices and shared knowledge to Qdrant: **Via Python:** ```python from mem0_client import mem0_client await mem0_client.start() await mem0_client.publish_knowledge( content="Always use EnvironmentFile= in systemd services for upgrade safety", category="knowledge", visibility="public", ) ``` **Via CLI:** ```bash python3 mem0_integration.py publish '{"content":"...", "visibility":"public"}' ``` **Via Node.js plugin (index.js):** The `publish` action is available through the same spawn interface used by `search` and `add`. ### Visibility Guidelines | Content type | Visibility | Example | |-------------|-----------|---------| | System best practices | public | "Use deploy.sh fix-service after upgrades" | | Project-specific knowledge | project | "{agent_id} uses Google Calendar API" | | User preferences | private | "User prefers dark mode" | | API keys, secrets | NEVER store | Use environment variables | --- ## 6. Cold Start Preload When a new session starts, `session_init.py` calls `cold_start_search()` which retrieves memories in three phases: 1. **Phase 0 (public)**: Best practices, shared config -- available to all agents 2. **Phase 1 (project)**: Project-specific guidelines -- based on agent's project membership 3. **Phase 2 (private)**: Agent's own recent context Results are deduplicated, ordered by phase priority, and injected into the System Prompt. --- ## 7. Local Agent Configuration Local agents run on the same server and connect to Qdrant at `localhost:6333`. Key configuration points: - `openclaw.json`: `collection_name: "mem0_v4_shared"` (NOT agent-specific collections) - `systemd/-gateway.env`: contains `MEM0_DASHSCOPE_API_KEY` - `EnvironmentFile=` in the service unit references the env file --- ## 8. Remote Agent Configuration Remote agents run on different servers and connect to Qdrant via Tailscale. ### Prerequisites 1. Tailscale installed and joined to the same tailnet on both servers 2. Qdrant accessible at the hub server's Tailscale IP (e.g., `100.115.94.1:6333`) 3. Tailscale ACL allows the remote server to access port 6333 ### Environment File ``` MEM0_QDRANT_HOST=100.115.94.1 MEM0_DASHSCOPE_API_KEY=sk-... OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1 OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1 ``` ### Onboarding ```bash ./onboard.sh remote1 "RemoteBot" advert 100.115.94.1 ``` The 4th argument sets `MEM0_QDRANT_HOST` in the generated env file. The agent is automatically added to `agents.yaml` and the monitor picks it up on reload. ### Monitoring The monitor reads from `agents.yaml` dynamically. Remote agents (type `remote-http`) are checked via their `health_url`. Remote agents cannot be auto-started from the hub; the monitor will only alert on failure. --- ## 9. Agent Monitor Service Hardening The `openclaw-agent-monitor.service` runs as a system-level systemd service with the following security constraints: | Directive | Value | Purpose | |-----------|-------|---------| | `ProtectSystem` | `strict` | Mounts entire filesystem read-only | | `ProtectHome` | `read-only` | Home directory is read-only | | `ReadWritePaths` | `/root/.openclaw/workspace/logs /run/user/0` | Whitelist for writes: log output + D-Bus for `systemctl --user` | | `NoNewPrivileges` | `true` | Cannot gain new privileges | | `MemoryMax` | `512M` | OOM guard | | `CPUQuota` | `20%` | Prevent monitor from starving other processes | **Why `/run/user/0`?** The monitor uses `systemctl --user start/stop` to manage gateway processes, which requires D-Bus access at the user runtime directory. Without this path whitelisted, `ProtectSystem=strict` would block the D-Bus socket and prevent auto-restart. **Initialization order in `agent-monitor.js`:** 1. `loadConfig()` -- read `openclaw.json` 2. `ensureLogDir()` -- create log directory (must happen before any `this.log()` calls) 3. `loadMonitoredServices()` -- parse `agents.yaml` (may log errors on failure) 4. Signal handlers + start monitoring loop --- ## 10. File Reference | File | Purpose | |------|---------| | `agents.yaml` | Single source of truth for agent registry | | `scripts/parse_agents.py` | Parses agents.yaml for bash/JS consumers | | `skills/mem0-integration/mem0_client.py` | Core client: search, write, publish, cold_start | | `skills/mem0-integration/mem0_integration.py` | CLI interface: init, search, add, publish, cold_start | | `skills/mem0-integration/session_init.py` | Three-phase cold start hook | | `skills/mem0-integration/project_registry.yaml` | Agent-to-project membership | | `templates/onboard.sh` | Automated agent onboarding (adds to agents.yaml, installs service, reloads monitor) | | `templates/offboard.sh` | Clean one-command agent removal | | `templates/agent-workspace/` | Workspace file templates | | `templates/systemd/` | Service and env file templates | | `agent-monitor.js` | Config-driven health monitor (reads agents.yaml) | | `deploy.sh` | Service management (reads agents.yaml) | | `docs/EXTENSIONS_ARCHITECTURE.md` | Systemd, monitor, upgrade safety | | `docs/MEMORY_ARCHITECTURE.md` | Four-layer memory system detail | --- --- --- # PART B: Operational Playbooks (面向 Main Agent 的操作手册) > **以下 Section 11-14 是 Main Agent (陈医生) 在对话中执行操作的分步指南。** > 当用户说"帮我创建一个新 agent"、"检查 agent 状态"、"清理记忆"、"移除 agent"时, > 按对应章节执行。每个步骤标注了需要向用户提问 (🗣️) 还是你自己执行 (🔧)。 --- ## 11. Playbook: Interactive Onboarding (创建新 Agent) 当用户说"我要创建新 agent"或类似意图时,按以下流程执行。 ### 11.1 信息收集阶段 (🗣️ 向用户逐步提问) 按以下顺序收集信息。每轮只问 1-2 个问题,不要一次全部列出。 **第 1 轮:基本身份** ``` 需要收集: 1. agent_id — 英文小写标识符,无空格(例: crypto, hr_bot, advert_pm) 2. agent_name — 显示名称,可以是中文(例: "加密分析师", "HR助手") ``` 示例提问: "新 Agent 的 ID 用什么?(英文小写,如 crypto)显示名称叫什么?" **第 2 轮:角色定义** ``` 需要收集: 3. role — 一句话角色描述(例: "加密货币行情分析与投资策略助手") 4. scope — 职责范围,2-5 条(例: "行情监控、策略分析、风险提醒") 5. personality — 性格/沟通风格(例: "专业严谨、数据驱动、适度幽默") ``` 示例提问: "这个 Agent 的角色是什么?负责哪些事情?你希望它是什么样的沟通风格?" **第 3 轮:项目归属** ``` 需要收集: 6. project_id — 所属项目(已有: advert, global; 或新建) 7. new_project — 如果是新项目,需要项目名称和描述 ``` 先展示已有项目: 读取 `skills/mem0-integration/project_registry.yaml` 示例提问: "这个 Agent 属于哪个项目?现有项目有: advert(广告业务)、global(全局)。需要新建项目吗?" **第 4 轮:Telegram Bot** ``` 需要收集: 8. bot_token — Telegram Bot Token ``` 如果用户还没有 token,给出创建指引: ``` 创建 Telegram Bot 的步骤: 1. 在 Telegram 搜索 @BotFather,发送 /newbot 2. 按提示输入 bot 显示名称(如: CryptoBot) 3. 输入 bot username(必须以 Bot 结尾,如: openclaw_crypto_bot) 4. BotFather 会返回一个 token(格式: 1234567890:ABCdef...) 5. 把这个 token 发给我 ``` **第 5 轮:部署方式** ``` 需要收集: 9. deploy_type — 本地(localhost) 还是远程(Tailscale IP) 10. qdrant_host — 远程时需要 Tailscale IP 地址 ``` 示例提问: "这个 Agent 部署在本服务器还是远程?如果远程,Tailscale IP 是多少?" ### 11.2 端口分配规则 | 端口 | 用途 | |------|------| | 18789 | main agent (已占用) | | 18790 | 第 2 个本地 agent | | 18791 | 第 3 个本地 agent | | ... | 依次递增 | 🔧 自动分配: 读取 `agents.yaml` 中已注册 agent 数量,port = 18789 + count。 远程 agent 不需要在本服务器分配端口。 ### 11.3 执行阶段 (🔧 按顺序执行) 收集完信息后,按以下步骤执行。**每步完成后向用户报告进度。** **Step 1: 运行 onboard.sh** ```bash cd /root/.openclaw/workspace/templates # 本地 agent: ./onboard.sh "" # 远程 agent: ./onboard.sh "" ``` 这会自动完成: 创建 workspace、注册 agents.yaml、注册 project_registry、 生成 systemd service/env、重载 monitor。 **Step 2: 填充 IDENTITY.md** 写入 `agents/-workspace/IDENTITY.md`: ```markdown # Agent Identity - **Name**: - **Agent ID**: - **Role**: <用户提供的角色描述> - **Project**: - **Created**: <今天日期> ## Scope <用户提供的职责范围,每条一行> ## Communication Style <用户提供的性格/沟通风格描述> ``` **Step 3: 填充 SOUL.md** 写入 `agents/-workspace/SOUL.md`: ```markdown # - Core Personality ## Beliefs <从用户描述的角色推导 2-3 条核心信念> ## Behavior Rules - Follow shared best practices from public memory - Respect memory visibility boundaries (public/project/private) - Log important decisions to memory for team awareness <根据角色补充 2-3 条特定行为准则> ## Communication Style <用户描述的沟通风格,展开为 2-3 句具体描述> ``` **Step 4: 如果是新项目,注册到 project_registry.yaml** 如果第 3 轮收集的是新项目,编辑 `skills/mem0-integration/project_registry.yaml`: ```yaml : name: "<项目名称>" description: "<项目描述>" members: - "" - "main" owner: "main" ``` **Step 5: 创建 openclaw.json** 这是最关键的步骤。从 main 的配置复制并修改: ```bash cp /root/.openclaw/openclaw.json /root/.openclaw-/openclaw.json ``` **必须修改的字段(字段映射表):** | JSON 路径 | main 的值 | 新 agent 应改为 | |-----------|----------|----------------| | `agents.list[0].id` | `"main"` | `""` | | `agents.defaults.workspace` | `"/root/.openclaw/workspace"` | `"/root/.openclaw/workspace/agents/-workspace"` | | `channels.telegram.botToken` | `"7047245486:AAF..."` | `"<用户提供的 token>"` | | `gateway.port` | `18789` | `<分配的端口>` | | `gateway.controlUi.allowedOrigins[2]` | `"http://100.115.94.1:18789"` | **`"http://100.115.94.1:<端口>"`(必须与该 agent 的 gateway.port 一致)** | | `gateway.controlUi.dangerouslyDisableDeviceAuth` | `true` | **保持 `true`**(否则从 Tailscale IP 打开 Control UI 会提示 "device identity required",需先配对浏览器设备) | | `gateway.controlUi.allowInsecureAuth` | 无或 `true` | **建议 `true`**(与 main 一致;HTTP 非 localhost 访问时需此选项才能绕过浏览器无法生成设备密钥的限制,否则仍会报 device identity required) | | `plugins.entries.mem0-integration.config.agent_id` | `"main"` | `""` | ⚠️ **Control UI 访问**:若 `allowedOrigins[2]` 未改为该 agent 的端口,用户访问 `http://100.115.94.1:<端口>/` 会报 **"origin not allowed"**,无法打开配对页。创建 openclaw.json 时务必同时改 `gateway.port` 与 `gateway.controlUi.allowedOrigins[2]`。 **保持不变的字段(继承 main 的配置):** - `models` — 使用相同的模型配置 - `auth` — 使用相同的认证 - `memory` — 使用 qmd 后端 - `skills` — 继承 tavily, find-skills-robin, mem0-integration - `plugins.load.paths` — 可保留或改为 agent 自己的 skills 路径 **Step 6: 启动服务** ```bash # 本地 agent: export XDG_RUNTIME_DIR=/run/user/$(id -u) systemctl --user start openclaw-gateway-.service # 检查状态: systemctl --user status openclaw-gateway-.service ``` **Step 7: 验证** ```bash ./deploy.sh health ``` ### 11.4 完成 Checklist (🔧 逐项确认后告知用户) ``` □ onboard.sh 运行成功 □ agents.yaml 已注册 □ project_registry.yaml 已注册(含 main 作为成员) □ IDENTITY.md 已填充角色/职责 □ SOUL.md 已填充性格/行为准则 □ openclaw.json 已创建,字段已修改: □ agents.list[0].id = □ agents.defaults.workspace 指向 agent workspace □ channels.telegram.botToken 使用新 token □ gateway.port 不与其他 agent 冲突 □ gateway.controlUi.allowedOrigins[2] = "http://100.115.94.1:<该 agent 端口>"(否则 Control UI 会报 origin not allowed) □ gateway.controlUi.dangerouslyDisableDeviceAuth = true(否则会报 device identity required) □ gateway.controlUi.allowInsecureAuth = true(从 Tailscale/LAN IP 用 HTTP 打开 UI 时建议开启) □ plugins.entries.mem0-integration.config.agent_id 正确 □ systemd 服务已启动 □ deploy.sh health 全部通过 □ Telegram Bot 配对完成(用户确认) ``` ### 11.6 Telegram 配对说明 (用户必须自行完成) 新建 agent 使用 `dmPolicy: pairing` 时,**配对必须由用户在 Telegram 与 Control UI 中完成**,main agent 无法代为执行。 **标准步骤(提供给用户):** 1. 在 Telegram 中搜索该 agent 的 Bot(如 @xxx_bot),发送 `/start` 2. 打开该 agent 的 Control UI:`http://100.115.94.1:<端口>/`(端口即该 agent 的 gateway.port) 3. 若出现 **"origin not allowed"**:说明该 agent 的 `openclaw.json` 中 `gateway.controlUi.allowedOrigins[2]` 未设为 `http://100.115.94.1:<端口>`,需由 main agent 修正后重启该 agent 服务 4. 若出现 **"device identity required"**:因通过 `http://100.115.94.1:<端口>` 访问时浏览器不在安全上下文(仅 HTTPS 或 localhost 才是),无法生成设备密钥。可选两种方式: - **方式 A(推荐)**:用 **带 token 的 URL** 打开 Control UI:`http://100.115.94.1:<端口>/?token=<该 agent 的 gateway.auth.token>`,即可绕过设备校验。 - **方式 B**:在该 agent 的 `openclaw.json` 的 `gateway.controlUi` 下同时设置 `"dangerouslyDisableDeviceAuth": true` 与 `"allowInsecureAuth": true`,保存后重启该 agent 服务(仅限受信任内网环境)。 5. 在 Control UI 进入 Devices/配对页面,确认 **Telegram 用户**的配对请求(此为 Bot 与用户的配对,与上一步的浏览器设备无关) 6. 配对完成后,Bot 即可正常回复私信 main agent 可做:提供上述步骤、给出 Control UI 地址、在出现 origin 错误时修正配置并提醒用户重试。 ### 11.5 对话范例 ``` 用户: 帮我创建一个新的加密货币分析 agent Eason: 好的。先确定基本信息: 1. Agent ID 用什么?建议用英文小写,如 crypto 2. 显示名称叫什么? 用户: ID 就叫 crypto,名称叫"加密分析师" Eason: 收到。这个 agent 的具体角色是什么?负责哪些事? 沟通风格方面,你希望它专业严谨还是轻松活泼? 用户: 负责行情监控、策略分析和风险提醒。专业但简洁。 Eason: 了解。项目归属方面,现有项目有: - advert (广告业务) - global (全局共享) 需要新建一个 crypto 项目吗? 用户: 新建 crypto 项目 Eason: 好。接下来需要一个独立的 Telegram Bot Token。 如果还没有,步骤如下: 1. Telegram 搜索 @BotFather → 发 /newbot 2. 设置名称和 username 3. 把返回的 token 发给我 用户: 1234567890:ABCdef... Eason: 最后确认:部署在本服务器还是远程? 用户: 本服务器 Eason: 好的,开始部署... [执行 Step 1-7] 部署完成!Checklist 全部通过。 请在 Telegram 给 @你的bot 发条消息测试一下。 ``` --- ## 12. Playbook: Agent Maintenance & Troubleshooting (日常维护) ### 12.1 日常健康检查 (🔧) 当用户问"agent 状态怎么样"或在 Heartbeat 中执行: ```bash ./deploy.sh health # 全局健康检查 ./deploy.sh status # 详细服务状态 ``` 解读结果并向用户报告: - ✓ 表示正常 - ✗ 表示服务未运行 → 尝试重启 - ⚠ 表示资源告警 → 报告具体数值 ### 12.2 Agent 未响应排查流程 ``` Step 1: 检查服务是否运行 systemctl --user status openclaw-gateway-.service Step 2: 如果 inactive → 检查日志 journalctl --user -u openclaw-gateway- -n 50 --no-pager Step 3: 常见问题及解决: - "Address already in use" → 端口冲突,检查 openclaw.json 的 gateway.port - "Cannot find module" → openclaw 版本问题,运行 ./deploy.sh fix-service - "ECONNREFUSED" → Qdrant 未启动,检查 docker ps | grep qdrant - "API key invalid" → 检查 systemd/-gateway.env 中的 API key - **"origin not allowed"(Control UI 打不开)** → 该 agent 的 openclaw.json 中 gateway.controlUi.allowedOrigins[2] 必须为 "http://100.115.94.1:<该 agent 的端口>";修改后执行 systemctl --user restart openclaw-gateway-.service - **"device identity required"(Control UI 要求设备配对)** → 通过 HTTP 访问非 localhost 时,浏览器无法生成设备密钥。解决:① 用带 token 的 URL:`http://100.115.94.1:<端口>/?token=`;或 ② 在该 agent 的 openclaw.json 的 gateway.controlUi 下同时设置 `"dangerouslyDisableDeviceAuth": true` 与 `"allowInsecureAuth": true`,保存后重启该 agent 服务(仅限受信任内网)。 Step 4: 重启 systemctl --user restart openclaw-gateway-.service Step 5: 仍然失败 → 收集日志给用户 journalctl --user -u openclaw-gateway- -n 200 --no-pager > /tmp/agent-debug.log ``` ### 12.3 OpenClaw 升级后恢复 当用户通过 UI 升级 OpenClaw 后,自定义配置可能丢失: ```bash ./deploy.sh fix-service # 重新注入 EnvironmentFile 到 systemd 服务 ./deploy.sh restart # 重启所有服务使配置生效 ./deploy.sh health # 确认恢复正常 ``` 向用户报告修复结果。 ### 12.4 查看 Agent 列表 ```bash python3 scripts/parse_agents.py list ``` 输出格式: `\t\t`,向用户展示时格式化为表格。 ### 12.5 调试模式 当用户需要调试某个 agent: ```bash ./deploy.sh debug-stop # 停止所有服务(含 monitor,防止自动重启) # ... 用户调试 ... ./deploy.sh debug-start # 恢复所有服务 ``` --- ## 13. Playbook: Memory Management (记忆管理) ### 13.1 发布共享知识 (🔧) 当用户说"把这条最佳实践共享给所有 agent": ```bash python3 skills/mem0-integration/mem0_integration.py publish \ '{"content":"<知识内容>", "visibility":"public", "category":"knowledge"}' ``` 当用户说"把这个信息共享给某项目": ```bash python3 skills/mem0-integration/mem0_integration.py publish \ '{"content":"<内容>", "visibility":"project", "project_id":"<项目>", "category":"knowledge"}' ``` ### 13.2 查看记忆统计 ```bash python3 skills/mem0-integration/memory_cleanup.py --dry-run ``` 向用户报告各 agent、各类型、各可见性的记忆数量。 ### 13.3 清理过期记忆 ```bash # 先 dry-run 查看: python3 skills/mem0-integration/memory_cleanup.py --dry-run --max-age-days 90 # 确认后执行: python3 skills/mem0-integration/memory_cleanup.py --max-age-days 90 ``` ### 13.4 为新 Agent 预载知识 (Cold Start) 新 agent 创建后,可以为其预载公共知识: ```bash python3 skills/mem0-integration/mem0_integration.py cold_start \ '{"agent_id":"", "user_id":"wang_yuanzhang", "top_k":10}' ``` ### 13.5 检查记忆可见性 当用户质疑"某 agent 能看到这条记忆吗": 1. 确定记忆的 `visibility` 和 `project_id` 2. 读 `project_registry.yaml` 确认 agent 是否在该 project 的 members 列表中 3. 可见性规则: - `public` → 所有 agent 可见 - `project` → 只有 project members 可见 - `private` → 只有写入者可见 --- ## 14. Playbook: Interactive Offboarding (移除 Agent) ### 14.1 信息收集 (🗣️) ``` 需要收集: 1. agent_id — 要移除的 agent ID 2. keep_data — 是否保留数据(workspace、profile、Qdrant 记忆) ``` 示例提问: "要移除哪个 Agent?需要保留它的数据吗?(保留可以日后恢复)" 🔧 先展示当前 agent 列表: ```bash python3 scripts/parse_agents.py list ``` ### 14.2 安全检查 (🔧) ``` □ 确认不是 main agent(main 不能被移除) □ 确认 agent 存在于 agents.yaml □ 向用户再次确认: "确定要移除 () 吗?这将停止服务并从注册表中删除。" ``` ### 14.3 执行 (🔧) ```bash cd /root/.openclaw/workspace/templates # 完全移除(含数据): ./offboard.sh # 仅注销(保留数据): ./offboard.sh --keep-data ``` 脚本会交互确认 (y/N),需要输入 y 确认。 ### 14.4 完成后报告 向用户报告: ``` Agent () 已移除: - 服务: 已停止并卸载 - agents.yaml: 已移除 - project_registry: 已移除 - Workspace: <已删除 / 已保留> - Qdrant 记忆: <已删除 / 已保留> - Monitor: 已重载 ``` 运行 `./deploy.sh health` 确认系统正常。 --- ## 15. Playbook: Backup & Cleanup (备份与清理) ### 15.1 备份命令 | 命令 | 说明 | |------|------| | `./deploy.sh backup` | 完整备份 (workspace + Qdrant snapshot + agent profiles + docker-compose) | | `./deploy.sh backup quick` | 快速备份 (仅 workspace 文件,不含 Qdrant) | | `bash scripts/10-create-backup.sh` | 独立备份脚本(包含 mem0 配置 + agents.yaml + Qdrant snapshot) | **备份保留策略**: 自动保留最近 10 个备份,旧备份自动删除。 **备份目录结构**: ``` /root/.openclaw/backups// ├── workspace.tar.gz # Layer 1+2 所有 MD 和配置文件 ├── .openclaw__openclaw.json # main agent profile ├── .openclaw-tongge__openclaw.json # 副 agent profiles (如有) ├── docker-compose.yml # Qdrant docker 配置 ├── qdrant-mem0_v4_shared.snapshot # Layer 4 向量数据 (full 模式) ├── qdrant-point-count.txt # 备份时的 point 数量 (用于校验) └── manifest.txt # 备份清单 ``` ### 15.2 恢复命令 | 命令 | 说明 | |------|------| | `./deploy.sh restore ` | 恢复 workspace 文件 + agent profiles | | `./deploy.sh restore-qdrant ` | 恢复 Qdrant 向量数据 | 恢复前会自动创建 quick 备份,且需要交互确认 (y/N)。 ### 15.3 记忆清理 清理脚本: `skills/mem0-integration/memory_cleanup.py` | 命令 | 说明 | |------|------| | `python3 memory_cleanup.py --dry-run` | 统计各维度记忆 + 列出过期记忆数量 (不删除) | | `python3 memory_cleanup.py --execute --max-age-days 90` | 实际删除过期记忆 | **保留策略** (与 `mem0_client.py` 的 `EXPIRATION_MAP` 对齐): - `session`: 7 天后过期 - `chat_summary`: 30 天后过期 - `preference`: 永久保留 - `knowledge`: 永久保留 `--max-age-days` 作为强制上限: 超过该天数的 session/chat_summary 无论 expiration_date 均会删除。preference 和 knowledge 永远不会被自动清理。 **审计日志**: 每次清理写入 `logs/security/memory-cleanup-.log`。 ### 15.4 自动化 Cron 安装脚本: `scripts/setup-cron.sh` ```bash ./scripts/setup-cron.sh # 安装定时任务 ./scripts/setup-cron.sh remove # 移除定时任务 ./scripts/setup-cron.sh status # 查看当前任务 ``` **定时计划**: | 时间 | 任务 | |------|------| | 每天 02:00 | `./deploy.sh backup` — 完整备份 | | 每周日 03:00 | `memory_cleanup.py --execute --max-age-days 90` — 清理过期记忆 | 日志输出到 `logs/system/cron-backup.log` 和 `logs/system/cron-cleanup.log`。 ### 15.5 交互式备份恢复流程 (🗣️) 当用户要求备份或恢复时的对话流程: **备份**: ``` 陈医生: "需要创建什么类型的备份?" 1. 完整备份 (含 Qdrant 向量数据,推荐) 2. 快速备份 (仅 workspace 文件) → 执行相应命令,报告备份路径和 Qdrant point 数量 → 建议: 重大变更前务必执行完整备份 ``` **恢复**: ``` 陈医生: "需要恢复到哪个备份?" → 列出 /root/.openclaw/backups/ 下可用备份 → 展示 manifest.txt 内容让用户确认 → 先恢复 workspace: ./deploy.sh restore → 如有 Qdrant 快照且用户确认: ./deploy.sh restore-qdrant → 恢复后执行 ./deploy.sh restart + ./deploy.sh health → 对比 qdrant-point-count.txt 与当前 point 数量 ``` --- ## 16. Playbook: Server Migration (服务器迁移) ### 16.1 迁移前准备 (🗣️) 信息收集: ``` 需要确认: 1. target_server — 目标服务器地址 (IP 或 Tailscale hostname) 2. target_user — 目标服务器用户名 (通常 root) 3. keep_source — 迁移后是否保留源服务器数据 4. tailscale — 目标服务器是否已加入 Tailscale 网络 ``` 示例提问: "要迁移到哪台服务器?是否已安装 Tailscale?迁移后源服务器数据要保留吗?" ### 16.2 源服务器: 完整备份 (🔧) ```bash cd /root/.openclaw/workspace ./deploy.sh backup ``` 确认备份完整性: ```bash ls -la /root/.openclaw/backups// cat /root/.openclaw/backups//manifest.txt cat /root/.openclaw/backups//qdrant-point-count.txt ``` ### 16.3 传输到目标服务器 (🔧) ```bash BACKUP_DIR="/root/.openclaw/backups/" TARGET="root@" rsync -avzP "$BACKUP_DIR" "$TARGET:/root/.openclaw/backups/" rsync -avzP /root/.openclaw/workspace/ "$TARGET:/root/.openclaw/workspace/" --exclude='.git' --exclude='logs' rsync -avzP /root/.openclaw/openclaw.json "$TARGET:/root/.openclaw/" ``` 副 agent profiles (如有): ```bash for d in /root/.openclaw-*/; do agent_name=$(basename "$d") rsync -avzP "$d" "$TARGET:/root/$agent_name/" done ``` ### 16.4 目标服务器: 安装基础设施 (🔧) ```bash # 1. 安装 Node.js (v24+) 和 OpenClaw curl -fsSL https://get.openclaw.com | bash # 2. 安装 Docker + Qdrant mkdir -p /opt/mem0-center && cd /opt/mem0-center # 从备份恢复 docker-compose.yml cp /root/.openclaw/backups//docker-compose.yml . docker compose up -d # 3. 等待 Qdrant 启动 sleep 5 curl -sf http://localhost:6333/collections | python3 -c "import sys,json; print(json.dumps(json.load(sys.stdin),indent=2))" # 4. 恢复 Qdrant 数据 cd /root/.openclaw/workspace ./deploy.sh restore-qdrant /root/.openclaw/backups//qdrant-mem0_v4_shared.snapshot # 5. 安装 Python 依赖 pip3 install qdrant-client mem0ai pyyaml # 6. 安装系统服务 ./deploy.sh install ``` ### 16.5 验证 (🔧) ```bash # 服务状态 ./deploy.sh health # Qdrant 数据对比 curl -sf http://localhost:6333/collections/mem0_v4_shared | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Points: {d[\"result\"][\"points_count\"]}')" # 对比源服务器的 qdrant-point-count.txt # 记忆检索测试 cd /root/.openclaw/workspace/skills/mem0-integration python3 mem0_integration.py search "测试查询" --agent-id main # Telegram 连通性 # 在 Telegram 上发送测试消息给 bot ``` ### 16.6 完成后 Checklist ``` □ 所有 agent 服务正常运行 (deploy.sh health 全绿) □ Qdrant point 数量与源服务器一致 □ 记忆检索正常返回结果 □ Telegram bot 回复正常 □ Cron 定时任务已安装 (scripts/setup-cron.sh install) □ 环境变量已设置 (MEM0_DASHSCOPE_API_KEY 等) □ Monitor 服务运行中 (systemctl status openclaw-agent-monitor) □ Tailscale 已加入 (如需远程 agent 连接) □ 源服务器数据处理 (保留/清理) ``` ### 16.7 回滚计划 如果迁移失败: ``` 1. 在源服务器上 ./deploy.sh debug-start 恢复服务 2. 目标服务器上 ./deploy.sh debug-stop 停止所有服务 3. 排查问题后重新尝试 ``` --- ## 17. 技能/插件管理 SOP ### 17.1 Skill vs Plugin 选型指南 OpenClaw 有两套扩展加载机制,选型规则如下: | 类型 | 加载方式 | 配置位置 | 适用场景 | |------|----------|----------|----------| | **内置 Skill** | OpenClaw 自动发现 | `skills.entries.` | Clawhub 市场内置技能(如 `find-skills-robin`) | | **自定义 Plugin** | 手动指定路径 | `plugins.load.paths` + `plugins.entries.` | 自研工具(tavily)、lifecycle hook(mem0)、任何需要自定义代码的扩展 | **判断规则:** - 如果只需要开关一个 Clawhub 内置功能 -> `skills.entries` - 如果有自己的 `openclaw.plugin.json` + `index.js` -> `plugins` - 如果需要 lifecycle hook(对话前后自动执行) -> 必须 `plugins` - **不要**同时在 `skills.entries` 和 `plugins.entries` 中重复启用同一个技能 **Plugin 必需文件:** ``` /root/.openclaw/workspace/skills// ├── openclaw.plugin.json # 插件清单(必需) ├── index.js # 工具/hook 实现(必需) ├── CONFIG_SUMMARY.md # 配置文档(推荐) └── TEST_REPORT.md # 测试报告(推荐) ``` ### 17.2 分阶段发布流程 所有新技能必须先在 main agent 上验证通过,再部署到辅 agent。 **Stage 1 -- 安装代码** 1. 将技能代码放入 `/root/.openclaw/workspace/skills//` 2. 确保有 `openclaw.plugin.json`(含 id、name、kind、main、tools/configSchema) 3. 确保有 `index.js`(导出 `register`/`activate` 和工具定义) **Stage 2 -- Main 启用并测试** 1. 在 main 的 `openclaw.json` 中: - `plugins.load.paths` 添加 `"/root/.openclaw/workspace/skills/"` - `plugins.entries.` 设为 `{ "enabled": true }` (如有 config 一并填写) 2. 重启 main gateway:`systemctl --user restart openclaw-gateway.service` 3. 检查日志确认插件加载:`journalctl --user -u openclaw-gateway -n 50 | grep -i ` 4. 通过 Telegram 对 main 发消息测试功能 **Stage 3 -- 审核** 按 `templates/SKILL_REVIEW_TEMPLATE.md` 完成审核,包括: | 审核维度 | 检查内容 | |----------|----------| | 安全 | API key 管理(环境变量 vs 硬编码)、网络请求范围、文件读写、权限提升 | | 功能 | agent 能否正确调用、结果是否准确、错误处理是否合理 | | 性能 | 响应时间、并发调用、对 agent 整体延迟的影响 | | 最佳实践 | 推荐参数、适用场景、已知限制,记录到 `CONFIG_SUMMARY.md` | **Stage 4 -- 推送辅 Agent** 1. 技能代码在共享 workspace 下,无需复制 2. 在辅 agent 的 `openclaw.json` 中: - `plugins.load.paths` 添加相同路径 - `plugins.entries.` 启用(注意 agent-specific 配置,如 mem0 的 `agent_id` 必须改为该 agent 的 ID) 3. 重启辅 agent gateway 4. 验证插件加载和功能正常 ### 17.3 当前技能清单 | 技能 ID | 类型 | 加载方式 | Main | Tongge | 说明 | |---------|------|----------|------|--------|------| | `find-skills-robin` | 内置 | `skills.entries` | 启用 | 启用 | Clawhub 技能发现 | | `mem0-integration` | lifecycle | `skills.entries` + `plugins` | 启用 | 启用 | 记忆系统(agent_id 需区分) | | `tavily` | tool | `plugins` | 启用 | 启用 | AI 搜索(共享 API key) | | `active-learning` | 内置 | `skills.entries` | -- | 启用 | 主动学习(仅 tongge) | | `memos-cloud-openclaw-plugin` | 内置 | `plugins.entries` | 启用 | 启用 | Memos 云插件 | | `qwen-portal-auth` | 内置 | `plugins.entries` | 启用 | 启用 | Qwen Portal OAuth | > **维护要求:** 每次新增或移除技能时,同步更新此表。 ### 17.4 Agent-Specific 配置注意事项 部分 plugin 在不同 agent 间需要不同配置: | Plugin | 需区分的配置项 | Main | Tongge | |--------|---------------|------|--------| | `mem0-integration` | `config.agent_id` | `"main"` | `"tongge"` | | `mem0-integration` | `config.user_id` | `"wang院长"` | `"wang院长"` | 部署到新 agent 时,务必检查以上配置项。 --- ## Changelog | Version | Date | Changes | |---------|------|---------| | 1.0 | 2026-03-06 | Initial version: hub-and-spoke model, templates, remote support | | 1.1 | 2026-03-06 | Config-driven architecture: agents.yaml as single registry; automated onboard/offboard; parse_agents.py helper; life agent (张大师) removed; main is only active agent | | 1.2 | 2026-03-06 | Code review + bug fixes (7 items): `parse_agents.py` output now shell-safe quoted; `agent-monitor.js` constructor ordering fixed (ensureLogDir before loadMonitoredServices) and fallback uses full `openclaw` path; `deploy.sh` switched `grep -qP` to `grep -qE` for portability; `offboard.sh` Qdrant delete uses `FilterSelector` wrapper; `onboard.sh`/`offboard.sh` inline Python rewritten with `sys.argv` to prevent shell injection; `openclaw-agent-monitor.service` added `/run/user/0` to `ReadWritePaths` for D-Bus access; removed corrupted trailing bytes in `offboard.sh` | | 2.0 | 2026-03-06 | Added operational playbooks (Part B): Interactive Onboarding (Sec 11, with conversation flow, field mapping table, port allocation, checklist, dialog example), Agent Maintenance & Troubleshooting (Sec 12), Memory Management (Sec 13), Interactive Offboarding (Sec 14). Document restructured into Part A (Architecture Reference) and Part B (Operational Playbooks). | | 2.1 | 2026-03-06 | Added Backup & Cleanup Playbook (Sec 15): backup/restore commands, memory cleanup with retention policy, cron automation, interactive dialogue flow. Added Server Migration Playbook (Sec 16): step-by-step migration with pre/post checklist, Qdrant snapshot recovery, rollback plan. | | 2.2 | 2026-03-09 | Added Skill/Plugin Management SOP (Sec 17): skill vs plugin selection guide, staged release workflow (main-first), current skill inventory, agent-specific config notes. Unified tavily loading to plugin mode across all agents. |