You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

1064 lines
38 KiB

# Multi-Agent Management Architecture
**Version:** 2.0
**Date:** 2026-03-06
**Maintainer:** Eason
**Current state:** Main (陈医生) is the only active agent. The life agent (张大师) has been removed. All agents are defined in `agents.yaml`.
> **For Main Agent (陈医生):** 你是 Hub Agent。本文档既是架构参考,也是你的操作手册。
> 当用户要求创建、维护、排查或移除 Agent 时,跳转到对应的 Playbook 章节(11-14),按步骤执行。
---
## 1. Hub-and-Spoke Model
Main agent acts as the **memory hub** -- responsible for publishing shared knowledge,
maintaining the project registry, and onboarding new agents. All other agents (local or
remote) are **spokes** that consume shared memory and contribute their own private/project
memories.
```
Main Agent (Hub) - defined in agents.yaml
|-- publish_knowledge() --> Qdrant mem0_v4_shared (visibility=public)
|-- publish_knowledge(project_id=X) --> (visibility=project)
|-- maintain project_registry.yaml
|-- maintain docs & best practices
|
+-- Local Spokes (same server, same Qdrant)
| |-- local-cli: main (openclaw gateway)
| |-- local-systemd: <agent_id> (port 187XX)
|
+-- Remote Spokes (Tailscale VPN -> Qdrant)
+-- remote-http: <agent_id> (health via HTTP)
```
---
## 2. Memory Visibility Model
All agents share one Qdrant collection: `mem0_v4_shared`.
Isolation is achieved through metadata fields.
| Visibility | Who can read | Metadata filter |
|-----------|-------------|-----------------|
| public | All agents | `visibility=public` |
| project | Same project members | `visibility=project, project_id=X` |
| private | Only the writing agent | `visibility=private, agent_id=X` |
Project membership is defined in `skills/mem0-integration/project_registry.yaml`.
Main agent is registered as member of all projects for audit access.
---
## 3. Agent Registry (agents.yaml)
**Path:** `/root/.openclaw/workspace/agents.yaml`
This file is the **single source of truth** for all agent definitions. All tooling reads from it dynamically:
| Consumer | Purpose |
|----------|---------|
| `deploy.sh` | Service management (start/stop/debug/fix) |
| `agent-monitor.js` | Health monitoring |
| `local_search.py` | Agent lookup for search |
| `memory_cleanup.py` | Agent-aware cleanup |
| `onboard.sh` / `offboard.sh` | Add/remove agents |
**Helper script:** `scripts/parse_agents.py` parses agents.yaml for bash/JS:
```bash
python3 scripts/parse_agents.py list # list agent IDs
python3 scripts/parse_agents.py info <id> # get agent info as KEY=VALUE (shell-safe quoted)
python3 scripts/parse_agents.py services # list all agents with service details (tab-separated)
python3 scripts/parse_agents.py ids # space-separated agent IDs (for bash loops)
```
> **Note:** The `info` subcommand outputs single-quoted values (`KEY='value'`) that are safe
> for `eval` in bash, even when values contain spaces, CJK characters, or special shell
> metacharacters. The `services` subcommand uses tab (`\t`) as the delimiter to avoid
> collisions with `|` or spaces in command strings.
**Agent types supported:**
| Type | Description |
|------|-------------|
| `local-cli` | Managed via `openclaw gateway` CLI (main agent) |
| `local-systemd` | Managed via user-level systemd unit |
| `remote-http` | Remote agent checked via HTTP health endpoint |
---
## 4. Agent Lifecycle
### 4.1 Onboard (create)
```bash
cd /root/.openclaw/workspace/templates
./onboard.sh <agent_id> <agent_name> <project_id> [qdrant_host]
```
**Fully automated.** This script:
1. Creates workspace at `agents/<agent_id>-workspace/` (IDENTITY.md, SOUL.md, mem0 config)
2. Registers the agent in `agents.yaml`
3. Registers in `project_registry.yaml`
4. For local agents: generates systemd service + env file, installs, enables
5. Reloads `openclaw-agent-monitor` so it picks up the new agent
**Examples:**
```bash
./onboard.sh crypto "CryptoBot" crypto # local agent
./onboard.sh remote1 "RemoteBot" advert 100.115.94.1 # remote agent
```
**Remaining manual steps (local-systemd):** Edit IDENTITY.md, create `~/.openclaw-<agent_id>/openclaw.json`, then start the service.
### 4.2 Offboard (retire)
```bash
cd /root/.openclaw/workspace/templates
./offboard.sh <agent_id> [--keep-data]
```
**Options:**
- (default) Full removal: stops service, removes from agents.yaml and project_registry, deletes workspace, profile, and Qdrant memories
- `--keep-data` Unregister only: keeps workspace and profile files
**Examples:**
```bash
./offboard.sh crypto # full removal
./offboard.sh crypto --keep-data # keep files, just unregister
```
The main (hub) agent cannot be offboarded.
---
## 5. Knowledge Publishing
Main agent can publish best practices and shared knowledge to Qdrant:
**Via Python:**
```python
from mem0_client import mem0_client
await mem0_client.start()
await mem0_client.publish_knowledge(
content="Always use EnvironmentFile= in systemd services for upgrade safety",
category="knowledge",
visibility="public",
)
```
**Via CLI:**
```bash
python3 mem0_integration.py publish '{"content":"...", "visibility":"public"}'
```
**Via Node.js plugin (index.js):**
The `publish` action is available through the same spawn interface used by `search` and `add`.
### Visibility Guidelines
| Content type | Visibility | Example |
|-------------|-----------|---------|
| System best practices | public | "Use deploy.sh fix-service after upgrades" |
| Project-specific knowledge | project | "{agent_id} uses Google Calendar API" |
| User preferences | private | "User prefers dark mode" |
| API keys, secrets | NEVER store | Use environment variables |
---
## 6. Cold Start Preload
When a new session starts, `session_init.py` calls `cold_start_search()` which
retrieves memories in three phases:
1. **Phase 0 (public)**: Best practices, shared config -- available to all agents
2. **Phase 1 (project)**: Project-specific guidelines -- based on agent's project membership
3. **Phase 2 (private)**: Agent's own recent context
Results are deduplicated, ordered by phase priority, and injected into the System Prompt.
---
## 7. Local Agent Configuration
Local agents run on the same server and connect to Qdrant at `localhost:6333`.
Key configuration points:
- `openclaw.json`: `collection_name: "mem0_v4_shared"` (NOT agent-specific collections)
- `systemd/<agent_id>-gateway.env`: contains `MEM0_DASHSCOPE_API_KEY`
- `EnvironmentFile=` in the service unit references the env file
---
## 8. Remote Agent Configuration
Remote agents run on different servers and connect to Qdrant via Tailscale.
### Prerequisites
1. Tailscale installed and joined to the same tailnet on both servers
2. Qdrant accessible at the hub server's Tailscale IP (e.g., `100.115.94.1:6333`)
3. Tailscale ACL allows the remote server to access port 6333
### Environment File
```
MEM0_QDRANT_HOST=100.115.94.1
MEM0_DASHSCOPE_API_KEY=sk-...
OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
```
### Onboarding
```bash
./onboard.sh remote1 "RemoteBot" advert 100.115.94.1
```
The 4th argument sets `MEM0_QDRANT_HOST` in the generated env file. The agent is automatically added to `agents.yaml` and the monitor picks it up on reload.
### Monitoring
The monitor reads from `agents.yaml` dynamically. Remote agents (type `remote-http`) are checked via their `health_url`. Remote agents cannot be auto-started from the hub; the monitor will only alert on failure.
---
## 9. Agent Monitor Service Hardening
The `openclaw-agent-monitor.service` runs as a system-level systemd service with the following security constraints:
| Directive | Value | Purpose |
|-----------|-------|---------|
| `ProtectSystem` | `strict` | Mounts entire filesystem read-only |
| `ProtectHome` | `read-only` | Home directory is read-only |
| `ReadWritePaths` | `/root/.openclaw/workspace/logs /run/user/0` | Whitelist for writes: log output + D-Bus for `systemctl --user` |
| `NoNewPrivileges` | `true` | Cannot gain new privileges |
| `MemoryMax` | `512M` | OOM guard |
| `CPUQuota` | `20%` | Prevent monitor from starving other processes |
**Why `/run/user/0`?** The monitor uses `systemctl --user start/stop` to manage gateway processes, which requires D-Bus access at the user runtime directory. Without this path whitelisted, `ProtectSystem=strict` would block the D-Bus socket and prevent auto-restart.
**Initialization order in `agent-monitor.js`:**
1. `loadConfig()` -- read `openclaw.json`
2. `ensureLogDir()` -- create log directory (must happen before any `this.log()` calls)
3. `loadMonitoredServices()` -- parse `agents.yaml` (may log errors on failure)
4. Signal handlers + start monitoring loop
---
## 10. File Reference
| File | Purpose |
|------|---------|
| `agents.yaml` | Single source of truth for agent registry |
| `scripts/parse_agents.py` | Parses agents.yaml for bash/JS consumers |
| `skills/mem0-integration/mem0_client.py` | Core client: search, write, publish, cold_start |
| `skills/mem0-integration/mem0_integration.py` | CLI interface: init, search, add, publish, cold_start |
| `skills/mem0-integration/session_init.py` | Three-phase cold start hook |
| `skills/mem0-integration/project_registry.yaml` | Agent-to-project membership |
| `templates/onboard.sh` | Automated agent onboarding (adds to agents.yaml, installs service, reloads monitor) |
| `templates/offboard.sh` | Clean one-command agent removal |
| `templates/agent-workspace/` | Workspace file templates |
| `templates/systemd/` | Service and env file templates |
| `agent-monitor.js` | Config-driven health monitor (reads agents.yaml) |
| `deploy.sh` | Service management (reads agents.yaml) |
| `docs/EXTENSIONS_ARCHITECTURE.md` | Systemd, monitor, upgrade safety |
| `docs/MEMORY_ARCHITECTURE.md` | Four-layer memory system detail |
---
---
---
# PART B: Operational Playbooks (面向 Main Agent 的操作手册)
> **以下 Section 11-14 是 Main Agent (陈医生) 在对话中执行操作的分步指南。**
> 当用户说"帮我创建一个新 agent"、"检查 agent 状态"、"清理记忆"、"移除 agent"时,
> 按对应章节执行。每个步骤标注了需要向用户提问 (🗣) 还是你自己执行 (🔧)。
---
## 11. Playbook: Interactive Onboarding (创建新 Agent)
当用户说"我要创建新 agent"或类似意图时,按以下流程执行。
### 11.1 信息收集阶段 (🗣 向用户逐步提问)
按以下顺序收集信息。每轮只问 1-2 个问题,不要一次全部列出。
**第 1 轮:基本身份**
```
需要收集:
1. agent_id — 英文小写标识符,无空格(例: crypto, hr_bot, advert_pm)
2. agent_name — 显示名称,可以是中文(例: "加密分析师", "HR助手")
```
示例提问: "新 Agent 的 ID 用什么?(英文小写,如 crypto)显示名称叫什么?"
**第 2 轮:角色定义**
```
需要收集:
3. role — 一句话角色描述(例: "加密货币行情分析与投资策略助手")
4. scope — 职责范围,2-5 条(例: "行情监控、策略分析、风险提醒")
5. personality — 性格/沟通风格(例: "专业严谨、数据驱动、适度幽默")
```
示例提问: "这个 Agent 的角色是什么?负责哪些事情?你希望它是什么样的沟通风格?"
**第 3 轮:项目归属**
```
需要收集:
6. project_id — 所属项目(已有: advert, global; 或新建)
7. new_project — 如果是新项目,需要项目名称和描述
```
先展示已有项目: 读取 `skills/mem0-integration/project_registry.yaml`
示例提问: "这个 Agent 属于哪个项目?现有项目有: advert(广告业务)、global(全局)。需要新建项目吗?"
**第 4 轮:Telegram Bot**
```
需要收集:
8. bot_token — Telegram Bot Token
```
如果用户还没有 token,给出创建指引:
```
创建 Telegram Bot 的步骤:
1. 在 Telegram 搜索 @BotFather,发送 /newbot
2. 按提示输入 bot 显示名称(如: CryptoBot)
3. 输入 bot username(必须以 Bot 结尾,如: openclaw_crypto_bot)
4. BotFather 会返回一个 token(格式: 1234567890:ABCdef...)
5. 把这个 token 发给我
```
**第 5 轮:部署方式**
```
需要收集:
9. deploy_type — 本地(localhost) 还是远程(Tailscale IP)
10. qdrant_host — 远程时需要 Tailscale IP 地址
```
示例提问: "这个 Agent 部署在本服务器还是远程?如果远程,Tailscale IP 是多少?"
### 11.2 端口分配规则
| 端口 | 用途 |
|------|------|
| 18789 | main agent (已占用) |
| 18790 | 第 2 个本地 agent |
| 18791 | 第 3 个本地 agent |
| ... | 依次递增 |
🔧 自动分配: 读取 `agents.yaml` 中已注册 agent 数量,port = 18789 + count。
远程 agent 不需要在本服务器分配端口。
### 11.3 执行阶段 (🔧 按顺序执行)
收集完信息后,按以下步骤执行。**每步完成后向用户报告进度。**
**Step 1: 运行 onboard.sh**
```bash
cd /root/.openclaw/workspace/templates
# 本地 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id>
# 远程 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id> <qdrant_host>
```
这会自动完成: 创建 workspace、注册 agents.yaml、注册 project_registry、
生成 systemd service/env、重载 monitor。
**Step 2: 填充 IDENTITY.md**
写入 `agents/<agent_id>-workspace/IDENTITY.md`:
```markdown
# Agent Identity
- **Name**: <agent_name>
- **Agent ID**: <agent_id>
- **Role**: <用户提供的角色描述>
- **Project**: <project_id>
- **Created**: <今天日期>
## Scope
<用户提供的职责范围每条一行>
## Communication Style
<用户提供的性格/沟通风格描述>
```
**Step 3: 填充 SOUL.md**
写入 `agents/<agent_id>-workspace/SOUL.md`:
```markdown
# <agent_name> - Core Personality
## Beliefs
<从用户描述的角色推导 2-3 条核心信念>
## Behavior Rules
- Follow shared best practices from public memory
- Respect memory visibility boundaries (public/project/private)
- Log important decisions to memory for team awareness
<根据角色补充 2-3 条特定行为准则>
## Communication Style
<用户描述的沟通风格展开为 2-3 句具体描述>
```
**Step 4: 如果是新项目,注册到 project_registry.yaml**
如果第 3 轮收集的是新项目,编辑 `skills/mem0-integration/project_registry.yaml`:
```yaml
<project_id>:
name: "<项目名称>"
description: "<项目描述>"
members:
- "<agent_id>"
- "main"
owner: "main"
```
**Step 5: 创建 openclaw.json**
这是最关键的步骤。从 main 的配置复制并修改:
```bash
cp /root/.openclaw/openclaw.json /root/.openclaw-<agent_id>/openclaw.json
```
**必须修改的字段(字段映射表):**
| JSON 路径 | main 的值 | 新 agent 应改为 |
|-----------|----------|----------------|
| `agents.list[0].id` | `"main"` | `"<agent_id>"` |
| `agents.defaults.workspace` | `"/root/.openclaw/workspace"` | `"/root/.openclaw/workspace/agents/<agent_id>-workspace"` |
| `channels.telegram.botToken` | `"7047245486:AAF..."` | `"<用户提供的 token>"` |
| `gateway.port` | `18789` | `<分配的端口>` |
| `gateway.controlUi.allowedOrigins[2]` | `"http://100.115.94.1:18789"` | **`"http://100.115.94.1:<端口>"`(必须与该 agent 的 gateway.port 一致)** |
| `gateway.controlUi.dangerouslyDisableDeviceAuth` | `true` | **保持 `true`**(否则从 Tailscale IP 打开 Control UI 会提示 "device identity required",需先配对浏览器设备) |
| `gateway.controlUi.allowInsecureAuth` | 无或 `true` | **建议 `true`**(与 main 一致;HTTP 非 localhost 访问时需此选项才能绕过浏览器无法生成设备密钥的限制,否则仍会报 device identity required) |
| `plugins.entries.mem0-integration.config.agent_id` | `"main"` | `"<agent_id>"` |
**Control UI 访问**:若 `allowedOrigins[2]` 未改为该 agent 的端口,用户访问 `http://100.115.94.1:<端口>/` 会报 **"origin not allowed"**,无法打开配对页。创建 openclaw.json 时务必同时改 `gateway.port``gateway.controlUi.allowedOrigins[2]`
**保持不变的字段(继承 main 的配置):**
- `models` — 使用相同的模型配置
- `auth` — 使用相同的认证
- `memory` — 使用 qmd 后端
- `skills` — 继承 tavily, find-skills-robin, mem0-integration
- `plugins.load.paths` — 可保留或改为 agent 自己的 skills 路径
**Step 6: 启动服务**
```bash
# 本地 agent:
export XDG_RUNTIME_DIR=/run/user/$(id -u)
systemctl --user start openclaw-gateway-<agent_id>.service
# 检查状态:
systemctl --user status openclaw-gateway-<agent_id>.service
```
**Step 7: 验证**
```bash
./deploy.sh health
```
### 11.4 完成 Checklist (🔧 逐项确认后告知用户)
```
□ onboard.sh 运行成功
□ agents.yaml 已注册
□ project_registry.yaml 已注册(含 main 作为成员)
□ IDENTITY.md 已填充角色/职责
□ SOUL.md 已填充性格/行为准则
□ openclaw.json 已创建,字段已修改:
□ agents.list[0].id = <agent_id>
□ agents.defaults.workspace 指向 agent workspace
□ channels.telegram.botToken 使用新 token
□ gateway.port 不与其他 agent 冲突
□ gateway.controlUi.allowedOrigins[2] = "http://100.115.94.1:<该 agent 端口>"(否则 Control UI 会报 origin not allowed)
□ gateway.controlUi.dangerouslyDisableDeviceAuth = true(否则会报 device identity required)
□ gateway.controlUi.allowInsecureAuth = true(从 Tailscale/LAN IP 用 HTTP 打开 UI 时建议开启)
□ plugins.entries.mem0-integration.config.agent_id 正确
□ systemd 服务已启动
□ deploy.sh health 全部通过
□ Telegram Bot 配对完成(用户确认)
```
### 11.6 Telegram 配对说明 (用户必须自行完成)
新建 agent 使用 `dmPolicy: pairing` 时,**配对必须由用户在 Telegram 与 Control UI 中完成**,main agent 无法代为执行。
**标准步骤(提供给用户):**
1. 在 Telegram 中搜索该 agent 的 Bot(如 @xxx_bot),发送 `/start`
2. 打开该 agent 的 Control UI:`http://100.115.94.1:<端口>/`(端口即该 agent 的 gateway.port)
3. 若出现 **"origin not allowed"**:说明该 agent 的 `openclaw.json``gateway.controlUi.allowedOrigins[2]` 未设为 `http://100.115.94.1:<端口>`,需由 main agent 修正后重启该 agent 服务
4. 若出现 **"device identity required"**:因通过 `http://100.115.94.1:<端口>` 访问时浏览器不在安全上下文(仅 HTTPS 或 localhost 才是),无法生成设备密钥。可选两种方式:
- **方式 A(推荐)**:用 **带 token 的 URL** 打开 Control UI:`http://100.115.94.1:<端口>/?token=< agent gateway.auth.token>`,即可绕过设备校验。
- **方式 B**:在该 agent 的 `openclaw.json``gateway.controlUi` 下同时设置 `"dangerouslyDisableDeviceAuth": true``"allowInsecureAuth": true`,保存后重启该 agent 服务(仅限受信任内网环境)。
5. 在 Control UI 进入 Devices/配对页面,确认 **Telegram 用户**的配对请求(此为 Bot 与用户的配对,与上一步的浏览器设备无关)
6. 配对完成后,Bot 即可正常回复私信
main agent 可做:提供上述步骤、给出 Control UI 地址、在出现 origin 错误时修正配置并提醒用户重试。
### 11.5 对话范例
```
用户: 帮我创建一个新的加密货币分析 agent
Eason: 好的。先确定基本信息:
1. Agent ID 用什么?建议用英文小写,如 crypto
2. 显示名称叫什么?
用户: ID 就叫 crypto,名称叫"加密分析师"
Eason: 收到。这个 agent 的具体角色是什么?负责哪些事?
沟通风格方面,你希望它专业严谨还是轻松活泼?
用户: 负责行情监控、策略分析和风险提醒。专业但简洁。
Eason: 了解。项目归属方面,现有项目有:
- advert (广告业务)
- global (全局共享)
需要新建一个 crypto 项目吗?
用户: 新建 crypto 项目
Eason: 好。接下来需要一个独立的 Telegram Bot Token。
如果还没有,步骤如下:
1. Telegram 搜索 @BotFather → 发 /newbot
2. 设置名称和 username
3. 把返回的 token 发给我
用户: 1234567890:ABCdef...
Eason: 最后确认:部署在本服务器还是远程?
用户: 本服务器
Eason: 好的,开始部署...
[执行 Step 1-7]
部署完成!Checklist 全部通过。
请在 Telegram 给 @你的bot 发条消息测试一下。
```
---
## 12. Playbook: Agent Maintenance & Troubleshooting (日常维护)
### 12.1 日常健康检查 (🔧)
当用户问"agent 状态怎么样"或在 Heartbeat 中执行:
```bash
./deploy.sh health # 全局健康检查
./deploy.sh status # 详细服务状态
```
解读结果并向用户报告:
- ✓ 表示正常
- ✗ 表示服务未运行 → 尝试重启
- ⚠ 表示资源告警 → 报告具体数值
### 12.2 Agent 未响应排查流程
```
Step 1: 检查服务是否运行
systemctl --user status openclaw-gateway-<agent_id>.service
Step 2: 如果 inactive → 检查日志
journalctl --user -u openclaw-gateway-<agent_id> -n 50 --no-pager
Step 3: 常见问题及解决:
- "Address already in use" → 端口冲突,检查 openclaw.json 的 gateway.port
- "Cannot find module" → openclaw 版本问题,运行 ./deploy.sh fix-service
- "ECONNREFUSED" → Qdrant 未启动,检查 docker ps | grep qdrant
- "API key invalid" → 检查 systemd/<agent_id>-gateway.env 中的 API key
- **"origin not allowed"(Control UI 打不开)** → 该 agent 的 openclaw.json 中 gateway.controlUi.allowedOrigins[2] 必须为 "http://100.115.94.1:<该 agent 的端口>";修改后执行 systemctl --user restart openclaw-gateway-<agent_id>.service
- **"device identity required"(Control UI 要求设备配对)** → 通过 HTTP 访问非 localhost 时,浏览器无法生成设备密钥。解决:① 用带 token 的 URL:`http://100.115.94.1:<端口>/?token=<gateway.auth.token>`;或 ② 在该 agent 的 openclaw.json 的 gateway.controlUi 下同时设置 `"dangerouslyDisableDeviceAuth": true` 与 `"allowInsecureAuth": true`,保存后重启该 agent 服务(仅限受信任内网)。
Step 4: 重启
systemctl --user restart openclaw-gateway-<agent_id>.service
Step 5: 仍然失败 → 收集日志给用户
journalctl --user -u openclaw-gateway-<agent_id> -n 200 --no-pager > /tmp/agent-debug.log
```
### 12.3 OpenClaw 升级后恢复
当用户通过 UI 升级 OpenClaw 后,自定义配置可能丢失:
```bash
./deploy.sh fix-service # 重新注入 EnvironmentFile 到 systemd 服务
./deploy.sh restart # 重启所有服务使配置生效
./deploy.sh health # 确认恢复正常
```
向用户报告修复结果。
### 12.4 查看 Agent 列表
```bash
python3 scripts/parse_agents.py list
```
输出格式: `<id>\t<type>\t<name>`,向用户展示时格式化为表格。
### 12.5 调试模式
当用户需要调试某个 agent:
```bash
./deploy.sh debug-stop # 停止所有服务(含 monitor,防止自动重启)
# ... 用户调试 ...
./deploy.sh debug-start # 恢复所有服务
```
---
## 13. Playbook: Memory Management (记忆管理)
### 13.1 发布共享知识 (🔧)
当用户说"把这条最佳实践共享给所有 agent":
```bash
python3 skills/mem0-integration/mem0_integration.py publish \
'{"content":"<知识内容>", "visibility":"public", "category":"knowledge"}'
```
当用户说"把这个信息共享给某项目":
```bash
python3 skills/mem0-integration/mem0_integration.py publish \
'{"content":"<内容>", "visibility":"project", "project_id":"<项目>", "category":"knowledge"}'
```
### 13.2 查看记忆统计
```bash
python3 skills/mem0-integration/memory_cleanup.py --dry-run
```
向用户报告各 agent、各类型、各可见性的记忆数量。
### 13.3 清理过期记忆
```bash
# 先 dry-run 查看:
python3 skills/mem0-integration/memory_cleanup.py --dry-run --max-age-days 90
# 确认后执行:
python3 skills/mem0-integration/memory_cleanup.py --max-age-days 90
```
### 13.4 为新 Agent 预载知识 (Cold Start)
新 agent 创建后,可以为其预载公共知识:
```bash
python3 skills/mem0-integration/mem0_integration.py cold_start \
'{"agent_id":"<agent_id>", "user_id":"wang_yuanzhang", "top_k":10}'
```
### 13.5 检查记忆可见性
当用户质疑"某 agent 能看到这条记忆吗":
1. 确定记忆的 `visibility``project_id`
2.`project_registry.yaml` 确认 agent 是否在该 project 的 members 列表中
3. 可见性规则:
- `public` → 所有 agent 可见
- `project` → 只有 project members 可见
- `private` → 只有写入者可见
---
## 14. Playbook: Interactive Offboarding (移除 Agent)
### 14.1 信息收集 (🗣)
```
需要收集:
1. agent_id — 要移除的 agent ID
2. keep_data — 是否保留数据(workspace、profile、Qdrant 记忆)
```
示例提问: "要移除哪个 Agent?需要保留它的数据吗?(保留可以日后恢复)"
🔧 先展示当前 agent 列表:
```bash
python3 scripts/parse_agents.py list
```
### 14.2 安全检查 (🔧)
```
□ 确认不是 main agent(main 不能被移除)
□ 确认 agent 存在于 agents.yaml
□ 向用户再次确认: "确定要移除 <agent_name> (<agent_id>) 吗?这将停止服务并从注册表中删除。"
```
### 14.3 执行 (🔧)
```bash
cd /root/.openclaw/workspace/templates
# 完全移除(含数据):
./offboard.sh <agent_id>
# 仅注销(保留数据):
./offboard.sh <agent_id> --keep-data
```
脚本会交互确认 (y/N),需要输入 y 确认。
### 14.4 完成后报告
向用户报告:
```
Agent <agent_name> (<agent_id>) 已移除:
- 服务: 已停止并卸载
- agents.yaml: 已移除
- project_registry: 已移除
- Workspace: <已删除 / 已保留>
- Qdrant 记忆: <已删除 / 已保留>
- Monitor: 已重载
```
运行 `./deploy.sh health` 确认系统正常。
---
## 15. Playbook: Backup & Cleanup (备份与清理)
### 15.1 备份命令
| 命令 | 说明 |
|------|------|
| `./deploy.sh backup` | 完整备份 (workspace + Qdrant snapshot + agent profiles + docker-compose) |
| `./deploy.sh backup quick` | 快速备份 (仅 workspace 文件,不含 Qdrant) |
| `bash scripts/10-create-backup.sh` | 独立备份脚本(包含 mem0 配置 + agents.yaml + Qdrant snapshot) |
**备份保留策略**: 自动保留最近 10 个备份,旧备份自动删除。
**备份目录结构**:
```
/root/.openclaw/backups/<TIMESTAMP>/
├── workspace.tar.gz # Layer 1+2 所有 MD 和配置文件
├── .openclaw__openclaw.json # main agent profile
├── .openclaw-tongge__openclaw.json # 副 agent profiles (如有)
├── docker-compose.yml # Qdrant docker 配置
├── qdrant-mem0_v4_shared.snapshot # Layer 4 向量数据 (full 模式)
├── qdrant-point-count.txt # 备份时的 point 数量 (用于校验)
└── manifest.txt # 备份清单
```
### 15.2 恢复命令
| 命令 | 说明 |
|------|------|
| `./deploy.sh restore <backup-dir>` | 恢复 workspace 文件 + agent profiles |
| `./deploy.sh restore-qdrant <snapshot-file>` | 恢复 Qdrant 向量数据 |
恢复前会自动创建 quick 备份,且需要交互确认 (y/N)。
### 15.3 记忆清理
清理脚本: `skills/mem0-integration/memory_cleanup.py`
| 命令 | 说明 |
|------|------|
| `python3 memory_cleanup.py --dry-run` | 统计各维度记忆 + 列出过期记忆数量 (不删除) |
| `python3 memory_cleanup.py --execute --max-age-days 90` | 实际删除过期记忆 |
**保留策略** (与 `mem0_client.py``EXPIRATION_MAP` 对齐):
- `session`: 7 天后过期
- `chat_summary`: 30 天后过期
- `preference`: 永久保留
- `knowledge`: 永久保留
`--max-age-days` 作为强制上限: 超过该天数的 session/chat_summary 无论 expiration_date 均会删除。preference 和 knowledge 永远不会被自动清理。
**审计日志**: 每次清理写入 `logs/security/memory-cleanup-<date>.log`
### 15.4 自动化 Cron
安装脚本: `scripts/setup-cron.sh`
```bash
./scripts/setup-cron.sh # 安装定时任务
./scripts/setup-cron.sh remove # 移除定时任务
./scripts/setup-cron.sh status # 查看当前任务
```
**定时计划**:
| 时间 | 任务 |
|------|------|
| 每天 02:00 | `./deploy.sh backup` — 完整备份 |
| 每周日 03:00 | `memory_cleanup.py --execute --max-age-days 90` — 清理过期记忆 |
日志输出到 `logs/system/cron-backup.log``logs/system/cron-cleanup.log`
### 15.5 交互式备份恢复流程 (🗣)
当用户要求备份或恢复时的对话流程:
**备份**:
```
陈医生: "需要创建什么类型的备份?"
1. 完整备份 (含 Qdrant 向量数据,推荐)
2. 快速备份 (仅 workspace 文件)
→ 执行相应命令,报告备份路径和 Qdrant point 数量
→ 建议: 重大变更前务必执行完整备份
```
**恢复**:
```
陈医生: "需要恢复到哪个备份?"
→ 列出 /root/.openclaw/backups/ 下可用备份
→ 展示 manifest.txt 内容让用户确认
→ 先恢复 workspace: ./deploy.sh restore <dir>
→ 如有 Qdrant 快照且用户确认: ./deploy.sh restore-qdrant <file>
→ 恢复后执行 ./deploy.sh restart + ./deploy.sh health
→ 对比 qdrant-point-count.txt 与当前 point 数量
```
---
## 16. Playbook: Server Migration (服务器迁移)
### 16.1 迁移前准备 (🗣)
信息收集:
```
需要确认:
1. target_server — 目标服务器地址 (IP 或 Tailscale hostname)
2. target_user — 目标服务器用户名 (通常 root)
3. keep_source — 迁移后是否保留源服务器数据
4. tailscale — 目标服务器是否已加入 Tailscale 网络
```
示例提问: "要迁移到哪台服务器?是否已安装 Tailscale?迁移后源服务器数据要保留吗?"
### 16.2 源服务器: 完整备份 (🔧)
```bash
cd /root/.openclaw/workspace
./deploy.sh backup
```
确认备份完整性:
```bash
ls -la /root/.openclaw/backups/<TIMESTAMP>/
cat /root/.openclaw/backups/<TIMESTAMP>/manifest.txt
cat /root/.openclaw/backups/<TIMESTAMP>/qdrant-point-count.txt
```
### 16.3 传输到目标服务器 (🔧)
```bash
BACKUP_DIR="/root/.openclaw/backups/<TIMESTAMP>"
TARGET="root@<target_server>"
rsync -avzP "$BACKUP_DIR" "$TARGET:/root/.openclaw/backups/"
rsync -avzP /root/.openclaw/workspace/ "$TARGET:/root/.openclaw/workspace/" --exclude='.git' --exclude='logs'
rsync -avzP /root/.openclaw/openclaw.json "$TARGET:/root/.openclaw/"
```
副 agent profiles (如有):
```bash
for d in /root/.openclaw-*/; do
agent_name=$(basename "$d")
rsync -avzP "$d" "$TARGET:/root/$agent_name/"
done
```
### 16.4 目标服务器: 安装基础设施 (🔧)
```bash
# 1. 安装 Node.js (v24+) 和 OpenClaw
curl -fsSL https://get.openclaw.com | bash
# 2. 安装 Docker + Qdrant
mkdir -p /opt/mem0-center && cd /opt/mem0-center
# 从备份恢复 docker-compose.yml
cp /root/.openclaw/backups/<TIMESTAMP>/docker-compose.yml .
docker compose up -d
# 3. 等待 Qdrant 启动
sleep 5
curl -sf http://localhost:6333/collections | python3 -c "import sys,json; print(json.dumps(json.load(sys.stdin),indent=2))"
# 4. 恢复 Qdrant 数据
cd /root/.openclaw/workspace
./deploy.sh restore-qdrant /root/.openclaw/backups/<TIMESTAMP>/qdrant-mem0_v4_shared.snapshot
# 5. 安装 Python 依赖
pip3 install qdrant-client mem0ai pyyaml
# 6. 安装系统服务
./deploy.sh install
```
### 16.5 验证 (🔧)
```bash
# 服务状态
./deploy.sh health
# Qdrant 数据对比
curl -sf http://localhost:6333/collections/mem0_v4_shared | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Points: {d[\"result\"][\"points_count\"]}')"
# 对比源服务器的 qdrant-point-count.txt
# 记忆检索测试
cd /root/.openclaw/workspace/skills/mem0-integration
python3 mem0_integration.py search "测试查询" --agent-id main
# Telegram 连通性
# 在 Telegram 上发送测试消息给 bot
```
### 16.6 完成后 Checklist
```
□ 所有 agent 服务正常运行 (deploy.sh health 全绿)
□ Qdrant point 数量与源服务器一致
□ 记忆检索正常返回结果
□ Telegram bot 回复正常
□ Cron 定时任务已安装 (scripts/setup-cron.sh install)
□ 环境变量已设置 (MEM0_DASHSCOPE_API_KEY 等)
□ Monitor 服务运行中 (systemctl status openclaw-agent-monitor)
□ Tailscale 已加入 (如需远程 agent 连接)
□ 源服务器数据处理 (保留/清理)
```
### 16.7 回滚计划
如果迁移失败:
```
1. 在源服务器上 ./deploy.sh debug-start 恢复服务
2. 目标服务器上 ./deploy.sh debug-stop 停止所有服务
3. 排查问题后重新尝试
```
---
## 17. 技能/插件管理 SOP
### 17.1 Skill vs Plugin 选型指南
OpenClaw 有两套扩展加载机制,选型规则如下:
| 类型 | 加载方式 | 配置位置 | 适用场景 |
|------|----------|----------|----------|
| **内置 Skill** | OpenClaw 自动发现 | `skills.entries.<id>` | Clawhub 市场内置技能(如 `find-skills-robin`) |
| **自定义 Plugin** | 手动指定路径 | `plugins.load.paths` + `plugins.entries.<id>` | 自研工具(tavily)、lifecycle hook(mem0)、任何需要自定义代码的扩展 |
**判断规则:**
- 如果只需要开关一个 Clawhub 内置功能 -> `skills.entries`
- 如果有自己的 `openclaw.plugin.json` + `index.js` -> `plugins`
- 如果需要 lifecycle hook(对话前后自动执行) -> 必须 `plugins`
- **不要**同时在 `skills.entries``plugins.entries` 中重复启用同一个技能
**Plugin 必需文件:**
```
/root/.openclaw/workspace/skills/<id>/
├── openclaw.plugin.json # 插件清单(必需)
├── index.js # 工具/hook 实现(必需)
├── CONFIG_SUMMARY.md # 配置文档(推荐)
└── TEST_REPORT.md # 测试报告(推荐)
```
### 17.2 分阶段发布流程
所有新技能必须先在 main agent 上验证通过,再部署到辅 agent。
**Stage 1 -- 安装代码**
1. 将技能代码放入 `/root/.openclaw/workspace/skills/<id>/`
2. 确保有 `openclaw.plugin.json`(含 id、name、kind、main、tools/configSchema)
3. 确保有 `index.js`(导出 `register`/`activate` 和工具定义)
**Stage 2 -- Main 启用并测试**
1. 在 main 的 `openclaw.json` 中:
- `plugins.load.paths` 添加 `"/root/.openclaw/workspace/skills/<id>"`
- `plugins.entries.<id>` 设为 `{ "enabled": true }` (如有 config 一并填写)
2. 重启 main gateway:`systemctl --user restart openclaw-gateway.service`
3. 检查日志确认插件加载:`journalctl --user -u openclaw-gateway -n 50 | grep -i <id>`
4. 通过 Telegram 对 main 发消息测试功能
**Stage 3 -- 审核**
`templates/SKILL_REVIEW_TEMPLATE.md` 完成审核,包括:
| 审核维度 | 检查内容 |
|----------|----------|
| 安全 | API key 管理(环境变量 vs 硬编码)、网络请求范围、文件读写、权限提升 |
| 功能 | agent 能否正确调用、结果是否准确、错误处理是否合理 |
| 性能 | 响应时间、并发调用、对 agent 整体延迟的影响 |
| 最佳实践 | 推荐参数、适用场景、已知限制,记录到 `CONFIG_SUMMARY.md` |
**Stage 4 -- 推送辅 Agent**
1. 技能代码在共享 workspace 下,无需复制
2. 在辅 agent 的 `openclaw.json` 中:
- `plugins.load.paths` 添加相同路径
- `plugins.entries.<id>` 启用(注意 agent-specific 配置,如 mem0 的 `agent_id` 必须改为该 agent 的 ID)
3. 重启辅 agent gateway
4. 验证插件加载和功能正常
### 17.3 当前技能清单
| 技能 ID | 类型 | 加载方式 | Main | Tongge | 说明 |
|---------|------|----------|------|--------|------|
| `find-skills-robin` | 内置 | `skills.entries` | 启用 | 启用 | Clawhub 技能发现 |
| `mem0-integration` | lifecycle | `skills.entries` + `plugins` | 启用 | 启用 | 记忆系统(agent_id 需区分) |
| `tavily` | tool | `plugins` | 启用 | 启用 | AI 搜索(共享 API key) |
| `active-learning` | 内置 | `skills.entries` | -- | 启用 | 主动学习(仅 tongge) |
| `memos-cloud-openclaw-plugin` | 内置 | `plugins.entries` | 启用 | 启用 | Memos 云插件 |
| `qwen-portal-auth` | 内置 | `plugins.entries` | 启用 | 启用 | Qwen Portal OAuth |
> **维护要求:** 每次新增或移除技能时,同步更新此表。
### 17.4 Agent-Specific 配置注意事项
部分 plugin 在不同 agent 间需要不同配置:
| Plugin | 需区分的配置项 | Main | Tongge |
|--------|---------------|------|--------|
| `mem0-integration` | `config.agent_id` | `"main"` | `"tongge"` |
| `mem0-integration` | `config.user_id` | `"wang院长"` | `"wang院长"` |
部署到新 agent 时,务必检查以上配置项。
---
## Changelog
| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2026-03-06 | Initial version: hub-and-spoke model, templates, remote support |
| 1.1 | 2026-03-06 | Config-driven architecture: agents.yaml as single registry; automated onboard/offboard; parse_agents.py helper; life agent (张大师) removed; main is only active agent |
| 1.2 | 2026-03-06 | Code review + bug fixes (7 items): `parse_agents.py` output now shell-safe quoted; `agent-monitor.js` constructor ordering fixed (ensureLogDir before loadMonitoredServices) and fallback uses full `openclaw` path; `deploy.sh` switched `grep -qP` to `grep -qE` for portability; `offboard.sh` Qdrant delete uses `FilterSelector` wrapper; `onboard.sh`/`offboard.sh` inline Python rewritten with `sys.argv` to prevent shell injection; `openclaw-agent-monitor.service` added `/run/user/0` to `ReadWritePaths` for D-Bus access; removed corrupted trailing bytes in `offboard.sh` |
| 2.0 | 2026-03-06 | Added operational playbooks (Part B): Interactive Onboarding (Sec 11, with conversation flow, field mapping table, port allocation, checklist, dialog example), Agent Maintenance & Troubleshooting (Sec 12), Memory Management (Sec 13), Interactive Offboarding (Sec 14). Document restructured into Part A (Architecture Reference) and Part B (Operational Playbooks). |
| 2.1 | 2026-03-06 | Added Backup & Cleanup Playbook (Sec 15): backup/restore commands, memory cleanup with retention policy, cron automation, interactive dialogue flow. Added Server Migration Playbook (Sec 16): step-by-step migration with pre/post checklist, Qdrant snapshot recovery, rollback plan. |
| 2.2 | 2026-03-09 | Added Skill/Plugin Management SOP (Sec 17): skill vs plugin selection guide, staged release workflow (main-first), current skill inventory, agent-specific config notes. Unified tavily loading to plugin mode across all agents. |