openClaw_agent_dm/docs/MULTI_AGENT_MANAGEMENT.md

# Multi-Agent Management Architecture

**Version:** 2.0
**Date:** 2026-03-06
**Maintainer:** Eason

**Current state:** Main (陈医生) is the only active agent. The life agent (张大师) has been removed. All agents are defined in `agents.yaml`.

> **For Main Agent (陈医生):** 你是 Hub Agent。本文档既是架构参考，也是你的操作手册。
> 当用户要求创建、维护、排查或移除 Agent 时，跳转到对应的 Playbook 章节（11-14），按步骤执行。

---

## 1. Hub-and-Spoke Model

Main agent acts as the **memory hub** -- responsible for publishing shared knowledge,
maintaining the project registry, and onboarding new agents. All other agents (local or
remote) are **spokes** that consume shared memory and contribute their own private/project
memories.

```
Main Agent (Hub) - defined in agents.yaml
  |-- publish_knowledge() --> Qdrant mem0_v4_shared (visibility=public)
  |-- publish_knowledge(project_id=X) --> (visibility=project)
  |-- maintain project_registry.yaml
  |-- maintain docs & best practices
  |
  +-- Local Spokes (same server, same Qdrant)
  |     |-- local-cli: main (openclaw gateway)
  |     |-- local-systemd: <agent_id> (port 187XX)
  |
  +-- Remote Spokes (Tailscale VPN -> Qdrant)
        +-- remote-http: <agent_id> (health via HTTP)
```

---

## 2. Memory Visibility Model

All agents share one Qdrant collection: `mem0_v4_shared`.
Isolation is achieved through metadata fields.

| Visibility | Who can read | Metadata filter |
|-----------|-------------|-----------------|
| public | All agents | `visibility=public` |
| project | Same project members | `visibility=project, project_id=X` |
| private | Only the writing agent | `visibility=private, agent_id=X` |

Project membership is defined in `skills/mem0-integration/project_registry.yaml`.
Main agent is registered as member of all projects for audit access.

---

## 3. Agent Registry (agents.yaml)

**Path:** `/root/.openclaw/workspace/agents.yaml`

This file is the **single source of truth** for all agent definitions. All tooling reads from it dynamically:

| Consumer | Purpose |
|----------|---------|
| `deploy.sh` | Service management (start/stop/debug/fix) |
| `agent-monitor.js` | Health monitoring |
| `local_search.py` | Agent lookup for search |
| `memory_cleanup.py` | Agent-aware cleanup |
| `onboard.sh` / `offboard.sh` | Add/remove agents |

**Helper script:** `scripts/parse_agents.py` parses agents.yaml for bash/JS:

```bash
python3 scripts/parse_agents.py list          # list agent IDs
python3 scripts/parse_agents.py info <id>     # get agent info as KEY=VALUE (shell-safe quoted)
python3 scripts/parse_agents.py services      # list all agents with service details (tab-separated)
python3 scripts/parse_agents.py ids           # space-separated agent IDs (for bash loops)
```

> **Note:** The `info` subcommand outputs single-quoted values (`KEY='value'`) that are safe
> for `eval` in bash, even when values contain spaces, CJK characters, or special shell
> metacharacters. The `services` subcommand uses tab (`\t`) as the delimiter to avoid
> collisions with `|` or spaces in command strings.

**Agent types supported:**

| Type | Description |
|------|-------------|
| `local-cli` | Managed via `openclaw gateway` CLI (main agent) |
| `local-systemd` | Managed via user-level systemd unit |
| `remote-http` | Remote agent checked via HTTP health endpoint |

---

## 4. Agent Lifecycle

### 4.1 Onboard (create)

```bash
cd /root/.openclaw/workspace/templates
./onboard.sh <agent_id> <agent_name> <project_id> [qdrant_host]
```

**Fully automated.** This script:

1. Creates workspace at `agents/<agent_id>-workspace/` (IDENTITY.md, SOUL.md, mem0 config)
2. Registers the agent in `agents.yaml`
3. Registers in `project_registry.yaml`
4. For local agents: generates systemd service + env file, installs, enables
5. Reloads `openclaw-agent-monitor` so it picks up the new agent

**Examples:**

```bash
./onboard.sh crypto "CryptoBot" crypto              # local agent
./onboard.sh remote1 "RemoteBot" advert 100.115.94.1 # remote agent
```

**Remaining manual steps (local-systemd):** Edit IDENTITY.md, create `~/.openclaw-<agent_id>/openclaw.json`, then start the service.

### 4.2 Offboard (retire)

```bash
cd /root/.openclaw/workspace/templates
./offboard.sh <agent_id> [--keep-data]
```

**Options:**

- (default) Full removal: stops service, removes from agents.yaml and project_registry, deletes workspace, profile, and Qdrant memories
- `--keep-data` Unregister only: keeps workspace and profile files

**Examples:**

```bash
./offboard.sh crypto              # full removal
./offboard.sh crypto --keep-data  # keep files, just unregister
```

The main (hub) agent cannot be offboarded.

---

## 5. Knowledge Publishing

Main agent can publish best practices and shared knowledge to Qdrant:

**Via Python:**

```python
from mem0_client import mem0_client
await mem0_client.start()
await mem0_client.publish_knowledge(
    content="Always use EnvironmentFile= in systemd services for upgrade safety",
    category="knowledge",
    visibility="public",
)
```

**Via CLI:**

```bash
python3 mem0_integration.py publish '{"content":"...", "visibility":"public"}'
```

**Via Node.js plugin (index.js):**

The `publish` action is available through the same spawn interface used by `search` and `add`.

### Visibility Guidelines

| Content type | Visibility | Example |
|-------------|-----------|---------|
| System best practices | public | "Use deploy.sh fix-service after upgrades" |
| Project-specific knowledge | project | "{agent_id} uses Google Calendar API" |
| User preferences | private | "User prefers dark mode" |
| API keys, secrets | NEVER store | Use environment variables |

---

## 6. Cold Start Preload

When a new session starts, `session_init.py` calls `cold_start_search()` which
retrieves memories in three phases:

1. **Phase 0 (public)**: Best practices, shared config -- available to all agents
2. **Phase 1 (project)**: Project-specific guidelines -- based on agent's project membership
3. **Phase 2 (private)**: Agent's own recent context

Results are deduplicated, ordered by phase priority, and injected into the System Prompt.

---

## 7. Local Agent Configuration

Local agents run on the same server and connect to Qdrant at `localhost:6333`.

Key configuration points:
- `openclaw.json`: `collection_name: "mem0_v4_shared"` (NOT agent-specific collections)
- `systemd/<agent_id>-gateway.env`: contains `MEM0_DASHSCOPE_API_KEY`
- `EnvironmentFile=` in the service unit references the env file

---

## 8. Remote Agent Configuration

Remote agents run on different servers and connect to Qdrant via Tailscale.

### Prerequisites

1. Tailscale installed and joined to the same tailnet on both servers
2. Qdrant accessible at the hub server's Tailscale IP (e.g., `100.115.94.1:6333`)
3. Tailscale ACL allows the remote server to access port 6333

### Environment File

```
MEM0_QDRANT_HOST=100.115.94.1
MEM0_DASHSCOPE_API_KEY=sk-...
OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
```

### Onboarding

```bash
./onboard.sh remote1 "RemoteBot" advert 100.115.94.1
```

The 4th argument sets `MEM0_QDRANT_HOST` in the generated env file. The agent is automatically added to `agents.yaml` and the monitor picks it up on reload.

### Monitoring

The monitor reads from `agents.yaml` dynamically. Remote agents (type `remote-http`) are checked via their `health_url`. Remote agents cannot be auto-started from the hub; the monitor will only alert on failure.

---

## 9. Agent Monitor Service Hardening

The `openclaw-agent-monitor.service` runs as a system-level systemd service with the following security constraints:

| Directive | Value | Purpose |
|-----------|-------|---------|
| `ProtectSystem` | `strict` | Mounts entire filesystem read-only |
| `ProtectHome` | `read-only` | Home directory is read-only |
| `ReadWritePaths` | `/root/.openclaw/workspace/logs /run/user/0` | Whitelist for writes: log output + D-Bus for `systemctl --user` |
| `NoNewPrivileges` | `true` | Cannot gain new privileges |
| `MemoryMax` | `512M` | OOM guard |
| `CPUQuota` | `20%` | Prevent monitor from starving other processes |

**Why `/run/user/0`?** The monitor uses `systemctl --user start/stop` to manage gateway processes, which requires D-Bus access at the user runtime directory. Without this path whitelisted, `ProtectSystem=strict` would block the D-Bus socket and prevent auto-restart.

**Initialization order in `agent-monitor.js`:**

1. `loadConfig()` -- read `openclaw.json`
2. `ensureLogDir()` -- create log directory (must happen before any `this.log()` calls)
3. `loadMonitoredServices()` -- parse `agents.yaml` (may log errors on failure)
4. Signal handlers + start monitoring loop

---

## 10. File Reference

| File | Purpose |
|------|---------|
| `agents.yaml` | Single source of truth for agent registry |
| `scripts/parse_agents.py` | Parses agents.yaml for bash/JS consumers |
| `skills/mem0-integration/mem0_client.py` | Core client: search, write, publish, cold_start |
| `skills/mem0-integration/mem0_integration.py` | CLI interface: init, search, add, publish, cold_start |
| `skills/mem0-integration/session_init.py` | Three-phase cold start hook |
| `skills/mem0-integration/project_registry.yaml` | Agent-to-project membership |
| `templates/onboard.sh` | Automated agent onboarding (adds to agents.yaml, installs service, reloads monitor) |
| `templates/offboard.sh` | Clean one-command agent removal |
| `templates/agent-workspace/` | Workspace file templates |
| `templates/systemd/` | Service and env file templates |
| `agent-monitor.js` | Config-driven health monitor (reads agents.yaml) |
| `deploy.sh` | Service management (reads agents.yaml) |
| `docs/EXTENSIONS_ARCHITECTURE.md` | Systemd, monitor, upgrade safety |
| `docs/MEMORY_ARCHITECTURE.md` | Four-layer memory system detail |

---

---
---

# PART B: Operational Playbooks (面向 Main Agent 的操作手册)

> **以下 Section 11-14 是 Main Agent (陈医生) 在对话中执行操作的分步指南。**
> 当用户说"帮我创建一个新 agent"、"检查 agent 状态"、"清理记忆"、"移除 agent"时，
> 按对应章节执行。每个步骤标注了需要向用户提问 (🗣️) 还是你自己执行 (🔧)。

---

## 11. Playbook: Interactive Onboarding (创建新 Agent)

当用户说"我要创建新 agent"或类似意图时，按以下流程执行。

### 11.1 信息收集阶段 (🗣️ 向用户逐步提问)

按以下顺序收集信息。每轮只问 1-2 个问题，不要一次全部列出。

**第 1 轮：基本身份**

```
需要收集:
1. agent_id    — 英文小写标识符，无空格（例: crypto, hr_bot, advert_pm）
2. agent_name  — 显示名称，可以是中文（例: "加密分析师", "HR助手"）
```

示例提问: "新 Agent 的 ID 用什么？（英文小写，如 crypto）显示名称叫什么？"

**第 2 轮：角色定义**

```
需要收集:
3. role        — 一句话角色描述（例: "加密货币行情分析与投资策略助手"）
4. scope       — 职责范围，2-5 条（例: "行情监控、策略分析、风险提醒"）
5. personality — 性格/沟通风格（例: "专业严谨、数据驱动、适度幽默"）
```

示例提问: "这个 Agent 的角色是什么？负责哪些事情？你希望它是什么样的沟通风格？"

**第 3 轮：项目归属**

```
需要收集:
6. project_id  — 所属项目（已有: advert, global; 或新建）
7. new_project — 如果是新项目，需要项目名称和描述
```

先展示已有项目: 读取 `skills/mem0-integration/project_registry.yaml`

示例提问: "这个 Agent 属于哪个项目？现有项目有: advert(广告业务)、global(全局)。需要新建项目吗？"

**第 4 轮：Telegram Bot**

```
需要收集:
8. bot_token   — Telegram Bot Token
```

如果用户还没有 token，给出创建指引:
```
创建 Telegram Bot 的步骤:
1. 在 Telegram 搜索 @BotFather，发送 /newbot
2. 按提示输入 bot 显示名称（如: CryptoBot）
3. 输入 bot username（必须以 Bot 结尾，如: openclaw_crypto_bot）
4. BotFather 会返回一个 token（格式: 1234567890:ABCdef...）
5. 把这个 token 发给我
```

**第 5 轮：部署方式**

```
需要收集:
9. deploy_type — 本地(localhost) 还是远程(Tailscale IP)
10. qdrant_host — 远程时需要 Tailscale IP 地址
```

示例提问: "这个 Agent 部署在本服务器还是远程？如果远程，Tailscale IP 是多少？"

### 11.2 端口分配规则

| 端口 | 用途 |
|------|------|
| 18789 | main agent (已占用) |
| 18790 | 第 2 个本地 agent |
| 18791 | 第 3 个本地 agent |
| ... | 依次递增 |

🔧 自动分配: 读取 `agents.yaml` 中已注册 agent 数量，port = 18789 + count。
远程 agent 不需要在本服务器分配端口。

### 11.3 执行阶段 (🔧 按顺序执行)

收集完信息后，按以下步骤执行。**每步完成后向用户报告进度。**

**Step 1: 运行 onboard.sh**

```bash
cd /root/.openclaw/workspace/templates
# 本地 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id>
# 远程 agent:
./onboard.sh <agent_id> "<agent_name>" <project_id> <qdrant_host>
```

这会自动完成: 创建 workspace、注册 agents.yaml、注册 project_registry、
生成 systemd service/env、重载 monitor。

**Step 2: 填充 IDENTITY.md**

写入 `agents/<agent_id>-workspace/IDENTITY.md`:

```markdown
# Agent Identity

- **Name**: <agent_name>
- **Agent ID**: <agent_id>
- **Role**: <用户提供的角色描述>
- **Project**: <project_id>
- **Created**: <今天日期>

## Scope
<用户提供的职责范围，每条一行>

## Communication Style
<用户提供的性格/沟通风格描述>
```

**Step 3: 填充 SOUL.md**

写入 `agents/<agent_id>-workspace/SOUL.md`:

```markdown
# <agent_name> - Core Personality

## Beliefs
<从用户描述的角色推导 2-3 条核心信念>

## Behavior Rules
- Follow shared best practices from public memory
- Respect memory visibility boundaries (public/project/private)
- Log important decisions to memory for team awareness
<根据角色补充 2-3 条特定行为准则>

## Communication Style
<用户描述的沟通风格，展开为 2-3 句具体描述>
```

**Step 4: 如果是新项目，注册到 project_registry.yaml**

如果第 3 轮收集的是新项目，编辑 `skills/mem0-integration/project_registry.yaml`:

```yaml
  <project_id>:
    name: "<项目名称>"
    description: "<项目描述>"
    members:
      - "<agent_id>"
      - "main"
    owner: "main"
```

**Step 5: 创建 openclaw.json**

这是最关键的步骤。从 main 的配置复制并修改:

```bash
cp /root/.openclaw/openclaw.json /root/.openclaw-<agent_id>/openclaw.json
```

**必须修改的字段（字段映射表）:**

| JSON 路径 | main 的值 | 新 agent 应改为 |
|-----------|----------|----------------|
| `agents.list[0].id` | `"main"` | `"<agent_id>"` |
| `agents.defaults.workspace` | `"/root/.openclaw/workspace"` | `"/root/.openclaw/workspace/agents/<agent_id>-workspace"` |
| `channels.telegram.botToken` | `"7047245486:AAF..."` | `"<用户提供的 token>"` |
| `gateway.port` | `18789` | `<分配的端口>` |
| `gateway.controlUi.allowedOrigins[2]` | `"http://100.115.94.1:18789"` | **`"http://100.115.94.1:<端口>"`（必须与该 agent 的 gateway.port 一致）** |
| `gateway.controlUi.dangerouslyDisableDeviceAuth` | `true` | **保持 `true`**（否则从 Tailscale IP 打开 Control UI 会提示 "device identity required"，需先配对浏览器设备） |
| `gateway.controlUi.allowInsecureAuth` | 无或 `true` | **建议 `true`**（与 main 一致；HTTP 非 localhost 访问时需此选项才能绕过浏览器无法生成设备密钥的限制，否则仍会报 device identity required） |
| `plugins.entries.mem0-integration.config.agent_id` | `"main"` | `"<agent_id>"` |

⚠️ **Control UI 访问**：若 `allowedOrigins[2]` 未改为该 agent 的端口，用户访问 `http://100.115.94.1:<端口>/` 会报 **"origin not allowed"**，无法打开配对页。创建 openclaw.json 时务必同时改 `gateway.port` 与 `gateway.controlUi.allowedOrigins[2]`。

**保持不变的字段（继承 main 的配置）:**

- `models` — 使用相同的模型配置
- `auth` — 使用相同的认证
- `memory` — 使用 qmd 后端
- `skills` — 继承 tavily, find-skills-robin, mem0-integration
- `plugins.load.paths` — 可保留或改为 agent 自己的 skills 路径

**Step 6: 启动服务**

```bash
# 本地 agent:
export XDG_RUNTIME_DIR=/run/user/$(id -u)
systemctl --user start openclaw-gateway-<agent_id>.service

# 检查状态:
systemctl --user status openclaw-gateway-<agent_id>.service
```

**Step 7: 验证**

```bash
./deploy.sh health
```

### 11.4 完成 Checklist (🔧 逐项确认后告知用户)

```
□ onboard.sh 运行成功
□ agents.yaml 已注册
□ project_registry.yaml 已注册（含 main 作为成员）
□ IDENTITY.md 已填充角色/职责
□ SOUL.md 已填充性格/行为准则
□ openclaw.json 已创建，字段已修改:
  □ agents.list[0].id = <agent_id>
  □ agents.defaults.workspace 指向 agent workspace
  □ channels.telegram.botToken 使用新 token
  □ gateway.port 不与其他 agent 冲突
  □ gateway.controlUi.allowedOrigins[2] = "http://100.115.94.1:<该 agent 端口>"（否则 Control UI 会报 origin not allowed）
  □ gateway.controlUi.dangerouslyDisableDeviceAuth = true（否则会报 device identity required）
  □ gateway.controlUi.allowInsecureAuth = true（从 Tailscale/LAN IP 用 HTTP 打开 UI 时建议开启）
  □ plugins.entries.mem0-integration.config.agent_id 正确
□ systemd 服务已启动
□ deploy.sh health 全部通过
□ Telegram Bot 配对完成（用户确认）
```

### 11.6 Telegram 配对说明 (用户必须自行完成)

新建 agent 使用 `dmPolicy: pairing` 时，**配对必须由用户在 Telegram 与 Control UI 中完成**，main agent 无法代为执行。

**标准步骤（提供给用户）:**

1. 在 Telegram 中搜索该 agent 的 Bot（如 @xxx_bot），发送 `/start`
2. 打开该 agent 的 Control UI：`http://100.115.94.1:<端口>/`（端口即该 agent 的 gateway.port）
3. 若出现 **"origin not allowed"**：说明该 agent 的 `openclaw.json` 中 `gateway.controlUi.allowedOrigins[2]` 未设为 `http://100.115.94.1:<端口>`，需由 main agent 修正后重启该 agent 服务
4. 若出现 **"device identity required"**：因通过 `http://100.115.94.1:<端口>` 访问时浏览器不在安全上下文（仅 HTTPS 或 localhost 才是），无法生成设备密钥。可选两种方式：
   - **方式 A（推荐）**：用 **带 token 的 URL** 打开 Control UI：`http://100.115.94.1:<端口>/?token=<该 agent 的 gateway.auth.token>`，即可绕过设备校验。
   - **方式 B**：在该 agent 的 `openclaw.json` 的 `gateway.controlUi` 下同时设置 `"dangerouslyDisableDeviceAuth": true` 与 `"allowInsecureAuth": true`，保存后重启该 agent 服务（仅限受信任内网环境）。
5. 在 Control UI 进入 Devices/配对页面，确认 **Telegram 用户**的配对请求（此为 Bot 与用户的配对，与上一步的浏览器设备无关）
6. 配对完成后，Bot 即可正常回复私信

main agent 可做：提供上述步骤、给出 Control UI 地址、在出现 origin 错误时修正配置并提醒用户重试。

### 11.5 对话范例

```
用户: 帮我创建一个新的加密货币分析 agent
Eason: 好的。先确定基本信息:
       1. Agent ID 用什么？建议用英文小写，如 crypto
       2. 显示名称叫什么？
用户: ID 就叫 crypto，名称叫"加密分析师"
Eason: 收到。这个 agent 的具体角色是什么？负责哪些事？
       沟通风格方面，你希望它专业严谨还是轻松活泼？
用户: 负责行情监控、策略分析和风险提醒。专业但简洁。
Eason: 了解。项目归属方面，现有项目有:
       - advert (广告业务)
       - global (全局共享)
       需要新建一个 crypto 项目吗？
用户: 新建 crypto 项目
Eason: 好。接下来需要一个独立的 Telegram Bot Token。
       如果还没有，步骤如下:
       1. Telegram 搜索 @BotFather → 发 /newbot
       2. 设置名称和 username
       3. 把返回的 token 发给我
用户: 1234567890:ABCdef...
Eason: 最后确认：部署在本服务器还是远程？
用户: 本服务器
Eason: 好的，开始部署...
       [执行 Step 1-7]
       部署完成！Checklist 全部通过。
       请在 Telegram 给 @你的bot 发条消息测试一下。
```

---

## 12. Playbook: Agent Maintenance & Troubleshooting (日常维护)

### 12.1 日常健康检查 (🔧)

当用户问"agent 状态怎么样"或在 Heartbeat 中执行:

```bash
./deploy.sh health      # 全局健康检查
./deploy.sh status      # 详细服务状态
```

解读结果并向用户报告:
- ✓ 表示正常
- ✗ 表示服务未运行 → 尝试重启
- ⚠ 表示资源告警 → 报告具体数值

### 12.2 Agent 未响应排查流程

```
Step 1: 检查服务是否运行
  systemctl --user status openclaw-gateway-<agent_id>.service

Step 2: 如果 inactive → 检查日志
  journalctl --user -u openclaw-gateway-<agent_id> -n 50 --no-pager

Step 3: 常见问题及解决:
  - "Address already in use" → 端口冲突，检查 openclaw.json 的 gateway.port
  - "Cannot find module" → openclaw 版本问题，运行 ./deploy.sh fix-service
  - "ECONNREFUSED" → Qdrant 未启动，检查 docker ps | grep qdrant
  - "API key invalid" → 检查 systemd/<agent_id>-gateway.env 中的 API key
  - **"origin not allowed"（Control UI 打不开）** → 该 agent 的 openclaw.json 中 gateway.controlUi.allowedOrigins[2] 必须为 "http://100.115.94.1:<该 agent 的端口>"；修改后执行 systemctl --user restart openclaw-gateway-<agent_id>.service
  - **"device identity required"（Control UI 要求设备配对）** → 通过 HTTP 访问非 localhost 时，浏览器无法生成设备密钥。解决：① 用带 token 的 URL：`http://100.115.94.1:<端口>/?token=<gateway.auth.token>`；或 ② 在该 agent 的 openclaw.json 的 gateway.controlUi 下同时设置 `"dangerouslyDisableDeviceAuth": true` 与 `"allowInsecureAuth": true`，保存后重启该 agent 服务（仅限受信任内网）。

Step 4: 重启
  systemctl --user restart openclaw-gateway-<agent_id>.service

Step 5: 仍然失败 → 收集日志给用户
  journalctl --user -u openclaw-gateway-<agent_id> -n 200 --no-pager > /tmp/agent-debug.log
```

### 12.3 OpenClaw 升级后恢复

当用户通过 UI 升级 OpenClaw 后，自定义配置可能丢失:

```bash
./deploy.sh fix-service    # 重新注入 EnvironmentFile 到 systemd 服务
./deploy.sh restart        # 重启所有服务使配置生效
./deploy.sh health         # 确认恢复正常
```

向用户报告修复结果。

### 12.4 查看 Agent 列表

```bash
python3 scripts/parse_agents.py list
```

输出格式: `<id>\t<type>\t<name>`，向用户展示时格式化为表格。

### 12.5 调试模式

当用户需要调试某个 agent:

```bash
./deploy.sh debug-stop     # 停止所有服务（含 monitor，防止自动重启）
# ... 用户调试 ...
./deploy.sh debug-start    # 恢复所有服务
```

---

## 13. Playbook: Memory Management (记忆管理)

### 13.1 发布共享知识 (🔧)

当用户说"把这条最佳实践共享给所有 agent":

```bash
python3 skills/mem0-integration/mem0_integration.py publish \
  '{"content":"<知识内容>", "visibility":"public", "category":"knowledge"}'
```

当用户说"把这个信息共享给某项目":

```bash
python3 skills/mem0-integration/mem0_integration.py publish \
  '{"content":"<内容>", "visibility":"project", "project_id":"<项目>", "category":"knowledge"}'
```

### 13.2 查看记忆统计

```bash
python3 skills/mem0-integration/memory_cleanup.py --dry-run
```

向用户报告各 agent、各类型、各可见性的记忆数量。

### 13.3 清理过期记忆

```bash
# 先 dry-run 查看:
python3 skills/mem0-integration/memory_cleanup.py --dry-run --max-age-days 90

# 确认后执行:
python3 skills/mem0-integration/memory_cleanup.py --max-age-days 90
```

### 13.4 为新 Agent 预载知识 (Cold Start)

新 agent 创建后，可以为其预载公共知识:

```bash
python3 skills/mem0-integration/mem0_integration.py cold_start \
  '{"agent_id":"<agent_id>", "user_id":"wang_yuanzhang", "top_k":10}'
```

### 13.5 检查记忆可见性

当用户质疑"某 agent 能看到这条记忆吗":

1. 确定记忆的 `visibility` 和 `project_id`
2. 读 `project_registry.yaml` 确认 agent 是否在该 project 的 members 列表中
3. 可见性规则:
   - `public` → 所有 agent 可见
   - `project` → 只有 project members 可见
   - `private` → 只有写入者可见

---

## 14. Playbook: Interactive Offboarding (移除 Agent)

### 14.1 信息收集 (🗣️)

```
需要收集:
1. agent_id  — 要移除的 agent ID
2. keep_data — 是否保留数据（workspace、profile、Qdrant 记忆）
```

示例提问: "要移除哪个 Agent？需要保留它的数据吗？（保留可以日后恢复）"

🔧 先展示当前 agent 列表:
```bash
python3 scripts/parse_agents.py list
```

### 14.2 安全检查 (🔧)

```
□ 确认不是 main agent（main 不能被移除）
□ 确认 agent 存在于 agents.yaml
□ 向用户再次确认: "确定要移除 <agent_name> (<agent_id>) 吗？这将停止服务并从注册表中删除。"
```

### 14.3 执行 (🔧)

```bash
cd /root/.openclaw/workspace/templates

# 完全移除（含数据）:
./offboard.sh <agent_id>

# 仅注销（保留数据）:
./offboard.sh <agent_id> --keep-data
```

脚本会交互确认 (y/N)，需要输入 y 确认。

### 14.4 完成后报告

向用户报告:
```
Agent <agent_name> (<agent_id>) 已移除:
  - 服务: 已停止并卸载
  - agents.yaml: 已移除
  - project_registry: 已移除
  - Workspace: <已删除 / 已保留>
  - Qdrant 记忆: <已删除 / 已保留>
  - Monitor: 已重载
```

运行 `./deploy.sh health` 确认系统正常。

---

## 15. Playbook: Backup & Cleanup (备份与清理)

### 15.1 备份命令

| 命令 | 说明 |
|------|------|
| `./deploy.sh backup` | 完整备份 (workspace + Qdrant snapshot + agent profiles + docker-compose) |
| `./deploy.sh backup quick` | 快速备份 (仅 workspace 文件，不含 Qdrant) |
| `bash scripts/10-create-backup.sh` | 独立备份脚本（包含 mem0 配置 + agents.yaml + Qdrant snapshot） |

**备份保留策略**: 自动保留最近 10 个备份，旧备份自动删除。

**备份目录结构**:
```
/root/.openclaw/backups/<TIMESTAMP>/
├── workspace.tar.gz                  # Layer 1+2 所有 MD 和配置文件
├── .openclaw__openclaw.json          # main agent profile
├── .openclaw-tongge__openclaw.json   # 副 agent profiles (如有)
├── docker-compose.yml                # Qdrant docker 配置
├── qdrant-mem0_v4_shared.snapshot    # Layer 4 向量数据 (full 模式)
├── qdrant-point-count.txt            # 备份时的 point 数量 (用于校验)
└── manifest.txt                      # 备份清单
```

### 15.2 恢复命令

| 命令 | 说明 |
|------|------|
| `./deploy.sh restore <backup-dir>` | 恢复 workspace 文件 + agent profiles |
| `./deploy.sh restore-qdrant <snapshot-file>` | 恢复 Qdrant 向量数据 |

恢复前会自动创建 quick 备份，且需要交互确认 (y/N)。

### 15.3 记忆清理

清理脚本: `skills/mem0-integration/memory_cleanup.py`

| 命令 | 说明 |
|------|------|
| `python3 memory_cleanup.py --dry-run` | 统计各维度记忆 + 列出过期记忆数量 (不删除) |
| `python3 memory_cleanup.py --execute --max-age-days 90` | 实际删除过期记忆 |

**保留策略** (与 `mem0_client.py` 的 `EXPIRATION_MAP` 对齐):
- `session`: 7 天后过期
- `chat_summary`: 30 天后过期
- `preference`: 永久保留
- `knowledge`: 永久保留

`--max-age-days` 作为强制上限: 超过该天数的 session/chat_summary 无论 expiration_date 均会删除。preference 和 knowledge 永远不会被自动清理。

**审计日志**: 每次清理写入 `logs/security/memory-cleanup-<date>.log`。

### 15.4 自动化 Cron

安装脚本: `scripts/setup-cron.sh`

```bash
./scripts/setup-cron.sh          # 安装定时任务
./scripts/setup-cron.sh remove   # 移除定时任务
./scripts/setup-cron.sh status   # 查看当前任务
```

**定时计划**:
| 时间 | 任务 |
|------|------|
| 每天 02:00 | `./deploy.sh backup` — 完整备份 |
| 每周日 03:00 | `memory_cleanup.py --execute --max-age-days 90` — 清理过期记忆 |

日志输出到 `logs/system/cron-backup.log` 和 `logs/system/cron-cleanup.log`。

### 15.5 交互式备份恢复流程 (🗣️)

当用户要求备份或恢复时的对话流程:

**备份**:
```
陈医生: "需要创建什么类型的备份？"
  1. 完整备份 (含 Qdrant 向量数据，推荐)
  2. 快速备份 (仅 workspace 文件)

→ 执行相应命令，报告备份路径和 Qdrant point 数量
→ 建议: 重大变更前务必执行完整备份
```

**恢复**:
```
陈医生: "需要恢复到哪个备份？"
→ 列出 /root/.openclaw/backups/ 下可用备份
→ 展示 manifest.txt 内容让用户确认
→ 先恢复 workspace: ./deploy.sh restore <dir>
→ 如有 Qdrant 快照且用户确认: ./deploy.sh restore-qdrant <file>
→ 恢复后执行 ./deploy.sh restart + ./deploy.sh health
→ 对比 qdrant-point-count.txt 与当前 point 数量
```

---

## 16. Playbook: Server Migration (服务器迁移)

### 16.1 迁移前准备 (🗣️)

信息收集:
```
需要确认:
1. target_server — 目标服务器地址 (IP 或 Tailscale hostname)
2. target_user   — 目标服务器用户名 (通常 root)
3. keep_source   — 迁移后是否保留源服务器数据
4. tailscale     — 目标服务器是否已加入 Tailscale 网络
```

示例提问: "要迁移到哪台服务器？是否已安装 Tailscale？迁移后源服务器数据要保留吗？"

### 16.2 源服务器: 完整备份 (🔧)

```bash
cd /root/.openclaw/workspace
./deploy.sh backup
```

确认备份完整性:
```bash
ls -la /root/.openclaw/backups/<TIMESTAMP>/
cat /root/.openclaw/backups/<TIMESTAMP>/manifest.txt
cat /root/.openclaw/backups/<TIMESTAMP>/qdrant-point-count.txt
```

### 16.3 传输到目标服务器 (🔧)

```bash
BACKUP_DIR="/root/.openclaw/backups/<TIMESTAMP>"
TARGET="root@<target_server>"

rsync -avzP "$BACKUP_DIR" "$TARGET:/root/.openclaw/backups/"
rsync -avzP /root/.openclaw/workspace/ "$TARGET:/root/.openclaw/workspace/" --exclude='.git' --exclude='logs'
rsync -avzP /root/.openclaw/openclaw.json "$TARGET:/root/.openclaw/"
```

副 agent profiles (如有):
```bash
for d in /root/.openclaw-*/; do
    agent_name=$(basename "$d")
    rsync -avzP "$d" "$TARGET:/root/$agent_name/"
done
```

### 16.4 目标服务器: 安装基础设施 (🔧)

```bash
# 1. 安装 Node.js (v24+) 和 OpenClaw
curl -fsSL https://get.openclaw.com | bash

# 2. 安装 Docker + Qdrant
mkdir -p /opt/mem0-center && cd /opt/mem0-center
# 从备份恢复 docker-compose.yml
cp /root/.openclaw/backups/<TIMESTAMP>/docker-compose.yml .
docker compose up -d

# 3. 等待 Qdrant 启动
sleep 5
curl -sf http://localhost:6333/collections | python3 -c "import sys,json; print(json.dumps(json.load(sys.stdin),indent=2))"

# 4. 恢复 Qdrant 数据
cd /root/.openclaw/workspace
./deploy.sh restore-qdrant /root/.openclaw/backups/<TIMESTAMP>/qdrant-mem0_v4_shared.snapshot

# 5. 安装 Python 依赖
pip3 install qdrant-client mem0ai pyyaml

# 6. 安装系统服务
./deploy.sh install
```

### 16.5 验证 (🔧)

```bash
# 服务状态
./deploy.sh health

# Qdrant 数据对比
curl -sf http://localhost:6333/collections/mem0_v4_shared | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Points: {d[\"result\"][\"points_count\"]}')"
# 对比源服务器的 qdrant-point-count.txt

# 记忆检索测试
cd /root/.openclaw/workspace/skills/mem0-integration
python3 mem0_integration.py search "测试查询" --agent-id main

# Telegram 连通性
# 在 Telegram 上发送测试消息给 bot
```

### 16.6 完成后 Checklist

```
□ 所有 agent 服务正常运行 (deploy.sh health 全绿)
□ Qdrant point 数量与源服务器一致
□ 记忆检索正常返回结果
□ Telegram bot 回复正常
□ Cron 定时任务已安装 (scripts/setup-cron.sh install)
□ 环境变量已设置 (MEM0_DASHSCOPE_API_KEY 等)
□ Monitor 服务运行中 (systemctl status openclaw-agent-monitor)
□ Tailscale 已加入 (如需远程 agent 连接)
□ 源服务器数据处理 (保留/清理)
```

### 16.7 回滚计划

如果迁移失败:
```
1. 在源服务器上 ./deploy.sh debug-start 恢复服务
2. 目标服务器上 ./deploy.sh debug-stop 停止所有服务
3. 排查问题后重新尝试
```

---

## 17. 技能/插件管理 SOP

### 17.1 Skill vs Plugin 选型指南

OpenClaw 有两套扩展加载机制，选型规则如下：

| 类型 | 加载方式 | 配置位置 | 适用场景 |
|------|----------|----------|----------|
| **内置 Skill** | OpenClaw 自动发现 | `skills.entries.<id>` | Clawhub 市场内置技能（如 `find-skills-robin`） |
| **自定义 Plugin** | 手动指定路径 | `plugins.load.paths` + `plugins.entries.<id>` | 自研工具（tavily）、lifecycle hook（mem0）、任何需要自定义代码的扩展 |

**判断规则：**

- 如果只需要开关一个 Clawhub 内置功能 -> `skills.entries`
- 如果有自己的 `openclaw.plugin.json` + `index.js` -> `plugins`
- 如果需要 lifecycle hook（对话前后自动执行） -> 必须 `plugins`
- **不要**同时在 `skills.entries` 和 `plugins.entries` 中重复启用同一个技能

**Plugin 必需文件：**

```
/root/.openclaw/workspace/skills/<id>/
├── openclaw.plugin.json   # 插件清单（必需）
├── index.js               # 工具/hook 实现（必需）
├── CONFIG_SUMMARY.md      # 配置文档（推荐）
└── TEST_REPORT.md         # 测试报告（推荐）
```

### 17.2 分阶段发布流程

所有新技能必须先在 main agent 上验证通过，再部署到辅 agent。

**Stage 1 -- 安装代码**

1. 将技能代码放入 `/root/.openclaw/workspace/skills/<id>/`
2. 确保有 `openclaw.plugin.json`（含 id、name、kind、main、tools/configSchema）
3. 确保有 `index.js`（导出 `register`/`activate` 和工具定义）

**Stage 2 -- Main 启用并测试**

1. 在 main 的 `openclaw.json` 中：
   - `plugins.load.paths` 添加 `"/root/.openclaw/workspace/skills/<id>"`
   - `plugins.entries.<id>` 设为 `{ "enabled": true }` （如有 config 一并填写）
2. 重启 main gateway：`systemctl --user restart openclaw-gateway.service`
3. 检查日志确认插件加载：`journalctl --user -u openclaw-gateway -n 50 | grep -i <id>`
4. 通过 Telegram 对 main 发消息测试功能

**Stage 3 -- 审核**

按 `templates/SKILL_REVIEW_TEMPLATE.md` 完成审核，包括：

| 审核维度 | 检查内容 |
|----------|----------|
| 安全 | API key 管理（环境变量 vs 硬编码）、网络请求范围、文件读写、权限提升 |
| 功能 | agent 能否正确调用、结果是否准确、错误处理是否合理 |
| 性能 | 响应时间、并发调用、对 agent 整体延迟的影响 |
| 最佳实践 | 推荐参数、适用场景、已知限制，记录到 `CONFIG_SUMMARY.md` |

**Stage 4 -- 推送辅 Agent**

1. 技能代码在共享 workspace 下，无需复制
2. 在辅 agent 的 `openclaw.json` 中：
   - `plugins.load.paths` 添加相同路径
   - `plugins.entries.<id>` 启用（注意 agent-specific 配置，如 mem0 的 `agent_id` 必须改为该 agent 的 ID）
3. 重启辅 agent gateway
4. 验证插件加载和功能正常

### 17.3 当前技能清单

| 技能 ID | 类型 | 加载方式 | Main | Tongge | 说明 |
|---------|------|----------|------|--------|------|
| `find-skills-robin` | 内置 | `skills.entries` | 启用 | 启用 | Clawhub 技能发现 |
| `mem0-integration` | lifecycle | `skills.entries` + `plugins` | 启用 | 启用 | 记忆系统（agent_id 需区分） |
| `tavily` | tool | `plugins` | 启用 | 启用 | AI 搜索（共享 API key） |
| `active-learning` | 内置 | `skills.entries` | -- | 启用 | 主动学习（仅 tongge） |
| `memos-cloud-openclaw-plugin` | 内置 | `plugins.entries` | 启用 | 启用 | Memos 云插件 |
| `qwen-portal-auth` | 内置 | `plugins.entries` | 启用 | 启用 | Qwen Portal OAuth |

> **维护要求：** 每次新增或移除技能时，同步更新此表。

### 17.4 Agent-Specific 配置注意事项

部分 plugin 在不同 agent 间需要不同配置：

| Plugin | 需区分的配置项 | Main | Tongge |
|--------|---------------|------|--------|
| `mem0-integration` | `config.agent_id` | `"main"` | `"tongge"` |
| `mem0-integration` | `config.user_id` | `"wang院长"` | `"wang院长"` |

部署到新 agent 时，务必检查以上配置项。

---

## Changelog

| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2026-03-06 | Initial version: hub-and-spoke model, templates, remote support |
| 1.1 | 2026-03-06 | Config-driven architecture: agents.yaml as single registry; automated onboard/offboard; parse_agents.py helper; life agent (张大师) removed; main is only active agent |
| 1.2 | 2026-03-06 | Code review + bug fixes (7 items): `parse_agents.py` output now shell-safe quoted; `agent-monitor.js` constructor ordering fixed (ensureLogDir before loadMonitoredServices) and fallback uses full `openclaw` path; `deploy.sh` switched `grep -qP` to `grep -qE` for portability; `offboard.sh` Qdrant delete uses `FilterSelector` wrapper; `onboard.sh`/`offboard.sh` inline Python rewritten with `sys.argv` to prevent shell injection; `openclaw-agent-monitor.service` added `/run/user/0` to `ReadWritePaths` for D-Bus access; removed corrupted trailing bytes in `offboard.sh` |
| 2.0 | 2026-03-06 | Added operational playbooks (Part B): Interactive Onboarding (Sec 11, with conversation flow, field mapping table, port allocation, checklist, dialog example), Agent Maintenance & Troubleshooting (Sec 12), Memory Management (Sec 13), Interactive Offboarding (Sec 14). Document restructured into Part A (Architecture Reference) and Part B (Operational Playbooks). |
| 2.1 | 2026-03-06 | Added Backup & Cleanup Playbook (Sec 15): backup/restore commands, memory cleanup with retention policy, cron automation, interactive dialogue flow. Added Server Migration Playbook (Sec 16): step-by-step migration with pre/post checklist, Qdrant snapshot recovery, rollback plan. |
| 2.2 | 2026-03-09 | Added Skill/Plugin Management SOP (Sec 17): skill vs plugin selection guide, staged release workflow (main-first), current skill inventory, agent-specific config notes. Unified tavily loading to plugin mode across all agents. |