22 KiB
MEMORY.md - Long-term Memory
This file contains curated long-term memories and important context.
Memory Management Strategy
- MEMORY.md: Curated long-term memories, important decisions, security templates, and key configurations
- QMD System: Automated memory backend with semantic search, auto-updates every 5 minutes
- Usage: Write significant learnings to MEMORY.md; rely on QMD for daily context and automation
- Access: MEMORY.md loaded only in main sessions (direct chats) for security
QMD Configuration
- Backend: qmd
- Auto-update: every 5 minutes
- Include default memory: true
- Last verified: 2026-02-20
Server Security Hardening Template (2026-02-20)
Environment
- Server: Ubuntu 24.04 LTS VPS (KVM)
- Panel: 宝塔面板 (BT-Panel) on port 888
- Public IP: 204.12.203.203
Security Configuration Applied
-
Port Exposure Minimization:
- Only ports 80 (HTTP) and 443 (HTTPS) publicly accessible
- SSH (port 22) restricted to internal/network access only
- OpenClaw gateway (port 18789) bound to localhost only
- All other services (MySQL, custom apps) internal-only
-
OpenClaw Secure Deployment:
- Gateway configured with
bind: "localhost"instead of"lan" - Access exclusively through Nginx reverse proxy with HTTPS
- Token-based authentication enabled
- WebSocket support properly configured in Nginx
- Gateway configured with
-
Firewall Management:
- Use 宝塔面板 (BT-Panel) built-in firewall for port management
- Alternative: system-level firewall (ufw/iptables) if no panel available
- Regular external port scanning to verify exposure
-
Critical Security Principles:
- Never expose sensitive services directly to public internet
- Always use reverse proxy with TLS termination for web services
- Implement defense in depth (firewall + service binding + authentication)
- Regular security audits using
openclaw security audit --deep
Migration Checklist for New Servers
- Install and configure 宝塔面板 or equivalent server management panel
- Set up Nginx reverse proxy with proper WebSocket support
- Configure OpenClaw with localhost binding only
- Restrict public ports to 80/443 only via firewall
- Enable automatic security updates
- Run initial security audit and document baseline
- Schedule periodic security audits via OpenClaw cron
Lessons Learned
- Panel-based firewalls (宝塔/aapanel) must be verified with external port scans
- Direct service exposure (like OpenClaw on 0.0.0.0) creates critical security risks
- Nginx reverse proxy configuration is essential for secure OpenClaw deployment
Agent Operations Logging Practice (2026-02-20)
Log Directory Structure
/root/.openclaw/workspace/logs/operations/- Manual operations and important changes/root/.openclaw/workspace/logs/system/- System-generated logs/root/.openclaw/workspace/logs/agents/- Individual agent logs/root/.openclaw/workspace/logs/security/- Security operations and audits
Automatic Logging Triggers
- Configuration Changes: Any modification to config files (.json, .yaml, etc.)
- Security Modifications: Firewall rules, authentication changes, port modifications
- Agent Lifecycle: Deployment, updates, removal of agents
- System Optimizations: Performance tuning, resource allocation changes
- Troubleshooting: Error diagnosis and resolution procedures
- Memory Updates: Significant changes to MEMORY.md or memory management
Log Format Standard
- Filename:
YYYY-MM-DD-HH-MM-SS-description.log - Timestamp: UTC time format
- Content:
[TIMESTAMP] [OPERATION_TYPE] [AGENT/USER] Description with before/after state
Implementation Guidelines
- Always log before making changes (capture current state)
- Include rollback instructions when applicable
- Redact sensitive information (passwords, tokens, private keys)
- Reference related MEMORY.md entries for context
- Use QMD for routine operational context, MEMORY.md for strategic decisions
Agent Health Monitoring & Alerting System (2026-02-20)
Features Implemented
- Crash Detection: Monitors uncaught exceptions and unhandled rejections
- Health Checks: Periodic service health verification (every 30 seconds)
- Multi-Channel Notifications: Telegram alerts for critical events
- Automatic Logging: All alerts logged to
/logs/agents/health-YYYY-MM-DD.log - Extensible Design: Easy to add new notification channels
Components Created
- Skill:
agent-monitor/SKILL.md- Documentation and usage guide - Monitor Script:
agent-monitor.js- Core monitoring logic - Startup Script:
start-agent-monitor.sh- Easy deployment - Log Directory:
/logs/agents/- Dedicated logging location
Alert Severity Levels
- CRITICAL: Process crashes, uncaught exceptions
- ERROR: Unhandled rejections, failed operations
- WARNING: Health check failures, performance issues
- INFO: Service status updates, recovery notifications
Integration Points
- Automatically integrated with existing Telegram channel
- Compatible with OpenClaw's agent architecture
- Works alongside existing logging and memory systems
- Can monitor any Node.js-based agent process
Usage Instructions
- Source the startup script:
source /root/.openclaw/workspace/start-agent-monitor.sh - Call
startAgentMonitor("agent-name", healthCheckFunction) - Monitor automatically sends alerts on errors/crashes
- Check logs in
/logs/agents/for detailed information
Complete System Architecture Upgrade (2026-02-20 14:25 UTC)
✅ All 5 Core Requirements Implemented
1. System-Level Persistence ✓
- Systemd Services:
openclaw-gateway.service+openclaw-agent-monitor.service - Auto-start on Boot: Both services enabled in multi-user.target
- Resource Limits: Memory (2G/512M), CPU (80%/20%), watchdog timers
- Status:
systemctl status openclaw-gateway/systemctl status openclaw-agent-monitor
2. Auto-Healing ✓
- Crash Detection: Monitors process exits, signals, uncaught exceptions
- Auto-Restart: Systemd Restart=always + monitor script restart logic
- Restart Limits: Max 5 restarts per 5 minutes (prevents restart loops)
- Health Checks: Every 30 seconds, automatic recovery on failure
3. Multi-Layer Memory Architecture ✓
- Core Memory:
CORE_INDEX.md- Identity, structure, file index (always loaded first) - Long-term Memory:
MEMORY.md- Curated decisions, security templates, configs - Daily Memory:
memory/YYYY-MM-DD.md- Raw conversation logs (auto-saved) - Passive Archive: On-demand conversion of valuable conversations to skills/notes
- Git Integration: All memory files tracked with version history
4. Git One-Click Rollback ✓
- Repository:
/root/.openclaw/workspace(already initialized) - Deploy Script:
./deploy.sh rollback- Rollback to previous commit - Specific Rollback:
./deploy.sh rollback-to <commit>- Rollback to specific commit - Auto-Backup: Backup created before rollback
- Service Restart: Automatic service restart after rollback
5. Telegram Notifications ✓
- Triggers: Service stop, error, crash, restart events
- Channels: Telegram (via bot API) + OpenClaw message tool
- Severity Levels: CRITICAL, ERROR, WARNING, INFO with emoji indicators
- Logging: All notifications logged to
/logs/agents/health-YYYY-MM-DD.log
📋 Management Commands (deploy.sh)
./deploy.sh install # Install & start all systemd services
./deploy.sh start # Start all services
./deploy.sh stop # Stop all services
./deploy.sh restart # Restart all services
./deploy.sh status # Show detailed service status
./deploy.sh logs # Show recent logs (last 50 lines)
./deploy.sh health # Run comprehensive health check
./deploy.sh backup # Create timestamped backup
./deploy.sh rollback # Rollback to previous git commit
./deploy.sh rollback-to <commit> # Rollback to specific commit
./deploy.sh help # Show help message
🔧 Systemd Service Details
-
Gateway Service:
/etc/systemd/system/openclaw-gateway.service- Memory limit: 2G, CPU: 80%, Watchdog: 30s
- Restart: always, RestartSec: 10s
- Logs:
journalctl -u openclaw-gateway -f
-
Monitor Service:
/etc/systemd/system/openclaw-agent-monitor.service- Memory limit: 512M, CPU: 20%
- Restart: always, RestartSec: 5s
- Logs:
journalctl -u openclaw-agent-monitor -f
📊 Health Check Metrics
- Gateway service status (active/inactive)
- Agent monitor status (active/inactive)
- Disk usage (warning at 80%)
- Memory usage (warning at 80%)
🎯 Next Steps (Future Enhancements)
- Add Prometheus/Grafana monitoring dashboard
- Implement log rotation and archival
- Add email notifications as backup channel
- Create web-based admin dashboard
- Add automated security scanning in CI/CD
User-Level vs System-Level Systemd Services - Critical Lesson (2026-02-20 14:35 UTC)
Problem Discovered
Initial deployment used system-level systemd services (/etc/systemd/system/) for OpenClaw Gateway, but OpenClaw natively uses user-level systemd (~/.config/systemd/user/). This caused:
- Service restart loops (5 attempts then failure)
- Error:
systemctl --user unavailable: Failed to connect to bus: No medium found - Conflicts between system and user service definitions
Root Cause
OpenClaw Gateway is designed as a user-level service because:
- It runs under the user's context, not root
- It needs access to user-specific config (
~/.openclaw/) - User-level services have different environment requirements
Solution: Hybrid Architecture
User-Level Service (Gateway)
- Location:
~/.config/systemd/user/openclaw-gateway.service - Required Setup:
# Enable linger (CRITICAL - allows user services to run without login session) loginctl enable-linger $(whoami) # Set environment variables export XDG_RUNTIME_DIR=/run/user/$(id -u) export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" - Management Commands:
systemctl --user status openclaw-gateway systemctl --user start/stop/restart openclaw-gateway journalctl --user -u openclaw-gateway -f
System-Level Service (Agent Monitor)
- Location:
/etc/systemd/system/openclaw-agent-monitor.service - Purpose: Independently monitor the gateway (survives user session issues)
- Management Commands:
systemctl status openclaw-agent-monitor systemctl start/stop/restart openclaw-agent-monitor journalctl -u openclaw-agent-monitor -f
Deployment Checklist for New Servers
# 1. Enable user linger (MUST DO FIRST)
loginctl enable-linger $(whoami)
# 2. Create runtime directory if needed
mkdir -p /run/user/$(id -u)
chmod 700 /run/user/$(id -u)
# 3. Export environment (add to ~/.bashrc for persistence)
echo 'export XDG_RUNTIME_DIR=/run/user/$(id -u)' >> ~/.bashrc
echo 'export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u)/bus' >> ~/.bashrc
# 4. Install services
./deploy.sh install
# 5. Verify
./deploy.sh health
Troubleshooting Guide
Error: "Failed to connect to bus: No medium found"
Cause: User linger not enabled or environment variables not set Fix:
loginctl enable-linger $(whoami)
export XDG_RUNTIME_DIR=/run/user/$(id -u)
export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus"
Error: "Start request repeated too quickly"
Cause: Service crashing due to misconfiguration
Fix: Check logs with journalctl --user -u openclaw-gateway -f
User service not starting after reboot
Cause: Linger not enabled
Fix: loginctl enable-linger $(whoami)
Best Practices for Multi-Agent Deployments
- Always enable linger on first setup - document this in deployment guide
- Use hybrid architecture - user-level for agents, system-level for monitors
- Set environment variables in startup scripts, not just shell config
- Test after reboot - verify services auto-start correctly
- Document in MEMORY.md - share lessons across agent instances
Updated deploy.sh Features
- Automatically enables linger during install
- Sets up XDG_RUNTIME_DIR and DBUS_SESSION_BUS_ADDRESS
- Uses
systemctl --userfor gateway,systemctlfor monitor - Health check verifies linger status and runtime directory
- Proper log commands for both service types
Collection 名称统一为 mem0_v4_shared (2026-02-27)
背景
之前配置中存在 Collection 名称不一致问题:
- 代码实际使用:
mem0_global_v4 - 用户指定/文档记录:
mem0_v4_shared(陈医生和张大师共用)
修改决策
王院长明确指示:所有 Collection 统一使用 mem0_v4_shared,不得随意修改关键配置。
修改文件列表
skills/mem0-integration/mem0_client.pyskills/mem0-integration/config.yamlskills/mem0-integration/skill.jsonskills/mem0-integration/config-life.yamlagents/life-agent.jsonagents/life-workspace/skills/mem0-integration/config.yamldocs/SYSTEM_ARCHITECTURE.md
验证结果
- ✅ Gateway 重启成功 (systemd 服务正常)
- ✅ Qdrant Collection
mem0_v4_shared已创建 - ✅ 向量维度:1024 (text-embedding-v4)
- ✅ 距离度量:Cosine
- ✅ 元数据索引:user_id, agent_id, actor_id, run_id
- ✅ Embedding 计费通道:Bailian 标准计费
操作日志
/root/.openclaw/workspace/logs/operations/2026-02-27-08-55-00-unify-collection-name.log
重要原则
- 关键配置(Collection 名称、Embedding 模型、计费通道)修改必须经过用户确认
- 所有 Agent 共享同一 Collection,通过
metadata.agent_id实现逻辑隔离
Eason 的工作原则 (2026-03-07)
- 主动思考义务 — 作为 Agent 网络的维护者,有义务主动发现安全隐患、优化机会、最佳实践,并提议改进方案
- 重要变更需审批 — 涉及安全配置、架构调整、权限变更等,必须先问王院长,获得确认后再执行
- 用"我们"不是"你们" — 我们是一个团队,一起工作。不说"你们的最佳实践",说"我们的最佳实践"
边界把握
- ✅ 应该做:主动审计、发现问题、提出方案、执行已批准的操作
- ❌ 不应该:擅自修改关键配置、替用户做决定、用 outsider 语气
Agent 部署最佳实践 (2026-03-07 新增)
技能/插件文件规范
问题: 为桐哥配置 Tavily 时,创建了 skill.json 但 OpenClaw 需要 openclaw.plugin.json,导致服务崩溃重启 38 次。
教训:
| 文件类型 | 用途 | 必需 | 命名 |
|---|---|---|---|
openclaw.plugin.json |
OpenClaw 插件清单 | ✅ 必需 | 固定名称 |
skill.json |
Clawhub 技能元数据 | ❌ 可选 | 固定名称 |
index.js |
插件/工具实现 | ✅ 必需 | 固定名称 |
SKILL.md |
技能文档 | ✅ 推荐 | 固定名称 |
检查清单(新增 Agent 时):
-
插件结构
openclaw.plugin.json已创建(不是skill.json)index.js已实现工具/插件逻辑plugins.load.paths已添加插件路径plugins.entries已启用插件
-
配置验证
- 执行
openclaw --profile <agent> doctor验证配置 - 执行
openclaw --profile <agent> status检查服务状态 - 查看日志
journalctl --user -u openclaw-gateway-<agent> -n 20
- 执行
-
技能启用
skills.entries.<skill>.enabled: true- 环境变量已配置(如 API Key)
- 插件依赖已加载
错误示例(不要这样做):
❌ 只创建 skill.json,没有 openclaw.plugin.json
❌ 没有验证配置就直接重启服务
❌ 服务崩溃后没有查看日志就继续修改
正确流程:
1. 创建技能文件(openclaw.plugin.json + index.js)
2. 在 openclaw.json 中配置 plugins.load.paths 和 plugins.entries
3. 运行 openclaw doctor 验证配置
4. 重启服务并检查状态
5. 查看日志确认插件加载成功
配置变更原则
- 先验证再重启 — 用
doctor命令验证配置,不要直接重启 - 看日志再修复 — 服务崩溃后先
journalctl看错误,再针对性修复 - 小步迭代 — 一次改一个配置,验证通过再继续
时区配置 (2026-03-07)
所有 Agent 统一使用香港时区 (Asia/Hong_Kong, UTC+8)
- Eason (主 Agent): 香港时区
- 桐哥: 香港时区
- 作息配置:7-23 点工作,23-7 点休息(香港时间)
- Cron 触发:每小时触发,脚本内部判断香港时区
转换关系:
- 香港 07:00 = UTC 23:00 (前一日)
- 香港 23:00 = UTC 15:00
- 香港 13:00 = UTC 05:00
安全审计误报分析 (2026-02-26)
背景
执行 openclaw security audit --deep 发现 4 个 CRITICAL/WARNING 问题,经人工复核确认为误报或已知权衡。
误报项及原因
| 审计项 | 原始评级 | 复核结论 | 根因 |
|---|---|---|---|
Gateway 绑定 lan |
CRITICAL | 误报 | 审计工具静态分析配置文件,无法感知运行时绑定到 Tailscale (100.115.94.1) |
| 设备认证禁用 | CRITICAL | 已知权衡 | 解决 HTTP 下 isSecureContext=false 问题,Tailscale+token 双重保护 |
| 无插件白名单 | WARNING | 建议修复 | 已确认暂不修复(成本低但收益有限) |
| 无速率限制 | WARNING | 威胁模型不匹配 | Tailscale 封闭网络 +48 字符强 token,暴力破解风险接近零 |
| MemoryLimit 废弃 | WARNING | 误报 | 审计参考 workspace 模板,实际 service 文件无此参数 |
核心教训
- 安全审计是静态分析 - 无法替代人工判断,需结合运行时上下文
- 理解威胁模型 - 审计假设的威胁场景需匹配实际部署环境
- 记录已知权衡 - 在 MEMORY.md 记录为什么某些"安全问题"被接受
详细文档
- 审计报告:
logs/operations/2026-02-26-20-59-30-config-audit-report.md - 复核分析:
logs/operations/2026-02-26-21-05-00-security-audit-review.md - 修复脚本:
fix-security-config.sh(未执行)
配置清理和推送 (2026-02-26)
操作
- 删除废弃
life/目录(空配置,未被任何文件引用) - 清理嵌套 git 仓库(
agents/life-workspace/.git,skills/openclaw-wecom/.git) - 移除 Python 缓存和运行时状态文件
- 提交并推送到远程仓库
Git 提交
- Commit:
378523c chore: 配置审计和清理 - 2026-02-26 - 远程:
gl.tigerone.tech:sw_dm/openClaw_agent_dm.git - 备份:
/root/.openclaw/backups/workspace-20260226-210956.tar.gz
保留的目录
agents/life-workspace/- 测试用 Agent 工作区skills/openclaw-wecom/- 企业微信技能(TypeScript 实现)
系统扩展架构升级完成 (2026-03-03 17:02 UTC)
6 项核心任务全部完成
Task 1 - 环境变量持久化
- 文件:
systemd/gateway.env,systemd/life-gateway.env - 权限: chmod 600 (仅 root 可读)
- 特点: 独立于 .service 文件,OpenClaw UI 升级不会覆盖
- 内容:
MEM0_DASHSCOPE_API_KEY=sk-4111c9dba5334510968f9ae72728944e OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1 OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
Task 2 - Agent Monitor 修复 (4 个 Bug)
- 重启限制: 集成到
monitorOpenClawService()viahandleServiceDown()— 无限重启循环已修复 - Life Agent 监控: 现在每 30 秒同时检查 gateway 和 openclaw-gateway-life.service
- 心跳日志: 每 10 分钟输出
gateway=OK, life=OK - 升级容忍: 首次检测到服务停止后等待 60 秒 (grace period),避免升级期间误报
Task 3 - Systemd 服务升级
- 模板更新: 废弃的
MemoryLimit=替换为MemoryMax= - Monitor 同步: 模板同步到
/etc/systemd/system/ - 环境变量注入: 两个 user-level service 文件添加
EnvironmentFile= - 遗留服务: 禁用并 masked 旧的系统级
openclaw-gateway.service - 状态: 所有 3 个服务已重启并确认 active
Task 4 - deploy.sh 增强
- 新命令:
debug-stop— 安全停止 monitor 防止调试期间自动重启debug-start— 调试完成后恢复所有服务fix-service— UI 升级后重新注入EnvironmentFile=
- Life Agent 集成:
start/stop/restart/status/logs/health/install全部支持 life agent
Task 5 - 统一架构文档
- 文件:
docs/EXTENSIONS_ARCHITECTURE.md - 内容: 服务架构、监控系统、记忆系统交叉引用、环境变量、调试流程、升级安全清单
Task 6 - CORE_INDEX.md 更新
- 文件树: 新增 .env 文件、.legacy 重命名、新文档
- 星标引用: EXTENSIONS_ARCHITECTURE.md 列为关键参考
- 升级指南: 添加升级安全指令到模型使用指南
- 管理命令: 更新 deploy.sh 命令列表
当前系统状态 (2026-03-04 03:32 UTC)
● openclaw-gateway.service Active: active (running) 10h ago
● openclaw-gateway-life.service Active: active (running) 10h ago
● openclaw-agent-monitor.service Active: active (running) 10h ago
Monitor 心跳日志正常:每 10 分钟输出 gateway=OK, life=OK
升级安全流程
# OpenClaw UI 升级后执行
./deploy.sh fix-service # 重新注入 EnvironmentFile=
./deploy.sh restart # 重启所有服务
./deploy.sh health # 验证健康状态
关键文档
- 扩展架构:
docs/EXTENSIONS_ARCHITECTURE.md— 修改基础设施前必读 - 记忆系统:
docs/MEMORY_ARCHITECTURE.md— 四层记忆体系详细设计 - 监控脚本:
agent-monitor.js— 健康监控与自动修复逻辑