22 KiB

Raw Blame History

MEMORY.md - Long-term Memory

This file contains curated long-term memories and important context.

Memory Management Strategy

MEMORY.md: Curated long-term memories, important decisions, security templates, and key configurations
QMD System: Automated memory backend with semantic search, auto-updates every 5 minutes
Usage: Write significant learnings to MEMORY.md; rely on QMD for daily context and automation
Access: MEMORY.md loaded only in main sessions (direct chats) for security

QMD Configuration

Backend: qmd
Auto-update: every 5 minutes
Include default memory: true
Last verified: 2026-02-20

Server Security Hardening Template (2026-02-20)

Environment

Server: Ubuntu 24.04 LTS VPS (KVM)
Panel: 宝塔面板 (BT-Panel) on port 888
Public IP: 204.12.203.203

Security Configuration Applied

Port Exposure Minimization:
- Only ports 80 (HTTP) and 443 (HTTPS) publicly accessible
- SSH (port 22) restricted to internal/network access only
- OpenClaw gateway (port 18789) bound to localhost only
- All other services (MySQL, custom apps) internal-only
OpenClaw Secure Deployment:
- Gateway configured with bind: "localhost" instead of "lan"
- Access exclusively through Nginx reverse proxy with HTTPS
- Token-based authentication enabled
- WebSocket support properly configured in Nginx
Firewall Management:
- Use 宝塔面板 (BT-Panel) built-in firewall for port management
- Alternative: system-level firewall (ufw/iptables) if no panel available
- Regular external port scanning to verify exposure
Critical Security Principles:
- Never expose sensitive services directly to public internet
- Always use reverse proxy with TLS termination for web services
- Implement defense in depth (firewall + service binding + authentication)
- Regular security audits using openclaw security audit --deep

Migration Checklist for New Servers

Install and configure 宝塔面板 or equivalent server management panel
Set up Nginx reverse proxy with proper WebSocket support
Configure OpenClaw with localhost binding only
Restrict public ports to 80/443 only via firewall
Enable automatic security updates
Run initial security audit and document baseline
Schedule periodic security audits via OpenClaw cron

Lessons Learned

Panel-based firewalls (宝塔/aapanel) must be verified with external port scans
Direct service exposure (like OpenClaw on 0.0.0.0) creates critical security risks
Nginx reverse proxy configuration is essential for secure OpenClaw deployment

Agent Operations Logging Practice (2026-02-20)

Log Directory Structure

/root/.openclaw/workspace/logs/operations/ - Manual operations and important changes
/root/.openclaw/workspace/logs/system/ - System-generated logs
/root/.openclaw/workspace/logs/agents/ - Individual agent logs
/root/.openclaw/workspace/logs/security/ - Security operations and audits

Automatic Logging Triggers

Configuration Changes: Any modification to config files (.json, .yaml, etc.)
Security Modifications: Firewall rules, authentication changes, port modifications
Agent Lifecycle: Deployment, updates, removal of agents
System Optimizations: Performance tuning, resource allocation changes
Troubleshooting: Error diagnosis and resolution procedures
Memory Updates: Significant changes to MEMORY.md or memory management

Log Format Standard

Filename: YYYY-MM-DD-HH-MM-SS-description.log
Timestamp: UTC time format
Content: [TIMESTAMP] [OPERATION_TYPE] [AGENT/USER] Description with before/after state

Implementation Guidelines

Always log before making changes (capture current state)
Include rollback instructions when applicable
Redact sensitive information (passwords, tokens, private keys)
Reference related MEMORY.md entries for context
Use QMD for routine operational context, MEMORY.md for strategic decisions

Agent Health Monitoring & Alerting System (2026-02-20)

Features Implemented

Crash Detection: Monitors uncaught exceptions and unhandled rejections
Health Checks: Periodic service health verification (every 30 seconds)
Multi-Channel Notifications: Telegram alerts for critical events
Automatic Logging: All alerts logged to /logs/agents/health-YYYY-MM-DD.log
Extensible Design: Easy to add new notification channels

Components Created

Skill: agent-monitor/SKILL.md - Documentation and usage guide
Monitor Script: agent-monitor.js - Core monitoring logic
Startup Script: start-agent-monitor.sh - Easy deployment
Log Directory: /logs/agents/ - Dedicated logging location

Alert Severity Levels

CRITICAL: Process crashes, uncaught exceptions
ERROR: Unhandled rejections, failed operations
WARNING: Health check failures, performance issues
INFO: Service status updates, recovery notifications

Integration Points

Automatically integrated with existing Telegram channel
Compatible with OpenClaw's agent architecture
Works alongside existing logging and memory systems
Can monitor any Node.js-based agent process

Usage Instructions

Source the startup script: source /root/.openclaw/workspace/start-agent-monitor.sh
Call startAgentMonitor("agent-name", healthCheckFunction)
Monitor automatically sends alerts on errors/crashes
Check logs in /logs/agents/ for detailed information

Complete System Architecture Upgrade (2026-02-20 14:25 UTC)

✅ All 5 Core Requirements Implemented

1. System-Level Persistence ✓

Systemd Services: openclaw-gateway.service + openclaw-agent-monitor.service
Auto-start on Boot: Both services enabled in multi-user.target
Resource Limits: Memory (2G/512M), CPU (80%/20%), watchdog timers
Status: systemctl status openclaw-gateway / systemctl status openclaw-agent-monitor

2. Auto-Healing ✓

Crash Detection: Monitors process exits, signals, uncaught exceptions
Auto-Restart: Systemd Restart=always + monitor script restart logic
Restart Limits: Max 5 restarts per 5 minutes (prevents restart loops)
Health Checks: Every 30 seconds, automatic recovery on failure

3. Multi-Layer Memory Architecture ✓

Core Memory: CORE_INDEX.md - Identity, structure, file index (always loaded first)
Long-term Memory: MEMORY.md - Curated decisions, security templates, configs
Daily Memory: memory/YYYY-MM-DD.md - Raw conversation logs (auto-saved)
Passive Archive: On-demand conversion of valuable conversations to skills/notes
Git Integration: All memory files tracked with version history

4. Git One-Click Rollback ✓

Repository: /root/.openclaw/workspace (already initialized)
Deploy Script: ./deploy.sh rollback - Rollback to previous commit
Specific Rollback: ./deploy.sh rollback-to <commit> - Rollback to specific commit
Auto-Backup: Backup created before rollback
Service Restart: Automatic service restart after rollback

5. Telegram Notifications ✓

Triggers: Service stop, error, crash, restart events
Channels: Telegram (via bot API) + OpenClaw message tool
Severity Levels: CRITICAL, ERROR, WARNING, INFO with emoji indicators
Logging: All notifications logged to /logs/agents/health-YYYY-MM-DD.log

📋 Management Commands (deploy.sh)

./deploy.sh install    # Install & start all systemd services
./deploy.sh start      # Start all services
./deploy.sh stop       # Stop all services
./deploy.sh restart    # Restart all services
./deploy.sh status     # Show detailed service status
./deploy.sh logs       # Show recent logs (last 50 lines)
./deploy.sh health     # Run comprehensive health check
./deploy.sh backup     # Create timestamped backup
./deploy.sh rollback   # Rollback to previous git commit
./deploy.sh rollback-to <commit>  # Rollback to specific commit
./deploy.sh help       # Show help message

🔧 Systemd Service Details

Gateway Service: /etc/systemd/system/openclaw-gateway.service
- Memory limit: 2G, CPU: 80%, Watchdog: 30s
- Restart: always, RestartSec: 10s
- Logs: journalctl -u openclaw-gateway -f
Monitor Service: /etc/systemd/system/openclaw-agent-monitor.service
- Memory limit: 512M, CPU: 20%
- Restart: always, RestartSec: 5s
- Logs: journalctl -u openclaw-agent-monitor -f

📊 Health Check Metrics

Gateway service status (active/inactive)
Agent monitor status (active/inactive)
Disk usage (warning at 80%)
Memory usage (warning at 80%)

🎯 Next Steps (Future Enhancements)

Add Prometheus/Grafana monitoring dashboard
Implement log rotation and archival
Add email notifications as backup channel
Create web-based admin dashboard
Add automated security scanning in CI/CD

User-Level vs System-Level Systemd Services - Critical Lesson (2026-02-20 14:35 UTC)

Problem Discovered

Initial deployment used system-level systemd services (/etc/systemd/system/) for OpenClaw Gateway, but OpenClaw natively uses user-level systemd (~/.config/systemd/user/). This caused:

Service restart loops (5 attempts then failure)
Error: systemctl --user unavailable: Failed to connect to bus: No medium found
Conflicts between system and user service definitions

Root Cause

OpenClaw Gateway is designed as a user-level service because:

It runs under the user's context, not root
It needs access to user-specific config (~/.openclaw/)
User-level services have different environment requirements

Solution: Hybrid Architecture

User-Level Service (Gateway)

Location: ~/.config/systemd/user/openclaw-gateway.service

Required Setup:

# Enable linger (CRITICAL - allows user services to run without login session)
loginctl enable-linger $(whoami)

# Set environment variables
export XDG_RUNTIME_DIR=/run/user/$(id -u)
export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus"

Management Commands:

systemctl --user status openclaw-gateway
systemctl --user start/stop/restart openclaw-gateway
journalctl --user -u openclaw-gateway -f

System-Level Service (Agent Monitor)

Location: /etc/systemd/system/openclaw-agent-monitor.service
Purpose: Independently monitor the gateway (survives user session issues)

Management Commands:

systemctl status openclaw-agent-monitor
systemctl start/stop/restart openclaw-agent-monitor
journalctl -u openclaw-agent-monitor -f

Deployment Checklist for New Servers

# 1. Enable user linger (MUST DO FIRST)
loginctl enable-linger $(whoami)

# 2. Create runtime directory if needed
mkdir -p /run/user/$(id -u)
chmod 700 /run/user/$(id -u)

# 3. Export environment (add to ~/.bashrc for persistence)
echo 'export XDG_RUNTIME_DIR=/run/user/$(id -u)' >> ~/.bashrc
echo 'export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u)/bus' >> ~/.bashrc

# 4. Install services
./deploy.sh install

# 5. Verify
./deploy.sh health

Troubleshooting Guide

Error: "Failed to connect to bus: No medium found"

Cause: User linger not enabled or environment variables not set Fix:

loginctl enable-linger $(whoami)
export XDG_RUNTIME_DIR=/run/user/$(id -u)
export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus"

Error: "Start request repeated too quickly"

Cause: Service crashing due to misconfiguration Fix: Check logs with journalctl --user -u openclaw-gateway -f

User service not starting after reboot

Cause: Linger not enabled Fix: loginctl enable-linger $(whoami)

Best Practices for Multi-Agent Deployments

Always enable linger on first setup - document this in deployment guide
Use hybrid architecture - user-level for agents, system-level for monitors
Set environment variables in startup scripts, not just shell config
Test after reboot - verify services auto-start correctly
Document in MEMORY.md - share lessons across agent instances

Updated deploy.sh Features

Automatically enables linger during install
Sets up XDG_RUNTIME_DIR and DBUS_SESSION_BUS_ADDRESS
Uses systemctl --user for gateway, systemctl for monitor
Health check verifies linger status and runtime directory
Proper log commands for both service types

Collection 名称统一为 mem0_v4_shared (2026-02-27)

背景

之前配置中存在 Collection 名称不一致问题：

代码实际使用：mem0_global_v4
用户指定/文档记录：mem0_v4_shared（陈医生和张大师共用）

修改决策

王院长明确指示：所有 Collection 统一使用 mem0_v4_shared，不得随意修改关键配置。

修改文件列表

skills/mem0-integration/mem0_client.py
skills/mem0-integration/config.yaml
skills/mem0-integration/skill.json
skills/mem0-integration/config-life.yaml
agents/life-agent.json
agents/life-workspace/skills/mem0-integration/config.yaml
docs/SYSTEM_ARCHITECTURE.md

验证结果

✅ Gateway 重启成功 (systemd 服务正常)
✅ Qdrant Collection mem0_v4_shared 已创建
✅ 向量维度：1024 (text-embedding-v4)
✅ 距离度量：Cosine
✅ 元数据索引：user_id, agent_id, actor_id, run_id
✅ Embedding 计费通道：Bailian 标准计费

操作日志

/root/.openclaw/workspace/logs/operations/2026-02-27-08-55-00-unify-collection-name.log

重要原则

关键配置（Collection 名称、Embedding 模型、计费通道）修改必须经过用户确认
所有 Agent 共享同一 Collection，通过 metadata.agent_id 实现逻辑隔离

Eason 的工作原则 (2026-03-07)

主动思考义务 — 作为 Agent 网络的维护者，有义务主动发现安全隐患、优化机会、最佳实践，并提议改进方案
重要变更需审批 — 涉及安全配置、架构调整、权限变更等，必须先问王院长，获得确认后再执行
用"我们"不是"你们" — 我们是一个团队，一起工作。不说"你们的最佳实践"，说"我们的最佳实践"

边界把握

✅ 应该做：主动审计、发现问题、提出方案、执行已批准的操作
❌ 不应该：擅自修改关键配置、替用户做决定、用 outsider 语气

Agent 部署最佳实践 (2026-03-07 新增)

技能/插件文件规范

问题: 为桐哥配置 Tavily 时，创建了 skill.json 但 OpenClaw 需要 openclaw.plugin.json，导致服务崩溃重启 38 次。

教训:

文件类型	用途	必需	命名
`openclaw.plugin.json`	OpenClaw 插件清单	✅ 必需	固定名称
`skill.json`	Clawhub 技能元数据	❌ 可选	固定名称
`index.js`	插件/工具实现	✅ 必需	固定名称
`SKILL.md`	技能文档	✅ 推荐	固定名称

检查清单（新增 Agent 时）:

插件结构
- openclaw.plugin.json 已创建（不是 skill.json）
- index.js 已实现工具/插件逻辑
- plugins.load.paths 已添加插件路径
- plugins.entries 已启用插件
配置验证
- 执行 openclaw --profile <agent> doctor 验证配置
- 执行 openclaw --profile <agent> status 检查服务状态
- 查看日志 journalctl --user -u openclaw-gateway-<agent> -n 20
技能启用
- skills.entries.<skill>.enabled: true
- 环境变量已配置（如 API Key）
- 插件依赖已加载

错误示例（不要这样做）:

❌ 只创建 skill.json，没有 openclaw.plugin.json
❌ 没有验证配置就直接重启服务
❌ 服务崩溃后没有查看日志就继续修改

正确流程:

1. 创建技能文件（openclaw.plugin.json + index.js）
2. 在 openclaw.json 中配置 plugins.load.paths 和 plugins.entries
3. 运行 openclaw doctor 验证配置
4. 重启服务并检查状态
5. 查看日志确认插件加载成功

配置变更原则

先验证再重启 — 用 doctor 命令验证配置，不要直接重启
看日志再修复 — 服务崩溃后先 journalctl 看错误，再针对性修复
小步迭代 — 一次改一个配置，验证通过再继续

时区配置 (2026-03-07)

所有 Agent 统一使用香港时区 (Asia/Hong_Kong, UTC+8)

Eason (主 Agent): 香港时区
桐哥: 香港时区
作息配置：7-23 点工作，23-7 点休息（香港时间）
Cron 触发：每小时触发，脚本内部判断香港时区

转换关系:

香港 07:00 = UTC 23:00 (前一日)
香港 23:00 = UTC 15:00
香港 13:00 = UTC 05:00

安全审计误报分析 (2026-02-26)

背景

执行 openclaw security audit --deep 发现 4 个 CRITICAL/WARNING 问题，经人工复核确认为误报或已知权衡。

误报项及原因

审计项	原始评级	复核结论	根因
Gateway 绑定 `lan`	CRITICAL	误报	审计工具静态分析配置文件，无法感知运行时绑定到 Tailscale (100.115.94.1)
设备认证禁用	CRITICAL	已知权衡	解决 HTTP 下 `isSecureContext=false` 问题，Tailscale+token 双重保护
无插件白名单	WARNING	建议修复	已确认暂不修复（成本低但收益有限）
无速率限制	WARNING	威胁模型不匹配	Tailscale 封闭网络 +48 字符强 token，暴力破解风险接近零
MemoryLimit 废弃	WARNING	误报	审计参考 workspace 模板，实际 service 文件无此参数

核心教训

安全审计是静态分析 - 无法替代人工判断，需结合运行时上下文
理解威胁模型 - 审计假设的威胁场景需匹配实际部署环境
记录已知权衡 - 在 MEMORY.md 记录为什么某些"安全问题"被接受

详细文档

审计报告：logs/operations/2026-02-26-20-59-30-config-audit-report.md
复核分析：logs/operations/2026-02-26-21-05-00-security-audit-review.md
修复脚本：fix-security-config.sh (未执行)

配置清理和推送 (2026-02-26)

操作

删除废弃 life/ 目录（空配置，未被任何文件引用）
清理嵌套 git 仓库（agents/life-workspace/.git, skills/openclaw-wecom/.git）
移除 Python 缓存和运行时状态文件
提交并推送到远程仓库

Git 提交

Commit: 378523c chore: 配置审计和清理 - 2026-02-26
远程：gl.tigerone.tech:sw_dm/openClaw_agent_dm.git
备份：/root/.openclaw/backups/workspace-20260226-210956.tar.gz

保留的目录

agents/life-workspace/ - 测试用 Agent 工作区
skills/openclaw-wecom/ - 企业微信技能（TypeScript 实现）

系统扩展架构升级完成 (2026-03-03 17:02 UTC)

6 项核心任务全部完成

Task 1 - 环境变量持久化

文件: systemd/gateway.env, systemd/life-gateway.env
权限: chmod 600 (仅 root 可读)
特点: 独立于 .service 文件，OpenClaw UI 升级不会覆盖

内容:

MEM0_DASHSCOPE_API_KEY=sk-4111c9dba5334510968f9ae72728944e
OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

Task 2 - Agent Monitor 修复 (4 个 Bug)

重启限制: 集成到 monitorOpenClawService() via handleServiceDown() — 无限重启循环已修复
Life Agent 监控: 现在每 30 秒同时检查 gateway 和 openclaw-gateway-life.service
心跳日志: 每 10 分钟输出 gateway=OK, life=OK
升级容忍: 首次检测到服务停止后等待 60 秒 (grace period)，避免升级期间误报

Task 3 - Systemd 服务升级

模板更新: 废弃的 MemoryLimit= 替换为 MemoryMax=
Monitor 同步: 模板同步到 /etc/systemd/system/
环境变量注入: 两个 user-level service 文件添加 EnvironmentFile=
遗留服务: 禁用并 masked 旧的系统级 openclaw-gateway.service
状态: 所有 3 个服务已重启并确认 active

Task 4 - deploy.sh 增强

新命令:
- debug-stop — 安全停止 monitor 防止调试期间自动重启
- debug-start — 调试完成后恢复所有服务
- fix-service — UI 升级后重新注入 EnvironmentFile=
Life Agent 集成: start/stop/restart/status/logs/health/install 全部支持 life agent

Task 5 - 统一架构文档

文件: docs/EXTENSIONS_ARCHITECTURE.md
内容: 服务架构、监控系统、记忆系统交叉引用、环境变量、调试流程、升级安全清单

Task 6 - CORE_INDEX.md 更新

文件树: 新增 .env 文件、.legacy 重命名、新文档
星标引用: EXTENSIONS_ARCHITECTURE.md 列为关键参考
升级指南: 添加升级安全指令到模型使用指南
管理命令: 更新 deploy.sh 命令列表

当前系统状态 (2026-03-04 03:32 UTC)

● openclaw-gateway.service        Active: active (running) 10h ago
● openclaw-gateway-life.service   Active: active (running) 10h ago
● openclaw-agent-monitor.service  Active: active (running) 10h ago

Monitor 心跳日志正常：每 10 分钟输出 gateway=OK, life=OK

升级安全流程

# OpenClaw UI 升级后执行
./deploy.sh fix-service   # 重新注入 EnvironmentFile=
./deploy.sh restart       # 重启所有服务
./deploy.sh health        # 验证健康状态

关键文档

扩展架构: docs/EXTENSIONS_ARCHITECTURE.md — 修改基础设施前必读
记忆系统: docs/MEMORY_ARCHITECTURE.md — 四层记忆体系详细设计
监控脚本: agent-monitor.js — 健康监控与自动修复逻辑

22 KiB Raw Blame History