You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

19 KiB

MEMORY.md - Long-term Memory

This file contains curated long-term memories and important context.

Memory Management Strategy

  • MEMORY.md: Curated long-term memories, important decisions, security templates, and key configurations
  • QMD System: Automated memory backend with semantic search, auto-updates every 5 minutes
  • Usage: Write significant learnings to MEMORY.md; rely on QMD for daily context and automation
  • Access: MEMORY.md loaded only in main sessions (direct chats) for security

QMD Configuration

  • Backend: qmd
  • Auto-update: every 5 minutes
  • Include default memory: true
  • Last verified: 2026-02-20

Server Security Hardening Template (2026-02-20)

Environment

  • Server: Ubuntu 24.04 LTS VPS (KVM)
  • Panel: 宝塔面板 (BT-Panel) on port 888
  • Public IP: 204.12.203.203

Security Configuration Applied

  1. Port Exposure Minimization:

    • Only ports 80 (HTTP) and 443 (HTTPS) publicly accessible
    • SSH (port 22) restricted to internal/network access only
    • OpenClaw gateway (port 18789) bound to localhost only
    • All other services (MySQL, custom apps) internal-only
  2. OpenClaw Secure Deployment:

    • Gateway configured with bind: "localhost" instead of "lan"
    • Access exclusively through Nginx reverse proxy with HTTPS
    • Token-based authentication enabled
    • WebSocket support properly configured in Nginx
  3. Firewall Management:

    • Use 宝塔面板 (BT-Panel) built-in firewall for port management
    • Alternative: system-level firewall (ufw/iptables) if no panel available
    • Regular external port scanning to verify exposure
  4. Critical Security Principles:

    • Never expose sensitive services directly to public internet
    • Always use reverse proxy with TLS termination for web services
    • Implement defense in depth (firewall + service binding + authentication)
    • Regular security audits using openclaw security audit --deep

Migration Checklist for New Servers

  • Install and configure 宝塔面板 or equivalent server management panel
  • Set up Nginx reverse proxy with proper WebSocket support
  • Configure OpenClaw with localhost binding only
  • Restrict public ports to 80/443 only via firewall
  • Enable automatic security updates
  • Run initial security audit and document baseline
  • Schedule periodic security audits via OpenClaw cron

Lessons Learned

  • Panel-based firewalls (宝塔/aapanel) must be verified with external port scans
  • Direct service exposure (like OpenClaw on 0.0.0.0) creates critical security risks
  • Nginx reverse proxy configuration is essential for secure OpenClaw deployment

Agent Operations Logging Practice (2026-02-20)

Log Directory Structure

  • /root/.openclaw/workspace/logs/operations/ - Manual operations and important changes
  • /root/.openclaw/workspace/logs/system/ - System-generated logs
  • /root/.openclaw/workspace/logs/agents/ - Individual agent logs
  • /root/.openclaw/workspace/logs/security/ - Security operations and audits

Automatic Logging Triggers

  1. Configuration Changes: Any modification to config files (.json, .yaml, etc.)
  2. Security Modifications: Firewall rules, authentication changes, port modifications
  3. Agent Lifecycle: Deployment, updates, removal of agents
  4. System Optimizations: Performance tuning, resource allocation changes
  5. Troubleshooting: Error diagnosis and resolution procedures
  6. Memory Updates: Significant changes to MEMORY.md or memory management

Log Format Standard

  • Filename: YYYY-MM-DD-HH-MM-SS-description.log
  • Timestamp: UTC time format
  • Content: [TIMESTAMP] [OPERATION_TYPE] [AGENT/USER] Description with before/after state

Implementation Guidelines

  • Always log before making changes (capture current state)
  • Include rollback instructions when applicable
  • Redact sensitive information (passwords, tokens, private keys)
  • Reference related MEMORY.md entries for context
  • Use QMD for routine operational context, MEMORY.md for strategic decisions

Agent Health Monitoring & Alerting System (2026-02-20)

Features Implemented

  1. Crash Detection: Monitors uncaught exceptions and unhandled rejections
  2. Health Checks: Periodic service health verification (every 30 seconds)
  3. Multi-Channel Notifications: Telegram alerts for critical events
  4. Automatic Logging: All alerts logged to /logs/agents/health-YYYY-MM-DD.log
  5. Extensible Design: Easy to add new notification channels

Components Created

  • Skill: agent-monitor/SKILL.md - Documentation and usage guide
  • Monitor Script: agent-monitor.js - Core monitoring logic
  • Startup Script: start-agent-monitor.sh - Easy deployment
  • Log Directory: /logs/agents/ - Dedicated logging location

Alert Severity Levels

  • CRITICAL: Process crashes, uncaught exceptions
  • ERROR: Unhandled rejections, failed operations
  • WARNING: Health check failures, performance issues
  • INFO: Service status updates, recovery notifications

Integration Points

  • Automatically integrated with existing Telegram channel
  • Compatible with OpenClaw's agent architecture
  • Works alongside existing logging and memory systems
  • Can monitor any Node.js-based agent process

Usage Instructions

  1. Source the startup script: source /root/.openclaw/workspace/start-agent-monitor.sh
  2. Call startAgentMonitor("agent-name", healthCheckFunction)
  3. Monitor automatically sends alerts on errors/crashes
  4. Check logs in /logs/agents/ for detailed information

Complete System Architecture Upgrade (2026-02-20 14:25 UTC)

All 5 Core Requirements Implemented

1. System-Level Persistence ✓

  • Systemd Services: openclaw-gateway.service + openclaw-agent-monitor.service
  • Auto-start on Boot: Both services enabled in multi-user.target
  • Resource Limits: Memory (2G/512M), CPU (80%/20%), watchdog timers
  • Status: systemctl status openclaw-gateway / systemctl status openclaw-agent-monitor

2. Auto-Healing ✓

  • Crash Detection: Monitors process exits, signals, uncaught exceptions
  • Auto-Restart: Systemd Restart=always + monitor script restart logic
  • Restart Limits: Max 5 restarts per 5 minutes (prevents restart loops)
  • Health Checks: Every 30 seconds, automatic recovery on failure

3. Multi-Layer Memory Architecture ✓

  • Core Memory: CORE_INDEX.md - Identity, structure, file index (always loaded first)
  • Long-term Memory: MEMORY.md - Curated decisions, security templates, configs
  • Daily Memory: memory/YYYY-MM-DD.md - Raw conversation logs (auto-saved)
  • Passive Archive: On-demand conversion of valuable conversations to skills/notes
  • Git Integration: All memory files tracked with version history

4. Git One-Click Rollback ✓

  • Repository: /root/.openclaw/workspace (already initialized)
  • Deploy Script: ./deploy.sh rollback - Rollback to previous commit
  • Specific Rollback: ./deploy.sh rollback-to <commit> - Rollback to specific commit
  • Auto-Backup: Backup created before rollback
  • Service Restart: Automatic service restart after rollback

5. Telegram Notifications ✓

  • Triggers: Service stop, error, crash, restart events
  • Channels: Telegram (via bot API) + OpenClaw message tool
  • Severity Levels: CRITICAL, ERROR, WARNING, INFO with emoji indicators
  • Logging: All notifications logged to /logs/agents/health-YYYY-MM-DD.log

📋 Management Commands (deploy.sh)

./deploy.sh install    # Install & start all systemd services
./deploy.sh start      # Start all services
./deploy.sh stop       # Stop all services
./deploy.sh restart    # Restart all services
./deploy.sh status     # Show detailed service status
./deploy.sh logs       # Show recent logs (last 50 lines)
./deploy.sh health     # Run comprehensive health check
./deploy.sh backup     # Create timestamped backup
./deploy.sh rollback   # Rollback to previous git commit
./deploy.sh rollback-to <commit>  # Rollback to specific commit
./deploy.sh help       # Show help message

🔧 Systemd Service Details

  • Gateway Service: /etc/systemd/system/openclaw-gateway.service

    • Memory limit: 2G, CPU: 80%, Watchdog: 30s
    • Restart: always, RestartSec: 10s
    • Logs: journalctl -u openclaw-gateway -f
  • Monitor Service: /etc/systemd/system/openclaw-agent-monitor.service

    • Memory limit: 512M, CPU: 20%
    • Restart: always, RestartSec: 5s
    • Logs: journalctl -u openclaw-agent-monitor -f

📊 Health Check Metrics

  • Gateway service status (active/inactive)
  • Agent monitor status (active/inactive)
  • Disk usage (warning at 80%)
  • Memory usage (warning at 80%)

🎯 Next Steps (Future Enhancements)

  • Add Prometheus/Grafana monitoring dashboard
  • Implement log rotation and archival
  • Add email notifications as backup channel
  • Create web-based admin dashboard
  • Add automated security scanning in CI/CD

User-Level vs System-Level Systemd Services - Critical Lesson (2026-02-20 14:35 UTC)

Problem Discovered

Initial deployment used system-level systemd services (/etc/systemd/system/) for OpenClaw Gateway, but OpenClaw natively uses user-level systemd (~/.config/systemd/user/). This caused:

  • Service restart loops (5 attempts then failure)
  • Error: systemctl --user unavailable: Failed to connect to bus: No medium found
  • Conflicts between system and user service definitions

Root Cause

OpenClaw Gateway is designed as a user-level service because:

  1. It runs under the user's context, not root
  2. It needs access to user-specific config (~/.openclaw/)
  3. User-level services have different environment requirements

Solution: Hybrid Architecture

User-Level Service (Gateway)

  • Location: ~/.config/systemd/user/openclaw-gateway.service
  • Required Setup:
    # Enable linger (CRITICAL - allows user services to run without login session)
    loginctl enable-linger $(whoami)
    
    # Set environment variables
    export XDG_RUNTIME_DIR=/run/user/$(id -u)
    export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus"
    
  • Management Commands:
    systemctl --user status openclaw-gateway
    systemctl --user start/stop/restart openclaw-gateway
    journalctl --user -u openclaw-gateway -f
    

System-Level Service (Agent Monitor)

  • Location: /etc/systemd/system/openclaw-agent-monitor.service
  • Purpose: Independently monitor the gateway (survives user session issues)
  • Management Commands:
    systemctl status openclaw-agent-monitor
    systemctl start/stop/restart openclaw-agent-monitor
    journalctl -u openclaw-agent-monitor -f
    

Deployment Checklist for New Servers

# 1. Enable user linger (MUST DO FIRST)
loginctl enable-linger $(whoami)

# 2. Create runtime directory if needed
mkdir -p /run/user/$(id -u)
chmod 700 /run/user/$(id -u)

# 3. Export environment (add to ~/.bashrc for persistence)
echo 'export XDG_RUNTIME_DIR=/run/user/$(id -u)' >> ~/.bashrc
echo 'export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u)/bus' >> ~/.bashrc

# 4. Install services
./deploy.sh install

# 5. Verify
./deploy.sh health

Troubleshooting Guide

Error: "Failed to connect to bus: No medium found"

Cause: User linger not enabled or environment variables not set Fix:

loginctl enable-linger $(whoami)
export XDG_RUNTIME_DIR=/run/user/$(id -u)
export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus"

Error: "Start request repeated too quickly"

Cause: Service crashing due to misconfiguration Fix: Check logs with journalctl --user -u openclaw-gateway -f

User service not starting after reboot

Cause: Linger not enabled Fix: loginctl enable-linger $(whoami)

Best Practices for Multi-Agent Deployments

  1. Always enable linger on first setup - document this in deployment guide
  2. Use hybrid architecture - user-level for agents, system-level for monitors
  3. Set environment variables in startup scripts, not just shell config
  4. Test after reboot - verify services auto-start correctly
  5. Document in MEMORY.md - share lessons across agent instances

Updated deploy.sh Features

  • Automatically enables linger during install
  • Sets up XDG_RUNTIME_DIR and DBUS_SESSION_BUS_ADDRESS
  • Uses systemctl --user for gateway, systemctl for monitor
  • Health check verifies linger status and runtime directory
  • Proper log commands for both service types

Collection 名称统一为 mem0_v4_shared (2026-02-27)

背景

之前配置中存在 Collection 名称不一致问题:

  • 代码实际使用:mem0_global_v4
  • 用户指定/文档记录:mem0_v4_shared(陈医生和张大师共用)

修改决策

王院长明确指示:所有 Collection 统一使用 mem0_v4_shared,不得随意修改关键配置。

修改文件列表

  1. skills/mem0-integration/mem0_client.py
  2. skills/mem0-integration/config.yaml
  3. skills/mem0-integration/skill.json
  4. skills/mem0-integration/config-life.yaml
  5. agents/life-agent.json
  6. agents/life-workspace/skills/mem0-integration/config.yaml
  7. docs/SYSTEM_ARCHITECTURE.md

验证结果

  • Gateway 重启成功 (systemd 服务正常)
  • Qdrant Collection mem0_v4_shared 已创建
  • 向量维度:1024 (text-embedding-v4)
  • 距离度量:Cosine
  • 元数据索引:user_id, agent_id, actor_id, run_id
  • Embedding 计费通道:Bailian 标准计费

操作日志

/root/.openclaw/workspace/logs/operations/2026-02-27-08-55-00-unify-collection-name.log

重要原则

  • 关键配置(Collection 名称、Embedding 模型、计费通道)修改必须经过用户确认
  • 所有 Agent 共享同一 Collection,通过 metadata.agent_id 实现逻辑隔离

安全审计误报分析 (2026-02-26)

背景

执行 openclaw security audit --deep 发现 4 个 CRITICAL/WARNING 问题,经人工复核确认为误报或已知权衡。

误报项及原因

审计项 原始评级 复核结论 根因
Gateway 绑定 lan CRITICAL 误报 审计工具静态分析配置文件,无法感知运行时绑定到 Tailscale (100.115.94.1)
设备认证禁用 CRITICAL 已知权衡 解决 HTTP 下 isSecureContext=false 问题,Tailscale+token 双重保护
无插件白名单 WARNING 建议修复 已确认暂不修复(成本低但收益有限)
无速率限制 WARNING 威胁模型不匹配 Tailscale 封闭网络 +48 字符强 token,暴力破解风险接近零
MemoryLimit 废弃 WARNING 误报 审计参考 workspace 模板,实际 service 文件无此参数

核心教训

  1. 安全审计是静态分析 - 无法替代人工判断,需结合运行时上下文
  2. 理解威胁模型 - 审计假设的威胁场景需匹配实际部署环境
  3. 记录已知权衡 - 在 MEMORY.md 记录为什么某些"安全问题"被接受

详细文档

  • 审计报告:logs/operations/2026-02-26-20-59-30-config-audit-report.md
  • 复核分析:logs/operations/2026-02-26-21-05-00-security-audit-review.md
  • 修复脚本:fix-security-config.sh (未执行)

配置清理和推送 (2026-02-26)

操作

  • 删除废弃 life/ 目录(空配置,未被任何文件引用)
  • 清理嵌套 git 仓库(agents/life-workspace/.git, skills/openclaw-wecom/.git
  • 移除 Python 缓存和运行时状态文件
  • 提交并推送到远程仓库

Git 提交

  • Commit: 378523c chore: 配置审计和清理 - 2026-02-26
  • 远程:gl.tigerone.tech:sw_dm/openClaw_agent_dm.git
  • 备份:/root/.openclaw/backups/workspace-20260226-210956.tar.gz

保留的目录

  • agents/life-workspace/ - 测试用 Agent 工作区
  • skills/openclaw-wecom/ - 企业微信技能(TypeScript 实现)

系统扩展架构升级完成 (2026-03-03 17:02 UTC)

6 项核心任务全部完成

Task 1 - 环境变量持久化

  • 文件: systemd/gateway.env, systemd/life-gateway.env
  • 权限: chmod 600 (仅 root 可读)
  • 特点: 独立于 .service 文件,OpenClaw UI 升级不会覆盖
  • 内容:
    MEM0_DASHSCOPE_API_KEY=sk-4111c9dba5334510968f9ae72728944e
    OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
    OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
    

Task 2 - Agent Monitor 修复 (4 个 Bug)

  • 重启限制: 集成到 monitorOpenClawService() via handleServiceDown() — 无限重启循环已修复
  • Life Agent 监控: 现在每 30 秒同时检查 gateway 和 openclaw-gateway-life.service
  • 心跳日志: 每 10 分钟输出 gateway=OK, life=OK
  • 升级容忍: 首次检测到服务停止后等待 60 秒 (grace period),避免升级期间误报

Task 3 - Systemd 服务升级

  • 模板更新: 废弃的 MemoryLimit= 替换为 MemoryMax=
  • Monitor 同步: 模板同步到 /etc/systemd/system/
  • 环境变量注入: 两个 user-level service 文件添加 EnvironmentFile=
  • 遗留服务: 禁用并 masked 旧的系统级 openclaw-gateway.service
  • 状态: 所有 3 个服务已重启并确认 active

Task 4 - deploy.sh 增强

  • 新命令:
    • debug-stop — 安全停止 monitor 防止调试期间自动重启
    • debug-start — 调试完成后恢复所有服务
    • fix-service — UI 升级后重新注入 EnvironmentFile=
  • Life Agent 集成: start/stop/restart/status/logs/health/install 全部支持 life agent

Task 5 - 统一架构文档

  • 文件: docs/EXTENSIONS_ARCHITECTURE.md
  • 内容: 服务架构、监控系统、记忆系统交叉引用、环境变量、调试流程、升级安全清单

Task 6 - CORE_INDEX.md 更新

  • 文件树: 新增 .env 文件、.legacy 重命名、新文档
  • 星标引用: EXTENSIONS_ARCHITECTURE.md 列为关键参考
  • 升级指南: 添加升级安全指令到模型使用指南
  • 管理命令: 更新 deploy.sh 命令列表

当前系统状态 (2026-03-04 03:32 UTC)

● openclaw-gateway.service        Active: active (running) 10h ago
● openclaw-gateway-life.service   Active: active (running) 10h ago
● openclaw-agent-monitor.service  Active: active (running) 10h ago

Monitor 心跳日志正常:每 10 分钟输出 gateway=OK, life=OK

升级安全流程

# OpenClaw UI 升级后执行
./deploy.sh fix-service   # 重新注入 EnvironmentFile=
./deploy.sh restart       # 重启所有服务
./deploy.sh health        # 验证健康状态

关键文档

  • 扩展架构: docs/EXTENSIONS_ARCHITECTURE.md — 修改基础设施前必读
  • 记忆系统: docs/MEMORY_ARCHITECTURE.md — 四层记忆体系详细设计
  • 监控脚本: agent-monitor.js — 健康监控与自动修复逻辑