You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
343 lines
14 KiB
343 lines
14 KiB
# MEMORY.md - Long-term Memory |
|
|
|
This file contains curated long-term memories and important context. |
|
|
|
## Memory Management Strategy |
|
- **MEMORY.md**: Curated long-term memories, important decisions, security templates, and key configurations |
|
- **QMD System**: Automated memory backend with semantic search, auto-updates every 5 minutes |
|
- **Usage**: Write significant learnings to MEMORY.md; rely on QMD for daily context and automation |
|
- **Access**: MEMORY.md loaded only in main sessions (direct chats) for security |
|
|
|
## QMD Configuration |
|
- Backend: qmd |
|
- Auto-update: every 5 minutes |
|
- Include default memory: true |
|
- Last verified: 2026-02-20 |
|
|
|
## Server Security Hardening Template (2026-02-20) |
|
|
|
### Environment |
|
- **Server**: Ubuntu 24.04 LTS VPS (KVM) |
|
- **Panel**: 宝塔面板 (BT-Panel) on port 888 |
|
- **Public IP**: 204.12.203.203 |
|
|
|
### Security Configuration Applied |
|
1. **Port Exposure Minimization**: |
|
- Only ports 80 (HTTP) and 443 (HTTPS) publicly accessible |
|
- SSH (port 22) restricted to internal/network access only |
|
- OpenClaw gateway (port 18789) bound to localhost only |
|
- All other services (MySQL, custom apps) internal-only |
|
|
|
2. **OpenClaw Secure Deployment**: |
|
- Gateway configured with `bind: "localhost"` instead of `"lan"` |
|
- Access exclusively through Nginx reverse proxy with HTTPS |
|
- Token-based authentication enabled |
|
- WebSocket support properly configured in Nginx |
|
|
|
3. **Firewall Management**: |
|
- Use 宝塔面板 (BT-Panel) built-in firewall for port management |
|
- Alternative: system-level firewall (ufw/iptables) if no panel available |
|
- Regular external port scanning to verify exposure |
|
|
|
4. **Critical Security Principles**: |
|
- Never expose sensitive services directly to public internet |
|
- Always use reverse proxy with TLS termination for web services |
|
- Implement defense in depth (firewall + service binding + authentication) |
|
- Regular security audits using `openclaw security audit --deep` |
|
|
|
### Migration Checklist for New Servers |
|
- [ ] Install and configure 宝塔面板 or equivalent server management panel |
|
- [ ] Set up Nginx reverse proxy with proper WebSocket support |
|
- [ ] Configure OpenClaw with localhost binding only |
|
- [ ] Restrict public ports to 80/443 only via firewall |
|
- [ ] Enable automatic security updates |
|
- [ ] Run initial security audit and document baseline |
|
- [ ] Schedule periodic security audits via OpenClaw cron |
|
|
|
### Lessons Learned |
|
- Panel-based firewalls (宝塔/aapanel) must be verified with external port scans |
|
- Direct service exposure (like OpenClaw on 0.0.0.0) creates critical security risks |
|
- Nginx reverse proxy configuration is essential for secure OpenClaw deployment |
|
|
|
## Agent Operations Logging Practice (2026-02-20) |
|
|
|
### Log Directory Structure |
|
- `/root/.openclaw/workspace/logs/operations/` - Manual operations and important changes |
|
- `/root/.openclaw/workspace/logs/system/` - System-generated logs |
|
- `/root/.openclaw/workspace/logs/agents/` - Individual agent logs |
|
- `/root/.openclaw/workspace/logs/security/` - Security operations and audits |
|
|
|
### Automatic Logging Triggers |
|
1. **Configuration Changes**: Any modification to config files (.json, .yaml, etc.) |
|
2. **Security Modifications**: Firewall rules, authentication changes, port modifications |
|
3. **Agent Lifecycle**: Deployment, updates, removal of agents |
|
4. **System Optimizations**: Performance tuning, resource allocation changes |
|
5. **Troubleshooting**: Error diagnosis and resolution procedures |
|
6. **Memory Updates**: Significant changes to MEMORY.md or memory management |
|
|
|
### Log Format Standard |
|
- **Filename**: `YYYY-MM-DD-HH-MM-SS-description.log` |
|
- **Timestamp**: UTC time format |
|
- **Content**: `[TIMESTAMP] [OPERATION_TYPE] [AGENT/USER] Description with before/after state` |
|
|
|
### Implementation Guidelines |
|
- Always log before making changes (capture current state) |
|
- Include rollback instructions when applicable |
|
- Redact sensitive information (passwords, tokens, private keys) |
|
- Reference related MEMORY.md entries for context |
|
- Use QMD for routine operational context, MEMORY.md for strategic decisions |
|
|
|
## Agent Health Monitoring & Alerting System (2026-02-20) |
|
|
|
### Features Implemented |
|
1. **Crash Detection**: Monitors uncaught exceptions and unhandled rejections |
|
2. **Health Checks**: Periodic service health verification (every 30 seconds) |
|
3. **Multi-Channel Notifications**: Telegram alerts for critical events |
|
4. **Automatic Logging**: All alerts logged to `/logs/agents/health-YYYY-MM-DD.log` |
|
5. **Extensible Design**: Easy to add new notification channels |
|
|
|
### Components Created |
|
- **Skill**: `agent-monitor/SKILL.md` - Documentation and usage guide |
|
- **Monitor Script**: `agent-monitor.js` - Core monitoring logic |
|
- **Startup Script**: `start-agent-monitor.sh` - Easy deployment |
|
- **Log Directory**: `/logs/agents/` - Dedicated logging location |
|
|
|
### Alert Severity Levels |
|
- **CRITICAL**: Process crashes, uncaught exceptions |
|
- **ERROR**: Unhandled rejections, failed operations |
|
- **WARNING**: Health check failures, performance issues |
|
- **INFO**: Service status updates, recovery notifications |
|
|
|
### Integration Points |
|
- Automatically integrated with existing Telegram channel |
|
- Compatible with OpenClaw's agent architecture |
|
- Works alongside existing logging and memory systems |
|
- Can monitor any Node.js-based agent process |
|
|
|
### Usage Instructions |
|
1. Source the startup script: `source /root/.openclaw/workspace/start-agent-monitor.sh` |
|
2. Call `startAgentMonitor("agent-name", healthCheckFunction)` |
|
3. Monitor automatically sends alerts on errors/crashes |
|
4. Check logs in `/logs/agents/` for detailed information |
|
|
|
--- |
|
|
|
## Complete System Architecture Upgrade (2026-02-20 14:25 UTC) |
|
|
|
### ✅ All 5 Core Requirements Implemented |
|
|
|
#### 1. System-Level Persistence ✓ |
|
- **Systemd Services**: `openclaw-gateway.service` + `openclaw-agent-monitor.service` |
|
- **Auto-start on Boot**: Both services enabled in multi-user.target |
|
- **Resource Limits**: Memory (2G/512M), CPU (80%/20%), watchdog timers |
|
- **Status**: `systemctl status openclaw-gateway` / `systemctl status openclaw-agent-monitor` |
|
|
|
#### 2. Auto-Healing ✓ |
|
- **Crash Detection**: Monitors process exits, signals, uncaught exceptions |
|
- **Auto-Restart**: Systemd Restart=always + monitor script restart logic |
|
- **Restart Limits**: Max 5 restarts per 5 minutes (prevents restart loops) |
|
- **Health Checks**: Every 30 seconds, automatic recovery on failure |
|
|
|
#### 3. Multi-Layer Memory Architecture ✓ |
|
- **Core Memory**: `CORE_INDEX.md` - Identity, structure, file index (always loaded first) |
|
- **Long-term Memory**: `MEMORY.md` - Curated decisions, security templates, configs |
|
- **Daily Memory**: `memory/YYYY-MM-DD.md` - Raw conversation logs (auto-saved) |
|
- **Passive Archive**: On-demand conversion of valuable conversations to skills/notes |
|
- **Git Integration**: All memory files tracked with version history |
|
|
|
#### 4. Git One-Click Rollback ✓ |
|
- **Repository**: `/root/.openclaw/workspace` (already initialized) |
|
- **Deploy Script**: `./deploy.sh rollback` - Rollback to previous commit |
|
- **Specific Rollback**: `./deploy.sh rollback-to <commit>` - Rollback to specific commit |
|
- **Auto-Backup**: Backup created before rollback |
|
- **Service Restart**: Automatic service restart after rollback |
|
|
|
#### 5. Telegram Notifications ✓ |
|
- **Triggers**: Service stop, error, crash, restart events |
|
- **Channels**: Telegram (via bot API) + OpenClaw message tool |
|
- **Severity Levels**: CRITICAL, ERROR, WARNING, INFO with emoji indicators |
|
- **Logging**: All notifications logged to `/logs/agents/health-YYYY-MM-DD.log` |
|
|
|
### 📋 Management Commands (deploy.sh) |
|
```bash |
|
./deploy.sh install # Install & start all systemd services |
|
./deploy.sh start # Start all services |
|
./deploy.sh stop # Stop all services |
|
./deploy.sh restart # Restart all services |
|
./deploy.sh status # Show detailed service status |
|
./deploy.sh logs # Show recent logs (last 50 lines) |
|
./deploy.sh health # Run comprehensive health check |
|
./deploy.sh backup # Create timestamped backup |
|
./deploy.sh rollback # Rollback to previous git commit |
|
./deploy.sh rollback-to <commit> # Rollback to specific commit |
|
./deploy.sh help # Show help message |
|
``` |
|
|
|
### 🔧 Systemd Service Details |
|
- **Gateway Service**: `/etc/systemd/system/openclaw-gateway.service` |
|
- Memory limit: 2G, CPU: 80%, Watchdog: 30s |
|
- Restart: always, RestartSec: 10s |
|
- Logs: `journalctl -u openclaw-gateway -f` |
|
|
|
- **Monitor Service**: `/etc/systemd/system/openclaw-agent-monitor.service` |
|
- Memory limit: 512M, CPU: 20% |
|
- Restart: always, RestartSec: 5s |
|
- Logs: `journalctl -u openclaw-agent-monitor -f` |
|
|
|
### 📊 Health Check Metrics |
|
- Gateway service status (active/inactive) |
|
- Agent monitor status (active/inactive) |
|
- Disk usage (warning at 80%) |
|
- Memory usage (warning at 80%) |
|
|
|
### 🎯 Next Steps (Future Enhancements) |
|
- [ ] Add Prometheus/Grafana monitoring dashboard |
|
- [ ] Implement log rotation and archival |
|
- [ ] Add email notifications as backup channel |
|
- [ ] Create web-based admin dashboard |
|
- [ ] Add automated security scanning in CI/CD |
|
|
|
--- |
|
|
|
## User-Level vs System-Level Systemd Services - Critical Lesson (2026-02-20 14:35 UTC) |
|
|
|
### Problem Discovered |
|
Initial deployment used system-level systemd services (`/etc/systemd/system/`) for OpenClaw Gateway, but OpenClaw natively uses **user-level systemd** (`~/.config/systemd/user/`). This caused: |
|
- Service restart loops (5 attempts then failure) |
|
- Error: `systemctl --user unavailable: Failed to connect to bus: No medium found` |
|
- Conflicts between system and user service definitions |
|
|
|
### Root Cause |
|
OpenClaw Gateway is designed as a user-level service because: |
|
1. It runs under the user's context, not root |
|
2. It needs access to user-specific config (`~/.openclaw/`) |
|
3. User-level services have different environment requirements |
|
|
|
### Solution: Hybrid Architecture |
|
|
|
#### User-Level Service (Gateway) |
|
- **Location**: `~/.config/systemd/user/openclaw-gateway.service` |
|
- **Required Setup**: |
|
```bash |
|
# Enable linger (CRITICAL - allows user services to run without login session) |
|
loginctl enable-linger $(whoami) |
|
|
|
# Set environment variables |
|
export XDG_RUNTIME_DIR=/run/user/$(id -u) |
|
export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" |
|
``` |
|
- **Management Commands**: |
|
```bash |
|
systemctl --user status openclaw-gateway |
|
systemctl --user start/stop/restart openclaw-gateway |
|
journalctl --user -u openclaw-gateway -f |
|
``` |
|
|
|
#### System-Level Service (Agent Monitor) |
|
- **Location**: `/etc/systemd/system/openclaw-agent-monitor.service` |
|
- **Purpose**: Independently monitor the gateway (survives user session issues) |
|
- **Management Commands**: |
|
```bash |
|
systemctl status openclaw-agent-monitor |
|
systemctl start/stop/restart openclaw-agent-monitor |
|
journalctl -u openclaw-agent-monitor -f |
|
``` |
|
|
|
### Deployment Checklist for New Servers |
|
```bash |
|
# 1. Enable user linger (MUST DO FIRST) |
|
loginctl enable-linger $(whoami) |
|
|
|
# 2. Create runtime directory if needed |
|
mkdir -p /run/user/$(id -u) |
|
chmod 700 /run/user/$(id -u) |
|
|
|
# 3. Export environment (add to ~/.bashrc for persistence) |
|
echo 'export XDG_RUNTIME_DIR=/run/user/$(id -u)' >> ~/.bashrc |
|
echo 'export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u)/bus' >> ~/.bashrc |
|
|
|
# 4. Install services |
|
./deploy.sh install |
|
|
|
# 5. Verify |
|
./deploy.sh health |
|
``` |
|
|
|
### Troubleshooting Guide |
|
|
|
#### Error: "Failed to connect to bus: No medium found" |
|
**Cause**: User linger not enabled or environment variables not set |
|
**Fix**: |
|
```bash |
|
loginctl enable-linger $(whoami) |
|
export XDG_RUNTIME_DIR=/run/user/$(id -u) |
|
export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" |
|
``` |
|
|
|
#### Error: "Start request repeated too quickly" |
|
**Cause**: Service crashing due to misconfiguration |
|
**Fix**: Check logs with `journalctl --user -u openclaw-gateway -f` |
|
|
|
#### User service not starting after reboot |
|
**Cause**: Linger not enabled |
|
**Fix**: `loginctl enable-linger $(whoami)` |
|
|
|
### Best Practices for Multi-Agent Deployments |
|
1. **Always enable linger** on first setup - document this in deployment guide |
|
2. **Use hybrid architecture** - user-level for agents, system-level for monitors |
|
3. **Set environment variables** in startup scripts, not just shell config |
|
4. **Test after reboot** - verify services auto-start correctly |
|
5. **Document in MEMORY.md** - share lessons across agent instances |
|
|
|
### Updated deploy.sh Features |
|
- Automatically enables linger during install |
|
- Sets up XDG_RUNTIME_DIR and DBUS_SESSION_BUS_ADDRESS |
|
- Uses `systemctl --user` for gateway, `systemctl` for monitor |
|
- Health check verifies linger status and runtime directory |
|
- Proper log commands for both service types |
|
|
|
--- |
|
## 安全审计误报分析 (2026-02-26) |
|
|
|
### 背景 |
|
执行 `openclaw security audit --deep` 发现 4 个 CRITICAL/WARNING 问题,经人工复核确认为误报或已知权衡。 |
|
|
|
### 误报项及原因 |
|
|
|
| 审计项 | 原始评级 | 复核结论 | 根因 | |
|
|--------|----------|----------|------| |
|
| Gateway 绑定 `lan` | CRITICAL | 误报 | 审计工具静态分析配置文件,无法感知运行时绑定到 Tailscale (100.115.94.1) | |
|
| 设备认证禁用 | CRITICAL | 已知权衡 | 解决 HTTP 下 `isSecureContext=false` 问题,Tailscale+token 双重保护 | |
|
| 无插件白名单 | WARNING | 建议修复 | 已确认暂不修复(成本低但收益有限) | |
|
| 无速率限制 | WARNING | 威胁模型不匹配 | Tailscale 封闭网络 +48 字符强 token,暴力破解风险接近零 | |
|
| MemoryLimit 废弃 | WARNING | 误报 | 审计参考 workspace 模板,实际 service 文件无此参数 | |
|
|
|
### 核心教训 |
|
1. **安全审计是静态分析** - 无法替代人工判断,需结合运行时上下文 |
|
2. **理解威胁模型** - 审计假设的威胁场景需匹配实际部署环境 |
|
3. **记录已知权衡** - 在 MEMORY.md 记录为什么某些"安全问题"被接受 |
|
|
|
### 详细文档 |
|
- 审计报告:`logs/operations/2026-02-26-20-59-30-config-audit-report.md` |
|
- 复核分析:`logs/operations/2026-02-26-21-05-00-security-audit-review.md` |
|
- 修复脚本:`fix-security-config.sh` (未执行) |
|
|
|
--- |
|
|
|
## 配置清理和推送 (2026-02-26) |
|
|
|
### 操作 |
|
- 删除废弃 `life/` 目录(空配置,未被任何文件引用) |
|
- 清理嵌套 git 仓库(`agents/life-workspace/.git`, `skills/openclaw-wecom/.git`) |
|
- 移除 Python 缓存和运行时状态文件 |
|
- 提交并推送到远程仓库 |
|
|
|
### Git 提交 |
|
- Commit: `378523c chore: 配置审计和清理 - 2026-02-26` |
|
- 远程:`gl.tigerone.tech:sw_dm/openClaw_agent_dm.git` |
|
- 备份:`/root/.openclaw/backups/workspace-20260226-210956.tar.gz` |
|
|
|
### 保留的目录 |
|
- `agents/life-workspace/` - 测试用 Agent 工作区 |
|
- `skills/openclaw-wecom/` - 企业微信技能(TypeScript 实现) |
|
|
|
|