# MEMORY.md - Long-term Memory This file contains curated long-term memories and important context. ## Memory Management Strategy - **MEMORY.md**: Curated long-term memories, important decisions, security templates, and key configurations - **QMD System**: Automated memory backend with semantic search, auto-updates every 5 minutes - **Usage**: Write significant learnings to MEMORY.md; rely on QMD for daily context and automation - **Access**: MEMORY.md loaded only in main sessions (direct chats) for security ## QMD Configuration - Backend: qmd - Auto-update: every 5 minutes - Include default memory: true - Last verified: 2026-02-20 ## Server Security Hardening Template (2026-02-20) ### Environment - **Server**: Ubuntu 24.04 LTS VPS (KVM) - **Panel**: 宝塔面板 (BT-Panel) on port 888 - **Public IP**: 204.12.203.203 ### Security Configuration Applied 1. **Port Exposure Minimization**: - Only ports 80 (HTTP) and 443 (HTTPS) publicly accessible - SSH (port 22) restricted to internal/network access only - OpenClaw gateway (port 18789) bound to localhost only - All other services (MySQL, custom apps) internal-only 2. **OpenClaw Secure Deployment**: - Gateway configured with `bind: "localhost"` instead of `"lan"` - Access exclusively through Nginx reverse proxy with HTTPS - Token-based authentication enabled - WebSocket support properly configured in Nginx 3. **Firewall Management**: - Use 宝塔面板 (BT-Panel) built-in firewall for port management - Alternative: system-level firewall (ufw/iptables) if no panel available - Regular external port scanning to verify exposure 4. **Critical Security Principles**: - Never expose sensitive services directly to public internet - Always use reverse proxy with TLS termination for web services - Implement defense in depth (firewall + service binding + authentication) - Regular security audits using `openclaw security audit --deep` ### Migration Checklist for New Servers - [ ] Install and configure 宝塔面板 or equivalent server management panel - [ ] Set up Nginx reverse proxy with proper WebSocket support - [ ] Configure OpenClaw with localhost binding only - [ ] Restrict public ports to 80/443 only via firewall - [ ] Enable automatic security updates - [ ] Run initial security audit and document baseline - [ ] Schedule periodic security audits via OpenClaw cron ### Lessons Learned - Panel-based firewalls (宝塔/aapanel) must be verified with external port scans - Direct service exposure (like OpenClaw on 0.0.0.0) creates critical security risks - Nginx reverse proxy configuration is essential for secure OpenClaw deployment ## Agent Operations Logging Practice (2026-02-20) ### Log Directory Structure - `/root/.openclaw/workspace/logs/operations/` - Manual operations and important changes - `/root/.openclaw/workspace/logs/system/` - System-generated logs - `/root/.openclaw/workspace/logs/agents/` - Individual agent logs - `/root/.openclaw/workspace/logs/security/` - Security operations and audits ### Automatic Logging Triggers 1. **Configuration Changes**: Any modification to config files (.json, .yaml, etc.) 2. **Security Modifications**: Firewall rules, authentication changes, port modifications 3. **Agent Lifecycle**: Deployment, updates, removal of agents 4. **System Optimizations**: Performance tuning, resource allocation changes 5. **Troubleshooting**: Error diagnosis and resolution procedures 6. **Memory Updates**: Significant changes to MEMORY.md or memory management ### Log Format Standard - **Filename**: `YYYY-MM-DD-HH-MM-SS-description.log` - **Timestamp**: UTC time format - **Content**: `[TIMESTAMP] [OPERATION_TYPE] [AGENT/USER] Description with before/after state` ### Implementation Guidelines - Always log before making changes (capture current state) - Include rollback instructions when applicable - Redact sensitive information (passwords, tokens, private keys) - Reference related MEMORY.md entries for context - Use QMD for routine operational context, MEMORY.md for strategic decisions ## Agent Health Monitoring & Alerting System (2026-02-20) ### Features Implemented 1. **Crash Detection**: Monitors uncaught exceptions and unhandled rejections 2. **Health Checks**: Periodic service health verification (every 30 seconds) 3. **Multi-Channel Notifications**: Telegram alerts for critical events 4. **Automatic Logging**: All alerts logged to `/logs/agents/health-YYYY-MM-DD.log` 5. **Extensible Design**: Easy to add new notification channels ### Components Created - **Skill**: `agent-monitor/SKILL.md` - Documentation and usage guide - **Monitor Script**: `agent-monitor.js` - Core monitoring logic - **Startup Script**: `start-agent-monitor.sh` - Easy deployment - **Log Directory**: `/logs/agents/` - Dedicated logging location ### Alert Severity Levels - **CRITICAL**: Process crashes, uncaught exceptions - **ERROR**: Unhandled rejections, failed operations - **WARNING**: Health check failures, performance issues - **INFO**: Service status updates, recovery notifications ### Integration Points - Automatically integrated with existing Telegram channel - Compatible with OpenClaw's agent architecture - Works alongside existing logging and memory systems - Can monitor any Node.js-based agent process ### Usage Instructions 1. Source the startup script: `source /root/.openclaw/workspace/start-agent-monitor.sh` 2. Call `startAgentMonitor("agent-name", healthCheckFunction)` 3. Monitor automatically sends alerts on errors/crashes 4. Check logs in `/logs/agents/` for detailed information --- ## Complete System Architecture Upgrade (2026-02-20 14:25 UTC) ### ✅ All 5 Core Requirements Implemented #### 1. System-Level Persistence ✓ - **Systemd Services**: `openclaw-gateway.service` + `openclaw-agent-monitor.service` - **Auto-start on Boot**: Both services enabled in multi-user.target - **Resource Limits**: Memory (2G/512M), CPU (80%/20%), watchdog timers - **Status**: `systemctl status openclaw-gateway` / `systemctl status openclaw-agent-monitor` #### 2. Auto-Healing ✓ - **Crash Detection**: Monitors process exits, signals, uncaught exceptions - **Auto-Restart**: Systemd Restart=always + monitor script restart logic - **Restart Limits**: Max 5 restarts per 5 minutes (prevents restart loops) - **Health Checks**: Every 30 seconds, automatic recovery on failure #### 3. Multi-Layer Memory Architecture ✓ - **Core Memory**: `CORE_INDEX.md` - Identity, structure, file index (always loaded first) - **Long-term Memory**: `MEMORY.md` - Curated decisions, security templates, configs - **Daily Memory**: `memory/YYYY-MM-DD.md` - Raw conversation logs (auto-saved) - **Passive Archive**: On-demand conversion of valuable conversations to skills/notes - **Git Integration**: All memory files tracked with version history #### 4. Git One-Click Rollback ✓ - **Repository**: `/root/.openclaw/workspace` (already initialized) - **Deploy Script**: `./deploy.sh rollback` - Rollback to previous commit - **Specific Rollback**: `./deploy.sh rollback-to ` - Rollback to specific commit - **Auto-Backup**: Backup created before rollback - **Service Restart**: Automatic service restart after rollback #### 5. Telegram Notifications ✓ - **Triggers**: Service stop, error, crash, restart events - **Channels**: Telegram (via bot API) + OpenClaw message tool - **Severity Levels**: CRITICAL, ERROR, WARNING, INFO with emoji indicators - **Logging**: All notifications logged to `/logs/agents/health-YYYY-MM-DD.log` ### 📋 Management Commands (deploy.sh) ```bash ./deploy.sh install # Install & start all systemd services ./deploy.sh start # Start all services ./deploy.sh stop # Stop all services ./deploy.sh restart # Restart all services ./deploy.sh status # Show detailed service status ./deploy.sh logs # Show recent logs (last 50 lines) ./deploy.sh health # Run comprehensive health check ./deploy.sh backup # Create timestamped backup ./deploy.sh rollback # Rollback to previous git commit ./deploy.sh rollback-to # Rollback to specific commit ./deploy.sh help # Show help message ``` ### 🔧 Systemd Service Details - **Gateway Service**: `/etc/systemd/system/openclaw-gateway.service` - Memory limit: 2G, CPU: 80%, Watchdog: 30s - Restart: always, RestartSec: 10s - Logs: `journalctl -u openclaw-gateway -f` - **Monitor Service**: `/etc/systemd/system/openclaw-agent-monitor.service` - Memory limit: 512M, CPU: 20% - Restart: always, RestartSec: 5s - Logs: `journalctl -u openclaw-agent-monitor -f` ### 📊 Health Check Metrics - Gateway service status (active/inactive) - Agent monitor status (active/inactive) - Disk usage (warning at 80%) - Memory usage (warning at 80%) ### 🎯 Next Steps (Future Enhancements) - [ ] Add Prometheus/Grafana monitoring dashboard - [ ] Implement log rotation and archival - [ ] Add email notifications as backup channel - [ ] Create web-based admin dashboard - [ ] Add automated security scanning in CI/CD --- ## User-Level vs System-Level Systemd Services - Critical Lesson (2026-02-20 14:35 UTC) ### Problem Discovered Initial deployment used system-level systemd services (`/etc/systemd/system/`) for OpenClaw Gateway, but OpenClaw natively uses **user-level systemd** (`~/.config/systemd/user/`). This caused: - Service restart loops (5 attempts then failure) - Error: `systemctl --user unavailable: Failed to connect to bus: No medium found` - Conflicts between system and user service definitions ### Root Cause OpenClaw Gateway is designed as a user-level service because: 1. It runs under the user's context, not root 2. It needs access to user-specific config (`~/.openclaw/`) 3. User-level services have different environment requirements ### Solution: Hybrid Architecture #### User-Level Service (Gateway) - **Location**: `~/.config/systemd/user/openclaw-gateway.service` - **Required Setup**: ```bash # Enable linger (CRITICAL - allows user services to run without login session) loginctl enable-linger $(whoami) # Set environment variables export XDG_RUNTIME_DIR=/run/user/$(id -u) export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" ``` - **Management Commands**: ```bash systemctl --user status openclaw-gateway systemctl --user start/stop/restart openclaw-gateway journalctl --user -u openclaw-gateway -f ``` #### System-Level Service (Agent Monitor) - **Location**: `/etc/systemd/system/openclaw-agent-monitor.service` - **Purpose**: Independently monitor the gateway (survives user session issues) - **Management Commands**: ```bash systemctl status openclaw-agent-monitor systemctl start/stop/restart openclaw-agent-monitor journalctl -u openclaw-agent-monitor -f ``` ### Deployment Checklist for New Servers ```bash # 1. Enable user linger (MUST DO FIRST) loginctl enable-linger $(whoami) # 2. Create runtime directory if needed mkdir -p /run/user/$(id -u) chmod 700 /run/user/$(id -u) # 3. Export environment (add to ~/.bashrc for persistence) echo 'export XDG_RUNTIME_DIR=/run/user/$(id -u)' >> ~/.bashrc echo 'export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u)/bus' >> ~/.bashrc # 4. Install services ./deploy.sh install # 5. Verify ./deploy.sh health ``` ### Troubleshooting Guide #### Error: "Failed to connect to bus: No medium found" **Cause**: User linger not enabled or environment variables not set **Fix**: ```bash loginctl enable-linger $(whoami) export XDG_RUNTIME_DIR=/run/user/$(id -u) export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" ``` #### Error: "Start request repeated too quickly" **Cause**: Service crashing due to misconfiguration **Fix**: Check logs with `journalctl --user -u openclaw-gateway -f` #### User service not starting after reboot **Cause**: Linger not enabled **Fix**: `loginctl enable-linger $(whoami)` ### Best Practices for Multi-Agent Deployments 1. **Always enable linger** on first setup - document this in deployment guide 2. **Use hybrid architecture** - user-level for agents, system-level for monitors 3. **Set environment variables** in startup scripts, not just shell config 4. **Test after reboot** - verify services auto-start correctly 5. **Document in MEMORY.md** - share lessons across agent instances ### Updated deploy.sh Features - Automatically enables linger during install - Sets up XDG_RUNTIME_DIR and DBUS_SESSION_BUS_ADDRESS - Uses `systemctl --user` for gateway, `systemctl` for monitor - Health check verifies linger status and runtime directory - Proper log commands for both service types ---