It's 9am Monday. Your team has sent 12 messages to the Slack bot since Friday. None got a response. The bot has been down since Saturday afternoon.
You find out when someone asks you directly: "Is the AI assistant broken?"
This is the most common complaint in the OpenClaw community. Bots go offline. Nobody notices immediately. By the time someone does, hours or days have passed.
Here are the six most common causes — and what to do about each one.
Cause 1: RAM Exhaustion
OpenClaw's memory footprint grows over time, especially with active conversations and skills that cache data. On a 1GB RAM VPS (the cheapest option), this is a near-certainty.
How it happens: OpenClaw starts fine. After a few days or a week of conversations, memory usage crosses the threshold. The process gets killed by the OS. Nobody notices.
How to diagnose:
# Check if OpenClaw is running ps aux | grep openclaw # Check memory pressure when it was killed journalctl -u openclaw --since "2 days ago" | grep -i "killed\|oom"
How to fix: Either upgrade to a 2GB+ VPS, or configure a swap file. Swap isn't ideal for performance but prevents crashes:
fallocate -l 2G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile
The memory leak problem
Some community skills have memory leaks. If your instance restarts daily at roughly the same time, you likely have a leaky skill. Remove skills one at a time to identify the culprit.
Cause 2: Uncaught Exceptions in Skills
A skill crashes. OpenClaw's error handling doesn't catch it cleanly. The whole process exits.
This is especially common with:
- Skills that make external API calls without timeout handling
- Skills that parse user input without input validation
- Skills using dynamic code execution patterns
How to diagnose:
# Check the last few minutes before crash time journalctl -u openclaw --since "1 hour ago" -p err
How to fix: The proper solution is better skill error isolation (which OpenClaw's core team is working on). The practical fix: use a process manager.
# Install PM2 npm install -g pm2 # Start OpenClaw with PM2 (auto-restarts on crash) pm2 start "node openclaw/index.js" --name openclaw pm2 save pm2 startup
Cause 3: API Key Expiration or Rate Limits
Your AI provider's API key hits a rate limit, spends over a quota, or gets rotated — and OpenClaw doesn't fail gracefully. It just stops.
How to diagnose: Check your AI provider dashboard for the time the bot went offline. Look for:
- Rate limit errors
- Spending limit hits
- Key rotation events
How to fix: Set spending alerts (not limits — limits cut off the service) at 80% of your budget. Get notified before the problem, not after.
Cause 4: Telegram/Slack/Discord API Issues
The channel integration breaks without OpenClaw failing. This happens when:
- Telegram bot token gets revoked (rare, but happens after policy changes)
- Slack app permissions get modified by an admin
- Discord rate limits hit during high-activity periods
How to diagnose: The bot process is running, but messages aren't being processed. Check the channel-specific logs:
journalctl -u openclaw | grep -i "telegram\|webhook\|401\|403"
How to fix: Reconnecting the channel integration usually resolves this. The harder problem: how do you know it happened?
The visibility problem
A channel disconnect looks identical to a complete crash from the outside — the bot just stops responding. But the fix is completely different. Without proper logging, you're guessing.
Cause 5: VPS Maintenance Windows
Your hosting provider takes the server down for maintenance. This is usually brief (5–30 minutes), but if OpenClaw doesn't auto-start on reboot, it stays down indefinitely.
How to fix:
# Enable OpenClaw to start on boot (systemd) sudo systemctl enable openclaw # Or with PM2 pm2 save pm2 startup
Test this by rebooting your VPS and verifying the bot comes back up without manual intervention.
Cause 6: Disk Full
OpenClaw generates logs. Logs fill disks. Full disks crash everything.
How to diagnose:
df -h du -sh /var/log/openclaw/ 2>/dev/null
How to fix: Configure log rotation:
# /etc/logrotate.d/openclaw
/var/log/openclaw/*.log {
weekly
rotate 4
compress
missingok
notifempty
}
The Monitoring Gap
Here's the core problem: basic uptime monitoring (is the process running?) doesn't catch half these failure modes.
The bot process can be running while:
- Memory is full and new requests are being dropped
- The Telegram integration is disconnected
- API calls are failing and being swallowed by bad error handling
- The disk is 99% full and new writes are failing
Real reliability requires monitoring at multiple layers, not just process-level health checks.
What Clawfleet Does Instead
Rather than giving you a monitoring configuration problem to solve, Clawfleet handles each failure mode:
- RAM exhaustion: Managed infrastructure with automatic scaling
- Skill crashes: Skill-level sandboxing, not process-level
- API issues: Real-time cost and error dashboard with alerts
- Channel disconnects: Detected and alerted within 60 seconds
- VPS maintenance: Zero-downtime infrastructure management
- Disk full: Managed logging with automatic rotation and archival
The on-call burden of a self-hosted OpenClaw instance is real. Most people don't want it — they just want the bot to work.
99.9% uptime SLA, included
Auto-restart, health monitoring, and incident alerts on every Clawfleet plan.
