DevOps Engineer
Render Backend Monitoring β The Agent Caught 2 Incidents Before I Woke Up
Key Takeaway
An AI agent monitors our Django backend on Render.com 24/7. It caught a server crash at 3am and a memory leak at 5am β diagnosing both and suggesting fixes before I opened my eyes.
The Problem
Backend monitoring is a solved problem if you're a Series B company with a dedicated SRE team and a $2,000/month Datadog bill. We're not that.
We run eskimoai-api on Render. Django, Python, PostgreSQL. It serves our production users. It needs to stay up. But Render's built-in monitoring is basic β you get CPU/memory graphs and deploy logs. No intelligent alerting. No root cause analysis. No "hey, this endpoint is leaking memory."
The alternative: set up Prometheus + Grafana + Alertmanager + PagerDuty. That's a week of infrastructure work and ongoing maintenance. For a team where the infrastructure engineer is also the product engineer is also me.
I needed monitoring that's smart enough to diagnose problems, not just report them. And it had to cost nothing in setup or maintenance time.
The Solution
Thom's DevOps sub-agent monitors our Render services continuously. Health checks every 5 minutes. Log analysis on anomalies. Incident detection, diagnosis, and response β all delivered to Telegram with suggested fixes.
Two real incidents. Both caught while I was sleeping.
The Process
The monitoring loop runs every 5 minutes:
bashShow code
# Health check
response=$(curl -s -o /tmp/health.json -w "%{http_code}|%{time_total}" \
https://eskimoai-api.onrender.com/health/)
status_code=$(echo "$response" | cut -d'|' -f1)
response_time=$(echo "$response" | cut -d'|' -f2)
# Alert thresholds
if [ "$status_code" != "200" ] || [ $(echo "$response_time > 2.0" | bc) -eq 1 ]; then
trigger_incident "$status_code" "$response_time"
fi
The agent also pulls Render service metrics via the API:
bashShow code
# Fetch service metrics from Render API
curl -s -H "Authorization: Bearer $RENDER_API_KEY" \
"https://api.render.com/v1/services/$SERVICE_ID/metrics?period=1h" \
| jq '{cpu: .cpu, memory: .memory, requests: .requests}'
Incident 1: Server Crash at 3:17 AM
The health check returned a 502. The agent escalated immediately:
View details
π¨ INCIDENT β eskimoai-api DOWN
Time: 03:17 UTC
Status: 502 Bad Gateway
Duration: Health check failed (was healthy 5 min ago)
π Render Logs (last 10 lines):
> [ERROR] Worker pid:12 exited with code 137 (OOM killed)
> [ERROR] Worker pid:14 exited with code 137 (OOM killed)
> [INFO] No workers available. Returning 502.
π Diagnosis: Out-of-Memory kill. Both Gunicorn workers
exceeded the 512MB memory limit simultaneously.
Recent deploys: v2.1.8 deployed 6 hours ago β includes
new PDF export endpoint.
π‘ Suggested fix:
1. Immediate: Restart service (clears memory)
2. Investigate: PDF export likely loads full documents into
memory. Add streaming/chunked response.
3. Consider: Upgrade from 512MB to 1GB if PDF exports
are expected usage.
π Auto-restarting service now...
The agent triggered a service restart via the Render API:
bashShow code
curl -s -X POST \
-H "Authorization: Bearer $RENDER_API_KEY" \
"https://api.render.com/v1/services/$SERVICE_ID/restart"
Then confirmed recovery:
View details
β
eskimoai-api recovered
Time: 03:19 UTC
Downtime: 2 minutes
Health check: 200 OK (response: 340ms)
β οΈ Root cause unresolved. PDF export endpoint needs memory fix.
Created issue #67: "PDF export OOM β add streaming response"
Incident 2: Memory Leak at 5:04 AM
Two hours later, the agent detected a pattern:
View details
β οΈ WARNING β eskimoai-api memory trending up
Current: 78% of 512MB limit (398MB)
30 min ago: 62% (317MB)
1 hour ago: 45% (230MB)
π Growth rate: 2.8MB/minute β will OOM in ~40 minutes
π Analysis: Access log shows /api/v1/export/report
called 47 times in the last hour (automated client).
Each call increases RSS by ~3.5MB without release.
π‘ Diagnosis: Memory not released after PDF generation.
Likely cause: file handles or BytesIO buffers not closed.
Suggested:
1. Restart now (preventive, before OOM)
2. Rate-limit /export/report to 10 req/min
3. Fix: ensure `with` context managers on all file
operations in export views
π Restarting preemptively...
The agent identified the leaking endpoint by correlating memory growth with access patterns β no APM tool required. Just log analysis and basic math.
Weekly Report:
View details
π eskimoai-api Weekly Health (Mar 2-8)
Uptime: 99.7% (2 incidents, 5 min total downtime)
Avg response time: 245ms (P95: 890ms)
Total requests: 47,302
Error rate: 0.3%
Memory: Avg 45% | Peak 78% (during incident)
CPU: Avg 12% | Peak 34%
π Traffic up 40% week-over-week.
π‘ Recommendation: Consider upgrading from Starter ($7/mo)
to Standard ($25/mo) if growth continues. Current headroom
is thin for memory-intensive endpoints.
The Results
| Metric | Before (Manual) | After (Agent) | Improvement |
|---|---|---|---|
| Incident detection time | When users complained | <5 minutes | Proactive |
| Overnight incidents caught | 0% (sleeping) | 100% | 24/7 coverage |
| Mean time to diagnosis | 15-30 min (manual log reading) | Instant (auto-analyzed) | 95% faster |
| Mean time to recovery | 20-60 min | 2-5 min (auto-restart) | 90% faster |
| Monthly monitoring cost | $0 (no monitoring) or $50+ (Datadog) | $0 (agent + Render API) | Free |
| Memory leaks caught proactively | 0 | 2 this month | Preventive |
Try It Yourself
You need: Render API key, a health endpoint, and a cron loop. The intelligence is in correlating metrics with access logs β when memory spikes, check which endpoints were hit. When response times degrade, check recent deploys. Most incidents follow patterns. Teach the agent the patterns.
The agent caught two incidents before I woke up. The old me would have woken up to angry user emails.
Related case studies
Software Engineer
Debugging a Production Error β Agent Found the Bug From Logs Alone
A production 500 error affected 23 users. An AI agent diagnosed a race condition from logs alone and shipped a fix PR in 35 minutes. Full incident timeline inside.
DevOps Engineer
Blueprint Deploys on Render β Entire Backend Stack in One YAML
Deploy an entire backend stack on Render.com in 5 minutes using a single YAML blueprint. API server, worker, database, Redis, and cron β all wired up automatically by an AI agent.
Full-Stack Developer
Auto-Generated API Documentation β From Code to Docs in 3 Minutes
How an AI agent reads Django views, serializers, and models to generate complete OpenAPI specs and markdown API docs in 3 minutes. Auto-updates on every merge.
Want results like these?
Start free with your own AI team. No credit card required.