DevOps Engineer

Render Backend Monitoring β€” The Agent Caught 2 Incidents Before I Woke Up

2 incidents caught overnight, MTTR 2-5 minEngineering & DevOps5 min read

Key Takeaway

An AI agent monitors our Django backend on Render.com 24/7. It caught a server crash at 3am and a memory leak at 5am β€” diagnosing both and suggesting fixes before I opened my eyes.

The Problem

Backend monitoring is a solved problem if you're a Series B company with a dedicated SRE team and a $2,000/month Datadog bill. We're not that.

We run eskimoai-api on Render. Django, Python, PostgreSQL. It serves our production users. It needs to stay up. But Render's built-in monitoring is basic β€” you get CPU/memory graphs and deploy logs. No intelligent alerting. No root cause analysis. No "hey, this endpoint is leaking memory."

The alternative: set up Prometheus + Grafana + Alertmanager + PagerDuty. That's a week of infrastructure work and ongoing maintenance. For a team where the infrastructure engineer is also the product engineer is also me.

I needed monitoring that's smart enough to diagnose problems, not just report them. And it had to cost nothing in setup or maintenance time.

The Solution

Thom's DevOps sub-agent monitors our Render services continuously. Health checks every 5 minutes. Log analysis on anomalies. Incident detection, diagnosis, and response β€” all delivered to Telegram with suggested fixes.

Two real incidents. Both caught while I was sleeping.

The Process

The monitoring loop runs every 5 minutes:

bashShow code
# Health check
response=$(curl -s -o /tmp/health.json -w "%{http_code}|%{time_total}" \
  https://eskimoai-api.onrender.com/health/)

status_code=$(echo "$response" | cut -d'|' -f1)
response_time=$(echo "$response" | cut -d'|' -f2)

# Alert thresholds
if [ "$status_code" != "200" ] || [ $(echo "$response_time > 2.0" | bc) -eq 1 ]; then
  trigger_incident "$status_code" "$response_time"
fi

The agent also pulls Render service metrics via the API:

bashShow code
# Fetch service metrics from Render API
curl -s -H "Authorization: Bearer $RENDER_API_KEY" \
  "https://api.render.com/v1/services/$SERVICE_ID/metrics?period=1h" \
  | jq '{cpu: .cpu, memory: .memory, requests: .requests}'

Incident 1: Server Crash at 3:17 AM

The health check returned a 502. The agent escalated immediately:

View details
🚨 INCIDENT β€” eskimoai-api DOWN
  Time: 03:17 UTC
  Status: 502 Bad Gateway
  Duration: Health check failed (was healthy 5 min ago)

  πŸ“‹ Render Logs (last 10 lines):
  > [ERROR] Worker pid:12 exited with code 137 (OOM killed)
  > [ERROR] Worker pid:14 exited with code 137 (OOM killed)
  > [INFO] No workers available. Returning 502.

  πŸ” Diagnosis: Out-of-Memory kill. Both Gunicorn workers
  exceeded the 512MB memory limit simultaneously.

  Recent deploys: v2.1.8 deployed 6 hours ago β€” includes
  new PDF export endpoint.

  πŸ’‘ Suggested fix:
  1. Immediate: Restart service (clears memory)
  2. Investigate: PDF export likely loads full documents into
     memory. Add streaming/chunked response.
  3. Consider: Upgrade from 512MB to 1GB if PDF exports
     are expected usage.

  πŸ”„ Auto-restarting service now...

The agent triggered a service restart via the Render API:

bashShow code
curl -s -X POST \
  -H "Authorization: Bearer $RENDER_API_KEY" \
  "https://api.render.com/v1/services/$SERVICE_ID/restart"

Then confirmed recovery:

View details
βœ… eskimoai-api recovered
  Time: 03:19 UTC
  Downtime: 2 minutes
  Health check: 200 OK (response: 340ms)

  ⚠️ Root cause unresolved. PDF export endpoint needs memory fix.
  Created issue #67: "PDF export OOM β€” add streaming response"

Incident 2: Memory Leak at 5:04 AM

Two hours later, the agent detected a pattern:

View details
⚠️ WARNING β€” eskimoai-api memory trending up
  Current: 78% of 512MB limit (398MB)
  30 min ago: 62% (317MB)
  1 hour ago: 45% (230MB)

  πŸ“ˆ Growth rate: 2.8MB/minute β€” will OOM in ~40 minutes

  πŸ” Analysis: Access log shows /api/v1/export/report
  called 47 times in the last hour (automated client).
  Each call increases RSS by ~3.5MB without release.

  πŸ’‘ Diagnosis: Memory not released after PDF generation.
  Likely cause: file handles or BytesIO buffers not closed.

  Suggested:
  1. Restart now (preventive, before OOM)
  2. Rate-limit /export/report to 10 req/min
  3. Fix: ensure `with` context managers on all file
     operations in export views

  πŸ”„ Restarting preemptively...

The agent identified the leaking endpoint by correlating memory growth with access patterns β€” no APM tool required. Just log analysis and basic math.

Weekly Report:

View details
πŸ“Š eskimoai-api Weekly Health (Mar 2-8)

  Uptime: 99.7% (2 incidents, 5 min total downtime)
  Avg response time: 245ms (P95: 890ms)
  Total requests: 47,302
  Error rate: 0.3%

  Memory: Avg 45% | Peak 78% (during incident)
  CPU: Avg 12% | Peak 34%

  πŸ“ˆ Traffic up 40% week-over-week.
  πŸ’‘ Recommendation: Consider upgrading from Starter ($7/mo)
  to Standard ($25/mo) if growth continues. Current headroom
  is thin for memory-intensive endpoints.

The Results

MetricBefore (Manual)After (Agent)Improvement
Incident detection timeWhen users complained<5 minutesProactive
Overnight incidents caught0% (sleeping)100%24/7 coverage
Mean time to diagnosis15-30 min (manual log reading)Instant (auto-analyzed)95% faster
Mean time to recovery20-60 min2-5 min (auto-restart)90% faster
Monthly monitoring cost$0 (no monitoring) or $50+ (Datadog)$0 (agent + Render API)Free
Memory leaks caught proactively02 this monthPreventive

Try It Yourself

You need: Render API key, a health endpoint, and a cron loop. The intelligence is in correlating metrics with access logs β€” when memory spikes, check which endpoints were hit. When response times degrade, check recent deploys. Most incidents follow patterns. Teach the agent the patterns.


The agent caught two incidents before I woke up. The old me would have woken up to angry user emails.

RenderMonitoringDjangoIncident ResponseDevOps

Want results like these?

Start free with your own AI team. No credit card required.

Render Backend Monitoring β€” The Agent Caught 2 Incidents Before I Woke Up β€” Mr.Chief