DevOps Engineer

Render Backend Monitoring — The Agent Caught 2 Incidents Before I Woke Up

2 incidents caught overnight, MTTR 2-5 minEngineering & DevOps5 min read

Key Takeaway

An AI agent monitors our Django backend on Render.com 24/7. It caught a server crash at 3am and a memory leak at 5am — diagnosing both and suggesting fixes before I opened my eyes.

The Problem

Backend monitoring is a solved problem if you're a Series B company with a dedicated SRE team and a $2,000/month Datadog bill. We're not that.

We run eskimoai-api on Render. Django, Python, PostgreSQL. It serves our production users. It needs to stay up. But Render's built-in monitoring is basic — you get CPU/memory graphs and deploy logs. No intelligent alerting. No root cause analysis. No "hey, this endpoint is leaking memory."

The alternative: set up Prometheus + Grafana + Alertmanager + PagerDuty. That's a week of infrastructure work and ongoing maintenance. For a team where the infrastructure engineer is also the product engineer is also me.

I needed monitoring that's smart enough to diagnose problems, not just report them. And it had to cost nothing in setup or maintenance time.

The Solution

Thom's DevOps sub-agent monitors our Render services continuously. Health checks every 5 minutes. Log analysis on anomalies. Incident detection, diagnosis, and response — all delivered to Telegram with suggested fixes.

Two real incidents. Both caught while I was sleeping.

The Process

The monitoring loop runs every 5 minutes:

bashShow code

# Health check
response=$(curl -s -o /tmp/health.json -w "%{http_code}|%{time_total}" \
  https://eskimoai-api.onrender.com/health/)

status_code=$(echo "$response" | cut -d'|' -f1)
response_time=$(echo "$response" | cut -d'|' -f2)

# Alert thresholds
if [ "$status_code" != "200" ] || [ $(echo "$response_time > 2.0" | bc) -eq 1 ]; then
  trigger_incident "$status_code" "$response_time"
fi

The agent also pulls Render service metrics via the API:

bashShow code

# Fetch service metrics from Render API
curl -s -H "Authorization: Bearer $RENDER_API_KEY" \
  "https://api.render.com/v1/services/$SERVICE_ID/metrics?period=1h" \
  | jq '{cpu: .cpu, memory: .memory, requests: .requests}'

Incident 1: Server Crash at 3:17 AM

The health check returned a 502. The agent escalated immediately:

View details

🚨 INCIDENT — eskimoai-api DOWN
  Time: 03:17 UTC
  Status: 502 Bad Gateway
  Duration: Health check failed (was healthy 5 min ago)

  📋 Render Logs (last 10 lines):
  > [ERROR] Worker pid:12 exited with code 137 (OOM killed)
  > [ERROR] Worker pid:14 exited with code 137 (OOM killed)
  > [INFO] No workers available. Returning 502.

  🔍 Diagnosis: Out-of-Memory kill. Both Gunicorn workers
  exceeded the 512MB memory limit simultaneously.

  Recent deploys: v2.1.8 deployed 6 hours ago — includes
  new PDF export endpoint.

  💡 Suggested fix:
  1. Immediate: Restart service (clears memory)
  2. Investigate: PDF export likely loads full documents into
     memory. Add streaming/chunked response.
  3. Consider: Upgrade from 512MB to 1GB if PDF exports
     are expected usage.

  🔄 Auto-restarting service now...

The agent triggered a service restart via the Render API:

bashShow code

curl -s -X POST \
  -H "Authorization: Bearer $RENDER_API_KEY" \
  "https://api.render.com/v1/services/$SERVICE_ID/restart"

Then confirmed recovery:

View details

✅ eskimoai-api recovered
  Time: 03:19 UTC
  Downtime: 2 minutes
  Health check: 200 OK (response: 340ms)

  ⚠️ Root cause unresolved. PDF export endpoint needs memory fix.
  Created issue #67: "PDF export OOM — add streaming response"

Incident 2: Memory Leak at 5:04 AM

Two hours later, the agent detected a pattern:

View details

⚠️ WARNING — eskimoai-api memory trending up
  Current: 78% of 512MB limit (398MB)
  30 min ago: 62% (317MB)
  1 hour ago: 45% (230MB)

  📈 Growth rate: 2.8MB/minute — will OOM in ~40 minutes

  🔍 Analysis: Access log shows /api/v1/export/report
  called 47 times in the last hour (automated client).
  Each call increases RSS by ~3.5MB without release.

  💡 Diagnosis: Memory not released after PDF generation.
  Likely cause: file handles or BytesIO buffers not closed.

  Suggested:
  1. Restart now (preventive, before OOM)
  2. Rate-limit /export/report to 10 req/min
  3. Fix: ensure `with` context managers on all file
     operations in export views

  🔄 Restarting preemptively...

The agent identified the leaking endpoint by correlating memory growth with access patterns — no APM tool required. Just log analysis and basic math.

Weekly Report:

View details

📊 eskimoai-api Weekly Health (Mar 2-8)

  Uptime: 99.7% (2 incidents, 5 min total downtime)
  Avg response time: 245ms (P95: 890ms)
  Total requests: 47,302
  Error rate: 0.3%

  Memory: Avg 45% | Peak 78% (during incident)
  CPU: Avg 12% | Peak 34%

  📈 Traffic up 40% week-over-week.
  💡 Recommendation: Consider upgrading from Starter ($7/mo)
  to Standard ($25/mo) if growth continues. Current headroom
  is thin for memory-intensive endpoints.

The Results

Metric	Before (Manual)	After (Agent)	Improvement
Incident detection time	When users complained	<5 minutes	Proactive
Overnight incidents caught	0% (sleeping)	100%	24/7 coverage
Mean time to diagnosis	15-30 min (manual log reading)	Instant (auto-analyzed)	95% faster
Mean time to recovery	20-60 min	2-5 min (auto-restart)	90% faster
Monthly monitoring cost	$0 (no monitoring) or $50+ (Datadog)	$0 (agent + Render API)	Free
Memory leaks caught proactively	0	2 this month	Preventive

Try It Yourself

You need: Render API key, a health endpoint, and a cron loop. The intelligence is in correlating metrics with access logs — when memory spikes, check which endpoints were hit. When response times degrade, check recent deploys. Most incidents follow patterns. Teach the agent the patterns.

Start free on Mr.Chief →

The agent caught two incidents before I woke up. The old me would have woken up to angry user emails.

RenderMonitoringDjangoIncident ResponseDevOps

Related case studies

Software Engineer

Debugging a Production Error — Agent Found the Bug From Logs Alone

A production 500 error affected 23 users. An AI agent diagnosed a race condition from logs alone and shipped a fix PR in 35 minutes. Full incident timeline inside.

Race condition diagnosed and fixed in 35 minutes4 min read

DevOps Engineer

Blueprint Deploys on Render — Entire Backend Stack in One YAML

Deploy an entire backend stack on Render.com in 5 minutes using a single YAML blueprint. API server, worker, database, Redis, and cron — all wired up automatically by an AI agent.

Full backend stack setup: 2 hours → 5 minutes3 min read

Full-Stack Developer

Auto-Generated API Documentation — From Code to Docs in 3 Minutes

How an AI agent reads Django views, serializers, and models to generate complete OpenAPI specs and markdown API docs in 3 minutes. Auto-updates on every merge.

API docs generated in 3 min, always 100% accurate4 min read

Want results like these?

Start free with your own AI team. No credit card required.

Start Free →Browse agents