Software Engineer

Debugging a Production Error β€” Agent Found the Bug From Logs Alone

Race condition diagnosed and fixed in 35 minutesEngineering & DevOps4 min read

Key Takeaway

A 500 error in production affected 23 users. Our agent diagnosed a race condition from error logs and stack traces alone, generated a fix with a reproducing test, and shipped a merged PR β€” all in 35 minutes.

The Problem

Thursday, 3:47 PM. The monitoring agent pings me: "eskimoai-api error rate spiked from 0.2% to 15.3%. 23 unique users affected. 500 errors on POST /api/jobs/complete."

This is the worst kind of bug. It's intermittent. It doesn't happen on every request. The same endpoint works 85% of the time. In the manual world, this means: reproduce it locally (can't β€” it's a concurrency issue), stare at logs, add more logging, deploy the logging, wait for it to happen again, stare at the new logs, form a hypothesis, test it, iterate. Half a day minimum. A full day if you're unlucky.

I gave the agent everything: error logs, the stack trace, the list of recent deploys, and the request pattern data.

The Solution

Our coding agent ingested the production context β€” logs, traces, recent git diffs, and request patterns β€” then diagnosed the root cause, generated a fix PR, and included a test that reproduced the race condition deterministically.

The Process

Step 1: Log Analysis

The agent received the raw Render logs:

View details
2026-03-11T15:47:23Z ERROR django.request Internal Server Error: /api/jobs/complete
Traceback (most recent call last):
  File "/app/jobs/views.py", line 142, in complete_job
    job.status = "completed"
    job.completed_at = timezone.now()
    job.save()
  File "/app/venv/lib/python3.12/django/db/models/base.py", line 822, in save
    self._perform_update(...)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "jobs_job_completion_unique"
DETAIL: Key (job_id, status)=(4521, completed) already exists.

Step 2: Pattern Recognition

The agent correlated timestamps with request logs:

View details
15:47:23.101 POST /api/jobs/complete job_id=4521 user=sarah_k
15:47:23.108 POST /api/jobs/complete job_id=4521 user=sarah_k

Two requests for the same job, 7 milliseconds apart. The agent's diagnosis:

"Race condition: two concurrent requests for the same job reach the complete_job handler simultaneously. Both read the job status as 'in_progress', both attempt to set it to 'completed', second write hits the unique constraint. This is a classic TOCTOU (time-of-check-to-time-of-use) bug."

Step 3: Root Cause in Code

The agent identified the vulnerable code path:

pythonShow code
# BEFORE: No locking β€” vulnerable to concurrent writes
def complete_job(request, job_id):
    job = Job.objects.get(id=job_id)  # Both requests read same state
    if job.status != "in_progress":
        return Response({"error": "Job not in progress"}, status=400)

    job.status = "completed"
    job.completed_at = timezone.now()
    job.result_data = request.data.get("result")
    job.save()  # Second request explodes here

Step 4: The Fix

pythonShow code
# AFTER: select_for_update() acquires a row-level lock
from django.db import transaction

def complete_job(request, job_id):
    with transaction.atomic():
        job = (
            Job.objects
            .select_for_update()  # Row-level lock β€” second request waits
            .get(id=job_id)
        )
        if job.status != "in_progress":
            return Response({"error": "Job not in progress"}, status=409)

        job.status = "completed"
        job.completed_at = timezone.now()
        job.result_data = request.data.get("result")
        job.save()

    return Response({"status": "completed"})

Step 5: Reproducing Test

The agent wrote a test that deterministically reproduced the race:

pythonShow code
from concurrent.futures import ThreadPoolExecutor
from django.test import TestCase, TransactionTestCase

class TestJobCompletionRace(TransactionTestCase):
    def test_concurrent_completion_does_not_duplicate(self):
        job = Job.objects.create(status="in_progress")

        def complete_request():
            from django.test import RequestFactory
            factory = RequestFactory()
            request = factory.post(
                f"/api/jobs/{job.id}/complete",
                data={"result": "done"},
                content_type="application/json"
            )
            from jobs.views import complete_job
            return complete_job(request, job.id)

        with ThreadPoolExecutor(max_workers=2) as executor:
            futures = [executor.submit(complete_request) for _ in range(2)]
            results = [f.result() for f in futures]

        status_codes = [r.status_code for r in results]
        self.assertIn(200, status_codes)  # One succeeds
        self.assertIn(409, status_codes)  # One gets conflict
        self.assertEqual(Job.objects.filter(id=job.id, status="completed").count(), 1)

The PR was opened, Nico reviewed it (approved with one minor comment about the 409 vs 400 status code), and it was merged.

The Results

0 min (auto-alert)

Time to detection

12 min

Time to diagnosis

18 min

Time to fix PR

35 min total

Time to merge

23

Users affected

0

Recurrence after fix

4-8 hours

Manual debug estimate

Try It Yourself

The key is giving the agent enough context: raw logs, stack traces, recent deploy diffs, and request patterns. Don't pre-filter β€” let the agent do the pattern matching. Race conditions are notoriously hard for humans to spot in logs because we read linearly. The agent sees all 23 error instances simultaneously and correlates timestamps in milliseconds.


The best debugger is the one that doesn't get frustrated at 3 AM.

DebuggingDjangoRace ConditionsIncident ResponseProduction

Want results like these?

Start free with your own AI team. No credit card required.

Debugging a Production Error β€” Agent Found the Bug From Logs Alone β€” Mr.Chief