DevOps Engineer

CI/CD Pipeline Monitoring β€” The Agent That Never Sleeps

Failure detection <60 sec vs 10-120 minEngineering & DevOps4 min read

Key Takeaway

An AI agent monitors every CI/CD pipeline across all our repos 24/7, auto-retries flaky jobs, and sends failure alerts to Telegram β€” so I never check a pipeline dashboard again.

The Problem

We run 5+ active repositories at PyratzLabs. learn-my-way (the frontend), eskimoai-api (the backend), three Artificial-Lab blockchain projects, plus internal tools. Each repo has CI pipelines. Each pipeline has 4-6 stages. Each stage can fail.

That's 20-30 pipeline runs per day. Someone needs to watch them.

Before the agent, that someone was me. I'd push code, tab over to GitLab, wait. Get distracted. Come back 20 minutes later. Pipeline failed? Cool β€” now I need to find which stage, which job, read the logs, figure out if it's a real failure or a flaky Docker pull timeout.

The worst part: flaky failures. Network timeouts on npm install. Docker Hub rate limits. Transient DNS errors. They fail the pipeline, and the fix is literally "run it again." But someone has to click retry. At 3am, nobody does. So the pipeline stays red until morning.

Context switching kills deep work. Every time I check a pipeline, I lose 10-15 minutes of focus. Five pipeline checks a day β€” that's over an hour of fractured attention.

The Solution

Thom's DevOps sub-agent monitors every pipeline across every repo, every minute. Failures get instant Telegram alerts with the exact error. Flaky jobs get auto-retried. Weekly reports show pipeline health across the entire fleet.

I haven't opened the GitLab CI dashboard in three months.

The Process

The monitoring runs on a cron loop. Every 60 seconds, the agent checks all active pipelines:

yamlShow code
# mrchief heartbeat config for pipeline monitoring
heartbeat:
  interval: 60s
  tasks:
    - name: pipeline-monitor
      repos:
        - pyratzlabs/learn-my-way
        - pyratzlabs/eskimoai-api
        - pyratzlabs/artificial-lab-contracts
        - pyratzlabs/artificial-lab-sdk
        - pyratzlabs/internal-tools
      check: active_pipelines

When a pipeline fails, the agent doesn't just say "it broke." It digs in:

bashShow code
# Get failed job details
failed_jobs=$(glab ci list-jobs --status failed -p "$project" --pipeline "$pid" -o json)

# Extract the relevant log tail
for job_id in $(echo "$failed_jobs" | jq -r '.[].id'); do
  glab ci trace "$job_id" | tail -50 > /tmp/job_log.txt
done

The Telegram alert is structured:

View details
🚨 Pipeline Failed: eskimoai-api #1247
  Branch: feature/payment-webhooks
  Stage: test
  Job: test-integration (failed)

  Error (last 5 lines):
  > ConnectionRefusedError: [Errno 111] Connection refused
  > Failed to connect to test database on port 5433

  πŸ” Analysis: Test DB container didn't start.
     Docker Compose health check timed out.
  πŸ’‘ This is a known flaky pattern. Auto-retrying (1/2)...

The auto-retry logic is conservative:

pythonShow code
FLAKY_PATTERNS = [
    r"ConnectionRefusedError.*5433",     # Test DB slow start
    r"npm ERR! network timeout",          # NPM registry hiccup
    r"error pulling image",               # Docker Hub rate limit
    r"ETIMEDOUT",                          # Generic network timeout
    r"503 Service Temporarily Unavailable" # GitLab runner overload
]

MAX_RETRIES = 2

def should_retry(job_log: str, retry_count: int) -> bool:
    if retry_count >= MAX_RETRIES:
        return False
    return any(re.search(p, job_log) for p in FLAKY_PATTERNS)

Every Sunday, the agent compiles a weekly health report:

View details
πŸ“Š Weekly Pipeline Report (Mar 2-8)

| Repo              | Runs | Pass | Fail | Rate  | Avg Duration |
|-------------------|------|------|------|-------|-------------|
| learn-my-way      | 47   | 44   | 3    | 93.6% | 4m 12s      |
| eskimoai-api      | 31   | 28   | 3    | 90.3% | 6m 45s      |
| al-contracts      | 12   | 12   | 0    | 100%  | 2m 30s      |

Top Flaky Jobs:
  1. eskimoai-api/test-integration β€” 3 flaky failures (DB startup)
  2. learn-my-way/build β€” 1 flaky failure (npm timeout)

Recommendation: Pin test DB image version to reduce startup variance.

The Results

MetricBefore (Manual)After (Agent)Improvement
Failure detection time10 min - 2 hours<60 seconds99% faster
Flaky job resolution5-30 min (manual retry)Automatic (<2 min)Zero human time
Pipeline dashboard visits/day5-80Eliminated
Context switches from CI~1 hour/day0Reclaimed
Weekend pipeline failures caught~40% (Monday morning)100% (real-time)24/7 coverage

Try It Yourself

You need glab CLI access and a cron or heartbeat loop. The pattern: poll pipelines, match failure logs against known flaky patterns, auto-retry what's safe, alert on everything else. The weekly report is just aggregation. Mr.Chief handles the scheduling and Telegram delivery natively.


The agent that never sleeps doesn't need coffee. Just API access.

CI/CDGitLabMonitoringAutomationTelegram

Want results like these?

Start free with your own AI team. No credit card required.

CI/CD Pipeline Monitoring β€” The Agent That Never Sleeps β€” Mr.Chief