DevOps Engineer

CI/CD Pipeline Monitoring — The Agent That Never Sleeps

Failure detection <60 sec vs 10-120 minEngineering & DevOps4 min read

Key Takeaway

An AI agent monitors every CI/CD pipeline across all our repos 24/7, auto-retries flaky jobs, and sends failure alerts to Telegram — so I never check a pipeline dashboard again.

The Problem

We run 5+ active repositories at PyratzLabs. learn-my-way (the frontend), eskimoai-api (the backend), three Artificial-Lab blockchain projects, plus internal tools. Each repo has CI pipelines. Each pipeline has 4-6 stages. Each stage can fail.

That's 20-30 pipeline runs per day. Someone needs to watch them.

Before the agent, that someone was me. I'd push code, tab over to GitLab, wait. Get distracted. Come back 20 minutes later. Pipeline failed? Cool — now I need to find which stage, which job, read the logs, figure out if it's a real failure or a flaky Docker pull timeout.

The worst part: flaky failures. Network timeouts on npm install. Docker Hub rate limits. Transient DNS errors. They fail the pipeline, and the fix is literally "run it again." But someone has to click retry. At 3am, nobody does. So the pipeline stays red until morning.

Context switching kills deep work. Every time I check a pipeline, I lose 10-15 minutes of focus. Five pipeline checks a day — that's over an hour of fractured attention.

The Solution

Thom's DevOps sub-agent monitors every pipeline across every repo, every minute. Failures get instant Telegram alerts with the exact error. Flaky jobs get auto-retried. Weekly reports show pipeline health across the entire fleet.

I haven't opened the GitLab CI dashboard in three months.

The Process

The monitoring runs on a cron loop. Every 60 seconds, the agent checks all active pipelines:

yamlShow code

# mrchief heartbeat config for pipeline monitoring
heartbeat:
  interval: 60s
  tasks:
    - name: pipeline-monitor
      repos:
        - pyratzlabs/learn-my-way
        - pyratzlabs/eskimoai-api
        - pyratzlabs/artificial-lab-contracts
        - pyratzlabs/artificial-lab-sdk
        - pyratzlabs/internal-tools
      check: active_pipelines

When a pipeline fails, the agent doesn't just say "it broke." It digs in:

bashShow code

# Get failed job details
failed_jobs=$(glab ci list-jobs --status failed -p "$project" --pipeline "$pid" -o json)

# Extract the relevant log tail
for job_id in $(echo "$failed_jobs" | jq -r '.[].id'); do
  glab ci trace "$job_id" | tail -50 > /tmp/job_log.txt
done

The Telegram alert is structured:

View details

🚨 Pipeline Failed: eskimoai-api #1247
  Branch: feature/payment-webhooks
  Stage: test
  Job: test-integration (failed)

  Error (last 5 lines):
  > ConnectionRefusedError: [Errno 111] Connection refused
  > Failed to connect to test database on port 5433

  🔍 Analysis: Test DB container didn't start.
     Docker Compose health check timed out.
  💡 This is a known flaky pattern. Auto-retrying (1/2)...

The auto-retry logic is conservative:

pythonShow code

FLAKY_PATTERNS = [
    r"ConnectionRefusedError.*5433",     # Test DB slow start
    r"npm ERR! network timeout",          # NPM registry hiccup
    r"error pulling image",               # Docker Hub rate limit
    r"ETIMEDOUT",                          # Generic network timeout
    r"503 Service Temporarily Unavailable" # GitLab runner overload
]

MAX_RETRIES = 2

def should_retry(job_log: str, retry_count: int) -> bool:
    if retry_count >= MAX_RETRIES:
        return False
    return any(re.search(p, job_log) for p in FLAKY_PATTERNS)

Every Sunday, the agent compiles a weekly health report:

View details

📊 Weekly Pipeline Report (Mar 2-8)

| Repo              | Runs | Pass | Fail | Rate  | Avg Duration |
|-------------------|------|------|------|-------|-------------|
| learn-my-way      | 47   | 44   | 3    | 93.6% | 4m 12s      |
| eskimoai-api      | 31   | 28   | 3    | 90.3% | 6m 45s      |
| al-contracts      | 12   | 12   | 0    | 100%  | 2m 30s      |

Top Flaky Jobs:
  1. eskimoai-api/test-integration — 3 flaky failures (DB startup)
  2. learn-my-way/build — 1 flaky failure (npm timeout)

Recommendation: Pin test DB image version to reduce startup variance.

The Results

Metric	Before (Manual)	After (Agent)	Improvement
Failure detection time	10 min - 2 hours	<60 seconds	99% faster
Flaky job resolution	5-30 min (manual retry)	Automatic (<2 min)	Zero human time
Pipeline dashboard visits/day	5-8	0	Eliminated
Context switches from CI	~1 hour/day	0	Reclaimed
Weekend pipeline failures caught	~40% (Monday morning)	100% (real-time)	24/7 coverage

Try It Yourself

You need glab CLI access and a cron or heartbeat loop. The pattern: poll pipelines, match failure logs against known flaky patterns, auto-retry what's safe, alert on everything else. The weekly report is just aggregation. Mr.Chief handles the scheduling and Telegram delivery natively.

Start free on Mr.Chief →

The agent that never sleeps doesn't need coffee. Just API access.

CI/CDGitLabMonitoringAutomationTelegram

Related case studies

Software Engineer

Creating Merge Requests From Telegram — One Message, Full CI Pipeline

Learn how PyratzLabs creates GitLab merge requests from Telegram in 15 seconds using AI agents. One message triggers MR creation, CI pipeline, and code review — replacing 10 minutes of manual work.

15 sec vs 10 min per MR4 min read

Software Engineer

Monitoring the Mr.Chief Repo — Every Issue and PR in My Telegram

How PyratzLabs monitors the Mr.Chief GitHub repo with an AI agent that summarizes issues, PRs, and releases — filtering noise and delivering only signal straight to Telegram.

90% GitHub notification noise reduced4 min read

DevOps Engineer

Release Automation — From Tag to Changelog to GitLab Release in 60 Seconds

PyratzLabs automated the entire release process with an AI agent. One Telegram command creates a tag, generates a categorized changelog, and publishes a GitLab release in 60 seconds flat.

30-60 min release process → 60 seconds4 min read

Want results like these?

Start free with your own AI team. No credit card required.

Start Free →Browse agents