DevOps Engineer
CI/CD Pipeline Monitoring β The Agent That Never Sleeps
Key Takeaway
An AI agent monitors every CI/CD pipeline across all our repos 24/7, auto-retries flaky jobs, and sends failure alerts to Telegram β so I never check a pipeline dashboard again.
The Problem
We run 5+ active repositories at PyratzLabs. learn-my-way (the frontend), eskimoai-api (the backend), three Artificial-Lab blockchain projects, plus internal tools. Each repo has CI pipelines. Each pipeline has 4-6 stages. Each stage can fail.
That's 20-30 pipeline runs per day. Someone needs to watch them.
Before the agent, that someone was me. I'd push code, tab over to GitLab, wait. Get distracted. Come back 20 minutes later. Pipeline failed? Cool β now I need to find which stage, which job, read the logs, figure out if it's a real failure or a flaky Docker pull timeout.
The worst part: flaky failures. Network timeouts on npm install. Docker Hub rate limits. Transient DNS errors. They fail the pipeline, and the fix is literally "run it again." But someone has to click retry. At 3am, nobody does. So the pipeline stays red until morning.
Context switching kills deep work. Every time I check a pipeline, I lose 10-15 minutes of focus. Five pipeline checks a day β that's over an hour of fractured attention.
The Solution
Thom's DevOps sub-agent monitors every pipeline across every repo, every minute. Failures get instant Telegram alerts with the exact error. Flaky jobs get auto-retried. Weekly reports show pipeline health across the entire fleet.
I haven't opened the GitLab CI dashboard in three months.
The Process
The monitoring runs on a cron loop. Every 60 seconds, the agent checks all active pipelines:
yamlShow code
# mrchief heartbeat config for pipeline monitoring
heartbeat:
interval: 60s
tasks:
- name: pipeline-monitor
repos:
- pyratzlabs/learn-my-way
- pyratzlabs/eskimoai-api
- pyratzlabs/artificial-lab-contracts
- pyratzlabs/artificial-lab-sdk
- pyratzlabs/internal-tools
check: active_pipelines
When a pipeline fails, the agent doesn't just say "it broke." It digs in:
bashShow code
# Get failed job details
failed_jobs=$(glab ci list-jobs --status failed -p "$project" --pipeline "$pid" -o json)
# Extract the relevant log tail
for job_id in $(echo "$failed_jobs" | jq -r '.[].id'); do
glab ci trace "$job_id" | tail -50 > /tmp/job_log.txt
done
The Telegram alert is structured:
View details
π¨ Pipeline Failed: eskimoai-api #1247
Branch: feature/payment-webhooks
Stage: test
Job: test-integration (failed)
Error (last 5 lines):
> ConnectionRefusedError: [Errno 111] Connection refused
> Failed to connect to test database on port 5433
π Analysis: Test DB container didn't start.
Docker Compose health check timed out.
π‘ This is a known flaky pattern. Auto-retrying (1/2)...
The auto-retry logic is conservative:
pythonShow code
FLAKY_PATTERNS = [
r"ConnectionRefusedError.*5433", # Test DB slow start
r"npm ERR! network timeout", # NPM registry hiccup
r"error pulling image", # Docker Hub rate limit
r"ETIMEDOUT", # Generic network timeout
r"503 Service Temporarily Unavailable" # GitLab runner overload
]
MAX_RETRIES = 2
def should_retry(job_log: str, retry_count: int) -> bool:
if retry_count >= MAX_RETRIES:
return False
return any(re.search(p, job_log) for p in FLAKY_PATTERNS)
Every Sunday, the agent compiles a weekly health report:
View details
π Weekly Pipeline Report (Mar 2-8)
| Repo | Runs | Pass | Fail | Rate | Avg Duration |
|-------------------|------|------|------|-------|-------------|
| learn-my-way | 47 | 44 | 3 | 93.6% | 4m 12s |
| eskimoai-api | 31 | 28 | 3 | 90.3% | 6m 45s |
| al-contracts | 12 | 12 | 0 | 100% | 2m 30s |
Top Flaky Jobs:
1. eskimoai-api/test-integration β 3 flaky failures (DB startup)
2. learn-my-way/build β 1 flaky failure (npm timeout)
Recommendation: Pin test DB image version to reduce startup variance.
The Results
| Metric | Before (Manual) | After (Agent) | Improvement |
|---|---|---|---|
| Failure detection time | 10 min - 2 hours | <60 seconds | 99% faster |
| Flaky job resolution | 5-30 min (manual retry) | Automatic (<2 min) | Zero human time |
| Pipeline dashboard visits/day | 5-8 | 0 | Eliminated |
| Context switches from CI | ~1 hour/day | 0 | Reclaimed |
| Weekend pipeline failures caught | ~40% (Monday morning) | 100% (real-time) | 24/7 coverage |
Try It Yourself
You need glab CLI access and a cron or heartbeat loop. The pattern: poll pipelines, match failure logs against known flaky patterns, auto-retry what's safe, alert on everything else. The weekly report is just aggregation. Mr.Chief handles the scheduling and Telegram delivery natively.
The agent that never sleeps doesn't need coffee. Just API access.
Related case studies
Software Engineer
Creating Merge Requests From Telegram β One Message, Full CI Pipeline
Learn how PyratzLabs creates GitLab merge requests from Telegram in 15 seconds using AI agents. One message triggers MR creation, CI pipeline, and code review β replacing 10 minutes of manual work.
Software Engineer
Monitoring the Mr.Chief Repo β Every Issue and PR in My Telegram
How PyratzLabs monitors the Mr.Chief GitHub repo with an AI agent that summarizes issues, PRs, and releases β filtering noise and delivering only signal straight to Telegram.
DevOps Engineer
Release Automation β From Tag to Changelog to GitLab Release in 60 Seconds
PyratzLabs automated the entire release process with an AI agent. One Telegram command creates a tag, generates a categorized changelog, and publishes a GitLab release in 60 seconds flat.
Want results like these?
Start free with your own AI team. No credit card required.