Product Manager

Scraping Protected Websites β€” When web_fetch Hits a Wall

89% success on protected sitesProductivity & Security5 min read

Key Takeaway

When web_fetch returns 403s on Cloudflare-protected and JS-rendered sites, Scrapling's three scraping modes β€” simple, stealth, and dynamic β€” bypass bot detection and extract the data agents actually need.

The Problem

Every AI agent framework gives you a basic web fetch tool. Mr.Chief's web_fetch works fine for simple pages β€” documentation, blog posts, public APIs. It's fast and lightweight.

Then you try to scrape a competitor's pricing page. 403 Forbidden.

You try a JS-heavy SaaS dashboard. Empty HTML β€” the content loads client-side.

You try a site behind Cloudflare's bot protection. Captcha wall.

This is the reality of the modern web. Over 20% of all websites use Cloudflare. Most SaaS products render client-side with React or Vue. Anti-bot systems are getting smarter every quarter. A basic HTTP GET request with a User-Agent header isn't enough anymore.

For AI agents that need to gather competitive intelligence, monitor pricing, or research companies, this is a showstopper. The data exists. It's publicly visible in a browser. But your agent can't access it.

The Solution

The Scrapling skill for Mr.Chief β€” a three-mode web scraping system that ranges from basic extraction to full stealth browser automation. Each mode trades speed for capability:

  • Simple mode: Fast HTML extraction. No browser. Works for static sites.
  • Stealth mode: Real browser fingerprint with anti-detection. Bypasses Cloudflare, DataDome, and similar.
  • Dynamic mode: Full browser automation. JavaScript execution, infinite scroll, login flows, interaction.

The Process

Here's how each mode works in practice.

Simple mode β€” when speed matters and the site is cooperating:

pythonShow code
# Simple mode: basic HTML extraction
# ~200ms per page, no browser overhead
scrapling simple --url "https://docs.example.com/api/reference" \
  --extract "article" \
  --format markdown

Stealth mode β€” when the site fights back:

pythonShow code
# Stealth mode: real browser fingerprint
# Bypasses Cloudflare, DataDome, PerimeterX
scrapling stealth --url "https://competitor.com/pricing" \
  --extract ".pricing-table" \
  --wait-for ".price-amount" \
  --format json

Stealth mode doesn't just set a User-Agent string. It generates a complete browser fingerprint β€” canvas, WebGL, fonts, plugins, screen resolution, timezone. To the anti-bot system, it looks like a real person on a real browser. Because it is a real browser. Just one that an agent controls.

Dynamic mode β€” when you need the full browser:

pythonShow code
# Dynamic mode: full browser automation
# Handles JS rendering, infinite scroll, interactions
scrapling dynamic --url "https://app.example.com/dashboard" \
  --actions '[
    {"scroll": "bottom", "times": 5},
    {"wait": ".loaded-content"},
    {"click": ".show-more-button"},
    {"wait": 2000}
  ]' \
  --extract ".data-card" \
  --format json

Real use case: competitor pricing scrape

We needed pricing data from five competitors. All SaaS companies. All behind Cloudflare.

View details
Competitor A (Cloudflare Pro):
  web_fetch β†’ 403 Forbidden ❌
  scrapling simple β†’ 403 Forbidden ❌
  scrapling stealth β†’ 200 OK βœ… (full pricing table extracted)

Competitor B (Cloudflare + JS rendering):
  web_fetch β†’ 200 but empty pricing section ❌
  scrapling simple β†’ 200 but empty pricing section ❌
  scrapling stealth β†’ 200 OK βœ… (JS rendered, prices loaded)

Competitor C (DataDome protection):
  web_fetch β†’ 403 Forbidden ❌
  scrapling stealth β†’ 200 OK βœ… (bypassed DataDome)

Every single one that blocked web_fetch was accessible through stealth mode. The agent extracted pricing tiers, feature lists, and plan names β€” structured as JSON, ready for analysis.

The Results

Metricweb_fetchScrapling SimpleScrapling StealthScrapling Dynamic
Speed (per page)~100ms~200ms~2-4s~5-15s
JS renderingNoNoYesYes
Cloudflare bypassNoNoYesYes
Anti-bot bypassNoNoYesYes
Infinite scrollNoNoNoYes
Login flowsNoNoNoYes
Resource usageMinimalLowMediumHigh
Success rate (protected sites)12%18%89%97%

The success rate tells the story. On protected sites, web_fetch works 12% of the time. Stealth mode works 89%. Dynamic mode β€” with full browser automation β€” works 97%.

The 3% dynamic mode failure is typically hard captchas (hCaptcha with visual challenges). Everything else falls.

Try It Yourself

bashShow code
# Install the scrapling skill
# Install via Mr.Chief dashboard after signing up at mrchief.ai/setup
# clawhub install scrapling

# Test simple mode on a static page
mrchief run --task "Use scrapling simple mode to extract the main content
from https://example.com/blog/post-1"

# Test stealth mode on a protected page
mrchief run --task "Use scrapling stealth mode to extract pricing data
from https://competitor.com/pricing β€” the site uses Cloudflare"

# Test dynamic mode for JS-heavy pages
mrchief run --task "Use scrapling dynamic mode to scrape the full product
listing from https://app.example.com β€” scroll to load all items"

Start with simple. Escalate to stealth. Use dynamic when you need interaction. The agent handles the mode selection automatically when you describe the problem.


The web doesn't want to be scraped. Scrapling disagrees.

Web ScrapingScraplingCloudflareBot DetectionCompetitive Intelligence

Want results like these?

Start free with your own AI team. No credit card required.

Scraping Protected Websites β€” When web_fetch Hits a Wall β€” Mr.Chief