← All posts

Web Scraping Fallbacks for MCP Servers: A Practical Design Guide

MCP web scraping looks simple until the first real user hands your tool a URL that returns 403, renders nothing without JavaScript, rate-limits your honest crawler user-agent, or produces 180 characters of useless navigation text and calls it content.

Then the clean diagram turns into a small haunted forest.

If you are building an MCP server, agent tool, or web-data integration, the answer is not “just add Firecrawl” or “just rotate user agents” or “just use a browser”. Those can all be useful. They can also create privacy surprises, nondeterministic tests, credit leaks, and fake-success responses that make the agent trust garbage.

The better move is to design the fallback chain explicitly.

This guide gives you a practical model for handling web scraping fallbacks in MCP tools: what to retry, what to hand to a provider, what to expose in configuration, and when to fail honestly.

The real problem: failures are not all the same

A web scraping failure can mean at least seven different things:

Those are different branches. Treating them all as “scrape failed, try provider B” is how you get expensive nonsense with nice logs.

For MCP servers, this matters because the agent will often treat tool output as evidence. A bad fallback does not just annoy the user. It can poison the next reasoning step.

Split transport failure from extraction-quality failure

The first design rule is simple:

Transport failure is not extraction-quality failure.

Transport failure means the tool could not fetch an acceptable source page. Examples:

Extraction-quality failure means the tool fetched something, but the content was not good enough for the requested task. Examples:

These branches need different fallback policies.

A clean shape looks like this:

if response.status_code in {403, 429, 503}:
    return handle_transport_block(url, status=response.status_code)

if response.status_code in {404, 410}:
    return hard_failure("not_found", status=response.status_code)

content = extract_readable_content(response.html)

if len(content.text) < MIN_CONTENT_LENGTH:
    return handle_low_content_quality(url, html=response.html)

result = extract_requested_fields(content, prompt)

if not result.found:
    return handle_extraction_miss(url, content, prompt)

return success(result, provenance="static_http")

That split sounds basic. It is also where many scraping tools get messy, because fallbacks are added after the fact instead of designed as a policy.

A sane fallback chain

A useful fallback chain is capability-based, not brand-based.

Bad shape:

{
  "fallbacks": ["firecrawl", "browser", "other_api"]
}

Better shape:

{
  "fallbacks": [
    { "type": "static_http", "purpose": "fast_fetch" },
    { "type": "readability", "purpose": "clean_main_content" },
    { "type": "browser_render", "purpose": "javascript_content" },
    { "type": "provider_markdown", "purpose": "blocked_or_complex_page" },
    { "type": "structured_extraction", "purpose": "typed_fields_from_known_url" }
  ]
}

The second version lets maintainers ask the real question:

What capability do we need next?

Not “which vendor do we throw at the corpse?”

Capability table

CapabilityBest forWeak spotTypical provenance
Static HTTP fetchFast public pages, docs, simple blogsBlocks, JS-heavy pages, thin contentstatic_http
Readability/trafilatura-style parsingMain article/content extractionProduct pages, dashboards, weird layoutsreadability_extract
Browser renderingJS-rendered pages and dynamic contentSlower, heavier, harder to hostbrowser_render
Firecrawl or similar web-data providerMarkdown, crawling, search, broad scrape workflowsExternal dependency and provider-specific output shapeprovider_markdown / provider_crawl
Structured extraction APIKnown URL + specific fields needed as JSONNot a crawler; needs a clear prompt/schemastructured_extraction

Firecrawl is strong when you need a broader web-data platform: scrape formats, Markdown, HTML, screenshots, links, JSON extraction, crawling/search workflows, and MCP integration. The official docs and MCP server show that surface area clearly.

A structured extractor like Haunt fits a narrower slot: the agent already has the URL and needs specific typed fields. That is a different job from crawling a site or building a Markdown corpus.

The fallback policy maintainers should expose

Do not hide all fallback behaviour behind magic. Give users a few explicit knobs.

Useful options:

{
  "politeMode": true,
  "fallbackOnTransportError": false,
  "fallbackOnLowContentQuality": true,
  "maxStaticRetries": 1,
  "respectRetryAfter": true,
  "allowExternalProviders": false,
  "structuredExtractionProvider": null
}

What those mean:

Default conservative policy:

{
  "politeMode": true,
  "fallbackOnTransportError": false,
  "fallbackOnLowContentQuality": true,
  "maxStaticRetries": 0,
  "respectRetryAfter": true,
  "allowExternalProviders": false
}

Best-effort policy:

{
  "politeMode": false,
  "fallbackOnTransportError": true,
  "fallbackOnLowContentQuality": true,
  "maxStaticRetries": 1,
  "respectRetryAfter": true,
  "allowExternalProviders": true
}

The point is not that these exact names are sacred. The point is that users should know whether a URL stayed local, went through a browser, hit Firecrawl, hit another provider, or failed.

Response provenance is not optional

Every fallback result should carry provenance.

Example response shape:

{
  "success": true,
  "mode": "structured_extraction",
  "provenance": {
    "source_url": "https://example.com/product",
    "final_url": "https://example.com/product",
    "attempts": [
      { "mode": "static_http", "status": 403, "outcome": "transport_blocked" },
      { "mode": "provider_markdown", "status": 200, "outcome": "low_field_coverage" },
      { "mode": "structured_extraction", "status": 200, "outcome": "fields_found" }
    ]
  },
  "data": {
    "product_name": "Example Widget",
    "price": "unknown",
    "availability": "in stock"
  },
  "warnings": [
    "price was requested but not found on the page"
  ]
}

Agents need this because downstream reasoning depends on confidence. A result from static_http with clean content is not the same as a result after three fallbacks and a missing field.

If the field is not found, say so. Do not hallucinate a price because the JSON schema wanted one. The model is already enough of a menace without your scraper handing it a fake receipt.

When to fail honestly

Some cases should not fall back forever.

Fail honestly when:

A good failure response is still useful:

{
  "success": false,
  "mode": "not_found",
  "error": "The URL returned HTTP 404. No extraction was attempted from the error page.",
  "provenance": {
    "attempts": [
      { "mode": "static_http", "status": 404, "outcome": "not_found" }
    ]
  }
}

That is better than a fake success with the page title 404 Not Found extracted as if it were the answer.

Where Haunt fits

Haunt API is useful when the agent already knows the URL and the next step needs structured JSON.

Example MCP provider boundary:

type StructuredExtractionInput = {
  url: string;
  prompt: string;
};

type StructuredExtractionResult = {
  success: boolean;
  data: Record<string, unknown>;
  confidence?: number;
  provenance?: Record<string, unknown>;
};

Example request:

curl -X POST https://hauntapi.com/v1/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "prompt": "Extract the product name, current price, availability, and main image URL as JSON"
  }'

This should be a leaf extractor, not the entire crawler brain.

Good Haunt fit:

Bad Haunt fit:

If your MCP server needs search, crawling, mapping, and broad Markdown generation, start with a crawler/web-data provider. If it needs one page turned into typed fields, a structured extractor is the cleaner boundary.

Example MCP tool design

Here is a simplified tool contract:

const extractWebDataTool = {
  name: "extract_web_data",
  description: "Fetch a URL and return either readable content or structured fields, with fallback provenance.",
  inputSchema: {
    type: "object",
    properties: {
      url: { type: "string" },
      prompt: { type: "string" },
      outputMode: {
        type: "string",
        enum: ["markdown", "structured_json"]
      },
      fallbackPolicy: {
        type: "string",
        enum: ["conservative", "best_effort"]
      }
    },
    required: ["url", "prompt"]
  }
};

Then route by intent:

if (outputMode === "markdown") {
  return runMarkdownFallbackChain(url, fallbackPolicy);
}

if (outputMode === "structured_json") {
  return runStructuredExtractionChain(url, prompt, fallbackPolicy);
}

Keep these separate. Markdown context and structured JSON extraction are cousins, not twins.

Which design should you choose?

Choose a conservative default if your users are self-hosting, privacy-sensitive, or likely to be surprised by external providers.

Choose a best-effort fallback if your users explicitly want reliability over locality and have configured provider keys.

Choose a structured extraction fallback when:

Do not choose silent magic. Silent magic is how support tickets reproduce.

Final checklist

Before shipping MCP web scraping fallbacks, check:

If the answer is yes, your MCP server will be much easier to trust.

If the answer is no, your agent may still work in the demo. It will just wait until production to become performance art.

If your MCP tool already knows the URL and needs structured JSON from the page, try Haunt API as a scoped extraction provider.

Read the Haunt API docs