Web Scraping Fallbacks for MCP Servers: A Practical Design Guide
MCP web scraping looks simple until the first real user hands your tool a URL that returns 403, renders nothing without JavaScript, rate-limits your honest crawler user-agent, or produces 180 characters of useless navigation text and calls it content.
Then the clean diagram turns into a small haunted forest.
If you are building an MCP server, agent tool, or web-data integration, the answer is not “just add Firecrawl” or “just rotate user agents” or “just use a browser”. Those can all be useful. They can also create privacy surprises, nondeterministic tests, credit leaks, and fake-success responses that make the agent trust garbage.
The better move is to design the fallback chain explicitly.
This guide gives you a practical model for handling web scraping fallbacks in MCP tools: what to retry, what to hand to a provider, what to expose in configuration, and when to fail honestly.
The real problem: failures are not all the same
A web scraping failure can mean at least seven different things:
- The server blocked the request:
403,429,503, bot wall, challenge page. - The URL does not exist or should not be scraped:
404,410, hard failure. - The page needs JavaScript rendering before useful content appears.
- The page returns HTML, but the readable content is too thin.
- The parser extracted content, but it extracted the wrong thing.
- The user asked for structured fields, but the page only produced generic text.
- The tool “succeeded” technically, but the result is useless to the agent.
Those are different branches. Treating them all as “scrape failed, try provider B” is how you get expensive nonsense with nice logs.
For MCP servers, this matters because the agent will often treat tool output as evidence. A bad fallback does not just annoy the user. It can poison the next reasoning step.
Split transport failure from extraction-quality failure
The first design rule is simple:
Transport failure is not extraction-quality failure.
Transport failure means the tool could not fetch an acceptable source page. Examples:
403 Forbidden429 Too Many Requests503 Service Unavailable- DNS failure
- TLS failure
- timeout
- redirect to a blocked/private/internal target
Extraction-quality failure means the tool fetched something, but the content was not good enough for the requested task. Examples:
- readable text under your minimum threshold
- boilerplate-only content
- cookie banner extracted as the main article
- product page returned but price field missing
- LLM or parser could not find requested fields
These branches need different fallback policies.
A clean shape looks like this:
if response.status_code in {403, 429, 503}:
return handle_transport_block(url, status=response.status_code)
if response.status_code in {404, 410}:
return hard_failure("not_found", status=response.status_code)
content = extract_readable_content(response.html)
if len(content.text) < MIN_CONTENT_LENGTH:
return handle_low_content_quality(url, html=response.html)
result = extract_requested_fields(content, prompt)
if not result.found:
return handle_extraction_miss(url, content, prompt)
return success(result, provenance="static_http")
That split sounds basic. It is also where many scraping tools get messy, because fallbacks are added after the fact instead of designed as a policy.
A sane fallback chain
A useful fallback chain is capability-based, not brand-based.
Bad shape:
{
"fallbacks": ["firecrawl", "browser", "other_api"]
}
Better shape:
{
"fallbacks": [
{ "type": "static_http", "purpose": "fast_fetch" },
{ "type": "readability", "purpose": "clean_main_content" },
{ "type": "browser_render", "purpose": "javascript_content" },
{ "type": "provider_markdown", "purpose": "blocked_or_complex_page" },
{ "type": "structured_extraction", "purpose": "typed_fields_from_known_url" }
]
}
The second version lets maintainers ask the real question:
What capability do we need next?
Not “which vendor do we throw at the corpse?”
Capability table
| Capability | Best for | Weak spot | Typical provenance |
|---|---|---|---|
| Static HTTP fetch | Fast public pages, docs, simple blogs | Blocks, JS-heavy pages, thin content | static_http |
| Readability/trafilatura-style parsing | Main article/content extraction | Product pages, dashboards, weird layouts | readability_extract |
| Browser rendering | JS-rendered pages and dynamic content | Slower, heavier, harder to host | browser_render |
| Firecrawl or similar web-data provider | Markdown, crawling, search, broad scrape workflows | External dependency and provider-specific output shape | provider_markdown / provider_crawl |
| Structured extraction API | Known URL + specific fields needed as JSON | Not a crawler; needs a clear prompt/schema | structured_extraction |
Firecrawl is strong when you need a broader web-data platform: scrape formats, Markdown, HTML, screenshots, links, JSON extraction, crawling/search workflows, and MCP integration. The official docs and MCP server show that surface area clearly.
A structured extractor like Haunt fits a narrower slot: the agent already has the URL and needs specific typed fields. That is a different job from crawling a site or building a Markdown corpus.
The fallback policy maintainers should expose
Do not hide all fallback behaviour behind magic. Give users a few explicit knobs.
Useful options:
{
"politeMode": true,
"fallbackOnTransportError": false,
"fallbackOnLowContentQuality": true,
"maxStaticRetries": 1,
"respectRetryAfter": true,
"allowExternalProviders": false,
"structuredExtractionProvider": null
}
What those mean:
politeMode: respectRetry-After, avoid aggressive retries, avoid user-agent games by default.fallbackOnTransportError: allow provider fallback when status codes like403,429, or503block the static path.fallbackOnLowContentQuality: use a fallback when the page fetched but the extracted content is too thin.maxStaticRetries: keep retries bounded. Infinite retry loops are just denial-of-service with optimism.respectRetryAfter: if the origin says wait, wait or fail with the retry metadata.allowExternalProviders: make privacy/cost boundaries explicit.structuredExtractionProvider: optional provider forurl + prompt -> JSONjobs.
Default conservative policy:
{
"politeMode": true,
"fallbackOnTransportError": false,
"fallbackOnLowContentQuality": true,
"maxStaticRetries": 0,
"respectRetryAfter": true,
"allowExternalProviders": false
}
Best-effort policy:
{
"politeMode": false,
"fallbackOnTransportError": true,
"fallbackOnLowContentQuality": true,
"maxStaticRetries": 1,
"respectRetryAfter": true,
"allowExternalProviders": true
}
The point is not that these exact names are sacred. The point is that users should know whether a URL stayed local, went through a browser, hit Firecrawl, hit another provider, or failed.
Response provenance is not optional
Every fallback result should carry provenance.
Example response shape:
{
"success": true,
"mode": "structured_extraction",
"provenance": {
"source_url": "https://example.com/product",
"final_url": "https://example.com/product",
"attempts": [
{ "mode": "static_http", "status": 403, "outcome": "transport_blocked" },
{ "mode": "provider_markdown", "status": 200, "outcome": "low_field_coverage" },
{ "mode": "structured_extraction", "status": 200, "outcome": "fields_found" }
]
},
"data": {
"product_name": "Example Widget",
"price": "unknown",
"availability": "in stock"
},
"warnings": [
"price was requested but not found on the page"
]
}
Agents need this because downstream reasoning depends on confidence. A result from static_http with clean content is not the same as a result after three fallbacks and a missing field.
If the field is not found, say so. Do not hallucinate a price because the JSON schema wanted one. The model is already enough of a menace without your scraper handing it a fake receipt.
When to fail honestly
Some cases should not fall back forever.
Fail honestly when:
- the origin returns
404or410; - the URL resolves to a private/internal address or unsafe network target;
- the user asks for data that is not present in fetched content;
- the only available content is an error page;
- the requested page requires authentication and no authorised auth path was configured;
- the fallback provider also returns thin or irrelevant content;
- the prompt asks for specific fields and the fallback only has generic title/metadata.
A good failure response is still useful:
{
"success": false,
"mode": "not_found",
"error": "The URL returned HTTP 404. No extraction was attempted from the error page.",
"provenance": {
"attempts": [
{ "mode": "static_http", "status": 404, "outcome": "not_found" }
]
}
}
That is better than a fake success with the page title 404 Not Found extracted as if it were the answer.
Where Haunt fits
Haunt API is useful when the agent already knows the URL and the next step needs structured JSON.
Example MCP provider boundary:
type StructuredExtractionInput = {
url: string;
prompt: string;
};
type StructuredExtractionResult = {
success: boolean;
data: Record<string, unknown>;
confidence?: number;
provenance?: Record<string, unknown>;
};
Example request:
curl -X POST https://hauntapi.com/v1/extract \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product",
"prompt": "Extract the product name, current price, availability, and main image URL as JSON"
}'
This should be a leaf extractor, not the entire crawler brain.
Good Haunt fit:
- known URL;
- specific fields;
- JSON output required;
- selectors would be brittle;
- agent workflow needs evidence it can pass to another tool.
Bad Haunt fit:
- crawl every page on a domain;
- build a full RAG corpus;
- search the web broadly;
- bypass paywalls or private authenticated areas without an explicit authorised path;
- scrape at massive scale with no regard for source behaviour.
If your MCP server needs search, crawling, mapping, and broad Markdown generation, start with a crawler/web-data provider. If it needs one page turned into typed fields, a structured extractor is the cleaner boundary.
Example MCP tool design
Here is a simplified tool contract:
const extractWebDataTool = {
name: "extract_web_data",
description: "Fetch a URL and return either readable content or structured fields, with fallback provenance.",
inputSchema: {
type: "object",
properties: {
url: { type: "string" },
prompt: { type: "string" },
outputMode: {
type: "string",
enum: ["markdown", "structured_json"]
},
fallbackPolicy: {
type: "string",
enum: ["conservative", "best_effort"]
}
},
required: ["url", "prompt"]
}
};
Then route by intent:
if (outputMode === "markdown") {
return runMarkdownFallbackChain(url, fallbackPolicy);
}
if (outputMode === "structured_json") {
return runStructuredExtractionChain(url, prompt, fallbackPolicy);
}
Keep these separate. Markdown context and structured JSON extraction are cousins, not twins.
Which design should you choose?
Choose a conservative default if your users are self-hosting, privacy-sensitive, or likely to be surprised by external providers.
Choose a best-effort fallback if your users explicitly want reliability over locality and have configured provider keys.
Choose a structured extraction fallback when:
- the URL is already selected;
- the agent needs fields, not a page dump;
- the failure is “we cannot get the data into the shape the workflow needs”;
- the response can carry provenance and warnings.
Do not choose silent magic. Silent magic is how support tickets reproduce.
Final checklist
Before shipping MCP web scraping fallbacks, check:
- Are transport failures and extraction-quality failures separate branches?
- Are
403,429, and503handled intentionally? - Do you respect
Retry-After? - Are
404and hard errors terminal? - Can users disable external providers?
- Does every result include provenance?
- Do structured prompts fail honestly when fields are missing?
- Are fallback modes testable and deterministic?
- Is the provider boundary based on capability, not brand worship?
If the answer is yes, your MCP server will be much easier to trust.
If the answer is no, your agent may still work in the demo. It will just wait until production to become performance art.
If your MCP tool already knows the URL and needs structured JSON from the page, try Haunt API as a scoped extraction provider.
Read the Haunt API docs