Extract Text From Supported Webpages Using an API

April 14, 2026 · 7 min read · Haunt API

What you'll learn

Why extract text via API instead of parsing HTML?
Python: extract text from a URL in 3 lines
JavaScript / Node.js example
Get structured data, not just raw text
Batch processing: extract text from multiple pages
Comparison: API vs BeautifulSoup vs Readability
Cost breakdown

Why extract text via API instead of parsing HTML?

Traditional web scraping for text extraction follows a tedious pattern: fetch the HTML, parse it with BeautifulSoup, strip out navigation and footers, remove scripts and styles, then hope what's left is the actual article content.

It works. Until it doesn't. Modern websites use dynamic rendering, shadow DOMs, and client-side frameworks that make simple HTML parsing unreliable. You end up with navigation text mixed into your content, or worse, empty results because the content loaded via JavaScript after your scraper already finished.

A text extraction API handles all of this for you:

JavaScript rendering: the API waits for the page to fully load
Content detection: ML-powered identification of the actual article/content area
Clean output: no nav bars, no footers, no ads, just the content you want
Blocked-page handling: clear failure metadata when a page requires login or human verification

Python: extract text from a URL in 3 lines

Here's the simplest way to extract readable text from a supported public webpage using Python:

import requests

response = requests.post("https://hauntapi.com/v1/extract", headers={
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
}, json={
    "url": "https://en.wikipedia.org/wiki/Web_scraping"
})

data = response.json()
print(data["content"])  # Clean, extracted text

That's it. The API returns the page's main content as clean text: no HTML tags, no navigation clutter, no boilerplate.

With the Haunt API Python SDK

pip install hauntapi

from hauntapi import HauntClient

client = HauntClient(api_key="your-api-key")
result = client.extract("https://en.wikipedia.org/wiki/Web_scraping")
print(result.content)

JavaScript / Node.js example

Same thing in Node.js using the Fetch API:

const response = await fetch("https://hauntapi.com/v1/extract", {
  method: "POST",
  headers: {
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    url: "https://en.wikipedia.org/wiki/Web_scraping"
  })
});

const data = await response.json();
console.log(data.content);

Get structured data, not just raw text

Sometimes you don't want all the text, you want specific data points. Haunt API lets you provide an extraction prompt to pull exactly what you need:

response = requests.post("https://hauntapi.com/v1/extract", headers={
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
}, json={
    "url": "https://news.ycombinator.com",
    "prompt": "Extract the top 5 stories with their titles, points, and URLs as JSON"
})

print(response.json()["data"])
# Returns structured JSON with exactly what you asked for

This is where traditional scrapers fall short. With BeautifulSoup, you'd need to inspect the HTML, find the right CSS selectors, handle edge cases, and write brittle parsing code. With an extraction API, you just describe what you want in plain English.

Batch processing: extract text from multiple pages

Need to extract text from hundreds or thousands of pages? Here's a production-ready batch script with error handling:

import requests
import json
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

API_KEY = "your-api-key"
URLS = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
    # ... hundreds more
]

def extract_text(url):
    try:
        resp = requests.post(
            "https://hauntapi.com/v1/extract",
            headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
            json={"url": url},
            timeout=30
        )
        resp.raise_for_status()
        return {"url": url, "text": resp.json().get("content", ""), "status": "ok"}
    except Exception as e:
        return {"url": url, "text": "", "status": "error", "error": str(e)}

results = []
with ThreadPoolExecutor(max_workers=5) as pool:
    futures = {pool.submit(extract_text, url): url for url in URLS}
    for future in as_completed(futures):
        results.append(future.result())

# Save to file
with open("extracted.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Extracted {len([r for r in results if r['status'] == 'ok'])}/{len(URLS)} pages")

Comparison: API vs BeautifulSoup vs Readability

Here's how the approaches stack up for text extraction:

BeautifulSoup: Free, but you parse HTML manually. No JS rendering. Brittle selectors break when sites change. ~20 lines of code per site.
Mozilla Readability: Good for article content, but requires a headless browser. Doesn't handle non-article pages (product pages, listings, etc.).
Extraction API: Handles rendering fallback and content detection where supported. 3 lines of code. Paid plans start at 10,000 credits/month.

Cost breakdown

Haunt API gives you 1,000 credits per month on the Free plan. Credits are not one-to-one requests, and no credit card is needed. For production use, Starter is £19/month for 10,000 credits. Compare that to the developer time you'd spend writing and maintaining BeautifulSoup scrapers.

Start extracting text from webpages in minutes. 1,000 credits/month, no credit card required.

Get a free key

Common use cases

Content aggregation: Pull article text from multiple news sources into your platform
SEO analysis: Extract and analyze competitor page content at scale
Training data: Collect clean text datasets for LLM fine-tuning
Research automation: Extract papers, abstracts, and data from academic sources
Monitoring: Track changes in page content for compliance or competitive intelligence

The key insight: if you're still manually parsing HTML to get text from webpages, you're spending time on the wrong problem. Let the API handle rendering, parsing, and cleaning, so you can focus on actually using the data.

Ready to extract clean text from supported public webpages?

Try the live demo

next scan

Turn a live page into structured JSON.

Use Haunt when selectors start lying to you.

Get a free key Read docs