← Back to blog

Extract Text From Any Webpage Using an API

What you'll learn
  1. Why extract text via API instead of parsing HTML?
  2. Python: extract text from a URL in 3 lines
  3. JavaScript / Node.js example
  4. Get structured data, not just raw text
  5. Batch processing: extract text from multiple pages
  6. Comparison: API vs BeautifulSoup vs Readability
  7. Cost breakdown

Why extract text via API instead of parsing HTML?

Traditional web scraping for text extraction follows a tedious pattern: fetch the HTML, parse it with BeautifulSoup, strip out navigation and footers, remove scripts and styles, then hope what's left is the actual article content.

It works — until it doesn't. Modern websites use dynamic rendering, shadow DOMs, and client-side frameworks that make simple HTML parsing unreliable. You end up with navigation text mixed into your content, or worse, empty results because the content loaded via JavaScript after your scraper already finished.

A text extraction API handles all of this for you:

Python: extract text from a URL in 3 lines

Here's the simplest way to extract readable text from any webpage using Python:

import requests

response = requests.post("https://hauntapi.com/v1/extract", headers={
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
}, json={
    "url": "https://en.wikipedia.org/wiki/Web_scraping"
})

data = response.json()
print(data["content"])  # Clean, extracted text

That's it. The API returns the page's main content as clean text — no HTML tags, no navigation clutter, no boilerplate.

With the Haunt API Python SDK

pip install hauntapi
from hauntapi import HauntClient

client = HauntClient(api_key="your-api-key")
result = client.extract("https://en.wikipedia.org/wiki/Web_scraping")
print(result.content)

JavaScript / Node.js example

Same thing in Node.js using the Fetch API:

const response = await fetch("https://hauntapi.com/v1/extract", {
  method: "POST",
  headers: {
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    url: "https://en.wikipedia.org/wiki/Web_scraping"
  })
});

const data = await response.json();
console.log(data.content);

Get structured data, not just raw text

Sometimes you don't want all the text — you want specific data points. Haunt API lets you provide an extraction prompt to pull exactly what you need:

response = requests.post("https://hauntapi.com/v1/extract", headers={
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
}, json={
    "url": "https://news.ycombinator.com",
    "prompt": "Extract the top 5 stories with their titles, points, and URLs as JSON"
})

print(response.json()["data"])
# Returns structured JSON with exactly what you asked for

This is where traditional scrapers fall short. With BeautifulSoup, you'd need to inspect the HTML, find the right CSS selectors, handle edge cases, and write brittle parsing code. With an extraction API, you just describe what you want in plain English.

Batch processing: extract text from multiple pages

Need to extract text from hundreds or thousands of pages? Here's a production-ready batch script with error handling:

import requests
import json
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

API_KEY = "your-api-key"
URLS = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
    # ... hundreds more
]

def extract_text(url):
    try:
        resp = requests.post(
            "https://hauntapi.com/v1/extract",
            headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
            json={"url": url},
            timeout=30
        )
        resp.raise_for_status()
        return {"url": url, "text": resp.json().get("content", ""), "status": "ok"}
    except Exception as e:
        return {"url": url, "text": "", "status": "error", "error": str(e)}

results = []
with ThreadPoolExecutor(max_workers=5) as pool:
    futures = {pool.submit(extract_text, url): url for url in URLS}
    for future in as_completed(futures):
        results.append(future.result())

# Save to file
with open("extracted.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Extracted {len([r for r in results if r['status'] == 'ok'])}/{len(URLS)} pages")

Comparison: API vs BeautifulSoup vs Readability

Here's how the approaches stack up for text extraction:

Cost breakdown

Haunt API gives you 100 free requests per month on the Basic plan. No credit card needed. For production use, the Pro plan is $0.01 per request — that's $10 for 1,000 extractions. Compare that to the developer time you'd spend writing and maintaining BeautifulSoup scrapers.

Start extracting text from webpages in minutes. 100 free requests, no credit card required.

Get Your Free API Key →

Common use cases

The key insight: if you're still manually parsing HTML to get text from webpages, you're spending time on the wrong problem. Let the API handle rendering, parsing, and cleaning — focus on actually using the data.

Ready to extract clean text from any webpage?

Try Haunt API Free →