Web Scraping API Python Tutorial: Extract Data in 5 Minutes

April 11, 2026 · 9 min read · Haunt API

What you'll learn

Why use a scraping API instead of BeautifulSoup?
Setup: API key and first request
Basic extraction: get data from any page
Structured extraction with prompts
Batch scraping multiple URLs
Error handling and retries
Saving results to CSV and JSON
Next steps

This tutorial shows you how to extract structured data from websites using Python and a web scraping API. You'll have working code in under 5 minutes: no BeautifulSoup, no Selenium, no proxy management. Just requests and a URL.

Why Use a Scraping API Instead of BeautifulSoup?

BeautifulSoup is great for simple, static pages. But production scraping hits walls fast:

JavaScript rendering: half the web is rendered client-side. BeautifulSoup sees empty divs.
Bot detection: Cloudflare, Datadome, PerimeterX block automated requests. Your scraper gets a CAPTCHA, not the page.
Selector maintenance: write 50 CSS selectors today, update 12 of them next week because the site changed its HTML.
Proxy rotation: scrape at scale and you need residential proxies, which cost $1-15/GB.

A scraping API reduces the infrastructure mess. You send a URL and prompt, then get structured data or a clear failure. You write business logic, not selector plumbing.

Setup: Get Your API Key

For this tutorial, we'll use Haunt API. It's free for 1,000 credits/month, enough to follow along and build something real.

Go to the Haunt API signup form
Create a free Haunt key, no credit card
Copy your Haunt API key from the signup response

Install the only dependency you need:

pip install requests

Set your API key as an environment variable:

export HAUNT_API_KEY="your_key_here"

warning Never hardcode API keys in source code. Use environment variables or a .env file with python-dotenv.

Basic Extraction: Get Data From Any Page

Here's the simplest possible scraper. Three lines of actual code:

import requests
import os

API_KEY = os.environ["HAUNT_API_KEY"]
API_URL = "https://hauntapi.com/v1/extract"

def extract(url, prompt):
    """Extract data from a public URL using a natural language prompt."""
    response = requests.post(
        API_URL,
        headers={
            "X-API-Key": API_KEY,
            "Content-Type": "application/json"
        },
        json={"url": url, "prompt": prompt}
    )
    response.raise_for_status()
    return response.json()

# Extract product info from any e-commerce page
result = extract(
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "Get the book title, price, availability, and rating"
)

print(result)

The response comes back as structured JSON:

{
    "data": {
        "title": "A Light in the Attic",
        "price": "£51.77",
        "availability": "In stock (22 available)",
        "rating": "Three"
    }
}

No CSS selectors. No XPath. No inspecting the DOM. You describe what you want in plain English and get structured data back.

Structured Extraction with Prompts

The prompt is your control surface. Be specific about what you want and how you want it:

# Extract a list of items from a page
result = extract(
    "https://news.ycombinator.com",
    "Get the top 10 stories with their titles, scores, and URLs. Return as a list."
)
# Returns: {"data": [{"title": "...", "score": 342, "url": "..."}, ...]}

# Extract specific fields from a company page
result = extract(
    "https://example-startup.com/about",
    "Extract: company name, founding year, total funding amount, CEO name, and headquarters location"
)
# Returns: {"data": {"company_name": "...", "founded": 2020, ...}}

# Extract tabular data
result = extract(
    "https://en.wikipedia.org/wiki/Python_(programming_language)",
    "Get the main programming paradigm, designer, first appeared year, and current stable version as key-value pairs"
)

The key insight: the same request shape works across many public pages. Change the URL, keep the prompt. Or change the prompt, keep the URL. Keep explicit failure handling for pages that block automated access.

Batch Scraping Multiple URLs

Real projects involve scraping hundreds of pages. Here's a production-ready batch scraper with concurrency:

import requests
import os
from concurrent.futures import ThreadPoolExecutor, as_completed

API_KEY = os.environ["HAUNT_API_KEY"]
API_URL = "https://hauntapi.com/v1/extract"

def extract_one(url, prompt):
    """Extract data from a single URL. Returns (url, data) or (url, error)."""
    try:
        response = requests.post(
            API_URL,
            headers={
                "X-API-Key": API_KEY,
                "Content-Type": "application/json"
            },
            json={"url": url, "prompt": prompt},
            timeout=30
        )
        response.raise_for_status()
        return (url, response.json()["data"])
    except Exception as e:
        return (url, str(e))

def batch_extract(urls, prompt, max_workers=3):
    """Scrape multiple URLs concurrently. Rate-limited to 3 concurrent."""
    results = {}
    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = {pool.submit(extract_one, url, prompt): url for url in urls}
        for future in as_completed(futures):
            url, data = future.result()
            results[url] = data
            print(f"  Yes {url[:60]}...")
    return results

# Example: scrape multiple product pages
urls = [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
    "https://books.toscrape.com/catalogue/soumission_998/index.html",
]

products = batch_extract(urls, "Get the book title, price, and availability")

for url, data in products.items():
    print(f"{data.get('title', 'ERROR')}: {data.get('price', 'N/A')}")

Three concurrent workers is a good default. Most scraping APIs rate-limit around 5-10 requests/second on free tiers. Increase max_workers as you scale up.

Error Handling and Retries

Production scrapers need retry logic. Websites timeout. APIs hiccup. Here’s a wrapper that handles them:

import time
import requests

def extract_with_retry(url, prompt, max_retries=3, backoff=2):
    """Extract with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            response = requests.post(
                API_URL,
                headers={
                    "X-API-Key": API_KEY,
                    "Content-Type": "application/json"
                },
                json={"url": url, "prompt": prompt},
                timeout=30
            )

            if response.status_code == 429:
                # Rate limited: wait and retry
                wait = backoff ** attempt
                print(f"  Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue

            response.raise_for_status()
            return response.json()["data"]

        except requests.exceptions.Timeout:
            print(f"  Timeout on attempt {attempt + 1}/{max_retries}")
            if attempt < max_retries - 1:
                time.sleep(backoff ** attempt)
            continue

        except requests.exceptions.HTTPError as e:
            if response.status_code >= 500:
                # Server error: retry
                print(f"  Server error {response.status_code}, retrying...")
                time.sleep(backoff ** attempt)
                continue
            # Client error (4xx): don't retry
            raise

    raise Exception(f"Failed after {max_retries} attempts: {url}")

This handles the three most common failure modes: rate limits (429), timeouts, and server errors (5xx). Client errors like 400 (bad request) or 401 (bad key) fail immediately, retrying won't help.

Saving Results to CSV and JSON

Extracted data is only useful if you save it. Here's a clean pattern for both formats:

import json
import csv

def save_json(data, filename="output.json"):
    """Save extracted data as JSON."""
    with open(filename, "w") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"Saved {len(data)} records to {filename}")

def save_csv(data, filename="output.csv"):
    """Save list of dicts as CSV. Auto-detects columns."""
    if not data:
        return

    # If data is a dict of results per URL, flatten it
    if isinstance(data, dict):
        rows = [{"url": k, **v} if isinstance(v, dict) else {"url": k, "value": v}
                for k, v in data.items()]
    else:
        rows = data

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        writer.writerows(rows)
    print(f"Saved {len(rows)} rows to {filename}")

# Usage
results = batch_extract(urls, "Get the book title, price, and availability")
save_json(results, "products.json")
save_csv(results, "products.csv")

Next Steps

You now have a safer pattern for extracting structured data from public pages with Python. Here's where to take it:

Schedule it: wrap your scraper in a cron job or Celery task for daily price monitoring or news aggregation.
Add to a database: pipe results into PostgreSQL, MongoDB, or a Google Sheet instead of files.
Build a pipeline: chain extractions: scrape a directory page for URLs, then scrape each URL for details.
Monitor changes: diff today's results against yesterday's. Get alerts when prices change or content updates.

The approach works because the API handles the fetch and extraction layers where supported, returns clear errors where not supported, and you handle the logic that matters to your project.

Start scraping in 30 seconds. 1,000 free credits/month, no credit card.

Get a free key

next scan

Turn a live page into structured JSON.

Use Haunt when selectors start lying to you.

Get a free key Read docs