This tutorial shows you how to extract structured data from websites using Python and a web scraping API. You'll have working code in under 5 minutes — no BeautifulSoup, no Selenium, no proxy management. Just requests and a URL.
BeautifulSoup is great for simple, static pages. But production scraping hits walls fast:
A scraping API handles all of this. You send a URL, you get data back. The API deals with rendering, bot detection, and proxies. You write business logic, not infrastructure.
For this tutorial, we'll use Haunt API. It's free for 100 requests/month — enough to follow along and build something real.
Install the only dependency you need:
pip install requests
Set your API key as an environment variable:
export RAPIDAPI_KEY="your_key_here"
⚠ Never hardcode API keys in source code. Use environment variables or a .env file with python-dotenv.
Here's the simplest possible scraper. Three lines of actual code:
import requests
import os
API_KEY = os.environ["RAPIDAPI_KEY"]
API_HOST = "haunt-web-extractor.p.rapidapi.com"
def extract(url, prompt):
"""Extract data from any URL using a natural language prompt."""
response = requests.post(
f"https://{API_HOST}/extract",
headers={
"x-rapidapi-key": API_KEY,
"x-rapidapi-host": API_HOST,
"Content-Type": "application/json"
},
json={"url": url, "prompt": prompt}
)
response.raise_for_status()
return response.json()
# Extract product info from any e-commerce page
result = extract(
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"Get the book title, price, availability, and rating"
)
print(result)
The response comes back as structured JSON:
{
"data": {
"title": "A Light in the Attic",
"price": "£51.77",
"availability": "In stock (22 available)",
"rating": "Three"
}
}
No CSS selectors. No XPath. No inspecting the DOM. You describe what you want in plain English and get structured data back.
The prompt is your control surface. Be specific about what you want and how you want it:
# Extract a list of items from a page
result = extract(
"https://news.ycombinator.com",
"Get the top 10 stories with their titles, scores, and URLs. Return as a list."
)
# Returns: {"data": [{"title": "...", "score": 342, "url": "..."}, ...]}
# Extract specific fields from a company page
result = extract(
"https://example-startup.com/about",
"Extract: company name, founding year, total funding amount, CEO name, and headquarters location"
)
# Returns: {"data": {"company_name": "...", "founded": 2020, ...}}
# Extract tabular data
result = extract(
"https://en.wikipedia.org/wiki/Python_(programming_language)",
"Get the main programming paradigm, designer, first appeared year, and current stable version as key-value pairs"
)
The key insight: the same code works on every website. Change the URL, keep the prompt. Or change the prompt, keep the URL. No rewriting selectors per site.
Real projects involve scraping hundreds of pages. Here's a production-ready batch scraper with concurrency:
import requests
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
API_KEY = os.environ["RAPIDAPI_KEY"]
API_HOST = "haunt-web-extractor.p.rapidapi.com"
def extract_one(url, prompt):
"""Extract data from a single URL. Returns (url, data) or (url, error)."""
try:
response = requests.post(
f"https://{API_HOST}/extract",
headers={
"x-rapidapi-key": API_KEY,
"x-rapidapi-host": API_HOST,
"Content-Type": "application/json"
},
json={"url": url, "prompt": prompt},
timeout=30
)
response.raise_for_status()
return (url, response.json()["data"])
except Exception as e:
return (url, str(e))
def batch_extract(urls, prompt, max_workers=3):
"""Scrape multiple URLs concurrently. Rate-limited to 3 concurrent."""
results = {}
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {pool.submit(extract_one, url, prompt): url for url in urls}
for future in as_completed(futures):
url, data = future.result()
results[url] = data
print(f" ✓ {url[:60]}...")
return results
# Example: scrape multiple product pages
urls = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"https://books.toscrape.com/catalogue/soumission_998/index.html",
]
products = batch_extract(urls, "Get the book title, price, and availability")
for url, data in products.items():
print(f"{data.get('title', 'ERROR')}: {data.get('price', 'N/A')}")
Three concurrent workers is a good default. Most scraping APIs rate-limit around 5-10 requests/second on free tiers. Increase max_workers as you scale up.
Production scrapers need retry logic. Websites timeout. APIs hiccup. Here's a robust wrapper:
import time
import requests
def extract_with_retry(url, prompt, max_retries=3, backoff=2):
"""Extract with exponential backoff retry."""
for attempt in range(max_retries):
try:
response = requests.post(
f"https://{API_HOST}/extract",
headers={
"x-rapidapi-key": API_KEY,
"x-rapidapi-host": API_HOST,
"Content-Type": "application/json"
},
json={"url": url, "prompt": prompt},
timeout=30
)
if response.status_code == 429:
# Rate limited — wait and retry
wait = backoff ** attempt
print(f" Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
response.raise_for_status()
return response.json()["data"]
except requests.exceptions.Timeout:
print(f" Timeout on attempt {attempt + 1}/{max_retries}")
if attempt < max_retries - 1:
time.sleep(backoff ** attempt)
continue
except requests.exceptions.HTTPError as e:
if response.status_code >= 500:
# Server error — retry
print(f" Server error {response.status_code}, retrying...")
time.sleep(backoff ** attempt)
continue
# Client error (4xx) — don't retry
raise
raise Exception(f"Failed after {max_retries} attempts: {url}")
This handles the three most common failure modes: rate limits (429), timeouts, and server errors (5xx). Client errors like 400 (bad request) or 401 (bad key) fail immediately — retrying won't help.
Extracted data is only useful if you save it. Here's a clean pattern for both formats:
import json
import csv
def save_json(data, filename="output.json"):
"""Save extracted data as JSON."""
with open(filename, "w") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Saved {len(data)} records to {filename}")
def save_csv(data, filename="output.csv"):
"""Save list of dicts as CSV. Auto-detects columns."""
if not data:
return
# If data is a dict of results per URL, flatten it
if isinstance(data, dict):
rows = [{"url": k, **v} if isinstance(v, dict) else {"url": k, "value": v}
for k, v in data.items()]
else:
rows = data
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print(f"Saved {len(rows)} rows to {filename}")
# Usage
results = batch_extract(urls, "Get the book title, price, and availability")
save_json(results, "products.json")
save_csv(results, "products.csv")
You now have everything you need to extract data from any website with Python. Here's where to take it:
The entire approach works because the API handles the hard parts — JavaScript rendering, bot detection, proxies — and you handle the logic that matters to your project.
Start scraping in 30 seconds. 100 free requests/month, no credit card.
Get Free API Key →