← Back to blog

How to Scrape JavaScript Rendered Pages with Python

You write the scraper. You run it. requests.get() returns the HTML. You pass it to BeautifulSoup. And... nothing. Empty divs. No product data. No prices. Just a skeleton page with a bundle.js script tag and a <div id="root"></div>.

Sound familiar? You've hit a JavaScript-rendered page. The HTML you're seeing is a shell — the actual content is loaded by JavaScript after the initial page load. And your HTTP client never runs JavaScript.

This is one of the most common walls developers hit when web scraping. Here's how to get past it.

Why Some Pages Return Empty HTML

Modern web apps — React, Vue, Next.js (client-side), Angular — render content in the browser. The server sends a minimal HTML document with a JavaScript bundle. The browser downloads the bundle, executes it, and the content appears.

Your HTTP client (requests, httpx, aiohttp) doesn't execute JavaScript. It gets the raw HTML and stops. No rendering. No DOM construction. No data.

Clues you're dealing with a JS-rendered page:

Approach 1: Find the Underlying API (Best Case)

Before reaching for a headless browser, check if the page loads data via an API call. Open DevTools → Network tab → filter by XHR or Fetch. Reload the page. You'll often see a JSON endpoint that returns exactly the data you want.

import requests

# Instead of scraping the rendered HTML, hit the API directly
response = requests.get(
    "https://example.com/api/products",
    headers={"Accept": "application/json"}
)

data = response.json()
for product in data["products"]:
    print(f"{product['name']}: ${product['price']}")

This is faster, more reliable, and uses less bandwidth than any browser-based approach. But it only works if the site exposes a clean API — and many don't, or they protect it with auth tokens that expire.

Approach 2: Playwright (The Gold Standard)

When there's no clean API to hit, you need a real browser. Playwright is the best headless browser automation library for Python in 2026. It renders pages exactly like Chrome does.

pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")

    # Wait for the content to actually render
    page.wait_for_selector(".product-card")

    # Extract data from the fully rendered DOM
    products = page.query_selector_all(".product-card")
    for product in products:
        name = product.query_selector(".name").inner_text()
        price = product.query_selector(".price").inner_text()
        print(f"{name}: {price}")

    browser.close()

Key things to know about Playwright:

Approach 3: Selenium (Still Works, But Aging)

Selenium has been around forever. It works, but Playwright is faster and more reliable for scraping specifically.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("--headless=new")
options.add_argument("--disable-gpu")

driver = webdriver.Chrome(options=options)
driver.get("https://example.com/products")

# Wait for JS to render
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)

cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for card in cards:
    print(card.text)

driver.quit()

Selenium's main advantage is ecosystem maturity — if you need to interact with complex UI elements (dropdowns, iframes, file uploads), the documentation and Stack Overflow answers are extensive. For pure extraction though, Playwright is the better choice.

Approach 4: Use a Scraping API (When You Don't Want to Manage Browsers)

⚠️ Running headless browsers at scale is expensive. Each page load takes 2-5 seconds and eats RAM. If you're scraping hundreds or thousands of pages, managing browser pools, proxies, and retries becomes a full infrastructure problem.

Scraping APIs solve this by handling the browser rendering, proxy rotation, and anti-bot bypass on their infrastructure. You send a URL, you get back the fully rendered HTML (or structured data).

Here's how to scrape a JavaScript-rendered page with Haunt API:

import requests

response = requests.get(
    "https://hauntapi.com/scrape",
    params={
        "url": "https://example.com/products",
        "render_js": "true"  # Enables full JS rendering
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

# You get back the fully rendered HTML — JS has already executed
html = response.json()["content"]

Or if you want structured data without parsing HTML yourself:

response = requests.get(
    "https://hauntapi.com/extract",
    params={
        "url": "https://example.com/products",
        "prompt": "Extract product names and prices as a JSON array"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

products = response.json()["data"]
# Returns: [{"name": "Widget A", "price": "$29.99"}, ...]

The AI extraction endpoint handles JavaScript rendering and parsing in one call. No CSS selectors, no HTML parsing, no broken scrapers when the site redesigns.

Performance Comparison

Approach Speed Cost Maintenance
Direct API call Fast (~100ms) Free Low — until they change the API
Playwright Slow (~3-5s per page) Server costs High — browser updates, proxies, anti-bot
Selenium Slow (~4-6s per page) Server costs High — same issues as Playwright
Scraping API Medium (~2-3s) Per-request None — they handle infra

Common Pitfalls When Scraping JS Pages

1. Not Waiting Long Enough

The most common mistake. The page loads, your script grabs the DOM immediately, and gets empty content. Always use explicit waits — wait for the specific element you need, not arbitrary time.sleep() calls.

# Bad — race condition
page.goto("https://example.com/products")
products = page.query_selector_all(".product-card")  # Might be empty!

# Good — wait for content
page.goto("https://example.com/products")
page.wait_for_selector(".product-card", timeout=10000)
products = page.query_selector_all(".product-card")

2. Ignoring Pagination

JS-rendered sites often use infinite scroll or "Load More" buttons. You need to handle these explicitly:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")
    page.wait_for_selector(".product-card")

    # Scroll to load more content
    for _ in range(5):  # Load 5 pages worth
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(1500)  # Wait for new content to load

    products = page.query_selector_all(".product-card")
    print(f"Loaded {len(products)} products")
    browser.close()

3. Getting Blocked by Anti-Bot Systems

Headless browsers leave fingerprints. Sites use Cloudflare, Datadome, PerimeterX, and similar services to detect and block automated browsers. Symptoms include CAPTCHA pages, 403 errors, or being served fake data.

Mitigation strategies:

4. Memory Leaks with Browser Pools

If you're running Playwright at scale, you need to manage browser contexts carefully. Each page that isn't properly closed leaks memory. Use context managers and always close browsers in a finally block:

from playwright.sync_api import sync_playwright

def scrape(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        try:
            page = browser.new_page()
            page.goto(url, wait_until="networkidle")
            return page.content()
        finally:
            browser.close()  # Always clean up

When to Use Which Approach

Use the direct API approach when the site exposes clean JSON endpoints. This is always the fastest and cheapest option.

Use Playwright when you need fine-grained control over browser interactions (clicking, scrolling, filling forms) and you're running a small to medium number of requests.

Use a scraping API when you're extracting data at scale, don't want to maintain browser infrastructure, or need to bypass anti-bot protections without building that expertise yourself.

The honest truth: most developers start with Playwright, spend two weeks fighting anti-bot detection and proxy rotation, then switch to an API. There's no shame in it. Browser automation is a solved problem — your time is better spent on the data itself.

Need to scrape JavaScript-rendered pages without the headache?

Haunt API handles rendering, anti-bot bypass, and data extraction. Free tier included.

Try Haunt API Free →