How to Scrape JavaScript Rendered Pages with Python

April 13, 2026 · 8 min read · Haunt API

You write the scraper. You run it. requests.get() returns the HTML. You pass it to BeautifulSoup. And... nothing. Empty divs. No product data. No prices. Just a skeleton page with a bundle.js script tag and a <div id="root"></div>.

Sound familiar? You've hit a JavaScript-rendered page. The HTML you're seeing is a shell: the actual content is loaded by JavaScript after the initial page load. And your HTTP client never runs JavaScript.

This is one of the most common walls developers hit when web scraping. Here's how to get past it.

Why Some Pages Return Empty HTML

Modern web apps built with React, Vue, Angular, or client-side Next.js render content in the browser. The server sends a minimal HTML document with a JavaScript bundle. The browser downloads the bundle, executes it, and the content appears.

Your HTTP client (requests, httpx, aiohttp) doesn't execute JavaScript. It gets the raw HTML and stops. No rendering. No DOM construction. No data.

Clues you're dealing with a JS-rendered page:

The HTML has <div id="app"></div> or <div id="root"></div>
There are <script> tags pointing to large .js bundles
You see references to React, Vue, Angular, or Next.js in the source
The page content in your browser doesn't match what requests.get() returns
Network tab shows XHR/fetch calls loading the actual data

Approach 1: Find the Underlying API (Best Case)

Before reaching for a headless browser, check if the page loads data via an API call. Open DevTools → Network tab → filter by XHR or Fetch. Reload the page. You'll often see a JSON endpoint that returns exactly the data you want.

import requests

# Instead of scraping the rendered HTML, hit the API directly
response = requests.get(
    "https://example.com/api/products",
    headers={"Accept": "application/json"}
)

data = response.json()
for product in data["products"]:
    print(f"{product['name']}: ${product['price']}")

This is faster, more reliable, and uses less bandwidth than any browser-based approach. But it only works if the site exposes a clean API, and many don't, or they protect it with auth tokens that expire.

Approach 2: Playwright (The Gold Standard)

When there's no clean API to hit, you need a real browser. Playwright is the best headless browser automation library for Python in 2026. It renders pages exactly like Chrome does.

pip install playwright
playwright install chromium

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")

    # Wait for the content to actually render
    page.wait_for_selector(".product-card")

    # Extract data from the fully rendered DOM
    products = page.query_selector_all(".product-card")
    for product in products:
        name = product.query_selector(".name").inner_text()
        price = product.query_selector(".price").inner_text()
        print(f"{name}: {price}")

    browser.close()

Key things to know about Playwright:

wait_for_selector() is critical. Without it, you'll extract content before JS finishes rendering. Always wait for the specific element you need.
Use wait_for_network_idle() if the page makes multiple sequential requests.
Headless Chromium uses ~150-300MB RAM per instance. Not trivial at scale.
Anti-bot detection is real. Many sites detect headless Chrome via navigator.webdriver, WebGL fingerprints, and behavioral analysis.

Approach 3: Selenium (Still Works, But Aging)

Selenium has been around forever. It works, but Playwright is faster and more reliable for scraping specifically.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("--headless=new")
options.add_argument("--disable-gpu")

driver = webdriver.Chrome(options=options)
driver.get("https://example.com/products")

# Wait for JS to render
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)

cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for card in cards:
    print(card.text)

driver.quit()

Selenium's main advantage is ecosystem maturity: if you need to interact with complex UI elements (dropdowns, iframes, file uploads), the documentation and Stack Overflow answers are extensive. For pure extraction though, Playwright is the better choice.

Approach 4: Use a Scraping API (When You Don't Want to Manage Browsers)

warning️ Running headless browsers at scale is expensive. Each page load takes 2-5 seconds and eats RAM. If you're scraping hundreds or thousands of pages, managing browser pools, proxies, and retries becomes a full infrastructure problem.

Extraction APIs help by handling browser rendering, fetch retries, and structured failure reporting on their infrastructure. You send a URL, and for supported public pages you get rendered HTML or structured data.

Here's how to scrape a JavaScript-rendered page with Haunt API:

import requests

response = requests.get(
    "https://hauntapi.com/scrape",
    params={
        "url": "https://example.com/products",
        "render_js": "true"  # Enables full JS rendering
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

# You get back the fully rendered HTML, JS has already executed
html = response.json()["content"]

Or if you want structured data without parsing HTML yourself:

response = requests.get(
    "https://hauntapi.com/extract",
    params={
        "url": "https://example.com/products",
        "prompt": "Extract product names and prices as a JSON array"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

products = response.json()["data"]
# Returns: [{"name": "Widget A", "price": "$29.99"}, ...]

The AI extraction endpoint handles JavaScript rendering and parsing in one call. No CSS selectors, no HTML parsing, no broken scrapers when the site redesigns.

Performance Comparison

Approach	Speed	Cost	Maintenance
Direct API call	Fast (~100ms)	Free	Low, until they change the API
Playwright	Slow (~3-5s per page)	Server costs	High: browser updates, proxies, anti-bot
Selenium	Slow (~4-6s per page)	Server costs	High, same issues as Playwright
Scraping API	Medium (~2-3s)	Per-request	None, they handle infra

Common Pitfalls When Scraping JS Pages

1. Not Waiting Long Enough

The most common mistake. The page loads, your script grabs the DOM immediately, and gets empty content. Always use explicit waits: wait for the specific element you need, not arbitrary time.sleep() calls.

# Bad: race condition
page.goto("https://example.com/products")
products = page.query_selector_all(".product-card")  # Might be empty!

# Good: wait for content
page.goto("https://example.com/products")
page.wait_for_selector(".product-card", timeout=10000)
products = page.query_selector_all(".product-card")

2. Ignoring Pagination

JS-rendered sites often use infinite scroll or "Load More" buttons. You need to handle these explicitly:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")
    page.wait_for_selector(".product-card")

    # Scroll to load more content
    for _ in range(5):  # Load 5 pages worth
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(1500)  # Wait for new content to load

    products = page.query_selector_all(".product-card")
    print(f"Loaded {len(products)} products")
    browser.close()

3. Getting Blocked by Anti-Bot Systems

Headless browsers leave fingerprints. Sites use Cloudflare, Datadome, PerimeterX, and similar services to detect and block automated browsers. Symptoms include CAPTCHA pages, 403 errors, or being served fake data.

Mitigation strategies:

Use playwright-stealth or equivalent patches
Rotate user agents and proxy IPs
Add random delays between requests
Use authorised access paths for pages that need them
Or use an API that gives structured data when supported and clear failures when blocked

4. Memory Leaks with Browser Pools

If you're running Playwright at scale, you need to manage browser contexts carefully. Each page that isn't properly closed leaks memory. Use context managers and always close browsers in a finally block:

from playwright.sync_api import sync_playwright

def scrape(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        try:
            page = browser.new_page()
            page.goto(url, wait_until="networkidle")
            return page.content()
        finally:
            browser.close()  # Always clean up

When to Use Which Approach

Use the direct API approach when the site exposes clean JSON endpoints. This is always the fastest and cheapest option.

Use Playwright when you need fine-grained control over browser interactions (clicking, scrolling, filling forms) and you're running a small to medium number of requests.

Use an extraction API when you are extracting supported public data at scale, do not want to maintain browser infrastructure, and need predictable failure metadata when pages block automated access.

The honest truth: most developers start with Playwright, spend two weeks fighting anti-bot detection and proxy rotation, then switch to an API. There's no shame in it. Browser automation is a solved problem. Your time is better spent on the data itself.

Need to scrape JavaScript-rendered pages without the headache?

Haunt API handles rendering fallback and data extraction. Free tier included.

Try the live demo

next scan

Turn a live page into structured JSON.

Use Haunt when selectors start lying to you.

Get a free key Read docs