You write the scraper. You run it. requests.get() returns the HTML. You pass it to BeautifulSoup. And... nothing. Empty divs. No product data. No prices. Just a skeleton page with a bundle.js script tag and a <div id="root"></div>.
Sound familiar? You've hit a JavaScript-rendered page. The HTML you're seeing is a shell — the actual content is loaded by JavaScript after the initial page load. And your HTTP client never runs JavaScript.
This is one of the most common walls developers hit when web scraping. Here's how to get past it.
Modern web apps — React, Vue, Next.js (client-side), Angular — render content in the browser. The server sends a minimal HTML document with a JavaScript bundle. The browser downloads the bundle, executes it, and the content appears.
Your HTTP client (requests, httpx, aiohttp) doesn't execute JavaScript. It gets the raw HTML and stops. No rendering. No DOM construction. No data.
Clues you're dealing with a JS-rendered page:
<div id="app"></div> or <div id="root"></div><script> tags pointing to large .js bundlesrequests.get() returnsBefore reaching for a headless browser, check if the page loads data via an API call. Open DevTools → Network tab → filter by XHR or Fetch. Reload the page. You'll often see a JSON endpoint that returns exactly the data you want.
import requests
# Instead of scraping the rendered HTML, hit the API directly
response = requests.get(
"https://example.com/api/products",
headers={"Accept": "application/json"}
)
data = response.json()
for product in data["products"]:
print(f"{product['name']}: ${product['price']}")
This is faster, more reliable, and uses less bandwidth than any browser-based approach. But it only works if the site exposes a clean API — and many don't, or they protect it with auth tokens that expire.
When there's no clean API to hit, you need a real browser. Playwright is the best headless browser automation library for Python in 2026. It renders pages exactly like Chrome does.
pip install playwright playwright install chromium
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
# Wait for the content to actually render
page.wait_for_selector(".product-card")
# Extract data from the fully rendered DOM
products = page.query_selector_all(".product-card")
for product in products:
name = product.query_selector(".name").inner_text()
price = product.query_selector(".price").inner_text()
print(f"{name}: {price}")
browser.close()
wait_for_selector() is critical. Without it, you'll extract content before JS finishes rendering. Always wait for the specific element you need.wait_for_network_idle() if the page makes multiple sequential requests.Selenium has been around forever. It works, but Playwright is faster and more reliable for scraping specifically.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("--headless=new")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=options)
driver.get("https://example.com/products")
# Wait for JS to render
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)
cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for card in cards:
print(card.text)
driver.quit()
Selenium's main advantage is ecosystem maturity — if you need to interact with complex UI elements (dropdowns, iframes, file uploads), the documentation and Stack Overflow answers are extensive. For pure extraction though, Playwright is the better choice.
⚠️ Running headless browsers at scale is expensive. Each page load takes 2-5 seconds and eats RAM. If you're scraping hundreds or thousands of pages, managing browser pools, proxies, and retries becomes a full infrastructure problem.
Scraping APIs solve this by handling the browser rendering, proxy rotation, and anti-bot bypass on their infrastructure. You send a URL, you get back the fully rendered HTML (or structured data).
Here's how to scrape a JavaScript-rendered page with Haunt API:
import requests
response = requests.get(
"https://hauntapi.com/scrape",
params={
"url": "https://example.com/products",
"render_js": "true" # Enables full JS rendering
},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
# You get back the fully rendered HTML — JS has already executed
html = response.json()["content"]
Or if you want structured data without parsing HTML yourself:
response = requests.get(
"https://hauntapi.com/extract",
params={
"url": "https://example.com/products",
"prompt": "Extract product names and prices as a JSON array"
},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
products = response.json()["data"]
# Returns: [{"name": "Widget A", "price": "$29.99"}, ...]
The AI extraction endpoint handles JavaScript rendering and parsing in one call. No CSS selectors, no HTML parsing, no broken scrapers when the site redesigns.
| Approach | Speed | Cost | Maintenance |
|---|---|---|---|
| Direct API call | Fast (~100ms) | Free | Low — until they change the API |
| Playwright | Slow (~3-5s per page) | Server costs | High — browser updates, proxies, anti-bot |
| Selenium | Slow (~4-6s per page) | Server costs | High — same issues as Playwright |
| Scraping API | Medium (~2-3s) | Per-request | None — they handle infra |
The most common mistake. The page loads, your script grabs the DOM immediately, and gets empty content. Always use explicit waits — wait for the specific element you need, not arbitrary time.sleep() calls.
# Bad — race condition
page.goto("https://example.com/products")
products = page.query_selector_all(".product-card") # Might be empty!
# Good — wait for content
page.goto("https://example.com/products")
page.wait_for_selector(".product-card", timeout=10000)
products = page.query_selector_all(".product-card")
JS-rendered sites often use infinite scroll or "Load More" buttons. You need to handle these explicitly:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
page.wait_for_selector(".product-card")
# Scroll to load more content
for _ in range(5): # Load 5 pages worth
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(1500) # Wait for new content to load
products = page.query_selector_all(".product-card")
print(f"Loaded {len(products)} products")
browser.close()
Headless browsers leave fingerprints. Sites use Cloudflare, Datadome, PerimeterX, and similar services to detect and block automated browsers. Symptoms include CAPTCHA pages, 403 errors, or being served fake data.
Mitigation strategies:
playwright-stealth or equivalent patchesIf you're running Playwright at scale, you need to manage browser contexts carefully. Each page that isn't properly closed leaks memory. Use context managers and always close browsers in a finally block:
from playwright.sync_api import sync_playwright
def scrape(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
try:
page = browser.new_page()
page.goto(url, wait_until="networkidle")
return page.content()
finally:
browser.close() # Always clean up
Use the direct API approach when the site exposes clean JSON endpoints. This is always the fastest and cheapest option.
Use Playwright when you need fine-grained control over browser interactions (clicking, scrolling, filling forms) and you're running a small to medium number of requests.
Use a scraping API when you're extracting data at scale, don't want to maintain browser infrastructure, or need to bypass anti-bot protections without building that expertise yourself.
The honest truth: most developers start with Playwright, spend two weeks fighting anti-bot detection and proxy rotation, then switch to an API. There's no shame in it. Browser automation is a solved problem — your time is better spent on the data itself.
Need to scrape JavaScript-rendered pages without the headache?
Haunt API handles rendering, anti-bot bypass, and data extraction. Free tier included.
Try Haunt API Free →