← Back to Blog

How to Scrape Websites Without Getting Blocked in Python (2026 Guide)

April 11, 2026 · 8 min read

Every web scraper has been there. You write a beautiful Python script, fire it off, and within minutes your IP is banned, you're staring at CAPTCHAs, or the site returns empty responses. Anti-bot systems have gotten really good in 2026.

Here are 7 techniques that actually work right now, ranked from simplest to most robust.

1. Rotate Your User-Agent

The most basic check sites perform. If your scraper identifies as python-requests/2.31, you're telling them exactly what you are.

import requests
from random import choice

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 Safari/17.2",
    "Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0",
]

headers = {"User-Agent": choice(USER_AGENTS)}
response = requests.get("https://example.com", headers=headers)

This alone gets you past ~30% of basic blocks. But it won't fool anyone serious.

2. Add Random Delays

Humans don't request 50 pages per second. Neither should your scraper.

import time, random

for url in urls:
    response = requests.get(url, headers=headers)
    time.sleep(random.uniform(1.5, 4.0))  # Random delay between requests

Simple but effective against rate-limiting. The tradeoff? Your scrape takes way longer.

3. Use Rotating Proxies

When a site blocks your IP, rotating through different IPs keeps you going. Residential proxies work best because they look like real users.

proxies = {
    "http": "http://user:pass@proxy.example.com:8080",
    "https": "http://user:pass@proxy.example.com:8080",
}
response = requests.get(url, headers=headers, proxies=proxies)

The problem: decent residential proxies cost $5-15/GB. That adds up fast at scale.

4. Use Sessions and Cookies

Some sites track session state. Using requests.Session() maintains cookies and makes you look like a real browsing session.

session = requests.Session()
session.headers.update({"User-Agent": choice(USER_AGENTS)})

# First request (might set cookies)
session.get("https://example.com")
# Subsequent requests carry those cookies
response = session.get("https://example.com/data")

5. Headless Browsers (Playwright/Selenium)

When sites use JavaScript rendering, simple HTTP requests won't cut it. Playwright renders the full page like a real browser.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    content = page.content()
    browser.close()

Powerful but slow and resource-heavy. Each browser instance uses 100-300MB RAM. Not great for scraping at scale.

6. Respect robots.txt and Scrape During Off-Peak Hours

Basic etiquette that also reduces your chance of getting caught. Check robots.txt and schedule heavy scrapes for off-peak hours (typically 2-6 AM in the target site's timezone).

7. Use a Web Extraction API

The most reliable approach for 2026: let a dedicated service handle the anti-bot arms race for you. Services like Haunt API manage proxy rotation, JavaScript rendering, and Cloudflare bypass automatically.

import requests

response = requests.post(
    "https://hauntapi.com/v1/extract",
    headers={"X-API-Key": "your-api-key"},
    json={"url": "https://example.com"}
)
data = response.json()

One API call. No proxy management, no headless browsers to maintain, no blocks. The free tier gives you 100 requests/month to test it out.

Stop Fighting Blocks. Start Extracting Data.

Haunt API handles Cloudflare bypass, JavaScript rendering, and proxy rotation automatically. Get 100 free requests/month.

Try Haunt API Free →

Quick Comparison

The right approach depends on your scale and budget. For quick scripts, rotating user-agents and delays work fine. For production scraping at scale, an API-based approach saves hours of maintenance.

Got questions about web scraping? Check out Haunt API or find us on RapidAPI.