Every web scraper has been there. You write a beautiful Python script, fire it off, and within minutes your IP is banned, you're staring at CAPTCHAs, or the site returns empty responses. Anti-bot systems have gotten really good in 2026.
Here are 7 techniques that actually work right now, ranked from simplest to most robust.
The most basic check sites perform. If your scraper identifies as python-requests/2.31, you're telling them exactly what you are.
import requests
from random import choice
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 Safari/17.2",
"Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0",
]
headers = {"User-Agent": choice(USER_AGENTS)}
response = requests.get("https://example.com", headers=headers)
This alone gets you past ~30% of basic blocks. But it won't fool anyone serious.
Humans don't request 50 pages per second. Neither should your scraper.
import time, random
for url in urls:
response = requests.get(url, headers=headers)
time.sleep(random.uniform(1.5, 4.0)) # Random delay between requests
Simple but effective against rate-limiting. The tradeoff? Your scrape takes way longer.
When a site blocks your IP, rotating through different IPs keeps you going. Residential proxies work best because they look like real users.
proxies = {
"http": "http://user:pass@proxy.example.com:8080",
"https": "http://user:pass@proxy.example.com:8080",
}
response = requests.get(url, headers=headers, proxies=proxies)
The problem: decent residential proxies cost $5-15/GB. That adds up fast at scale.
Some sites track session state. Using requests.Session() maintains cookies and makes you look like a real browsing session.
session = requests.Session()
session.headers.update({"User-Agent": choice(USER_AGENTS)})
# First request (might set cookies)
session.get("https://example.com")
# Subsequent requests carry those cookies
response = session.get("https://example.com/data")
When sites use JavaScript rendering, simple HTTP requests won't cut it. Playwright renders the full page like a real browser.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
content = page.content()
browser.close()
Powerful but slow and resource-heavy. Each browser instance uses 100-300MB RAM. Not great for scraping at scale.
Basic etiquette that also reduces your chance of getting caught. Check robots.txt and schedule heavy scrapes for off-peak hours (typically 2-6 AM in the target site's timezone).
The most reliable approach for 2026: let a dedicated service handle the anti-bot arms race for you. Services like Haunt API manage proxy rotation, JavaScript rendering, and Cloudflare bypass automatically.
import requests
response = requests.post(
"https://hauntapi.com/v1/extract",
headers={"X-API-Key": "your-api-key"},
json={"url": "https://example.com"}
)
data = response.json()
One API call. No proxy management, no headless browsers to maintain, no blocks. The free tier gives you 100 requests/month to test it out.
Haunt API handles Cloudflare bypass, JavaScript rendering, and proxy rotation automatically. Get 100 free requests/month.
Try Haunt API Free →The right approach depends on your scale and budget. For quick scripts, rotating user-agents and delays work fine. For production scraping at scale, an API-based approach saves hours of maintenance.
Got questions about web scraping? Check out Haunt API or find us on RapidAPI.