Web Scraping with Node.js in 2026: The Complete Practical Guide
Node.js is still one of the most popular choices for web scraping. The ecosystem is deep, the async model handles concurrency well, and JavaScript is the language of the web — so parsing HTML feels native.
But the scraping landscape in 2026 looks different than it did even two years ago. Cloudflare is more aggressive. Sites render everything client-side. CSS selectors break constantly. Here's an honest breakdown of every approach available to Node developers, with working code and real trade-offs.
The four approaches
| Method | Best for | Setup effort | Maintenance |
|---|---|---|---|
| HTTP + Cheerio | Static HTML, simple pages | Low | High (selectors break) |
| Puppeteer / Playwright | Dynamic SPAs, JS-rendered content | Medium | High (selectors + browser changes) |
| Scraping libraries (e.g. Crawlee) | Large-scale crawl jobs | High | Medium |
| Extraction API (e.g. Haunt) | Structured data from any site | Low | Low |
Let's walk through each with code.
1. Fetch + Cheerio: the lightweight option
If the target site serves static HTML, node-fetch (or native fetch in Node 18+) paired with cheerio is still the fastest way to get going.
import { load } from 'cheerio';
const res = await fetch('https://example.com/products');
const html = await res.text();
const $ = load(html);
const products = [];
$('.product-card').each((_, el) => {
products.push({
name: $(el).find('.title').text().trim(),
price: $(el).find('.price').text().trim(),
});
});
console.log(products);
This works great — until the site redesigns their HTML, changes a class name, or moves to client-side rendering. Then every selector breaks and you're back in the DOM inspector rewriting queries.
2. Puppeteer: when you need a real browser
Puppeteer launches a headless Chrome instance, so you get full JS execution, cookie handling, and DOM access. It's the go-to for dynamic sites.
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto('https://example.com/products', {
waitUntil: 'networkidle2'
});
const products = await page.evaluate(() => {
return [...document.querySelectorAll('.product-card')].map(el => ({
name: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
}));
});
console.log(products);
await browser.close();
The problem? You're still writing CSS selectors. Puppeteer solves the rendering problem but not the maintenance problem. When the site changes their markup, your scraper dies.
Also: Puppeteer is slow and expensive on resources. Each browser instance uses 50-150MB of RAM. Scale that to 100 concurrent pages and you're burning through server resources fast.
Playwright vs Puppeteer
Playwright (by Microsoft) is the newer alternative. It supports Chromium, Firefox, and WebKit. For scraping specifically, the main differences:
- Auto-waiting: Playwright waits for elements more intelligently out of the box
- Multi-browser: Useful when sites fingerprint Chrome specifically
- Network interception: Both support it, Playwright's API is slightly cleaner
For scraping, either works. Playwright has more momentum in 2026. The selector maintenance problem is identical.
3. Crawlee: for serious crawl jobs
If you need to crawl thousands of pages, handle rate limiting, manage queues, and retry failures, crawlee (from the Apify team) is purpose-built for this.
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ $, request, log }) {
const title = $('h1').text();
log.info(`${request.url}: ${title}`);
// Enqueue more pages
await.enqueueLinks({
globs: ['https://example.com/products/*'],
});
},
maxRequestsPerMinute: 60, // rate limit yourself
});
await crawler.run(['https://example.com/products']);
Crawlee handles the plumbing well. But it's overkill for extracting data from a handful of pages, and you're still writing selectors.
4. Extraction APIs: skip the plumbing entirely
This is where the scraping world is heading in 2026. Instead of maintaining selectors, you send a URL and describe what you want in plain English. The API handles rendering, Cloudflare bypass, and returns structured JSON.
With Haunt API:
const response = await fetch('https://hauntapi.com/v1/extract', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-RapidAPI-Key': 'YOUR_KEY',
'X-RapidAPI-Host': 'haunt-web-extractor.p.rapidapi.com'
},
body: JSON.stringify({
url: 'https://example.com/products',
prompt: 'Extract the product name, price, and availability for each product listed on this page.'
})
});
const data = await response.json();
console.log(data.result);
No selectors. No browser management. No Cloudflare handling. You describe the data you want, you get it back as structured JSON.
This approach trades fine-grained control for reliability. If you need pixel-perfect scraping of a single site you own, Puppeteer gives you more control. If you need to extract data from sites you don't control, an extraction API is more maintainable.
Anti-blocking strategies for Node.js
Regardless of which approach you use, you'll eventually hit blocks. Here's what works in 2026:
- Rotate User-Agent strings. Sites check these. Use a library like
user-agentson npm. - Respect rate limits. Even aggressive scrapers benefit from 1-2 second delays between requests.
- Use residential proxies for anything serious. Datacenter IPs get flagged fast.
- Handle JavaScript challenges. Cloudflare's "Checking your browser" page requires a real browser execution environment.
- Mimic human behaviour. Random delays, scrolling patterns, mouse movement — Puppeteer-extra has plugins for this.
For more on this, see our deeper guide: How to Scrape Websites Without Getting Blocked.
Which approach should you pick?
| Your situation | Use this |
|---|---|
| Quick one-off scrape, static HTML | fetch + cheerio |
| Dynamic site, need JS rendering | Puppeteer or Playwright |
| Large-scale crawl (1000+ pages) | Crawlee |
| Data from sites you don't control | Extraction API (Haunt) |
| Don't want to maintain selectors ever | Extraction API (Haunt) |
The maintenance problem nobody talks about
Here's the thing about web scraping that tutorials don't mention: the code you write today will break. Maybe in a week, maybe in three months. Sites change their HTML. They add CAPTCHAs. They move to SPAs. Your selectors stop matching.
I've seen teams spend more time maintaining scrapers than building features. The selector treadmill is real. Every time the target site updates, someone has to open DevTools, find the new class names, update the code, deploy, test.
Extraction APIs exist to solve exactly this problem. By describing what you want instead of where to find it, you decouple your code from the site's HTML structure. When the site redesigns, the API adapts. Your integration doesn't break.
Stop writing selectors. Start describing data.
Haunt API extracts structured JSON from any URL. No browser management, no CSS selectors, no maintenance treadmill. Start free with 100 requests on RapidAPI.
Try Haunt Free on RapidAPI