Node.js

Web Scraping with Node.js in 2026: The Complete Practical Guide

April 2026 · 7 min read

Node.js is still one of the most popular choices for web scraping. The ecosystem is deep, the async model handles concurrency well, and JavaScript is the language of the web — so parsing HTML feels native.

But the scraping landscape in 2026 looks different than it did even two years ago. Cloudflare is more aggressive. Sites render everything client-side. CSS selectors break constantly. Here's an honest breakdown of every approach available to Node developers, with working code and real trade-offs.

The four approaches

Method Best for Setup effort Maintenance
HTTP + Cheerio Static HTML, simple pages Low High (selectors break)
Puppeteer / Playwright Dynamic SPAs, JS-rendered content Medium High (selectors + browser changes)
Scraping libraries (e.g. Crawlee) Large-scale crawl jobs High Medium
Extraction API (e.g. Haunt) Structured data from any site Low Low

Let's walk through each with code.

1. Fetch + Cheerio: the lightweight option

If the target site serves static HTML, node-fetch (or native fetch in Node 18+) paired with cheerio is still the fastest way to get going.

import { load } from 'cheerio';

const res = await fetch('https://example.com/products');
const html = await res.text();
const $ = load(html);

const products = [];
$('.product-card').each((_, el) => {
  products.push({
    name: $(el).find('.title').text().trim(),
    price: $(el).find('.price').text().trim(),
  });
});

console.log(products);

This works great — until the site redesigns their HTML, changes a class name, or moves to client-side rendering. Then every selector breaks and you're back in the DOM inspector rewriting queries.

When it breaks: Cloudflare-protected sites, React/Next.js SPAs, pagination via JS, login walls. You'll get empty responses or CAPTCHA pages.

2. Puppeteer: when you need a real browser

Puppeteer launches a headless Chrome instance, so you get full JS execution, cookie handling, and DOM access. It's the go-to for dynamic sites.

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

await page.goto('https://example.com/products', {
  waitUntil: 'networkidle2'
});

const products = await page.evaluate(() => {
  return [...document.querySelectorAll('.product-card')].map(el => ({
    name: el.querySelector('.title')?.textContent?.trim(),
    price: el.querySelector('.price')?.textContent?.trim(),
  }));
});

console.log(products);
await browser.close();

The problem? You're still writing CSS selectors. Puppeteer solves the rendering problem but not the maintenance problem. When the site changes their markup, your scraper dies.

Also: Puppeteer is slow and expensive on resources. Each browser instance uses 50-150MB of RAM. Scale that to 100 concurrent pages and you're burning through server resources fast.

Playwright vs Puppeteer

Playwright (by Microsoft) is the newer alternative. It supports Chromium, Firefox, and WebKit. For scraping specifically, the main differences:

For scraping, either works. Playwright has more momentum in 2026. The selector maintenance problem is identical.

3. Crawlee: for serious crawl jobs

If you need to crawl thousands of pages, handle rate limiting, manage queues, and retry failures, crawlee (from the Apify team) is purpose-built for this.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ $, request, log }) {
    const title = $('h1').text();
    log.info(`${request.url}: ${title}`);

    // Enqueue more pages
    await.enqueueLinks({
      globs: ['https://example.com/products/*'],
    });
  },
  maxRequestsPerMinute: 60, // rate limit yourself
});

await crawler.run(['https://example.com/products']);

Crawlee handles the plumbing well. But it's overkill for extracting data from a handful of pages, and you're still writing selectors.

4. Extraction APIs: skip the plumbing entirely

This is where the scraping world is heading in 2026. Instead of maintaining selectors, you send a URL and describe what you want in plain English. The API handles rendering, Cloudflare bypass, and returns structured JSON.

With Haunt API:

const response = await fetch('https://hauntapi.com/v1/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-RapidAPI-Key': 'YOUR_KEY',
    'X-RapidAPI-Host': 'haunt-web-extractor.p.rapidapi.com'
  },
  body: JSON.stringify({
    url: 'https://example.com/products',
    prompt: 'Extract the product name, price, and availability for each product listed on this page.'
  })
});

const data = await response.json();
console.log(data.result);

No selectors. No browser management. No Cloudflare handling. You describe the data you want, you get it back as structured JSON.

This approach trades fine-grained control for reliability. If you need pixel-perfect scraping of a single site you own, Puppeteer gives you more control. If you need to extract data from sites you don't control, an extraction API is more maintainable.

Anti-blocking strategies for Node.js

Regardless of which approach you use, you'll eventually hit blocks. Here's what works in 2026:

For more on this, see our deeper guide: How to Scrape Websites Without Getting Blocked.

Which approach should you pick?

Your situation Use this
Quick one-off scrape, static HTML fetch + cheerio
Dynamic site, need JS rendering Puppeteer or Playwright
Large-scale crawl (1000+ pages) Crawlee
Data from sites you don't control Extraction API (Haunt)
Don't want to maintain selectors ever Extraction API (Haunt)

The maintenance problem nobody talks about

Here's the thing about web scraping that tutorials don't mention: the code you write today will break. Maybe in a week, maybe in three months. Sites change their HTML. They add CAPTCHAs. They move to SPAs. Your selectors stop matching.

I've seen teams spend more time maintaining scrapers than building features. The selector treadmill is real. Every time the target site updates, someone has to open DevTools, find the new class names, update the code, deploy, test.

Extraction APIs exist to solve exactly this problem. By describing what you want instead of where to find it, you decouple your code from the site's HTML structure. When the site redesigns, the API adapts. Your integration doesn't break.

Stop writing selectors. Start describing data.

Haunt API extracts structured JSON from any URL. No browser management, no CSS selectors, no maintenance treadmill. Start free with 100 requests on RapidAPI.

Try Haunt Free on RapidAPI