Node.js

Web Scraping with Node.js in 2026: The Complete Practical Guide

April 2026 · 7 min read

Node.js is still one of the most popular choices for web scraping. The ecosystem is deep, the async model handles concurrency well, and JavaScript is the language of the web, so parsing HTML feels native.

But the scraping landscape in 2026 looks different than it did even two years ago. Cloudflare is more aggressive. Sites render everything client-side. CSS selectors break constantly. Here's an honest breakdown of every approach available to Node developers, with working code and real trade-offs.

The four approaches

Method	Best for	Setup effort	Maintenance
HTTP + Cheerio	Static HTML, simple pages	Low	High (selectors break)
Puppeteer / Playwright	Dynamic SPAs, JS-rendered content	Medium	High (selectors + browser changes)
Scraping libraries (e.g. Crawlee)	Large-scale crawl jobs	High	Medium
Extraction API (e.g. Haunt)	Structured data from supported public and authorised pages	Low	Low

Let's walk through each with code.

1. Fetch + Cheerio: the lightweight option

If the target site serves static HTML, node-fetch (or native fetch in Node 18+) paired with cheerio is still the fastest way to get going.

import { load } from 'cheerio';

const res = await fetch('https://example.com/products');
const html = await res.text();
const $ = load(html);

const products = [];
$('.product-card').each((_, el) => {
  products.push({
    name: $(el).find('.title').text().trim(),
    price: $(el).find('.price').text().trim(),
  });
});

console.log(products);

This works great. Until the site redesigns their HTML, changes a class name, or moves to client-side rendering. Then every selector breaks and you're back in the DOM inspector rewriting queries.

When it breaks: Cloudflare-protected sites, React/Next.js SPAs, pagination via JS, login walls. You'll get empty responses or CAPTCHA pages.

2. Puppeteer: when you need a real browser

Puppeteer launches a headless Chrome instance, so you get full JS execution, cookie handling, and DOM access. It's the go-to for dynamic sites.

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

await page.goto('https://example.com/products', {
  waitUntil: 'networkidle2'
});

const products = await page.evaluate(() => {
  return [...document.querySelectorAll('.product-card')].map(el => ({
    name: el.querySelector('.title')?.textContent?.trim(),
    price: el.querySelector('.price')?.textContent?.trim(),
  }));
});

console.log(products);
await browser.close();

The problem? You're still writing CSS selectors. Puppeteer solves the rendering problem but not the maintenance problem. When the site changes their markup, your scraper dies.

Also: Puppeteer is slow and expensive on resources. Each browser instance uses 50-150MB of RAM. Scale that to 100 concurrent pages and you're burning through server resources fast.

Playwright vs Puppeteer

Playwright (by Microsoft) is the newer alternative. It supports Chromium, Firefox, and WebKit. For scraping specifically, the main differences:

Auto-waiting: Playwright waits for elements more intelligently out of the box
Multi-browser: Useful when sites fingerprint Chrome specifically
Network interception: Both support it, Playwright's API is slightly cleaner

For scraping, either works. Playwright has more momentum in 2026. The selector maintenance problem is identical.

3. Crawlee: for serious crawl jobs

If you need to crawl thousands of pages, handle rate limiting, manage queues, and retry failures, crawlee (from the Apify team) is purpose-built for this.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ $, request, log }) {
    const title = $('h1').text();
    log.info(`${request.url}: ${title}`);

    // Enqueue more pages
    await.enqueueLinks({
      globs: ['https://example.com/products/*'],
    });
  },
  maxRequestsPerMinute: 60, // rate limit yourself
});

await crawler.run(['https://example.com/products']);

Crawlee handles the plumbing well. But it's overkill for extracting data from a handful of pages, and you're still writing selectors.

4. Extraction APIs: skip the plumbing entirely

This is where the scraping world is heading in 2026. Instead of maintaining selectors, you send a URL and describe what you want in plain English. The API handles rendering fallback where supported and returns structured JSON.

With Haunt API:

const response = await fetch('https://hauntapi.com/v1/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-API-Key': process.env.HAUNT_API_KEY
  },
  body: JSON.stringify({
    url: 'https://example.com/products',
    prompt: 'Extract the product name, price, and availability for each product listed on this page.'
  })
});

const data = await response.json();
console.log(data.result);

No selectors. No browser management. Clear failure when a target blocks access. You describe the data you want, you get it back as structured JSON.

This approach trades fine-grained control for reliability. If you need pixel-perfect scraping of a single site you own, Puppeteer gives you more control. If you need to extract data from sites you don't control, an extraction API is more maintainable.

Reliability strategies for Node.js scraping

Regardless of which approach you use, you will eventually hit blocks, rate limits, or empty responses. Here is the responsible playbook for 2026:

Rotate User-Agent strings. Sites check these. Use a library like user-agents on npm.
Respect rate limits. Even aggressive scrapers benefit from 1-2 second delays between requests.
Use authorised network paths where appropriate. Do not treat proxy rotation as permission to ignore source rules.
Detect challenge pages. Human-verification walls should produce explicit failures, not fabricated extracted data.
Prefer allowed access paths. Use APIs, feeds, allowlists, customer-provided sessions, or smaller public-page batches when available.

For more on this, see our deeper guide: How to Reduce Web Scraping Failures.

Which approach should you pick?

Your situation	Use this
Quick one-off scrape, static HTML	`fetch` + `cheerio`
Dynamic site, need JS rendering	Puppeteer or Playwright
Large-scale crawl (1000+ pages)	Crawlee
Data from sites you don't control	Extraction API (Haunt)
Don't want to maintain selectors ever	Extraction API (Haunt)

The maintenance problem nobody talks about

Here's the thing about web scraping that tutorials don't mention: the code you write today will break. Maybe in a week, maybe in three months. Sites change their HTML. They add CAPTCHAs. They move to SPAs. Your selectors stop matching.

I've seen teams spend more time maintaining scrapers than building features. The selector treadmill is real. Every time the target site updates, someone has to open DevTools, find the new class names, update the code, deploy, test.

Extraction APIs exist to solve exactly this problem. By describing what you want instead of where to find it, you decouple your code from the site's HTML structure. When the site redesigns, the API adapts. Your integration doesn't break.

Stop writing selectors. Start describing data.

Haunt API extracts structured JSON from public and authorised URLs. No browser management, no CSS selectors, no maintenance treadmill. Start free with 1,000 credits, no card.

Get a free key

next scan

Turn a live page into structured JSON.

Use Haunt when selectors start lying to you.

Get a free key Read docs

Web Scraping with Node.js in 2026: The Complete Practical Guide

The four approaches

1. Fetch + Cheerio: the lightweight option

2. Puppeteer: when you need a real browser

Playwright vs Puppeteer

3. Crawlee: for serious crawl jobs

4. Extraction APIs: skip the plumbing entirely

Reliability strategies for Node.js scraping

Which approach should you pick?

The maintenance problem nobody talks about

Related posts

Turn a live page into structured JSON.