Company Website Data Extraction API: Turn Business Pages into JSON
Most company websites are written for humans, not databases. That is fine until you need to enrich leads, compare competitors, monitor partner pages, or feed clean company facts into an internal tool.
You can scrape the HTML yourself, maintain selectors for every layout, and spend the afternoon arguing with cookie banners. Or you can send Haunt a URL and a plain-English extraction prompt, then get structured JSON back.
Use case: extract useful business facts from company websites — name, category, services, pricing hints, location, contact routes, proof points, and calls to action — without building a custom parser for every site.
What company website extraction is useful for
This pattern is useful when you already have URLs and need the page turned into structured records:
- Lead enrichment: add company category, offer, audience, pricing hints, and contact routes to a lead list.
- Competitor monitoring: track changes to pricing pages, feature pages, testimonials, or positioning.
- Partner/vendor research: extract what a service does, who it is for, and whether it has docs, pricing, or proof.
- Directory building: convert messy service pages into consistent profiles for review.
- Sales research: summarize what a company sells before writing a human, relevant note.
The data you can ask for
Because Haunt uses a natural-language prompt, you are not limited to a fixed schema. You describe the fields you want.
{
"company_name": "...",
"website": "...",
"category": "...",
"one_sentence_summary": "...",
"target_customers": ["..."],
"products_or_services": ["..."],
"pricing_signals": ["..."],
"contact_routes": ["..."],
"trust_signals": ["..."],
"primary_call_to_action": "..."
}
The important bit: Haunt should return what the page supports. If the page does not mention pricing, the result should say that clearly instead of inventing a number. Fake confidence is worse than missing data.
Example API request
Send a URL plus a prompt describing the company facts you want:
curl -X POST https://hauntapi.com/v1/extract \
-H "Content-Type: application/json" \
-H "X-API-Key: YOUR_HAUNT_API_KEY" \
-d '{
"url": "https://example-company.com",
"prompt": "Extract company facts for lead enrichment. Return JSON with company_name, one_sentence_summary, category, target_customers, products_or_services, pricing_signals, contact_routes, trust_signals, and primary_call_to_action. If a field is not visible on the page, return null or an empty list rather than guessing."
}'
Example Python workflow
For a small batch, keep the prompt stable and swap the URL:
import requests
headers = {
"Content-Type": "application/json",
"X-API-Key": "YOUR_HAUNT_API_KEY",
}
prompt = """
Extract company facts for lead enrichment.
Return JSON with company_name, one_sentence_summary, category,
target_customers, products_or_services, pricing_signals,
contact_routes, trust_signals, and primary_call_to_action.
Do not guess missing fields.
"""
for url in urls:
response = requests.post(
"https://hauntapi.com/v1/extract",
headers=headers,
json={"url": url, "prompt": prompt},
timeout=60,
)
response.raise_for_status()
print(response.json())
Why not just parse HTML?
HTML parsing is fine when the layout is stable and you know exactly where the data lives. Company sites are usually not like that. One uses a pricing table. One hides pricing behind copy. One has the offer in the hero. One has the contact route in the footer. One ships half the page through JavaScript.
| Approach | Best for | Weak spot |
|---|---|---|
| CSS selectors | Known layout, repeated pages | Breaks when the site changes |
| Raw scraping API | Fetching HTML at scale | You still need parsing and cleanup |
| Haunt extraction | Turning varied pages into structured JSON | Best results need clear prompts and public page content |
Prompt tips for better lead-enrichment JSON
- Say exactly which fields you want.
- Tell the model not to guess missing values.
- Ask for arrays when a page may contain multiple services, audiences, or contact routes.
- Keep source-specific notes in separate fields, so your app can review them later.
- Store the URL and timestamp next to the extracted JSON. Websites change. Annoying little beasts.
When this is not the right tool
Do not use company website extraction as a magic database. If you need private CRM data, logged-in dashboards, guaranteed fresh legal filings, or verified phone numbers, use the proper source. Haunt is for extracting visible web page content into useful structure.
It is also not a permission slip to spam people. Use extracted lead context to be more relevant, not louder.
Try company website extraction
Start with one URL and one plain-English prompt. If the output is useful, batch the same prompt across your lead list.
Get a Haunt API key Try on RapidAPI