Raw HTML is noise for an LLM. Denoize fixes that.
One API call. Point us at a page, a PDF, a doc — get back clean Markdown, structured metadata, and chunks already sized for your RAG pipeline.
no credit card · 1,000 req/mo free
The HTML goes in. The noise goes away.
<!DOCTYPE html> <html><head><script src="gtag.js"></script> <script>dataLayer.push({...})</script> <meta property="og:title" content="..."> ... 180 more lines ... </head><body> <nav class="site-nav">...</nav> <div class="cookie-banner">We value your privacy...</div> <aside class="sidebar-ads">...</aside> <article> <h1>RAG for Large Language Models: A Survey</h1> <p>Large Language Models showcase impressive...</p> </article> <footer>... share buttons, comments widget ...</footer> <script>/* 30 more trackers */</script> </body></html>
{
"url": "https://arxiv.org/abs/2312.10997",
"kind": "html",
"contentType": "text/html; charset=utf-8",
"rendered": false,
"metadata": {
"title": "RAG for Large Language Models: A Survey",
"description": "A survey of retrieval-augmented generation...",
"author": "Gao, Xiong, Gao, et al.",
"siteName": "arXiv",
"image": null,
"publishedAt": "2023-12-18",
"language": "en",
"canonical": "https://arxiv.org/abs/2312.10997"
},
"markdown": "# RAG for Large Language Models: A Survey\n\nLarge Language Models showcase impressive capabilities but encounter challenges like hallucination...",
"chunks": [
{ "index": 0, "text": "# RAG for Large Language Models...", "charCount": 2043, "estimatedTokens": 509 },
{ "index": 1, "text": "Retrieval-Augmented Generation...", "charCount": 1954, "estimatedTokens": 487 },
… 8 more
],
"stats": { "chars": 1847, "estimatedTokens": 412, "chunkCount": 10 },
"cached": false
}Three steps. That's the whole thing.
Any http(s) URL. No schema, no scraping config.
Plain fetch first, headless Chromium fallback for JS-heavy pages.
Boilerplate stripped, metadata normalized, chunks sized for your embedding model.
Pay for what you use. No subscription.
Credits never expire. Each API call uses 1 credit.
forever
- —25 free credits at signup
- —HTML + PDF + text
- —Headless browser fallback
- —REST API access
100 credits
$0.05 / request
- —Credits never expire
- —Everything in Free
- —MCP server access
- —Cache hits free (24h)
- —Email support
500 credits
$0.03 / request
- —Credits never expire
- —Everything in Starter
- —Priority rendering
Obvious questions.
How is this different from a scraper I could build?+
We do the unglamorous parts: Readability-class content extraction, paragraph-aware Markdown conversion, automatic browser fallback only when needed, Redis cache shared across customers. Fifteen minutes of work in one fetch() call.
Does it handle JavaScript-rendered pages?+
Yes. We try a plain fetch first (fast, cheap) and spin up a headless Chromium only when the Markdown comes back thin or the page looks like an empty SPA shell. You don't pick — we do.
What about PDFs?+
Parsed with unpdf (pdf.js under the hood). Per-page Markdown plus document metadata: author, title, language, page count. OCR for scanned PDFs is on the roadmap.
What's the latency?+
500–2000 ms for a fresh plain fetch. 2–5 s when Chromium is involved. Paid plans get cached responses (under 50 ms, zero credits) for any URL extracted in the past 24 hours.
What chunk size should I use?+
Default is 512 tokens with 50 overlap — a good match for OpenAI and Cohere embeddings. Override chunk.size and chunk.overlap per request if your retriever expects different dimensions.
Can I plug this into Claude Desktop?+
Yes — MCP is included in Starter and Standard. Drop one block into your Claude Desktop / Cursor / Windsurf config with your API key as a Bearer token — no local install, no npm package to maintain. Free accounts can still use the REST API; MCP unlocks the moment you buy any credit pack.