denoizev1 · beta

Raw HTML is noise for an LLM. Denoize fixes that.

One API call. Point us at a page, a PDF, a doc — get back clean Markdown, structured metadata, and chunks already sized for your RAG pipeline.

no credit card · 1,000 req/mo free

before / after

The HTML goes in. The noise goes away.

raw.html48 KB
<!DOCTYPE html>
<html><head><script src="gtag.js"></script>
<script>dataLayer.push({...})</script>
<meta property="og:title" content="...">
... 180 more lines ...
</head><body>
<nav class="site-nav">...</nav>
<div class="cookie-banner">We value your privacy...</div>
<aside class="sidebar-ads">...</aside>
<article>
  <h1>RAG for Large Language Models: A Survey</h1>
  <p>Large Language Models showcase impressive...</p>
</article>
<footer>... share buttons, comments widget ...</footer>
<script>/* 30 more trackers */</script>
</body></html>
↓ extract ↓
response.json200 · 1.8 KB · 412 tokens
{
  "url":         "https://arxiv.org/abs/2312.10997",
  "kind":        "html",
  "contentType": "text/html; charset=utf-8",
  "rendered":    false,
  "metadata": {
    "title":       "RAG for Large Language Models: A Survey",
    "description": "A survey of retrieval-augmented generation...",
    "author":      "Gao, Xiong, Gao, et al.",
    "siteName":    "arXiv",
    "image":       null,
    "publishedAt": "2023-12-18",
    "language":    "en",
    "canonical":   "https://arxiv.org/abs/2312.10997"
  },
  "markdown": "# RAG for Large Language Models: A Survey\n\nLarge Language Models showcase impressive capabilities but encounter challenges like hallucination...",
  "chunks": [
    { "index": 0, "text": "# RAG for Large Language Models...", "charCount": 2043, "estimatedTokens": 509 },
    { "index": 1, "text": "Retrieval-Augmented Generation...", "charCount": 1954, "estimatedTokens": 487 },
    … 8 more
  ],
  "stats": { "chars": 1847, "estimatedTokens": 412, "chunkCount": 10 },
  "cached":      false
}
how it works

Three steps. That's the whole thing.

01
POST a URL

Any http(s) URL. No schema, no scraping config.

02
We fetch and render

Plain fetch first, headless Chromium fallback for JS-heavy pages.

03
Get Markdown + chunks

Boilerplate stripped, metadata normalized, chunks sized for your embedding model.

pricing

Pay for what you use. No subscription.

Credits never expire. Each API call uses 1 credit.

Free
$0

forever

  • 25 free credits at signup
  • HTML + PDF + text
  • Headless browser fallback
  • REST API access
Start free
Starter
$4.99

100 credits

$0.05 / request

  • Credits never expire
  • Everything in Free
  • MCP server access
  • Cache hits free (24h)
  • Email support
Buy Starter
popular
Standard
$14.99

500 credits

$0.03 / request

  • Credits never expire
  • Everything in Starter
  • Priority rendering
Buy Standard
faq

Obvious questions.

How is this different from a scraper I could build?+

We do the unglamorous parts: Readability-class content extraction, paragraph-aware Markdown conversion, automatic browser fallback only when needed, Redis cache shared across customers. Fifteen minutes of work in one fetch() call.

Does it handle JavaScript-rendered pages?+

Yes. We try a plain fetch first (fast, cheap) and spin up a headless Chromium only when the Markdown comes back thin or the page looks like an empty SPA shell. You don't pick — we do.

What about PDFs?+

Parsed with unpdf (pdf.js under the hood). Per-page Markdown plus document metadata: author, title, language, page count. OCR for scanned PDFs is on the roadmap.

What's the latency?+

500–2000 ms for a fresh plain fetch. 2–5 s when Chromium is involved. Paid plans get cached responses (under 50 ms, zero credits) for any URL extracted in the past 24 hours.

What chunk size should I use?+

Default is 512 tokens with 50 overlap — a good match for OpenAI and Cohere embeddings. Override chunk.size and chunk.overlap per request if your retriever expects different dimensions.

Can I plug this into Claude Desktop?+

Yes — MCP is included in Starter and Standard. Drop one block into your Claude Desktop / Cursor / Windsurf config with your API key as a Bearer token — no local install, no npm package to maintain. Free accounts can still use the REST API; MCP unlocks the moment you buy any credit pack.

Think of a URL you want in your RAG pipeline. Now try it.

Get an API key