Show HN: Robust LLM Extractor for Websites in TypeScript

Hacker NewsMarch 26, 2026article

We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.<p>LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:<p>- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. O

Originally published by

Hacker News

Read original →

Show HN: Robust LLM Extractor for Websites in TypeScript

More in Pivot 5

Unsuccessfully training AI to play my favorite niche childhood game

The Biggest Bundle? How everyone is wrong about AI

I use CLI agents daily. They are capable – but they don't understand my product

The Marginal Revolution: Rise and Decline, and the Pending AI Revolution

More from Pivot News

Fact or fiction: Exploring the reality of AI in payments testing

Patch and perish: The hidden risks of incremental payment modernisation

Get Pivot 5 news in your inbox