Launch HN: Parsewise (YC P25) – Reason Across Documents with an API

34 points - today at 1:48 PM

Hi all, it’s Greg and Max, founders of Parsewise here (https://www.parsewise.ai/api). Parsewise transforms a bucket of unstructured data into schema compliant data, retaining lineage for values resolved across documents.

Imagine giving Claude a bunch of files and asking for a CSV or JSON output. If you have tried this, you know both the system limitations (number of files, type of inputs, cost, latency) but also the human-facing challenge of having no way to validate the results quickly. We solve both. We help tech teams simplify their unstructured data ETL, and loop in business experts for the definitions and for instant validation.

Here is a video with a few use cases: https://www.youtube.com/watch?v=dbRllnnh47w

Parsewise in the words of someone coming to us: ”I need to extract information from insurance policy PDFs, phone calls that have been transcribed, emails, etc. I am NOT looking for something that would just extract data point by data point, page by page into a structured well-defined schema but more something more agentic that can understand that information might be across documents and that it should reason over what to extract.”

We started the company based on a decade of experience (and pain) in complex data transformation and data analysis / synthesis. Greg was building both classical ETL and implemented AI workflows at Palantir. At Bain, Max did highly complex data analysis in the financial sector, similar to many of our customers.

Parsewise works by taking in a bucket of data (think hundreds or thousands of pdfs, excels etc.), and outputting schema compliant data where every single value is traceable down to word level citations across multiple documents in the bucket. We provide API customers with ways to show the lineage in their own applications, or they can use our platform for internal operations. At the core of the data processing we have self-improving agent definitions. They define the acceptable sources, the logic for resolving or combining values, and the rule for highlighting uncertainty to the end user.

The underlying tech is model and cloud agnostic and can be deployed in private networks. We have seen the best results with Gemini models for visual reasoning, achieving SOTA (beating Claude Fable) on the strongest grounded reasoning benchmark we have found (Databricks OfficeQA). Notably, we focused more on the “human harness” rather than the model harness, leaning into the actual friction we saw in uptake, which is around verifiability. That means optimizing the time and clicks required to trust the outcomes. We use vLLMs for parsing, and then we use small models for efficient large scale exhaustive search. Unlike RAG, we do not sample; instead, we exhaustively find all relevant values for a given query. We use larger models for decision making around resolutions and flagging inconsistencies to users.

This exhaustiveness and explicit value sourcing is unique to our platform, and it goes beyond the first step of data parsing that many existing providers cover.

We would love to welcome builders and tinkerers to try Parsewise on your complex document challenges. We have a ton of ideas on how we can expand the product and make it better, but would appreciate feedback and ideas from the community!

Comments

capevace today at 5:51 PM

I built a similar tool some time ago called Struktur (https://struktur.sh).

It’s much more limited in scope but fully open source and highly customisable. In fact it’s made for people to build their own pipelines on top of, providing the scaffolding needed to do so in a reliable way.

During development I’ve found it to be hard to truly generalise agent/llm-based data extraction, especially around the unlimited number of input types without task specific instructions (many files of the same kind, single large files, mixed kinds, bad quality files, docx/pdf/png/… the list goes on). Users sadly wanna upload all of these, and developers want a „one size fits all“ solution.

I am interested in how your solution deals with this. I came up with a strategy based approach so every task can be customised if needed, but I’d be delighted to see a technical writeup of how you deal with this endless variety of input + extraction task combos! :)

whinvik today at 5:58 PM

Document parsing is top of my mind lately because in some of the areas we work on the bottleneck is starting to become being able to query documents the same way one queries an api.

I keep thinking the most obvious analogue is we need some way to represent documents the same way we can represent structured data in parquet. Parquet allows easy range bases queries and there is so much tooling built around Arrow.

But for documents I keep hitting a wall to figure out what the right abstractions are. Parquet allows filterable metadata. But what such metadata is there for documents. Then there is the arbitrrariness of chunking, vectorization.

If we could just do this in a 2 step process where every document to process can be represented in a parquet like data format then I think we will atleast have the semblance of a solution.

chaitralikakde today at 4:31 PM

How portable are your agent definitions? If I build one for insurance documents, how much work is needed to adapt it to a completely different domain like legal contracts or healthcare?

vinaigrette today at 5:21 PM

This looks great for digital humanities, specifically archival work. Would love to try it.

rdksu today at 5:28 PM

Hey ! Is this kind of like structured output over a large scale document corpora ?

gorgmah today at 2:52 PM

I worked recently on an internal tool to achieve this kind of things, mostly plugging mistral OCR to gemini to extract structured data from documents. We then perform automated diffs too.

There seems to be an insane amount of competition in the "Intelligent Document Processing" market, like for instance parseur, whose founder is often on HN himself.

What do you think sets you apart from competition like : 1) Mistral document AI : depending on the model, it looks way cheaper than yours, OCR model pricing ranges from 0.001 to 0.004 EUR / page and they have structured output wired in the OCR API if needed (things then get fed to one of their LLMs) + EU-based and GDPR ready 2) parseur / rossum / docsumo / nanonets (which is YC 2017) ?

red_hare today at 3:31 PM

I say this with a lot of love: The vibecoded applications in your demo reek of AI slop design.

This isn't a critique of your product. It's just that the a beige-orange theme, the pill components, and the left-border highlight give me that visceral reaction as reading a paragraph littered with em dashes and "not X but Y." It makes me take you less seriously.

Cool demo otherwise.

mauryaudayan today at 3:34 PM

llamaparse also do it, what is different here?

gergelycsegzi today at 2:02 PM

Ah probably should add a link to our website: https://www.parsewise.ai/api

gnerd00 today at 2:44 PM

[flagged]