# Document parser

## Document Parser

Document Parser lets your AI workers fetch a file from a URL and read it as clean, structured Markdown. It works like [File download](/toolhouse/capabilites/file-download.md), but instead of handing the raw file to your worker, Toolhouse converts it first — making the contents far easier for your worker to interpret and act on. Toolhouse adds Document Parser automatically when your worker needs it, or you can add it manually.

***

### How Document Parser works

When your worker needs to read a document — whether you provided the URL or the worker found it through Web Search — it passes the URL to Document Parser. Toolhouse fetches the file and converts its contents to Markdown on a best-effort basis before passing the result to your worker.

This means your worker receives clean, readable text with structure preserved — headings, lists, tables, and paragraphs — rather than raw bytes or noisy file output.

**Example prompt for your worker:**

> "Parse the product spec sheet at this URL and list all technical requirements in a table."

***

### When to use Document Parser vs. File Download

Document Parser is the right choice whenever the file you need to read has structure or formatting that matters — or whenever the file format is not natively readable as plain text.

| Situation                            | Use                                 |
| ------------------------------------ | ----------------------------------- |
| PDF reports, whitepapers, or manuals | **Document Parser**                 |
| Scanned or image-heavy documents     | **Document Parser**                 |
| HTML pages with complex layout       | **Document Parser**                 |
| Plain text, CSV, JSON, XML           | **File Download**                   |
| Binary files, archives, media        | Neither — consider Virtual Computer |

If you're unsure, prefer Document Parser. The Markdown conversion step is low-cost and makes the output significantly more useful to your worker.

***

### Best-effort conversion

Document Parser converts files to Markdown on a best-effort basis. Most well-structured documents — PDFs with selectable text, HTML pages, Word documents — convert reliably. However, some files may not convert cleanly:

* **Scanned documents** with no embedded text layer may produce poor output or none at all
* **Complex layouts** such as multi-column documents or heavily formatted slides may lose some structure during conversion
* **Non-document files** such as images or executables cannot be meaningfully converted

When conversion quality is critical, test with a sample document first. You can prompt your worker with:

> "Parse this document and tell me if the output looks complete and well-structured before proceeding."

***

### Context window limits

Like File Download, Document Parser limits the size of the converted output to fit within your worker's context window. The Markdown conversion often reduces file size compared to the raw original, which means Document Parser can sometimes surface more usable content than File Download for the same file.

If the document is very long and gets truncated, consider splitting the task — for example, asking your worker to parse and summarize one section at a time.

***

### Using Document Parser with Virtual Computer

Document Parser and Virtual Computer complement each other well for document-heavy workflows. Document Parser converts the file into readable Markdown; Virtual Computer can then process that content programmatically — extracting structured data, running calculations, or transforming the output.

To configure this in Agent Editor:

> "Parse the financial report at this URL using Document Parser, then use the virtual computer to extract all numeric values from the tables and compute year-over-year growth."

***

### Adding Document Parser manually

* Go to **Agents** in your Toolhouse dashboard
* Click on your worker to edit it
* Select **Integrations**, then click **Add Integration**
* Choose **Document Parser**
* Click **Save changes**

***

### Limitations and gotchas

| Constraint                 | Detail                                                                                                                                    |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| **Best-effort conversion** | Markdown conversion is not guaranteed for all file types. Scanned documents, image-heavy files, and unusual formats may not convert well. |
| **Output size cap**        | Converted output is truncated to fit within your worker's context window. Very long documents may be cut off.                             |
| **No binary support**      | Files that cannot be meaningfully converted to text — images, executables, archives — are not suited for Document Parser.                 |
| **URL must be accessible** | The file URL must be publicly reachable. Files behind authentication or private networks cannot be fetched.                               |

***

### Frequently asked questions

**How is Document Parser different from File Download?** File Download gives your worker the raw file contents as-is. Document Parser adds a conversion step that transforms the file into Markdown, making it far easier for your worker to read and reason about structured documents like PDFs or HTML pages.

**Does Document Parser work on web pages, not just files?** Yes. If the URL points to an HTML page, Document Parser will convert the page content to Markdown — stripping navigation, ads, and other boilerplate to surface the core content.

**What if the conversion produces poor output?** You can prompt your worker to flag quality issues before proceeding, or fall back to File Download for simple text files. For scanned documents with no text layer, Document Parser may not be the right tool — consider pre-processing the file with an OCR service before passing it to your worker.

**Can my worker use Document Parser on a file it found itself?** Yes. If your worker discovers a document URL through Web Search or another integration, it can pass that URL to Document Parser without any additional input from you.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.toolhouse.ai/toolhouse/capabilites/document-parser.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
