For the complete documentation index, see llms.txt. This page is also available as Markdown.

Vision

Vision allows your worker to read and understand images, as well as capture and interpret the contents of webpages. Toolhouse adds these capabilities automatically if your worker needs them to complete its task. You can also add it manually.

How Vision works

You can create a worker that needs to see and interpret visual content. For example, you can create a customer support worker that monitors your website for you. You want the worker to detect changes in your UI and summarize what it sees on each page.

In this case, Toolhouse detects the worker will need to read images and it will add Vision to it.

Setting up how your worker interprets images

To specify how the worker should describe or analyze images, edit your agent in Agent Editor and tell the editor something like: "I want my worker to always describe images in detail, focusing on any text, charts, or key visual elements it finds." You can use natural language and Toolhouse will translate it into proper prompt instructions for you.

Read an image and act on it

Your worker can use Vision alongside other integrations-- for example, pairing it with Image Generation to first read an existing image and then produce a modified or inspired version of it.

Adding Vision manually

  • Go to Agents in your Toolhouse

  • Click on your worker to edit it

  • Select Integrations, then click Add Integration

  • Choose Describe Image to read the contents of an image

  • Choose Page Screenshot to take a screenshot of a webpage and read its contents

  • Click Save changes

Last updated