Building Towards Computer Use with Anthropic

In my last blog post, I wrote about how AI agents function. Essentially, they are groups of LLMs whose outputs become each other's inputs in a manner that enables the solution of tasks; individual LLMs can also serve as individual agents that utilize tools (such as computer apps or functions in code) to complete tasks. For example, AI agents can be used to perform tasks on computers that we would normally associate with developers or human workers such as writing code or working with software applications. DeepLearning.AI's recent short course on Building Towards Computer Use with Antrophic provides a window into those possibilities; I am using this blog post for taking notes.

Introduction:

Anthropic uses LLMs to initiate mouse-clicks and keystrokes in agentic workflows to accomplish computer tasks. Anthropic provides API for agents. Anthropic's models take 200,000 input tokens or 500 pages of text which enables long form prompts and memory-caching in prompts. Loosely, one might analogize this to having 500 pages worth of notes for a human worker to utilize in the problem solving process (en tandem with the stored reservoir of knowledge in their brain); pushing the analogy further, those notes would be utilized for tracking and coordinating group work in the context of agentic workflows.

Overview:






Working with the API:
The notion of "role" is important as it tells the model how to frame and create its messages in an interactive manner linked to other agents or users.

Multimodal Requests:




The Claude model is multimodal, so it can be prompted by the user to assist with image data as well as text data.




Real World Prompting:



Everyday consumer use of LLMs like ChatGPT in the browser suggest LLMs can be prompted in a casual and loose manner. This can work for one-off use cases by consumers, but there is also a place for more systematic and scalable enterprise level prompting which involves much longer and structured prompts. For example, prompting an LLM to write meeting/call notes or summarizing sentiments of product reviews.

Prompt Caching:

Prompt caching can be used to make the process of prompting an LLM more efficient. Essentially, it amounts to storing the content of a prior prompt so that it can be used to augment subsequent prompts rather than be reprocessed anew for every prompt. For example, if a series of prompts inquire into the contents of a book, it would be inefficient to include the book's content from scratch in every prompt. To me, this is very reminiscent of Retrieval Augmented Generation (RAG) where the book in this example would be stored in a backend vector database rather than fed and cached via the prompt. Perhaps caching would cut out the step of needing to make a vector database, but would be a less scalable approach.






Tool Use (Function Calling):
Give an LLM such as Anthropi's Claude the ability to call external tools (or functions). 





For example, an LLM can be connected to literal function calls and data that can be utilized and queried to fulfill an agentic role such as a sales assistant. Below, an example from the course is provided where Claude has access to a purchases database and is equipped with functions such as "get_user", "get_order_by_id", "get_customer_orders", "cancel_orders". Such tooling allows Claude to play the role of a sales assistant that can get customer order information and modify them based on the input from a consumer yielding a useful agentic chatbot. 

Much of the ideas here rhyme with RAG and one can see an overarching motif: LLMs are being augmented through external data, external tools, external agents (other LLMs).








Computer Use:

Comments

Popular Posts