Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more
From Anthropic released the “Using the Computer” feature for Claude in October, there has been a lot of excitement about what AI agents can do when they have the power to mimic human interactions. A new study by Show laboratory at the National University of Singapore provides insight into what we can expect from the current generation of graphical user interface (GUI) agents.
Claude is the first frontier model capable of interacting as a GUI agent with a device via the same interfaces used by humans. The model only accesses desktop screenshots and interacts by triggering keyboard and mouse actions. The feature promises to allow users to automate tasks through simple instructions and without the need for API access to applications.
Researchers tested Claude on a variety of tasks, including searching the web, performing workflows, office productivity, and playing video games. Web search tasks involve browsing and interacting with websites, such as finding and purchasing articles or subscribing to news services. Workflow tasks involve cross-application interactions, such as extracting information from a website and inserting it into a spreadsheet. Office productivity tasks test the agent’s ability to perform common operations such as formatting documents, sending emails, and creating presentations. Video game tasks assess the agent’s ability to perform multi-step tasks that require understanding game logic and planning actions.
Each task tests the model’s capability in three dimensions: planning, action, and criticism. First, the model must provide a coherent plan for accomplishing the task. They must then be able to carry out the plan by translating each step into an action, such as opening a browser, clicking on elements and entering text. Finally, the critical element determines whether the model can evaluate its progress and success in completing the task. The model must be able to understand if it has made mistakes along the way and correct its trajectory. And if the task is not feasible, it must give a logical explanation. The researchers created a framework based on these three components and reviewed and evaluated all tests performed by humans.
Overall, Claude did an excellent job completing complex tasks. He was able to reason and plan several steps necessary to complete a task, carry out the actions and evaluate his progress at each stage of the process. It can also coordinate different applications, such as copying information from web pages and pasting it into spreadsheets. Additionally, in some cases, it revisits the results at the end of the task to ensure everything is aligned with the goal. The model’s reasoning trace shows that it has a general understanding of how different tools and applications work and can coordinate them effectively.
However, it also tends to make trivial errors that average human users would easily avoid. For example, in one task, the model failed to complete a subscription because it did not scroll through a web page to find the corresponding button. In other cases, it failed at very simple and clear tasks, such as selecting and replacing text or changing bullet points to numbers. Furthermore, either the model did not realize its mistake or it made false assumptions about why it was not able to achieve the desired goal.
According to the researchers, the model’s errors in assessing its progress highlight “a gap in the model’s self-assessment mechanisms” and suggest that “a comprehensive solution to this problem may still require improvements to the framework.” the GUI agent, as an internalized strict criticism module. » The results also clearly show that GUI agents cannot reproduce all the fundamental nuances of how humans use computers.
What does this mean for businesses?
The promise of using basic text descriptions to automate tasks is very appealing. But at least for now, the technology is not ready for mass deployment. Model behavior is unstable and can lead to unpredictable results, which can have adverse consequences in sensitive applications. Performing actions through interfaces designed for humans is also not the fastest way to accomplish tasks that can be done through APIs.
And we still have a lot to learn about the security risks of large language models (LLMs) controlling the mouse and keyboard. For example, one study shows that web agents can easily be the victim of contradictory attacks that humans would easily ignore.
Automating tasks at scale always requires robust infrastructure, including APIs and microservices that can be securely connected and served at scale. However, tools like Claude Computer Use can help product teams explore ideas and navigate different solutions to a problem without investing time and money in developing new features or services to automate tasks. Once a viable solution is discovered, the team can focus on developing the code and components needed to deliver it efficiently and reliably.