Why we are abandoning pixel-based processing in favor of structural semantic mapping for the next generation of macOS agents.
The current paradigm of AI agents (sometimes called "Action Models") treats the computer screen as a series of images. By taking continuous screenshots and feeding them into models like GPT-4o or Claude 3.5 Sonnet, these agents attempt to "reason" about where to click and what to type.
While this approach is impressive in its generality, it is fundamentally flawed for production-grade operating system automation. It introduces a layer of abstraction that is both computationally expensive and unnecessarily slow.
We argue that the future of desktop AI lies not in Computer Vision , but in deep structural integration with the host operating system's Accessibility APIs.
Sending screenshots to the cloud for every single action is the status quo, but it is fundamentally unsustainable. We call this the "Vision Tax" — a combination of three critical failures that hold back real-time automation.
A round-trip for a 4K screenshot upload and inference typically exceeds 3,000ms. In a workflow requiring 10 steps, the user waits 30 seconds for a task that should take 5. Local indexing reduces this to ~50ms.
According to recent benchmarks , processing raw pixels consumes 90% more tokens than processing structured accessibility data. This makes vision-based agents 10x more expensive to run at scale.
Beyond costs, there is the issue of Probabilistic Failure. Vision models can misinterpret a slight change in UI skin or transparency, whereas structural trees provide a deterministic path to the element.
Current vision-based agents require a constant stream of your screen recording to be sent to third-party servers. This is a non-starter for privacy-conscious users and enterprise environments.
"When an agent 'sees' your screen, it isn't just seeing the button it needs to click. It sees your private emails, Slack messages, and banking tabs."
Even with zero-retention policies, the data egress itself is a significant vulnerability. For AI to truly become part of our operating systems, it must respect the boundary of the local machine. Remote vision is a compromise we are no longer willing to make.
By utilizing on-device local models like Llama 3 or Phi-3 , Lazzy ensures that zero bytes of your visual data ever leave your Mac.
Our breakthrough comes from shifting the bottleneck from Vision to Indexing. Instead of asking the AI to "look" at the screen repeatedly, we teach it the structure of the application once.
During the Discovery Phase, our specialized Indexing Agent explores the application structure — identifying buttons, text fields, menu items, and their associated hotkeys. This creates a persistent App Map.
{
"app": "Safari",
"bundleId": "com.apple.Safari",
"elements": [
{
"id": "btn_new_tab",
"role": "button",
"label": "New Tab",
"hotkey": "Cmd+T",
"path": "Window > Toolbar > Add Button"
},
{
"id": "input_url",
"role": "textfield",
"label": "Address Bar",
"hint": "Type URL here"
}
]
}Once an app is indexed, an SLM (Small Language Model) can interpret user requests like "Summarise my current tab in Safari" and execute it by looking up the "Address Bar" ID, reading the text, and clicking "New Tab" to paste the summary.
By removing the need for cloud vision, we enable a revolution in personal computing. This architecture is heavily inspired by Google's ScreenAI research, but pushes it closer to the edge.
Structural indexing yields nearly 100% action reliability compared to the 85% seen in vision models.
Work on an airplane or in a subway — your agent doesn't need a 5G connection to click a button.
Your desktop data stays on your NVMe. Not in a training set.
We are currently onboarding researchers and privacy enthusiasts to our early access program.