I'd been putting it off since the news broke. Last weekend, I dug out my old MacBook with an outdated OS, wiped it, hardened it, and got down to business. I finally caved and installed OpenClaw.
But OpenClaw was just the latest step in a progression that echoes what many practitioners are experiencing at the enterprise level right now.
The Progression
I started where most did: ChatGPT for fast, general-purpose queries. Then Claude for deeper reasoning and writing. Then Notion AI (Claude/OpenAI wrapper) for working inside my existing knowledge base. On the development side, V0 to Cursor to Claude Code and Codex.
Claude plus Notion was a first-mover integration that became the strongest hybrid pattern I found: reasoning, planning, and writing grounded in a persistent context without manual export. Claude Cowork, with local file access, ballooned that capability (and the plug-ins and integrations keep coming).
And then OpenClaw: the full-control, open architecture agent that you build programmatically. My hardened setup required exporting documents from Notion, which immediately forked my context (more on that below). But the raw capability is real.
Each layer added capability and redundancy. The question is which tradeoffs I’d make (and at what cost).
That's a microcosm of the enterprise problem. One person, multiple architectural approaches, no coordination framework. Scale that to an organization with dozens of teams and hundreds of workflows, and the structural gaps compound quickly.
How Things Compare
Context: The Part That Worked
I'll give myself credit where it’s due. Feeding the tools the context they needed was the closest thing to a solved problem in my setup.
Claude reading Notion directly was the strongest pattern. Reasoning and writing grounded in a single source of truth. I structured my knowledge base, maintained persistent context, and the agents used it well.
The one failure point was any workflow requiring content export. OpenClaw's setup required exporting documents from Notion; instant desync. Edits in Notion didn't propagate; two sources of truth are zero sources of truth. To solve that, I gave it read-only access to Notion almost immediately (weakening my security posture). The same failure appears in writing workflows: draft in Notion, edit in Word, finalize format. Word edits never return to the source, and context loss happens immediately.
But context only worked for me because of scale. A small, well-defined set of tasks, a single knowledge base, and one person maintaining everything. But it's obvious where this breaks. At enterprise scale, it becomes the core infrastructure question: which systems feed agents the data, metadata, and semantic models they need, and how do you keep all of that synchronized across platforms that were never built to talk to each other?
Execution: The Part That Didn't
Almost every multi-step task that required handoffs, reformatting, or chaining outputs to inputs broke at some point. No architecture I tested fully owned execution natively. I was the orchestrator. The LLMs could reason about what needed to happen, but they couldn't make it happen across systems. Claude’s integrations and OpenClaw open architecture brought a new level of flexibility, but design and governance became far more important.
Securing Autonomous AI Agents
Boards now treat agentic AI as critical risk. See how top tools combine guardrails, testing and runtime controls to keep autonomous systems in bounds.
At enterprise scale, the "agent execution layer" still doesn't exist in most organizations. Emerging protocols like Model Context Protocol (MCP) and Agent2Agent (A2A) are trying to solve this, but the reason I was manually reformatting and re-prompting is a lack of a secure, standardized way for these architectures to coordinate work. Enterprises still have almost no evidence base to assess whether these protocols will hold up in production.
Observability: When Execution Fails, You Need to See Where
When a multi-step task fails, you need to trace where it went wrong.
OpenClaw offered more visibility into agent behavior but required technical skill to interpret. Claude and ChatGPT reduced friction but hid decision logic. The only oversight in my setup was me watching the terminal output and checking whether the results looked right.
EMA's research found that 63% of organizations will only enable AI-driven automated actions with human oversight. But watching a terminal and hoping you catch the failure isn't governance. Agent observability is just emerging as a discipline, and it needs to be structural: traceable actions, auditable decisions, reversible outputs. And it must come before deployment, before agents are making consequential decisions autonomously.
Pilot to Production: Why One Person's Stack Won't Scale
Each architecture had a different complexity floor. ChatGPT was immediate. Claude with Notion required some configuration. OpenClaw required real technical effort just to start (old MacBook, outdated OS, terminal-only, no onboarding designed for humans).
I burned through $20 in API credits via OpenClaw before finishing initial setup, on top of subscriptions to ChatGPT, Claude, and Notion AI (thanks to free trials and extra credits). Running parallel architectures (which evaluation requires) multiplies cost surfaces fast. And because these architectures overlap, you pay for redundant capability until you commit to a primary approach.
My stack is an experiment with one person doing one project. Add a second person or a second project, and coordination burden doubles, cost surfaces multiply, and context forks proliferate. That's the pilot-to-production wall. Scalability is an architectural constraint that shapes every other decision.
AI Superintelligence Meets SOC
SOC automation platforms turn SIEM data into closed-loop response, redefining detection, remediation speed, and security infrastructure design.
The Agentic Architecture Gap
Agentic architecture centers on context and execution: how agents consume data, metadata, and semantic models, and how they coordinate work across systems that were never designed to interoperate at the agent layer. The organizations treating agentic AI as a deployment problem (pick a vendor, plug it in) will hit the same walls I did.
My progression surfaced four evaluation criteria that map directly to what EMA will be investigating across enterprise IT:
Context. Can agents access persistent, structured context without export or drift? This was the closest to solved in my setup, but only when the architecture supported direct access. Export kills it.
Execution. Can agents hand off work and chain tasks programmatically, or does a human stitch every transition? This was the primary failure point. Architectures that require manual orchestration won't survive the pilot phase.
Observability. Are agent actions traceable, auditable, and reversible? Oversight can't be optional when agents operate autonomously.
Scalability. What's the total cost across overlapping tools, and how does complexity scale from one workflow to ten?
Why Frontier AI Stays Locked
By limiting access to Mythos, Anthropic highlights the governance gap between AI’s cybersecurity potential and the systemic risk of weaponized models.
From Weekend Evaluations to Enterprise Strategy
Every tradeoff I navigated is one that enterprises are navigating as well. The tools work. The architectures run. What's missing is the infrastructure between having the right context and getting the right outcome.
I'm still in evaluation mode. Comparing complexity, costs, and testing where flexibility matters and where it's overhead. Many enterprises are in the same place. The ones who treat agentic architecture as an engineering discipline will build something that works.
EMA will be surveying data, platform, and IT operations leaders about architectures, protocol strategies, observability models, and the vendor approaches emerging around agent infrastructure. The goal: an evidence base for decisions that currently run on instinct and marketing.
Comments ( 0 )