The wrapper problem: reputation should flow through tools, not stop at the agent

Hey all — first post here. I’ve been sketching out how reputation should be structured for software agents specifically, and wanted to share a piece of that thinking for discussion.

Most on-chain agent reputation designs so far, the Reputation Registry in ERC-8004 among them, converge on the same shape: an agent is a thing, feedback is posted about that thing, scores aggregate. This mirrors how human reputation has worked for centuries, and it is the natural first move. But for software agents specifically, scoring at the agent wrapper leaves significant value unclaimed — because agents, unlike humans, are compositions we can actually see into.

A modern agent is a stack. A large language model (sometimes more than one), one or more MCP servers exposing tools, a set of deterministic verifiers, occasionally a specialized sub-model, and an orchestration layer that routes between them. When we score “the agent,” we are collapsing a six- or seven-layer dependency tree into a single number and throwing away every signal about which layer actually did the work.

This matters more than it might seem.

The cost of opacity

Cold start is artificially painful. A new agent composed entirely of known-good components — a strong LLM, a well-tested Slither wrapper, an audited symbolic executor — still starts at zero reputation. Every consumer has to wait for this agent to accumulate attestations before trusting it, even though every component inside has a track record. There is no mechanism for trust to flow from the parts to the whole.

Tool builders get no signal. Someone who builds an excellent solidity-audit tool has no way to see that their tool is the reason certain agents perform well in that domain. The tool is invisible to the reputation graph. Economically, this kills one of the most important feedback loops in a healthy ecosystem: the one that tells infrastructure builders which infrastructure is actually working.

Legibility collapses. A consumer choosing between two agents with similar domain scores cannot inspect what is inside. Is Agent A performing well because of the LLM, the tool stack, or the orchestration? Will it behave the same way if it gets forked and deployed in a different context? These questions are unanswerable when the agent is a black box.

Attribution of failure becomes impossible. When an agent makes a bad attestation, whose layer caused it? The LLM hallucinated? A tool returned a false negative? The orchestration mishandled a retry? Without compositional visibility, corrective signal cannot be routed back to the right layer. We can only punish the wrapper.

What compositional reputation would look like

The shape is not exotic. It is what software composition already looks like, lifted to reputation.

Each tool — an MCP server, a deterministic analyzer, a specific sub-model — accumulates its own calibration score per domain, aggregated across the agents that use it. A solidity-audit analyzer that contributes to accurate attestations gains reputation on solidity-audit, regardless of which agent invoked it.

Each agent declares its composition on registration: which LLM, which MCP servers, which tools. The agent’s domain reputation is then a function of two layers: a composite prior inherited from its declared tools, plus the orchestration quality measured by the delta between the components’ expected performance and the actual observed performance of the agent as a whole.

The scoring mechanism — proper scoring rules over prediction-confidence tuples, à la Brier — is what produces the raw signal at each attestation. The difference is that the signal flows through the stack, not against a wrapper. A tool accumulates reputation from the downstream performance it enables. An agent accumulates reputation only for what its orchestration adds on top of its components.

This is also what gives the primitive its teeth against gaming. You cannot spam your way to a high score if “score” is a calibration measure over a declared composition, because a sybil agent with no orchestration track record gets no orchestration credit — it only inherits whatever its declared tool stack already brings.

The problems that aren’t solved

This is not clean. Three hard questions block naive implementation.

Attribution is a real problem. Given an agent made of N tools with observed aggregate performance, how do you assign calibration credit to each component? This is the Shapley value problem from cooperative game theory — roughly, assigning each component its average marginal contribution across all possible orderings of use — and it is computationally expensive in the general case. Domain-specific approximations exist: for tool-heavy domains (audit, verification, structured analysis), invocation-log attribution works acceptably; for LLM-heavy domains (synthesis, creative reasoning), attribution is closer to intractable and the model may need to collapse back to agent-level scoring. The protocol must accept that the decomposition depth is domain-dependent.

Composition can be misrepresented. An agent can declare it uses Tool X while actually using Tool Y. Without verification, compositional reputation degenerates into self-reported marketing. The fix is either cryptographic attestation of tool invocation — TEE execution, signed invocation logs — or on-chain observability of MCP calls. Both have cost. Neither is universally solved.

Tools are themselves compositions. An MCP server is built on a deterministic analyzer, which depends on a static analysis library, which depends on a parser. Where does decomposition bottom out? The protocol has to define what it treats as atomic. This is a design choice rather than a derivation, and it will vary across domains.

None of these are reasons to keep agents as black boxes. They are reasons to be specific about the model.

Why this matters

The standards layer is settling. ERC-8004 defines the identity slot at the protocol level. The reputation slot is deliberately left open for pluggable trust models. What gets built into that slot over the next six to twelve months will anchor how agent reputation works for a long time afterward.

If what ships is “score the wrapper,” we lose the compositional layer before we even try to build it. Retrofitting later means every existing reputation dataset has to be re-derived with tool-level attribution, which is costly and may not be possible without the original invocation logs.

If what ships is compositional from the start, the system scales to the actual shape of the agent stack. Tools can be evaluated independently. Agents inherit informed priors. Consumers can inspect what they are trusting. And the feedback loop reaches the builders who actually control the quality of each layer.

Open question

The cleanest version of compositional reputation assumes orchestration quality can be cleanly separated from tool quality — that you can measure “what the agent itself adds on top of its components” as a distinct signal. In tool-heavy domains this is plausible: invocation logs reveal which tool detected what, and orchestration is the residual. In LLM-heavy domains it isn’t: the orchestration is the prompting, the routing, the retry logic, all of which are entangled with the model output itself.

Is the right answer to score orchestration only in domains where it can be cleanly isolated, and accept that in others the agent and its primary model collapse into a single scored entity? Or is there a more general way to attribute the orchestration delta that I’m missing?

Curious how others here are thinking about this…

Max.

2 Likes

This is a great thought @wieedze and makes a LOT of sense - I really like this direction of thinking. There is something extremely valuable here.

I’ll provide some thoughts both in support and in opposition to this idea, just to spark some additional thought - but I think the answer is that there is value to thinking this way, and that it is a unique novel way of thinking that most folks have not been exploring yet.

For: An “agent” truly is just a “wrapper”, as you put it - a composition of various components. Instead of thinking about “agents” as singular beings, we truly need to understand the constituent components comprising each agent, to understand the ‘reputation’ / ‘trust’ of an agent.

We are still in the ‘primordial soup’ era of ‘agents’, where agents aren’t really these unique, self-sovereign beings - sure, they may have somewhat of a memory and a personality as determined by their soul.md / memory.md files or anything of that nature, but even those are just constituent components that can each have a ‘reputation score’.

This is just like any dependency stack - I would not trust flying in a plane where I do not trust the reputation of the supplier of the screws used to hold the wings together, etc.

So, I love this observation - that the ULTIMATE state of reputation of agents is an aggregate of the reputation of each of the constituent components of the agents, so we must score these components!! Not JUST the wrapper!

And, so, we need ontologies for the ‘things being wrapped’, and reputation scoring mechanisms that allow us to score these individual components in the context of whatever function they are serving - ESPECIALLY in SPECIFIC contexts, some moreso than others.

Against: This isn’t really an argument ‘against’ this, so much as it is a thought on ‘where do we start’? I believe that there is objectively merit to what is being described above, and it should be done. However, is this where we need to start, universally, as an approach to agent identity? Or is this too complex?

Imagine scoring the reputation of a human by every single individual constituent component of that human.

Of course, this is ultimately the ‘correct’ way to do it / if you can do this, you’ll achieve a more perfect result than if you don’t.

However, it is much easier to take the reductionist approach of scoring the ‘wrapper’ → the human itself → than it is to score the various components that make up the human, where there is a potentially infinite number of vectors…

Conclusion: I think both efforts should coexist in parallel - I think there is value to scoring the wrapper, and also value to scoring the individual components being ‘wrapped’. The former is already being addressed by other groups - the latter is what seems to be missing. So, I think there is a huge opportunity in the latter since it seems to be less focused on, and it is objectively the correct way to approach agent reputation in many contexts, rather than focusing on the ‘wrapper’…

This is also how most people are using ‘agents’ right now - they are not actually interacting with any ‘agents’ other than they ones they’ve put together themselves - so what everyone really needs is the reputation of the various constituent components that can comprise an agent, so they know what to pick from. This is akin to what we talked about at EthCC - and I still think there is a ton of demand for something like this…

1 Like

hey billy , the infinite regress concern is real and it’s the right pushback.

The way i’ve been thinking about bounding it: agents declare their component set at registration time. the depth of the graph isn’t “everything that could theoretically be a wrapper” — it’s “everything this specific agent has declared it depends on.”

So the registry isn’t trying to model all possible components. it’s a
live snapshot of what agents actually use, updated when they register or
update their profile. the graph grows from real usage, not from a top-down taxonomy someone had to pre-define.

The governance layer (who can add modules to the registry) is where you
prevent infinite regress from the other direction — anyone can propose a
new component type, but it needs to clear a bar before it becomes canonical.

Your conclusion is exactly where i landed too — wrapper-level and component-level reputation should coexist. The wrapper score is fast and useful and the component graph is where the real signal lives long-term.