Hey all — first post here. I’ve been sketching out how reputation should be structured for software agents specifically, and wanted to share a piece of that thinking for discussion.
Most on-chain agent reputation designs so far, the Reputation Registry in ERC-8004 among them, converge on the same shape: an agent is a thing, feedback is posted about that thing, scores aggregate. This mirrors how human reputation has worked for centuries, and it is the natural first move. But for software agents specifically, scoring at the agent wrapper leaves significant value unclaimed — because agents, unlike humans, are compositions we can actually see into.
A modern agent is a stack. A large language model (sometimes more than one), one or more MCP servers exposing tools, a set of deterministic verifiers, occasionally a specialized sub-model, and an orchestration layer that routes between them. When we score “the agent,” we are collapsing a six- or seven-layer dependency tree into a single number and throwing away every signal about which layer actually did the work.
This matters more than it might seem.
The cost of opacity
Cold start is artificially painful. A new agent composed entirely of known-good components — a strong LLM, a well-tested Slither wrapper, an audited symbolic executor — still starts at zero reputation. Every consumer has to wait for this agent to accumulate attestations before trusting it, even though every component inside has a track record. There is no mechanism for trust to flow from the parts to the whole.
Tool builders get no signal. Someone who builds an excellent solidity-audit tool has no way to see that their tool is the reason certain agents perform well in that domain. The tool is invisible to the reputation graph. Economically, this kills one of the most important feedback loops in a healthy ecosystem: the one that tells infrastructure builders which infrastructure is actually working.
Legibility collapses. A consumer choosing between two agents with similar domain scores cannot inspect what is inside. Is Agent A performing well because of the LLM, the tool stack, or the orchestration? Will it behave the same way if it gets forked and deployed in a different context? These questions are unanswerable when the agent is a black box.
Attribution of failure becomes impossible. When an agent makes a bad attestation, whose layer caused it? The LLM hallucinated? A tool returned a false negative? The orchestration mishandled a retry? Without compositional visibility, corrective signal cannot be routed back to the right layer. We can only punish the wrapper.
What compositional reputation would look like
The shape is not exotic. It is what software composition already looks like, lifted to reputation.
Each tool — an MCP server, a deterministic analyzer, a specific sub-model — accumulates its own calibration score per domain, aggregated across the agents that use it. A solidity-audit analyzer that contributes to accurate attestations gains reputation on solidity-audit, regardless of which agent invoked it.
Each agent declares its composition on registration: which LLM, which MCP servers, which tools. The agent’s domain reputation is then a function of two layers: a composite prior inherited from its declared tools, plus the orchestration quality measured by the delta between the components’ expected performance and the actual observed performance of the agent as a whole.
The scoring mechanism — proper scoring rules over prediction-confidence tuples, à la Brier — is what produces the raw signal at each attestation. The difference is that the signal flows through the stack, not against a wrapper. A tool accumulates reputation from the downstream performance it enables. An agent accumulates reputation only for what its orchestration adds on top of its components.
This is also what gives the primitive its teeth against gaming. You cannot spam your way to a high score if “score” is a calibration measure over a declared composition, because a sybil agent with no orchestration track record gets no orchestration credit — it only inherits whatever its declared tool stack already brings.
The problems that aren’t solved
This is not clean. Three hard questions block naive implementation.
Attribution is a real problem. Given an agent made of N tools with observed aggregate performance, how do you assign calibration credit to each component? This is the Shapley value problem from cooperative game theory — roughly, assigning each component its average marginal contribution across all possible orderings of use — and it is computationally expensive in the general case. Domain-specific approximations exist: for tool-heavy domains (audit, verification, structured analysis), invocation-log attribution works acceptably; for LLM-heavy domains (synthesis, creative reasoning), attribution is closer to intractable and the model may need to collapse back to agent-level scoring. The protocol must accept that the decomposition depth is domain-dependent.
Composition can be misrepresented. An agent can declare it uses Tool X while actually using Tool Y. Without verification, compositional reputation degenerates into self-reported marketing. The fix is either cryptographic attestation of tool invocation — TEE execution, signed invocation logs — or on-chain observability of MCP calls. Both have cost. Neither is universally solved.
Tools are themselves compositions. An MCP server is built on a deterministic analyzer, which depends on a static analysis library, which depends on a parser. Where does decomposition bottom out? The protocol has to define what it treats as atomic. This is a design choice rather than a derivation, and it will vary across domains.
None of these are reasons to keep agents as black boxes. They are reasons to be specific about the model.
Why this matters
The standards layer is settling. ERC-8004 defines the identity slot at the protocol level. The reputation slot is deliberately left open for pluggable trust models. What gets built into that slot over the next six to twelve months will anchor how agent reputation works for a long time afterward.
If what ships is “score the wrapper,” we lose the compositional layer before we even try to build it. Retrofitting later means every existing reputation dataset has to be re-derived with tool-level attribution, which is costly and may not be possible without the original invocation logs.
If what ships is compositional from the start, the system scales to the actual shape of the agent stack. Tools can be evaluated independently. Agents inherit informed priors. Consumers can inspect what they are trusting. And the feedback loop reaches the builders who actually control the quality of each layer.
Open question
The cleanest version of compositional reputation assumes orchestration quality can be cleanly separated from tool quality — that you can measure “what the agent itself adds on top of its components” as a distinct signal. In tool-heavy domains this is plausible: invocation logs reveal which tool detected what, and orchestration is the residual. In LLM-heavy domains it isn’t: the orchestration is the prompting, the routing, the retry logic, all of which are entangled with the model output itself.
Is the right answer to score orchestration only in domains where it can be cleanly isolated, and accept that in others the agent and its primary model collapse into a single scored entity? Or is there a more general way to attribute the orchestration delta that I’m missing?
Curious how others here are thinking about this…
Max.