Why Evaluator Accuracy Changes Everything About On-Chain Reputation
Category: Ideas & Brainstorming · Reputation Computation
Most reputation systems treat every signal equally.
Stake 1 TRUST to support an agent. Someone else stakes 1 TRUST to oppose. The system counts: 50/50. Done.
But here’s what’s missing: who staked matters as much as how much they staked.
A staker who backed 10 agents that all maintained trust above 50% has proven judgment. A staker who backed 10 agents that all crashed has proven the opposite. Yet their 1 TRUST carries the same weight.
That’s the gap we’ve been working on with AgentScore. And I think the approach could be useful for the broader Intuition ecosystem.
The Problem With Equal Weighting
Today, on-chain reputation is essentially a popularity contest weighted by capital. More TRUST staked = higher score. This works at scale but has clear failure modes:
-
A whale can dominate any vault with raw capital
-
A coordinated group of low-quality stakers can inflate scores
-
There’s no incentive to evaluate carefully — just to evaluate
The result: lots of noise, hard to separate signal.
You can partially fix this with whale detection (we do — diversity-weighted ratios that penalize any single staker holding more than 25% of a side). But that only handles the capital dimension. It doesn’t address judgment quality.
Introducing Evaluator Accuracy
What if the system tracked how good each staker is at evaluating?
The concept is simple:
-
Look at every position a staker has across all agents
-
Check: did the agents they supported actually maintain trust? Did agents they opposed actually lose trust?
-
Turn that track record into a weight multiplier
The formula:
rawAccuracy = correctPicks / totalPicks
confidence = 1 - e^(-totalPicks / 5)
adjusted = 0.5 + (rawAccuracy - 0.5) × confidence
weight = 0.5 + adjusted
This gives a range of 0.5x (consistently wrong) to 1.5x (consistently right).
Key properties:
-
New staker = 1.0x. No penalty for being new. Neutral until you prove yourself.
-
Random staker = 1.0x. If you just guess, your accuracy hovers around 50%, confidence anchoring keeps you at neutral weight. The system doesn’t punish randomness — it rewards consistent accuracy.
-
Confidence scales with sample size. One lucky pick doesn’t make you an Oracle. You need sustained accuracy across many positions.
-
Self-staking excluded. You can’t build evaluator reputation by staking on agents you registered yourself.
Three Multipliers Per Position
Each staking position now carries three multipliers:
effectiveStake = amount × diversityWeight × evaluatorWeight
-
Amount — how much you staked (economics)
-
Diversity weight — are you a whale? (anti-manipulation)
-
Evaluator weight — are you any good at evaluating? (signal quality)
This means two stakers putting in the same 1 TRUST can have different effective impact:
-
Sage evaluator (85%+ accuracy): effective = 1 × 1.0 × 1.4 = 1.4 TRUST
-
Scout evaluator (45% accuracy): effective = 1 × 1.0 × 0.95 = 0.95 TRUST
Same capital. Different influence. Because one has demonstrated better judgment.
Evaluator Tiers
We introduced tiers to make track records visible:
| Tier | Requirements | Weight Range |
|---|---|---|
| < 3 evaluations | ~1.0x | |
| 3+ evals, accuracy < 60% | 0.9–1.1x | |
| 5+ evals, accuracy ≥ 60% | 1.1–1.2x | |
| 10+ evals, accuracy ≥ 75% | 1.2–1.35x | |
| 20+ evals, accuracy ≥ 85% | 1.35–1.5x |
This creates a natural gamification: stakers compete not just to stake more, but to stake better. The evaluator leaderboard ranks the best evaluators in the ecosystem — a meta-reputation layer on top of agent reputation.
How This Handles Cold Start
This came up in a recent conversation with @repboiz — how do you prevent early bad actors from shaping initial weights?
Four properties of the system handle it:
-
Confidence anchoring — new agents start at 50. Score only deviates as real stake accumulates. Early bad actors with low total stake barely move the needle.
-
Evaluator weight = 1.0x at start — you can’t walk in with outsized influence. You earn it across many evaluations.
-
Soft gate — agents with low support ratio get proportionally reduced scores. 30% support = score × 0.6. One honest oppose signal caps damage.
-
Min stake threshold — dust wallets don’t count for diversity metrics. Spinning up 50 sybil wallets with 0.001 each does nothing.
The system is designed to be boring at cold start. Score sits near 50, nothing dramatic happens, and influence must be earned slowly. Early bad actors get a muted, expensive, temporary effect that decays as real participants arrive.
Why This Matters Beyond Agent Reputation
The evaluator accuracy pattern isn’t specific to AI agents. It could apply to any Intuition use case where staking quality matters:
-
Topic trust circles — a staker who consistently backs credible people in crypto should have more weight than one who backs rugged projects
-
Content curation — an evaluator who flags high-quality content early should gain influence
-
Prediction markets — an expert who consistently predicts correctly should have more weight (this is what we’re building with SENSE)
The core primitive is: track record → weight → influence. It’s a meritocratic layer that sits on top of economic staking.
What’s Next
We’re exploring a few extensions:
-
Domain-specific evaluator accuracy — you could be a great evaluator for DeFi agents but terrible at evaluating healthcare agents. Per-domain track records would reflect that.
-
Evaluator decay — should old evaluations lose weight over time? If you were accurate 6 months ago but haven’t evaluated recently, should your weight persist?
-
Cross-app evaluator portability — if your evaluator reputation is built on Intuition triples, other dApps could consume it. An evaluator with Sage status on AgentScore could carry that credibility into other reputation contexts.
-
Integration with trust graph traversal — Sofia MCP’s EigenTrust computes trust propagation through the graph. Adding evaluator weights as a quality filter on those paths could produce higher-signal trust propagation. A path through a Sage evaluator is worth more than a path through a random staker.
Open Questions
I’d love the community’s perspective on:
-
Should evaluator weight affect both support AND oppose sides equally? Currently it does. But maybe opposing is harder and should be weighted differently?
-
What’s the right tau (τ) for confidence scaling? We use τ=5 (63% confidence at 5 evaluations). Should it be slower or faster?
-
Should there be a cap on evaluator weight? We cap at 1.5x. Too low? Too high?
-
How do you prevent evaluators from only staking on “safe” agents to farm accuracy? (e.g. only backing already-high-trust agents to inflate track record)
If you’re building on Intuition and thinking about reputation computation, I think evaluator accuracy is worth considering. It’s the difference between “how much was staked” and “how much was staked by people who know what they’re talking about.”
Built on Intuition Protocol. Exploring programmable trust one layer at a time.