Why Evaluator Accuracy Changes Everything About On-Chain Reputation

kryptoremontier · April 5, 2026, 1:39pm

Why Evaluator Accuracy Changes Everything About On-Chain Reputation

Category: Ideas & Brainstorming · Reputation Computation

Most reputation systems treat every signal equally.

Stake 1 TRUST to support an agent. Someone else stakes 1 TRUST to oppose. The system counts: 50/50. Done.

But here’s what’s missing: who staked matters as much as how much they staked.

A staker who backed 10 agents that all maintained trust above 50% has proven judgment. A staker who backed 10 agents that all crashed has proven the opposite. Yet their 1 TRUST carries the same weight.

That’s the gap we’ve been working on with AgentScore. And I think the approach could be useful for the broader Intuition ecosystem.

The Problem With Equal Weighting

Today, on-chain reputation is essentially a popularity contest weighted by capital. More TRUST staked = higher score. This works at scale but has clear failure modes:

A whale can dominate any vault with raw capital
A coordinated group of low-quality stakers can inflate scores
There’s no incentive to evaluate carefully — just to evaluate

The result: lots of noise, hard to separate signal.

You can partially fix this with whale detection (we do — diversity-weighted ratios that penalize any single staker holding more than 25% of a side). But that only handles the capital dimension. It doesn’t address judgment quality.

Introducing Evaluator Accuracy

What if the system tracked how good each staker is at evaluating?

The concept is simple:

Look at every position a staker has across all agents
Check: did the agents they supported actually maintain trust? Did agents they opposed actually lose trust?
Turn that track record into a weight multiplier

The formula:

rawAccuracy = correctPicks / totalPicks
confidence  = 1 - e^(-totalPicks / 5)
adjusted    = 0.5 + (rawAccuracy - 0.5) × confidence
weight      = 0.5 + adjusted

This gives a range of 0.5x (consistently wrong) to 1.5x (consistently right).

Key properties:

New staker = 1.0x. No penalty for being new. Neutral until you prove yourself.
Random staker = 1.0x. If you just guess, your accuracy hovers around 50%, confidence anchoring keeps you at neutral weight. The system doesn’t punish randomness — it rewards consistent accuracy.
Confidence scales with sample size. One lucky pick doesn’t make you an Oracle. You need sustained accuracy across many positions.
Self-staking excluded. You can’t build evaluator reputation by staking on agents you registered yourself.

Three Multipliers Per Position

Each staking position now carries three multipliers:

effectiveStake = amount × diversityWeight × evaluatorWeight

Amount — how much you staked (economics)
Diversity weight — are you a whale? (anti-manipulation)
Evaluator weight — are you any good at evaluating? (signal quality)

This means two stakers putting in the same 1 TRUST can have different effective impact:

Sage evaluator (85%+ accuracy): effective = 1 × 1.0 × 1.4 = 1.4 TRUST
Scout evaluator (45% accuracy): effective = 1 × 1.0 × 0.95 = 0.95 TRUST

Same capital. Different influence. Because one has demonstrated better judgment.

Evaluator Tiers

We introduced tiers to make track records visible:

Tier	Requirements	Weight Range
Newcomer	< 3 evaluations	~1.0x
Scout	3+ evals, accuracy < 60%	0.9–1.1x
Analyst	5+ evals, accuracy ≥ 60%	1.1–1.2x
Oracle	10+ evals, accuracy ≥ 75%	1.2–1.35x
Sage	20+ evals, accuracy ≥ 85%	1.35–1.5x

This creates a natural gamification: stakers compete not just to stake more, but to stake better. The evaluator leaderboard ranks the best evaluators in the ecosystem — a meta-reputation layer on top of agent reputation.

How This Handles Cold Start

This came up in a recent conversation with @repboiz — how do you prevent early bad actors from shaping initial weights?

Four properties of the system handle it:

Confidence anchoring — new agents start at 50. Score only deviates as real stake accumulates. Early bad actors with low total stake barely move the needle.
Evaluator weight = 1.0x at start — you can’t walk in with outsized influence. You earn it across many evaluations.
Soft gate — agents with low support ratio get proportionally reduced scores. 30% support = score × 0.6. One honest oppose signal caps damage.
Min stake threshold — dust wallets don’t count for diversity metrics. Spinning up 50 sybil wallets with 0.001 each does nothing.

The system is designed to be boring at cold start. Score sits near 50, nothing dramatic happens, and influence must be earned slowly. Early bad actors get a muted, expensive, temporary effect that decays as real participants arrive.

Why This Matters Beyond Agent Reputation

The evaluator accuracy pattern isn’t specific to AI agents. It could apply to any Intuition use case where staking quality matters:

Topic trust circles — a staker who consistently backs credible people in crypto should have more weight than one who backs rugged projects
Content curation — an evaluator who flags high-quality content early should gain influence
Prediction markets — an expert who consistently predicts correctly should have more weight (this is what we’re building with SENSE)

The core primitive is: track record → weight → influence. It’s a meritocratic layer that sits on top of economic staking.

What’s Next

We’re exploring a few extensions:

Domain-specific evaluator accuracy — you could be a great evaluator for DeFi agents but terrible at evaluating healthcare agents. Per-domain track records would reflect that.
Evaluator decay — should old evaluations lose weight over time? If you were accurate 6 months ago but haven’t evaluated recently, should your weight persist?
Cross-app evaluator portability — if your evaluator reputation is built on Intuition triples, other dApps could consume it. An evaluator with Sage status on AgentScore could carry that credibility into other reputation contexts.
Integration with trust graph traversal — Sofia MCP’s EigenTrust computes trust propagation through the graph. Adding evaluator weights as a quality filter on those paths could produce higher-signal trust propagation. A path through a Sage evaluator is worth more than a path through a random staker.

Open Questions

I’d love the community’s perspective on:

Should evaluator weight affect both support AND oppose sides equally? Currently it does. But maybe opposing is harder and should be weighted differently?
What’s the right tau (τ) for confidence scaling? We use τ=5 (63% confidence at 5 evaluations). Should it be slower or faster?
Should there be a cap on evaluator weight? We cap at 1.5x. Too low? Too high?
How do you prevent evaluators from only staking on “safe” agents to farm accuracy? (e.g. only backing already-high-trust agents to inflate track record)

If you’re building on Intuition and thinking about reputation computation, I think evaluator accuracy is worth considering. It’s the difference between “how much was staked” and “how much was staked by people who know what they’re talking about.”

Built on Intuition Protocol. Exploring programmable trust one layer at a time.

Topic		Replies	Views
Reputation Should Be Queryable 3. Reputation Computation	6	18	April 5, 2026
Unresolved problems in identity, reputation, and curation for trustworthy AI infrastructure 1. Ideas & Brainstorming intuition , reputation	7	156	February 12, 2026
Smart Trust Reputation Engine for Intuition (AI-powered) 4. Ecosystem Development intuition , reputation , knowledge-graphs	0	22	November 28, 2025
Why a Single Trust Score Doesn't Work 3. Reputation Computation	0	22	March 23, 2026
Reputation 2.0 — Redefining Reviews Through Intuition 1. Ideas & Brainstorming expedition , intuition , elements , reputation , knowledge-graphs	2	32	October 22, 2025

Why Evaluator Accuracy Changes Everything About On-Chain Reputation

Why Evaluator Accuracy Changes Everything About On-Chain Reputation

The Problem With Equal Weighting

Introducing Evaluator Accuracy

Three Multipliers Per Position

Evaluator Tiers

How This Handles Cold Start

Why This Matters Beyond Agent Reputation

What’s Next

Open Questions

Related topics