The ATO Framework

Agent Tool Optimization (ATO) is the practice of optimizing your tools, APIs, and services so AI agents autonomously discover, select, and execute them. ATO is to the agent economy what SEO was to the search economy.

Why ATO matters now

The MCP ecosystem now has over 10,000 servers with 97 million monthly SDK downloads. AI agents are autonomously selecting tools — choosing which APIs to call, which services to use, which databases to query. When an agent needs a CRM tool, it picks from dozens of candidates. Research shows that 97.1% of tool descriptions contain quality defects, and tools with optimized descriptions are selected 3.6x more often.

This is the same dynamic that created the SEO industry: when search engines became the primary way humans found websites, optimizing for search became essential. Now that agents are becoming the primary way AI systems find tools, optimizing for agent selection is the next frontier.

Three stages of agent selection

Stage 1

Be recognized

AI systems know your tool exists. Your documentation is in LLM training data. Your site has llms.txt. Structured data is in place. This is where LLMO, GEO, and AEO operate — and where they stop. Being recognized is necessary but not sufficient.

Equivalent: appearing in search results (impression)

Stage 2

Be selected

When an agent searches for a tool, yours appears in the candidate list and gets chosen over competitors. Your tool name is searchable via BM25 and regex. Your description clearly states purpose, usage context, and return values. Your schema is precisely defined. This is ATO's core contribution — the layer LLMO does not cover.

Equivalent: getting the click (conversion)

Stage 3

Be used reliably

Once selected, your tool executes successfully. Errors are handled gracefully. Responses are structured and useful. The agent comes back next time. This is where per-call revenue, retention, and long-term value are built. No other optimization framework addresses this stage.

Equivalent: repeat purchase (retention)

Four dimensions of ToolRank Score

ToolRank Score evaluates every tool across four dimensions. Each dimension is scored independently, so you know exactly what to fix and in what order.

Findability 25%

Can agents discover your tool? Covers registry presence (Smithery, MCP Registry, npm), category tagging, verified/deployed status, llms.txt, and LLM training data inclusion. A tool that isn't registered anywhere has zero chance of being selected.

Clarity 35% — highest weight

Can agents understand what your tool does? Evaluates six description components: purpose statement, usage examples, error handling, parameter descriptions, return descriptions, and constraints. Research identified 18 "smell" categories — quality defects that mislead agents. Clarity has the highest impact on selection: improving functionality descriptions alone yields +11.6% selection improvement.

Precision 25%

Is your interface precisely defined? Checks type definitions for every parameter, enum constraints for limited-value fields, default values for optional parameters, required field declarations, and error response structures. A vague schema leads to incorrect parameter filling and failed executions.

Efficiency 15%

Are you token-efficient? Measures estimated token cost of your tool definitions, total tool count (5-15 is optimal; accuracy degrades past 20), naming conventions for search compatibility, and execution speed. GitHub cut their MCP server from 40 tools to 13 and benchmarks improved.

ATO maps directly to SEO

If you understand SEO, you already understand ATO. The concepts are parallel — only the target has changed from search engines to AI agents.

SEO concept	ATO equivalent
Technical SEO	MCP tool definitions, API schemas, Agent Cards
Content SEO	Documentation AI-readability, llms.txt (= LLMO)
Backlinks	Registry presence, cross-tool references, ecosystem trust
PageRank / Domain Authority	ToolRank Score
Google Search Console	ToolRank Monitor
Algorithm updates	Agent selection algorithm changes (Tool Search, registry ranking)
Core Web Vitals	Execution success rate, response latency

Before & after: a real example

The same tool, two definitions. One scores 52/100. The other scores 96/100. The difference is a few minutes of work.

Before — 52/100

{
  "name": "get",
  "description": "gets data from the api"
}

Generic name. Vague description. No schema. No context.

After — 96/100

{
  "name": "search_repositories",
  "description": "Searches for GitHub repositories
    matching a query. Useful for finding
    open-source projects or checking if a
    repo exists. Returns name, description,
    stars, language, and URL.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search query"
      },
      "sort": {
        "type": "string",
        "enum": ["stars", "forks", "updated"],
        "default": "stars"
      }
    },
    "required": ["query"]
  }
}

Specific name. Clear purpose + context + return value. Typed schema with enums and defaults.

Evidence: Score predicts selection

We ran 500 simulated tool selection rounds with three quality levels. An agent was asked to choose between tools of varying description quality for the same task. The results:

Low quality (59/100)

1.8%

selection rate

Medium quality (83/100)

12.8%

selection rate

High quality (95/100)

85.4%

selection rate

Pearson correlation: r = 0.828 (score vs selection probability). 500 rounds, 3 categories, local simulation.

This aligns with the academic finding (arXiv 2602.18914) that quality-compliant descriptions achieve 72% selection probability versus 20% baseline — a 3.6x advantage.

Methodology: selection_sim.py (open source). This is a local heuristic-based simulation, not an LLM selection test. It models the signals agents use (description quality, schema completeness) but is not a substitute for real-world selection data. Claude API validation is planned to strengthen external reproducibility.

Score architecture

ToolRank Score evaluates tools across three progressive layers. Each layer adds depth while keeping the previous layer's score stable and reproducible.

Layer 1: Spec Quality

Live — v1.0.0

Rule-based scoring across 4 dimensions (Findability, Clarity, Precision, Efficiency). 14 checks, deterministic, zero cost. Free for everyone. Weights auto-calibrated weekly.

Layer 2: Selection Performance

Live — v2.0.0

LLM-based selection testing using Claude Sonnet API. 100-round tournaments measure real agent preference. Results stored in trust_tiers with Selection Verified badge at ≥70% win rate.

Measurement specification:

Model: Claude Sonnet 4 (fixed version for reproducibility)
Method: 100-round tournament per tool. Each round presents 4 competing tools for a realistic task.
Metric: Selection rate (% of rounds tool is chosen). ≥70% = Selection Verified.
Frequency: Monthly automated via GitHub Actions
Reproducibility: Full evaluation scripts and prompts open-sourced

Layer 3: Execution Reliability

Live — v2.0.0

Runtime testing of deployed MCP servers. Health checks, latency, MCP handshake, tool listing, and error quality. ≥90% success rate = Runtime Verified badge.

Measurement specification: