Primary Sources

Canonical references for the autonomous agent ecosystem. Protocol specs, benchmarks, foundational papers, and official registries — the ground truth, not interpretations.

Protocol Spec

Agent-to-Agent Protocol Specification ↗Verified 2026-03-01

Google's A2A protocol specification — defines how autonomous agents discover, communicate with, and delegate tasks to other agents. Covers Agent Cards (/.well-known/agent.json), task lifecycle, SSE streaming, and multi-turn agent conversations.

Model Context Protocol Specification ↗Verified 2026-03-01

The official MCP specification — defines the protocol for connecting AI models to external tools and data sources. Covers transport layers (stdio, HTTP+SSE), capability negotiation, tool schemas, resource access, and sampling. The authoritative source for any MCP implementation.

OpenAI Function Calling Reference ↗Verified 2026-03-01

OpenAI's function calling specification — the interface that triggered widespread adoption of structured tool use in LLMs. Defines the JSON schema format for tool definitions and the tool_calls response format. Most LLMs now implement a compatible interface.

Benchmark

GAIA: General AI Assistants Benchmark ↗Verified 2026-03-01

Benchmark testing real-world assistant capabilities requiring multi-step reasoning, tool use, and web access. Questions require agents to browse the web, run code, and synthesize information across multiple steps. A strong proxy for general agent capability.

SWE-bench: Software Engineering Benchmark ↗Verified 2026-03-01

The primary benchmark for evaluating AI systems on real-world software engineering tasks. Consists of GitHub issues from popular Python repositories — agents must fix bugs by reading issue descriptions, navigating codebases, and writing patches. SWE-bench Verified is the curated subset with human-validated solutions.

τ-bench: Agent Benchmark for Real-World Tasks ↗Verified 2026-03-01

Benchmarks agents on realistic retail and airline customer service tasks requiring multi-turn tool use, policy adherence, and complex decision-making. Tests agents in environments where actions have real consequences and policies must be followed.

Paper

A Design System Governance Process (Brad Frost) ↗Verified 2026-03-09

Brad Frost's 10-step design system contribution process — the standard governance framework for maintaining consistency across design system changes. Steps from 'try existing component options first' through UX testing, final review, documentation, versioned release, and QA. Each contribution tracked through: submitted → under review → needs more information → approved for design → approved for development → scheduled for release → declined. Component acceptance criteria: usefulness, uniqueness, usability, consistency, versatility.

AI Code Technical Debt Crisis 2026-2027 (Pixelmojo) ↗Verified 2026-03-09

Comprehensive data compilation on AI-generated code quality: 84% developers using AI tools but trust dropped from 43% to 29%, code duplication up 48%, refactoring down 60%, code churn up 41%, 45% of AI code has security vulnerabilities, Fortune 50 security findings up 10x. Includes the Pixelmojo Mitigation Framework: short-term (governance + CI/CD gates), medium-term (hybrid workflows + upskilling), long-term (modular systems + 20-30% debt remediation budget). The data every builder needs to see.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models ↗Verified 2026-03-01

Wei et al.'s foundational paper showing that prompting models to produce intermediate reasoning steps (chain-of-thought) dramatically improves performance on complex tasks. The basis for all modern agent reasoning approaches including ReAct, extended thinking, and o1-style models.

Cognitive Debt (Margaret-Anne Storey) ↗Verified 2026-03-09

Introduces 'cognitive debt' — the gap between what code does and what developers understand about the code. Lives in the developer's head rather than in the codebase like technical debt. Even if AI agents produce code that could be easy to understand, humans may have lost the plot and not understand what the program is supposed to do, how their intentions were implemented, or how to change it. Published February 2026, part of a convergent moment where five independent groups identified the same structural problem in one week.

Design Token-Based UI Architecture (Martin Fowler / Andreas Kutschmann) ↗Verified 2026-03-09

The canonical reference for token architecture. Defines the three-layer hierarchy: Option Tokens (WHAT — available design options), Decision Tokens (HOW — contextual application), Component Tokens (WHERE — specific UI component mappings). Establishes the deployment pipeline: Check → Build (Style Dictionary) → Test (visual regression + Storybook) → Publish (npm with semver) → Notify teams. Git as source of truth, Figma as editing interface.

From Prototype to Product: How Design Systems Prevent Vibe Coding Pitfalls (Supernova) ↗Verified 2026-03-09

Supernova's analysis of vibe coding's five critical problems at scale: architecture by accumulation, inconsistent UI and behavior, security and quality risks, documentation debt, and false confidence. Includes their practical 8-step playbook: publish naming/scoping contracts, ban raw values and require CI checks, document top 10 components with states and accessibility, sync Figma to code, establish versioned releases, track token coverage, provide approved prompts for AI tools, make system primitives the default.

How We Use AI to Turn Figma Designs into Production Code (monday.com) ↗Verified 2026-03-09

The most detailed public case study of design-system-enforced AI code generation (February 2026). Documents monday.com's 11-node LangGraph agentic architecture: Design System MCP derived from real sources of truth, translation detector, layout analyzer, token fetcher, component identifier — several running in parallel. Key insight: 'The difference between naive and agent-processed code is not visual polish. The difference is whether the code already conforms to the design system or whether that work is left to the developer.' The gold standard for production agentic design enforcement.

LLM Powered Autonomous Agents ↗Verified 2026-03-01

Lilian Weng's canonical overview of LLM agent architectures — covers planning, memory, tool use, and agent design patterns. The most widely cited non-paper reference in the agent design space. Still relevant despite being from 2023.

Machine-Readable Design Systems: Designing for AI as a User (Diana Wolosin, Indeed) ↗Verified 2026-03-09

Defines the AIX (AI Experience) framework — the principle that design systems must be structured for AI consumers, not just human consumers. Three-layer metadata architecture: WHAT (raw assets → structured metadata), HOW (implementation rules, prop types, accessibility), WHY (strategic intent, usage guidelines). Three-layer MCP configuration: Visual (Figma MCP), Implementation (Design System MCP), Bridge (Code Connect). Key quote: 'Just as UX shapes how humans behave in a system, the structure of a design system shapes how AI behaves when generating interfaces.'

ReAct: Synergizing Reasoning and Acting in Language Models ↗Verified 2026-03-01

The paper that introduced the ReAct (Reasoning + Acting) pattern — the foundation of most modern agent architectures. Shows how interleaving reasoning traces with action execution dramatically improves agent performance over pure chain-of-thought or pure tool use.

Spec-Driven Development (InfoQ) ↗Verified 2026-03-09

The definitive article on Spec-Driven Development (SDD) — the 2026 industry consensus approach to AI-assisted coding. Uses well-crafted software requirement specifications as prompts, aided by AI coding agents. Collaborate with AI to create clear specifications BEFORE coding. Break work into small, testable increments. AI generates unit tests for its own code. Also references GitHub's Spec Kit — an open-source toolkit for SDD that integrates with GitHub Copilot, Claude Code, and Gemini CLI. Thoughtworks and InfoQ both promoting this as the 2026 standard.

Supercharge Your Design System with LLMs and Storybook MCP (Codrops) ↗Verified 2026-03-09

The primary technical guide to Storybook MCP — how the Component Manifest works, setup instructions, how agents use it for autonomous correction loops. Covers: structured JSON metadata exposing component lists, descriptions, props with types/defaults, example code from stories. Shows how agents run component tests, see what fails, and fix their own bugs. Dramatically reduces token consumption vs. loading entire codebases.

Verification Debt (Lars Janssen) ↗Verified 2026-03-09

Defines 'verification debt' — the growing gap between how fast code can be generated and how fast it can be validated. Unlike technical debt, which usually announces itself through mounting friction, verification debt breeds false confidence. Survey finding: 96% of developers don't fully trust AI-generated code to be functionally correct, but only 48% say they always check it before committing. Published March 2026.

Vibe Coding Is Not the Same as AI Engineering (Addy Osmani) ↗Verified 2026-03-09

Addy Osmani's influential piece distinguishing 'vibe coding' (undirected AI code generation) from 'AI engineering' (structured, specification-driven AI-assisted development). Key quote: 'AI tools are copilots, not autopilots.' Directly influenced the Spec-Driven Development movement. Written by Google Chrome's engineering manager — carries significant industry weight.

Writing a Good CLAUDE.md (HumanLayer) ↗Verified 2026-03-09

The canonical guide to writing effective CLAUDE.md files for AI coding agents. Key principles: three components (WHAT, WHY, HOW), keep under 300 lines (frontier LLMs reliably follow 150-200 instructions), progressive disclosure via referenced files, don't use Claude as a linter (use hooks), don't auto-generate (craft manually), point to file:line sources not code snippets. HumanLayer maintains their own CLAUDE.md at fewer than 60 lines. Essential reading for any agentic codebase.

Registry

Anthropic MCP Servers — Official Repository ↗Verified 2026-03-01

Anthropic's official collection of reference MCP server implementations. Includes servers for Brave Search, Google Drive, GitHub, Postgres, Slack, and more. The reference implementations to study when building new MCP servers.

Smithery — MCP Server Registry ↗Verified 2026-03-01

The primary public registry for MCP servers. Catalog of community and official MCP servers with installation instructions, capability descriptions, and usage counts. The fastest way to discover what MCP servers exist for a given integration.

Model Card

Claude Model Card & Safety Report ↗Verified 2026-03-01

Anthropic's official model card for Claude — capabilities, limitations, safety properties, and evaluation results. Includes information on tool use, extended thinking, and responsible deployment. Required reading for understanding Claude's agent-relevant capabilities.

GPT-4o System Card ↗Verified 2026-03-01

OpenAI's system card for GPT-4o — the multimodal model underlying most OpenAI API agents. Covers capabilities, safety evaluations, and deployment guidelines. The authoritative reference for understanding GPT-4o's tool-use and reasoning properties.

Standard

JSON Schema Specification ↗Verified 2026-03-01

The specification for JSON Schema — used to define the parameter schemas for LLM function calling. All major LLM providers (Anthropic, OpenAI, Google) use JSON Schema to describe tool inputs. Understanding JSON Schema is required for building well-defined agent tools.

OpenAPI Specification ↗Verified 2026-03-01

The standard for describing RESTful APIs — widely used to define tool schemas for LLM function calling. When agents need to call REST APIs, OpenAPI schemas are the most common machine-readable format. Understanding OpenAPI is prerequisite for building API-calling agents.

W3C Design Tokens Specification (2025.10) ↗Verified 2026-03-09

The first stable version of the W3C Design Tokens specification, announced October 28, 2025. The production-ready, vendor-neutral format for sharing design decisions across tools and platforms. JSON format with application/design-tokens+json media type. Supports Display P3, Oklch, and all CSS Color Module 4 color spaces. Backed by Adobe, Amazon, Google, Microsoft, Meta, Figma, Shopify, Salesforce, and 20+ other organizations. The foundational standard for any machine-readable design system.

Primary Sources

Canonical references for the autonomous agent ecosystem. Protocol specs, benchmarks, foundational papers, and official registries — the ground truth, not interpretations.