New:
AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction

AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction

NEWSApril 2, 2026

If you work at an AEC firm and want to understand what these results mean for your practice, read our companion post: What Does AEC-Bench Mean for AEC Firms?

Architecture, engineering, and construction is a $13 trillion global industry. While AI agents have reshaped software engineering workflows (with coding agents now augmenting tens of millions of developers), the built world represents the next great frontier for AI — one that has yet to be unlocked. Not because of a lack of ambition, but because of a fundamental mismatch between how AI agents perceive information and how the built world communicates.

Today, we are releasing AEC-Bench — the first multimodal benchmark for evaluating AI agents on real-world architecture, engineering, and construction tasks. With 196 task instances across 9 task families, AEC-Bench provides the industry's first rigorous measurement framework for AI agent capability in construction coordination workflows. Our results reveal that domain-specific agent design — not just larger models — is the key to unlocking AI in the built world.

We openly release our benchmark dataset, agent harness, and evaluation code at github.com/nomic-ai/aec-bench under an Apache 2.0 license for full replicability.

AEC-Bench Results

Mean Score by Agent Harness (Pass @ 1)

Nomic Agent
OpenAI Codex
Anthropic Claude

Intra-Sheet

Single-page understanding

Nomic Agent
0.0
OpenAI Codex
0.0
Anthropic Claude
0.0

Intra-Drawing

Multi-sheet reasoning

Nomic Agent
0.0
OpenAI Codex
0.0
Anthropic Claude
0.0

Intra-Project

Cross-document coordination

Nomic Agent
0.0
OpenAI Codex
0.0
Anthropic Claude
0.0

Best configuration per harness family shown. Higher is better.

Why the Built World Breaks AI Agents

Construction documents are among the most information-dense artifacts produced in any industry. A single drawing set can span hundreds of pages of tightly packed annotations, callouts, linework, and cross-references that require sophisticated visual reasoning to interpret. These are not digital-native documents like source code — they are 2D views exported from 3D modeling software, where meaning is conveyed through structured visual and textual elements including plans, details, callouts, notes, and title blocks.

Construction drawing example

Coordination failures between architects, engineers, and construction teams are the main drivers of scheduling delays and budget overruns in construction projects. During pre-construction, design, engineering, and handoff, many delays arise from inconsistencies introduced while authoring and revising drawing sets and project documents. In response, industry teams rely on standardized review and coordination workflows that demand deep professional experience and multimodal reasoning — exactly the kind of work AI agents should help with, but currently cannot reliably perform.

Standard coding agents — systems like Claude Code and Codex that excel at navigating source code repositories — approach construction documents with the same strategies they use for code: text extraction, keyword search, and image rendering. But construction documents are fundamentally different. Text extraction tools like pdftotext collapse spatial layout and geometric relationships into linear text, discarding the visual structure that carries critical meaning. Vision-based tools lack the precision needed for reliable geometric reasoning. The result is a systematic failure mode: agents retrieve incomplete or incorrect context, leading to compounding errors.

Our evaluation reveals just how ingrained this behavior is: across all models evaluated, 77% of agent trajectories invoked pdftotext as their primary information extraction strategy, effectively treating complex multimodal construction documents as flat text files. Codex-based agents relied entirely on Bash (100% of interactions), executing every action as a shell command. This coding-oriented tool repertoire is fundamentally mismatched with the demands of multimodal AEC document coordination.

Introducing AEC-Bench

AEC-Bench is a multimodal benchmark grounded in real-world construction coordination workflows, developed in collaboration with domain experts including practicing architects and engineers. We curated 196 task instances across 9 task families, organized into three scopes of increasing complexity based on how much document context an agent needs to complete the coordination task:

Intra-Sheet — Tasks solvable from a single page. Checking whether callouts match referenced elements, verifying detail titles, or reviewing a local assembly. The focus is on understanding relationships between text and multimodal drawing elements on one page.

Intra-Drawing — Tasks requiring reasoning across multiple sheets within the same drawing set. Validating cross-references, comparing sheet indices against title blocks, and tracing details across views. These demand navigating between pages and tracking related information across the set.

Intra-Project — Tasks involving multiple documents: drawings, specifications, and submittals. Identifying conflicts between specs and drawings, or evaluating submittals for compliance. These reflect real project-level coordination where relevant information is distributed across different documents.

Each task instance consists of a natural-language instruction, a sandboxed execution environment with real construction documents sourced from public-sector projects, and an automated verifier graded against ground truth established by professional engineers and architects.

AEC Bench

Benchmark Composition

196 instances · 9 task families · 3 scopes
Intra-Sheet

Single drawing sheet

Instances0
Intra-Drawing

Multiple sheets, one set

Instances0
Intra-Project

Drawings, specs & submittals

Instances0
Single page
Multi-document
The Nomic Agent: Leading Performance Across AEC Tasks

We evaluated agent performance across two general-purpose coding-agent harness families — Codex (GPT-5.2, GPT-5.4) and Claude Code (Opus 4.6, Sonnet 4.6) — in two configurations: H (base harness with standard tools) and H+N (base harness augmented with Nomic tools including Nomic Parse and Nomic Embeddings). We also evaluated the fully integrated Nomic Agent, our domain-specific agent harness built from the ground up for built world workflows.

The Nomic Agent achieves the highest overall performance, establishing itself as the leading agent harness for AEC tasks:

  • Intra-Sheet: 70.6 mean reward — 8.1 points ahead of the next best configuration, demonstrating superior single-page multimodal understanding
  • Intra-Project: 62.0 mean reward — outperforming all other configurations on the most complex cross-document coordination tasks
  • Intra-Drawing: 88.3 mean reward — leading all harness configurations on multi-sheet reasoning

AEC-Bench

Task Simulations

Agent receives:
CivicCenter_ArchDrawings_PermitSet.pdf· 47 pp.
Sheet IndexIntra-Drawing
No.Description
G-001
General Notes & Abbreviations
A-101
Site Plan
A-201
Floor Plan — Level 1
A-202
Floor Plan — Level 2
A-301
Exterior Elevations
Title Block — A-201Sheet
Project
Civic Center Renovation
Portland, OR · Permit Set 2024-03
Sheet Title
First Floor Plan
Sheet No.
A-201
Drawn by
KL
Date
2024-03-15
Scale
1/8"=1'-0"

Task: Find all sheet index entries that don't match the actual title block on that sheet.

Retrieval Is the Bottleneck, Not Reasoning

The most impactful finding from AEC-Bench is that retrieval is the primary bottleneck for AI agents in AEC. Agents frequently fail before reaching the core reasoning step because they cannot reliably locate the relevant sheet, detail, or document. Once the correct context is retrieved, performance improves substantially.

Augmenting base harnesses with Nomic tools — structured representations of drawings generated using domain-specific models like Nomic Parse and Nomic Embeddings — produced dramatic improvements on retrieval-sensitive tasks across all foundation model families:

  • Detail-technical-review: +32.2 points average improvement across models
  • Spec-drawing-sync: +20.8 points average improvement
  • Drawing-navigation: +18.75 points average improvement, with multiple models achieving a perfect 100% score

These gains held consistently across foundation model families — from GPT-5.4 to Sonnet 4.6 — demonstrating that structured, domain-specific document representation uniformly improves agent performance regardless of the underlying model.

However, tasks requiring precise visual-spatial grounding — such as note-callout-accuracy and cross-reference-tracing — showed that document parsing alone is not sufficient. These tasks require tracing leader lines and interpreting geometric relationships purely visually, which remains a concrete failure mode for current foundation model harnesses. This finding points toward the need for deeper visual grounding capabilities in next-generation agent systems.

Open Source for the Industry

We believe the AEC industry deserves rigorous, open benchmarks to drive progress in AI adoption. As agent benchmarks evolve, incorporating realistic document structures and workflows will be critical for understanding agent behavior in applied settings. AEC-Bench is our contribution toward that goal.

The full benchmark, agent harnesses, and evaluation code are available at github.com/nomic-ai/aec-bench under an Apache 2.0 license. We encourage researchers and practitioners to build upon this work and push the boundaries of what AI agents can achieve in the built world.

Future work will expand the benchmark to include larger drawing sets, more AEC disciplines, and additional task families. We are developing improved verifiers that assess evidence grounding and reasoning steps, and designing agentic systems built specifically for document navigation — systems that iteratively explore drawing sets, maintain spatial memory on sheets, and retrieve evidence regions before reasoning.

To learn more about how Nomic is building the AI platform for the built world, visit our Platform or book a demo.

Harsh Mankodiya, Chase Gallik, Theodoros Galanos (Aurecon), Andriy Mulyar
Nomic

Share this article:

Related Articles

Nomic agents work in your project delivery software and tools
SharePoint
Egnyte
Autodesk Construction Cloud
ProjectWise (Bentley)
Google Drive
Dropbox
Box
Microsoft Teams
Gmail
Outlook