New:

What Does AEC-Bench Mean for AEC Firms?

BLOGApril 2, 2026
What Does AEC-Bench Mean for AEC Firms?

This week, we released AEC-Bench — the first open-source benchmark for evaluating AI agents on real architecture, engineering, and construction tasks. The research paper details model architectures, scoring methodologies, and agent trajectories. But if you run an AEC firm, manage a project team, or lead a design practice, the question that matters is simpler: what does this mean for my work?

The answer is significant. AEC-Bench proves — with measurable, reproducible evidence — that domain-specific AI dramatically outperforms general-purpose tools on the tasks your teams do every day. And the performance gap is not small.

The Core Finding: General-Purpose AI Fails on AEC Workflows

AEC-Bench evaluated the leading AI agents — Claude Code, Codex, and others — on 196 real-world construction coordination tasks: reviewing drawing details, tracing cross-references, checking submittals against specifications, verifying code compliance. The same work your project teams do every day.

The results were stark. General-purpose agents treated construction drawings like source code — relying on text extraction and keyword search on documents whose meaning is fundamentally visual and spatial. These tools collapse the rich visual structure of construction documents into flat text, losing the spatial relationships that carry critical engineering meaning. Annotations detached from the elements they describe. Cross-references stripped of their geometric context. Drawing details reduced to character strings.

The Nomic Agent — purpose-built for AEC workflows — outperformed every general-purpose configuration across all three task complexity levels. On detail technical review tasks, adding Nomic's domain-specific tools improved performance by an average of 32 points. On specification-drawing synchronization, the improvement averaged 21 points.

AEC-Bench Results

Mean Score by Agent Harness (Pass @ 1)

Nomic Agent
OpenAI Codex
Anthropic Claude

Intra-Sheet

Single-page understanding

Nomic Agent
0.0
OpenAI Codex
0.0
Anthropic Claude
0.0

Intra-Drawing

Multi-sheet reasoning

Nomic Agent
0.0
OpenAI Codex
0.0
Anthropic Claude
0.0

Intra-Project

Cross-document coordination

Nomic Agent
0.0
OpenAI Codex
0.0
Anthropic Claude
0.0

Best configuration per harness family shown. Higher is better.

Drawing Reviews: Catching What Humans Spend Days Finding

Drawing review is the single most time-intensive QA/QC workflow in most design firms. A senior architect or engineer opens each sheet, scans for inconsistencies, cross-references against previous submissions, and marks up issues. For a 200-sheet drawing set, this process takes days — sometimes weeks.

AEC-Bench tested exactly these workflows. Tasks like detail-title-accuracy — verifying that drawing detail titles match the content actually shown — and cross-reference-resolution — identifying references that point to non-existent targets — are the bread and butter of drawing review. These are the checks that consume the bulk of reviewer time and are most prone to human fatigue errors.

The Nomic Agent scored 70.6 on intra-sheet tasks and 88.3 on intra-drawing tasks, leading all configurations. For firms, this translates directly to practical value: an AI first-pass review that catches the routine issues — missing dimensions, broken cross-references, inconsistent annotations — before a senior reviewer ever opens the set. Your most experienced people spend their time on design judgment, not hunting for typos in title blocks.

Submittal Reviews: From Weeks to Hours

Submittal review is one of the most consequential yet repetitive workflows in construction. For every major building component, a contractor submits product data for the design team's review. On a large project, this means thousands of individual submittals, each requiring comparison against drawings and specifications.

AEC-Bench includes submittal-review as a dedicated task family — the largest in the benchmark with 36 task instances. These tasks require the agent to evaluate submitted product data against project specifications and drawings, identifying deviations and assessing compliance — exactly the workflow that bogs down project teams and creates bottlenecks in construction schedules.

The Nomic Agent achieved the highest score on these intra-project tasks at 62.0, demonstrating the ability to coordinate information across multiple document types — the specs, the drawings, and the submittals themselves. For firms, this means AI that can perform the first-pass comparison that currently consumes the majority of reviewer time, flagging deviations and surfacing the specification requirements that matter for each submitted product.

Specification-Drawing Coordination: Eliminating Costly Conflicts

One of the most expensive failure modes in construction is a conflict between the specifications and the drawings. The spec calls for one product; the drawing shows another. The spec requires a certain performance rating; the drawing detail does not meet it. These conflicts, when caught late, generate RFIs, change orders, and delays that can cost orders of magnitude more than catching them during design.

Spec-drawing-sync tasks in AEC-Bench test exactly this — identifying conflicts between specification documents and drawing sets. Adding Nomic's tools improved performance by an average of 20.8 points across all foundation models tested. The Nomic Agent's ability to parse both text-heavy specifications and visually dense drawings, then reason about their consistency, directly addresses one of the highest-cost coordination failures in the industry.

Cross-Reference Verification: Trust Your Drawing Set

Modern drawing sets are networks of cross-references. A floor plan references a wall section. That wall section references a detail. That detail references a specification section. When any link in this chain is broken — a detail was renumbered but the reference was not updated, a sheet was removed but callouts still point to it — the drawing set loses integrity.

AEC-Bench tested cross-reference-resolution and cross-reference-tracing across 75 task instances. These tasks require navigating multi-sheet drawing sets, following reference chains, and identifying where they break. The Nomic Agent led all configurations on these intra-drawing tasks with an 88.3 mean score. For firms, this is the difference between a drawing set you can trust and one that generates a stream of RFIs during construction.

Why Retrieval Matters More Than Reasoning

Perhaps the most practically important finding from AEC-Bench is that retrieval — not reasoning — is the primary bottleneck for AI in AEC. Agents frequently failed before reaching the reasoning step because they could not locate the right sheet, detail, or document.

This finding has direct implications for how firms should evaluate AI tools. A tool built on the most powerful language model in the world will still fail if it cannot find the right information in your drawing set. Domain-specific parsing and embeddings — the kind Nomic builds — are what close this gap. When agents were given structured, parsed representations of construction documents instead of raw PDFs, performance improved dramatically across every foundation model tested.

For firms evaluating AI solutions, this means the question is not just "which language model does it use?" but "can it actually read and navigate our documents?"

AEC-Bench Results

Impact of Domain-Specific Tools on Key AEC Tasks

General-Purpose Agent
With Nomic Tools
Detail Technical Review+0.0
Without Nomic
0.0
With Nomic Tools
0.0
Spec-Drawing Sync+0.0
Without Nomic
0.0
With Nomic Tools
0.0
Drawing Navigation+0.0
Without Nomic
0.0
With Nomic Tools
0.0
Submittal Review+0.0
Without Nomic
0.0
With Nomic Tools
0.0

Average improvement shown across foundation model families. Higher is better.

What This Means for Your Firm

AEC-Bench is an open-source benchmark — anyone can reproduce the results and test their own tools against it. But the practical implications go beyond benchmarking:

Drawing review cycles can shrink. An AI first-pass that catches cross-reference errors, title block inconsistencies, and annotation issues means your senior reviewers start their work with a cleaner set and spend their expertise on design and engineering judgment.

Submittal review bottlenecks can break. Automated comparison of submitted products against specifications reduces the weeks-long review cycles that hold up construction schedules.

Coordination errors get caught earlier. Spec-drawing conflicts identified during design cost a fraction of what they cost when discovered during construction.

Your institutional knowledge becomes searchable. The same domain-specific parsing that powers these workflows also makes your firm's historical drawings and project data searchable — enabling your teams to find and reuse proven details instead of recreating them from scratch.

Generic AI tools weren't built for this. The workflows that define this industry — the drawing reviews, the submittal checks, the code compliance verifications, the cross-document coordination — require AI built specifically for how the built world communicates.

AEC-Bench puts a number on that gap, and the Nomic Agent is built to close it.

Book a demo to see how Nomic can transform your firm's review and coordination workflows.

Share this article:
Nomic agents work in your project delivery software and tools
SharePoint
Egnyte
Autodesk Construction Cloud
ProjectWise (Bentley)
Google Drive
Dropbox
Box
Microsoft Teams
Gmail
Outlook