Upcoming Event:

The Future of AI in the Built World with Nomic - AEC Tech Week NYC

The Future of AI in the Built World with Nomic - AEC Tech Week NYC

NEWS
EVENT
WHITE PAPER
April 2, 2025

Nomic Embed Multimodal: Open Source Multimodal Embedding Models for Text, Images, PDFs, and Charts

Download the White Paper

Get your copy of the comprehensive AEC AI transformation guide

DEMO REQUEST RECEIVED

Thank you for your interest!

Your message is on its way to our sales team. We’ll reach out shortly to learn more about your needs and show you how our platform can unlock new possibilities for your business.

Download
Oops! Something went wrong while submitting the form.
Register now
Rich Text 1

We're excited to announce the release of Nomic Embed Multimodal, a suite of models that achieve state-of-the-art performance in embedding PDFs, images, papers, and charts.

This release includes four models, available in two sizes (3B and 7B parameters) and two variants:

  • ColNomic Embed Multimodal (3B and 7B): Multi-vector late interaction multimodal embedding models (more powerful)
  • Nomic Embed Multimodal (3B and 7B): Single-vector multimodal embedding models (faster & use less memory/storage)

Our best model, ColNomic Embed Multimodal 7B, achieves 62.7 NDCG@5 on Vidore-v2, a visual document retrieval benchmark focused on page-level retrieval, with a +2.8 point improvement over the previous state-of-the-art models. Additionally, Nomic Embed Multimodal 7B outperforms all other single-vector models on the benchmark.

Video 1
00:00
/
00:00
Rich Text 2
Challenges with PDFs, Papers, and Charts

Documents are visually rich structures that convey information not just through text, but through figures, page layouts, tables, and even fonts. Traditional retrieval systems primarily rely on extracted text, missing these crucial visual elements and often requiring complex, error-prone processing pipelines. Nomic Embed Multimodal, inspired by ColPali and DSE, solves this problem by supporting interleaved text and image inputs, making it ideal for:

  • PDF documents and research papers
  • Screenshots of applications and websites
  • Visually rich content where layout matters
  • Multilingual documents where visual context is important
VLMs for Multimodal Embeddings

Before multimodal embedding models, representing multimodal data for retrieval required:Both the ColPali and DSE approach significantly outperform multimodal models with CLIP-style architectures by addressing the modality gap through interleaved processing of text and images.

The ColNomic Embed Multimodal models use a multi-vector late interaction mechanism. Instead of creating one embedding per document or query, ColNomic creates multiple embeddings. This allows for more precise matching during retrieval and leads to better performance.

  • Separate encoders for visual and text inputs
  • Complex preprocessing pipelines to extract data from images

In contrast, VLMs provide a simple and accurate way to embed image and text data with a single model, eliminating these complexities while improving performance. This approach delivers superior accuracy compared to text-only approaches that require OCR, while being faster than complex pipelines with multiple processing steps. It also provides more comprehensive capture of visual information by directly processing images alongside related text.

Figure 1: Multimodal embedding architecture (courtesy of DSE)

Both the ColPali and DSE approach significantly outperform multimodal models with CLIP-style architectures by addressing the modality gap through interleaved processing of text and images.

The ColNomic Embed Multimodal models use a multi-vector late interaction mechanism. Instead of creating one embedding per document or query, ColNomic creates multiple embeddings. This allows for more precise matching during retrieval and leads to better performance.

Rich Text 3
How We Improved Multimodal Embeddings

Building on these advances, we applied our learnings from training high-performance text embeddings to create even better multimodal embeddings. Starting with Qwen2.5-VL 3B Instruct as our baseline, we implemented several key improvements:

1. Sampling From the Same Source

We discovered that naive sampling across dataset sources allows models to learn shortcuts rather than semantic relationships. By forcing sampling from the same source, we create harder in-batch negatives that prevent the model from "cheating" and improve its understanding of content relationships.

Result: +2.9 point improvement on Vidore-v2 NDCG@5

2. Hard Negative Mining

We trained an initial dense model on the ColPali training dataset and VDR Multilingual Train Dataset, then used it to retrieve the top-k nearest neighbors for each query.

Additionally, we reduced false negatives using positive-aware hard negative mining, a technique first introduced in NV-Retriever.

Results:

  • 1 Hard Negative: +3.5 points
  • 4 Hard Negatives: +4.7 points
  • 6 Hard Negatives: +5.2 points
Integration with Real World RAG Workflows

VLMs like Nomic Embed Multimodal simplify how RAG systems handle documents with rich visual content. Documents with with equations, diagrams, charts, and tables provide essential context alongside the text.

Technical documentation presents similar challenges - code blocks, flowcharts, and screenshots need to be understood together with their surrounding text. The same applies to product catalogs with specifications and images, or financial reports containing charts and numerical data.

By embedding visual and textual content together, retrieval becomes more accurate and integrations into real systems become much easier to implement and experiment with. Removing preprocessing steps often makes indexing faster and reduces complexity, and the single API for both images and text keeps implementations straightforward.

Conclusion

Nomic Embed Multimodal offers state-of-the-art performance while substantially simplifying the retrieval pipeline. As part of the broader Nomic Embed Ecosystem, this technology demonstrates our commitment to pushing the boundaries of embedding capabilities. To read more about our complete ecosystem, you can learn more on our detailed blog post.

Rich Text 4
Get Started With Nomic Embed Multimodal

You can get started with our new model collection here in Hugging Face


And here are some guides demonstrating how to use the new models as retrievers for RAG workflows:

Nomic Documentation

RAG Over PDFs with Nomic Embed Multimodal
RAG Over PDFs with ColNomic Embed Multimodal

Google Colab Tutorial Notebooks

Nomic Embed Multimodal Tutorial in Google Colab
ColNomic Embed Multimodal Tutorial in Google Colab

Related news

Continue Reading

View All News & resources
NEWS
EVENT
WHITE PAPER
February 1, 2024

Introducing Nomic Embed: A Truly Open Embedding Model

NEWS
EVENT
WHITE PAPER
February 14,2024

Unboxing Nomic Embed v1.5: Resizable Production Embeddings with Matryoshka Representation Learning

NEWS
EVENT
WHITE PAPER
June 5,2024

Nomic Embed Vision: Expanding The Nomic Latent Space

NEWS
EVENT
WHITE PAPER
September 24, 2024

SOC 2 Type 2 & Security at Nomic

NEWS
EVENT
WHITE PAPER
March 27, 2025

Nomic Embed Code: A State-of-the-Art Code Retriever

NEWS
EVENT
WHITE PAPER
April 2, 2025

Nomic Embed Multimodal: Open Source Multimodal Embedding Models for Text, Images, PDFs, and Charts

NEWS
EVENT
WHITE PAPER
November 12th, 2025
NYC

The Future of AI in the Built World with Nomic: AEC Tech Conference NYC

NEWS
EVENT
WHITE PAPER
August 20, 2025

Announcing our AI in AEC White paper