Upcoming Event:

The Future of AI in the Built World with Nomic - AEC Tech Week NYC

Security

Pricing

News

Login

Our products

Nomic gives you the tools to transform unstructured content into governed, accurate, AI-ready intelligence. So your teams can finally put institutional knowledge to work with confidence.

Nomic Platform

A turnkey solution to operationalise AI across your firm’s knowledge.

Developer API

A customisable platform for developers to build AI over enterprise data.

NEWS

EVENT

WHITE PAPER

April 2, 2025

Nomic Embed Multimodal: Open Source Multimodal Embedding Models for Text, Images, PDFs, and Charts

DEMO REQUEST RECEIVED

Thank you for your interest!

Your message is on its way to our sales team. We’ll reach out shortly to learn more about your needs and show you how our platform can unlock new possibilities for your business.

Oops! Something went wrong while submitting the form.

Rich Text 1

We're excited to announce the release of Nomic Embed Multimodal, a suite of models that achieve state-of-the-art performance in embedding PDFs, images, papers, and charts.

This release includes four models, available in two sizes (3B and 7B parameters) and two variants:

ColNomic Embed Multimodal (3B and 7B): Multi-vector late interaction multimodal embedding models (more powerful)
Nomic Embed Multimodal (3B and 7B): Single-vector multimodal embedding models (faster & use less memory/storage)

Our best model, ColNomic Embed Multimodal 7B, achieves 62.7 NDCG@5 on Vidore-v2, a visual document retrieval benchmark focused on page-level retrieval, with a +2.8 point improvement over the previous state-of-the-art models. Additionally, Nomic Embed Multimodal 7B outperforms all other single-vector models on the benchmark.

Video 1

00:00

Rich Text 2

Challenges with PDFs, Papers, and Charts

Documents are visually rich structures that convey information not just through text, but through figures, page layouts, tables, and even fonts. Traditional retrieval systems primarily rely on extracted text, missing these crucial visual elements and often requiring complex, error-prone processing pipelines. Nomic Embed Multimodal, inspired by ColPali and DSE, solves this problem by supporting interleaved text and image inputs, making it ideal for:

PDF documents and research papers
Screenshots of applications and websites
Visually rich content where layout matters
Multilingual documents where visual context is important

VLMs for Multimodal Embeddings

Before multimodal embedding models, representing multimodal data for retrieval required:Both the ColPali and DSE approach significantly outperform multimodal models with CLIP-style architectures by addressing the modality gap through interleaved processing of text and images.

The ColNomic Embed Multimodal models use a multi-vector late interaction mechanism. Instead of creating one embedding per document or query, ColNomic creates multiple embeddings. This allows for more precise matching during retrieval and leads to better performance.

‍

Separate encoders for visual and text inputs
Complex preprocessing pipelines to extract data from images

In contrast, VLMs provide a simple and accurate way to embed image and text data with a single model, eliminating these complexities while improving performance. This approach delivers superior accuracy compared to text-only approaches that require OCR, while being faster than complex pipelines with multiple processing steps. It also provides more comprehensive capture of visual information by directly processing images alongside related text.

Figure 1: Multimodal embedding architecture (courtesy of DSE)

Both the ColPali and DSE approach significantly outperform multimodal models with CLIP-style architectures by addressing the modality gap through interleaved processing of text and images.

‍

Rich Text 3

How We Improved Multimodal Embeddings

Building on these advances, we applied our learnings from training high-performance text embeddings to create even better multimodal embeddings. Starting with Qwen2.5-VL 3B Instruct as our baseline, we implemented several key improvements:

1. Sampling From the Same Source

We discovered that naive sampling across dataset sources allows models to learn shortcuts rather than semantic relationships. By forcing sampling from the same source, we create harder in-batch negatives that prevent the model from "cheating" and improve its understanding of content relationships.

Result: +2.9 point improvement on Vidore-v2 NDCG@5

2. Hard Negative Mining

We trained an initial dense model on the ColPali training dataset and VDR Multilingual Train Dataset, then used it to retrieve the top-k nearest neighbors for each query.

Additionally, we reduced false negatives using positive-aware hard negative mining, a technique first introduced in NV-Retriever.

Results:

1 Hard Negative: +3.5 points
4 Hard Negatives: +4.7 points
6 Hard Negatives: +5.2 points

Integration with Real World RAG Workflows

VLMs like Nomic Embed Multimodal simplify how RAG systems handle documents with rich visual content. Documents with with equations, diagrams, charts, and tables provide essential context alongside the text.

Technical documentation presents similar challenges - code blocks, flowcharts, and screenshots need to be understood together with their surrounding text. The same applies to product catalogs with specifications and images, or financial reports containing charts and numerical data.

By embedding visual and textual content together, retrieval becomes more accurate and integrations into real systems become much easier to implement and experiment with. Removing preprocessing steps often makes indexing faster and reduces complexity, and the single API for both images and text keeps implementations straightforward.

Conclusion

Nomic Embed Multimodal offers state-of-the-art performance while substantially simplifying the retrieval pipeline. As part of the broader Nomic Embed Ecosystem, this technology demonstrates our commitment to pushing the boundaries of embedding capabilities. To read more about our complete ecosystem, you can learn more on our detailed blog post.

‍

Rich Text 4

Get Started With Nomic Embed Multimodal

You can get started with our new model collection here in Hugging Face

‍
And here are some guides demonstrating how to use the new models as retrievers for RAG workflows:

Nomic Documentation

RAG Over PDFs with Nomic Embed Multimodal
RAG Over PDFs with ColNomic Embed Multimodal

Google Colab Tutorial Notebooks

Nomic Embed Multimodal Tutorial in Google Colab
ColNomic Embed Multimodal Tutorial in Google Colab

Continue Reading

NEWS

EVENT

WHITE PAPER

February 1, 2024

NEWS

EVENT

WHITE PAPER

February 14,2024

NEWS

EVENT

WHITE PAPER

June 5,2024

NEWS

EVENT

WHITE PAPER

September 24, 2024

NEWS

EVENT

WHITE PAPER

March 27, 2025

NEWS

EVENT

WHITE PAPER

April 2, 2025

NEWS

EVENT

WHITE PAPER

November 12th, 2025

NYC

NEWS

EVENT

WHITE PAPER

August 20, 2025

Our products

Security

News

Security

News

Login

menu

Toolkit

Essence

Explanation

Expression

Appendix

Nomic Embed Multimodal: Open Source Multimodal Embedding Models for Text, Images, PDFs, and Charts

Download the White Paper

Get your copy of the comprehensive AEC AI transformation guide