Author: Nomic Team

Data Mapping Series

This is our Data Mapping Series, where we do a deep dive into the technologies and tools powering the unstructured data visualization and exploration capabilities of the Nomic Atlas platform. In this series you'll learn about how machine learning concepts like embeddings and dimensionality reduction blend with scalable web graphics to enable anyone to explore and work with massive datasets in their web browsers.

If you have ever visited a well-maintained book store or library, and taken the time to wander around and explore, you probably found that you were able to navigate quite a lot of data without feeling information overload.

Data mapping is like exploring an interactive well-organized library customized to your data (image from Princeton University)

Data maps are like interactive libraries of your data organized into categories and sub-categories (image from Princeton University)

This is because all of the "data" - the books - has been organized to facilitate browsing using relative positioning in a system like the Dewey Decimal System or Library of Congress Classification System where books are hierarchically arranged into categories, sub-categories, sub-sub-categories, etc.

Tools like Nomic Atlas bring this library-like browsing experience to any dataset by creating data maps that organize information based on semantic relationships. These maps use AI models that output embeddings to encode the meaning of each data point, effectively creating a custom, interactive library-like browsing experience specialized to your data.

How Data Maps Arrange Your Data Automatically

It's useful to have a quick mental picture of what is going on behind the scenes to arrange data for a data map:

Each data point is encoded as a high-dimensional vector representing what the data means based where it gets located in the vector space.

Here's an illustration: of these three different strings that get embedded into vectors with Nomic Embed, the two fruit strings, being more semantically related than either string is to "cars are fast", end up closer together in vector space.

Ultimately, this is what makes embeddings so useful: they are a general purpose method for measuring the semantic relatedness of different pieces of data. The problem is that these embedding vectors are high dimensional, so for large datasets it can be unwieldy to get a picture of what the embeddings look like.

This is where dimensionality reduction becomes crucial: we can change the embeddings from high dimensional vectors to low dimensional 2D vectors, making large collections of embeddings much easier to visualize.

The tradeoff is that we lose some information about the relationships between the original high-dimensional data points when we compress down to only two dimensions. But the benefit is that we gain much more of an ability to visualize and explore the data. Each embedding gets represented as a single point on a map, and the map as a whole ends up showing neighborhoods of points that represent different semantic groupings of the data.

For our previous example, this would situate the two fruit-related sentences in a neighborhood of points clearly distinct from the neighborhood that contains the car sentence:

When you explore your data in an interactive environment that lets you freely zoom and pan around (as we'll demonstrate with Atlas below), you'll discover how powerful this neighborhood-based view can be. Seeing your data organized into semantic clusters reveals patterns and insights at both macro and micro scales that would be difficult to uncover otherwise.

Let's see this in action with a rich, real-world dataset: a collection of over 1 million biographical summaries from Wikipedia. In the interactive data map below, you can zoom into specific clusters and explore different neighborhoods that emerge from the data.

This is a data map hosted in Atlas - We're visualizing over 1 million biographies from highly visited Wikipedia pages.

Scroll on to continue reading, or click the small map icon in the bottom-left corner of this box to explore the dataset in Atlas.

To read a data map like this, zoom in and scroll around to explore different regions of the dataset. You can hover over individual points to read the text they contain.

Each point in this map represents a person, with their position determined by the semantic similarities in the texts: similar descriptions of careers and life experiences in the dataset naturally cluster together as neighborhoods of points in the map.

This makes it easy to spot the prevalence of abstract concepts in a dataset - like "athletics" & "politics" - at a glance.

The biographies of people in sports are mostly grouped together at the bottom of this map, since our models running Atlas pick up on what all these biographies have in common relative to the rest of the dataset.

Then, when you want to explore the data in more detail, you can color the data by sub-topics to see more nuanced differences between regions of your dataset.

We're building Atlas because we believe that all the right technologies are now coming together to make data maps accessible to everyone - not just machine learning experts. When you upload data to Atlas, our infrastructure automatically handles the technicalilities of the data mapping algorithms behind the scenes to more quickly get you close to your data.

Atlas integrates three core technologies in our software to turn unstructured data into interactive data maps:

Embeddings: Nomic Embed

We trained our own text embedder Nomic Embed Text, a model that packs a punch at 137M parameters used throughout the Atlas platform to encode semantic information. Then, we released a compatible image embedder called Nomic Embed Vision, aligned to the text embedder to enable multimodal use cases like semantic search over images. The models are open source & available for anyone to download to embed text and images - with further modalities like audio & video on the roadmap for our vision of a universal embedding space for any kind of data. We'll discuss how embedding models work in more detail in Part 2 of this series.

Dimensionality Reduction: Nomic Project

We developed our own dimensionality reduction algorithm Nomic Project that scales efficiently and currently runs in production to compute the 2D layout for every dataset uploaded to Atlas. In Part 3 of this series, we'll share more about Nomic Project and how it builds on past research in this space to be more scalable and efficient.

Web Graphics: Deepscatter

We maintain Deepscatter as an open source Javascript library for fast data operations & graphics powering Atlas, making high-speed data interaction a reality for anyone in their browser. Deepscatter is also responsible for the interactive scrollytelling view of data we rendered above, and it's going to make amazing data stories possible for anyone to tell with their data. With Deepscatter making graphics fast & beautiful, you can get closer to the rich and important stories that live within your data. We'll explain what makes it so fast in Part 4 of this series.

Data Mapping Series
Part 2: Embeddings