If you have ever visited a well-maintained book store or library, and taken the time to wander around and explore, you probably found that you were able to navigate quite a lot of data without feeling information overload.
Data maps are like interactive libraries of your data organized into categories and sub-categories (image from Princeton University)
This is because all of the "data" - the books - has been organized to facilitate browsing using relative positioning in a system like the Dewey Decimal System or Library of Congress Classification System where books are hierarchically arranged into categories, sub-categories, sub-sub-categories, etc.
Tools like Nomic Atlas bring this library-like browsing experience to any dataset by creating data maps that organize information based on semantic relationships. These maps use AI models that output embeddings to encode the meaning of each data point, effectively creating a custom, interactive library-like browsing experience specialized to your data.
It's useful to have a quick mental picture of what is going on behind the scenes to arrange data for a data map:
Each data point is encoded as a high-dimensional vector representing what the data means based where it gets located in the vector space.
Here's an illustration: of these three different strings that get embedded into vectors with Nomic Embed, the two fruit strings, being more semantically related than either string is to "cars are fast", end up closer together in vector space.
Ultimately, this is what makes embeddings so useful: they are a general purpose method for measuring the semantic relatedness of different pieces of data. The problem is that these embedding vectors are high dimensional, so for large datasets it can be unwieldy to get a picture of what the embeddings look like.
This is where dimensionality reduction becomes crucial: we can change the embeddings from high dimensional vectors to low dimensional 2D vectors, making large collections of embeddings much easier to visualize.
The tradeoff is that we lose some information about the relationships between the original high-dimensional data points when we compress down to only two dimensions. But the benefit is that we gain much more of an ability to visualize and explore the data. Each embedding gets represented as a single point on a map, and the map as a whole ends up showing neighborhoods of points that represent different semantic groupings of the data.
For our previous example, this would situate the two fruit-related sentences in a neighborhood of points clearly distinct from the neighborhood that contains the car sentence:
When you explore your data in an interactive environment that lets you freely zoom and pan around (as we'll demonstrate with Atlas below), you'll discover how powerful this neighborhood-based view can be. Seeing your data organized into semantic clusters reveals patterns and insights at both macro and micro scales that would be difficult to uncover otherwise.
Let's see this in action with a rich, real-world dataset: a collection of over 1 million biographical summaries from Wikipedia. In the interactive data map below, you can zoom into specific clusters and explore different neighborhoods that emerge from the data.
We're building Atlas because we believe that all the right technologies are now coming together to make data maps accessible to everyone - not just machine learning experts. When you upload data to Atlas, our infrastructure automatically handles the technicalilities of the data mapping algorithms behind the scenes to more quickly get you close to your data.
Atlas integrates three core technologies in our software to turn unstructured data into interactive data maps:
We trained our own text embedder Nomic Embed Text, a model that packs a punch at 137M parameters used throughout the Atlas platform to encode semantic information. Then, we released a compatible image embedder called Nomic Embed Vision, aligned to the text embedder to enable multimodal use cases like semantic search over images. The models are open source & available for anyone to download to embed text and images - with further modalities like audio & video on the roadmap for our vision of a universal embedding space for any kind of data. We'll discuss how embedding models work in more detail in Part 2 of this series.
We developed our own dimensionality reduction algorithm Nomic Project that scales efficiently and currently runs in production to compute the 2D layout for every dataset uploaded to Atlas. In Part 3 of this series, we'll share more about Nomic Project and how it builds on past research in this space to be more scalable and efficient.
We maintain Deepscatter as an open source Javascript library for fast data operations & graphics powering Atlas, making high-speed data interaction a reality for anyone in their browser. Deepscatter is also responsible for the interactive scrollytelling view of data we rendered above, and it's going to make amazing data stories possible for anyone to tell with their data. With Deepscatter making graphics fast & beautiful, you can get closer to the rich and important stories that live within your data. We'll explain what makes it so fast in Part 4 of this series.