Embeddings: The Unsung Hero of the ChatGPT revolution (that will probably save Google)

a classifier box full of squares made of wood with a different galaxy on each space - Image .png

AI dominates every digital space. More specifically OpenAI with ChaptGPT and GPT4, clearly outperforming [at least in public perception] usual suspects like Google who claimed to be "AI first", since 2016. The irony is that OpenAI uses Transformers for it's star product ChatGPT and friends, which is a Google invention from 2017. There are many reports that Google sounded the alarm to catch up, and yesterdays Google IO shows they are all-in. But a comparison between OpenAI and Google strategies, in my opinion, reveals starkly contrasting strategies: OpenAI goes loudly and boldly all in on generative text, and Google (more silently) on embeddings.

What are Embeddings?

In the realm of AI, embeddings are vector representations of data - be it text, images, or audio. They capture the meaning behind the data in a mathematical format, allowing the machine to process and understand it. For example, in the context of text, words with similar meanings are mapped close to each other in a high-dimensional vector space. You can think of it as the ability to do "king"- "men" + "woman" = "queen". This forms the backbone of natural language understanding in models like GPT4.

Mathematically, embeddings start as random vectors assigned to token parts of words. These vectors live in a high-dimensional space, up to thousands of dimensions. During the AI training process, gradient descent to minimize the loss function wiggle the vectors with slight adjustments each time the associated token or word is processed. The adjustments are made in such a way that the vectors corresponding to tokens or words that appear in similar contexts move closer to each other in this high-dimensional space, while those associated with tokens that appear in different contexts move apart. Over time, this iterative process results in the formation of clusters of vectors, where each cluster corresponds to a particular semantic meaning. Words with similar meanings or associations end up being close together in the vector space, and words with different meanings end up being far apart.

This clustering of semantic meaning through embeddings forms the cornerstone of natural language understanding in AI models. By capturing the intricate relationships between words, phrases, and contexts, embeddings enable AI models to grasp the nuances of human language, thereby improving their ability to generate contextually relevant responses.

Crucially, they are created to support generating sensible data, and by doing so they capture an incredible amount of value in themselves that I believe is often overlooked in favor of the gimmicky of generating that sensible text.

The Power of Embeddings

The beauty of embeddings lies in their ability to harness real and highly abstracted meaning. It is not generated text, its a representation of real text, so using embeddings greatly reduces the risk of the model making things up. This not only ensures the authenticity of the information but also allows for referencing back to the original content. Less generation of text, more finding most relevant snippets of real text that are relevant.

Moreover, one can take a new text and create new embeddings for that text without retraining the entire model (as long as the new text doesn't introduce highly different meanings). This feature allows for seamless integration of new data, ensuring the model stays updated and aware of the latest information. Moreover, you can embed ads or sponsored content which seamlessly allow to monetize the model in a transparent and explicit way that doesn't bias or pollute the training itself.

While OpenAI is making significant strides in refining generative text through human feedback loops to minimize fake data, I believe Google is silently pushing the envelope with embeddings. The latter strategy is not only technologically superior but also makes more business sense. It facilitates source referencing, content maintenance, language agnosticism, and straightforward ad placements.

The Future of Embeddings

Our understanding of embeddings has come a long way since the advent of word2vec in 2013, also by Google, and transformers have taken this to a new level. I believe that the true potential of embeddings is still being explored, and that there is vast room for extremely disruptive innovation. In fact, it is clear to me that embeddings are core to the enterprise value of domain specific applications like Bloomberg's GPT. Embeddings may soon become the leverage of regulatory guardrails for high-risk AI applications such as medicine or law enforcement due to its transparency, and testability.

Earth Transformers
One particular area that piques my interest is geospatial remote sensing embeddings, or "Earth Transformers". I'm positive this has the potential to disrupt the Earth Observation space in ways we can't even begin to comprehend yet. I might have more to share soon, but if you interested, or an expert in this field, please ping me.

To echo Steve Ballmer, and as I've emphasized to multiple clients on generative AI:

"Embeddings, Embeddings, Embeddings!"
"Embeddings, Embeddings, Embeddings!"