Serverless RAG Time

Week of May 27, 2024

May 30, 2024

Today we are going to talk about rags. Personally I prefer terry cloth as I find it picks up spills more effectively. Okay, I am kidding, but we will be talking about another type of RAG, Retrieval augmented generation.

What is RAG?

Before we dive into the role serverless has in the world of RAG, it is important to level set on what RAG is.

NVIDIA describes RAG as a “technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.”

Let’s look at it another way. One of the core primitives of Generative AI (GenAI) are Large Language Models (LLMs). As the name suggests, LLMs are massive machine learning algorithms trained on massive amounts of data. You GenAI uses the LLM as a source to generate text, audio, video, etc.

Now, there is a caveat. Building an LLM is not a trivial task. Most of the major LLMs are backed by a large organization/corporation. Google has Gemini, OpenAI has ChatGPT, Meta has Llama, Anthropic has Claude and so on. It is estimated that ChatGPT-4 cost $100M to train. That's not something most of us have laying around. While you could train your own, it is easier to just use one that was created.

But those models weren’t trained on YOUR data. What if I have documentation that I want to train the model on? Or maybe I want to train the model on my proprietary code? I don’t necessarily want to provide this proprietary data to these organizations to train their LLMs on. So surely there must be a middle ground there.

A RAG is an external source (database) with your extra data that can be used in combination with the LLM to get “customize” that LLM for your chatbot or whatever GenAI tool you are trying to create. This is a very basic definition, for more information, I suggest reviewing the aforementioned NVIDIA blog post or this Medium post.

Where does Serverless Come into play?

Let’s talk about some serverless RAGs in the news.

Let’s first talk about the concept of a vector database. Vector databases, in short, are databases that can store unstructured data in a format called a “vector” (fixed-length lists of numbers). Prior to the rise of GenAI, it was seen as a bit of a niche project with a niche use-case. Since GenAI, people have found a lot of uses for this.

One company that specializes in this is Pinecone. Pinecone is a company and a vector database that just skyrocketed in popularity once GenAI took off. They are a serverless vector database. Recently they announced that they are now generally available on AWS.

While Pinecone is arguably the premier solution, there are other contenders. Upstash comes to mind. PostgreSQL recently came out with an extension called pgvector which essentially turns PostgreSQL into a vector database. Neon.tech, who’ve we talked about before, offers serverless pgvector.

As time progresses, I anticipate more startups to offer serverless databases as GenAI and RAG grow.

Serverless Apps + Serverless Data = Serverless GenAI?

So where does Serverless fit in the GenAI world? Well the LLMs themselves will almost certainly be trained and hosted on traditional infrastructure, be it VMs or Kubernetes. Kubernetes is actually becoming a popular tool for AI/ML. Both OpenAI and Anthropic train on Kubernetes. KubeRay is becoming a popular method for distributing ML workloads on Kubernetes.

So that part of the equation will largely require infra. However, there are ways to abstract this away such as with Kuberay, as mentioned earlier, and Kubeflow. This can all be operationalized with a tool like Pulumi, allowing data scientist to use Python to deploy infrastructure. While not exactly serverless, it’s getting close.

However, we can inference with the LLM using serverless technology. The chatbots, search bars, etc. can all be hosted on serverless runtimes. We now have serverless databases for RAG. We also can use serverless technology to help ingest data into a pipeline for training.

What’s Next?

As mentioned earlier, I forsee a lot of serverless vector database startups to take off. Pinecone had big news this week and Neon.tech did earlier as well. Last year Chroma raised $18M and Weaviate raised $50M. My friends at TheNile raised $11.2M. GenAI and RAGs has been the tipping point for serverless data.

However, I don’t believe that it will stop with just serverless Postgres running pgvector or serverless vector databases. We will see other forms of serverless data retrieval become more popular. When that happens. I will write about it here!

-Photo by Karolina Grabowska https://www.pexels.com/@karolina-grabowska/

The Cloud Is Serverless

Discussion about this post