Big Cloud Embraces Serverless AI

Week of August 28, 2024

Aug 28, 2024

Is a platform valid just because big tech embraces it? Not really, but it helps validate the technology when it goes from startups to big tech. Serverless is not new by any stretch of the imagination but it is now being seen as valid and these stories will help show that.

Forbes wrote another article where they state that “Serverless architecture is no longer the future of computing—it is the present.” Forrester also wrote a little about serverless. They call our trends such as WASM, AI and the evolution of Serverless passed FaaS (all things that I have covered). Clearly serverless is here to stay but recently, we heard of some major steps.

These past few weeks had a few massive stories around serverless. This time, it’s from major companies! So let’s see what our hyperscalers are up to

Wait, Serverless GPUs?! No way!

Yes, you hear me right. Google Cloud just recently introduced Cloud Run GPU support. Now, it is no secret that I am a major Cloud Run fan. Now yes, I work for Google Cloud but even if I didn’t I would still think Cloud Run is an awesome product because it is.

One of the biggest problems with FaaS platforms is that they are opinionated and bespoke. You are running your code on someone else’s runtime so you need to package the code in a way that they deem acceptable AND you have to use the language version and libraries that they support. Containers solve those issues because, as I have said before, you bundle your code with the runtime.

Naturally, serverless containers are an amazing offering to me and it just so happens that Google invented this space. Even before Cloud Run’s launch in 2019, Google open-sourced Knative back in 2018 and Knative Serving’s API is the basis of Cloud Run.

Now, back to the story, Cloud Run now supports NVIDIA L4 GPUs. Now, what does this mean? This is something I have heard people ask me over and over. “Serverless GPUs, so what”? Well imagine this.

You have an LLM like gemma2, Stable Diffusion or Mixtral (or any myriad of LLMs). You want to serve that LLM in a container so you use Ollama or Hugging Face. The use case is that you have created a chat bot and you want to inference with the LLM. In the past, you may deploy the container on Kubernetes and inference that way.

Not that Cloud Run supports GPUs, you can attach a GPU to a Cloud Run instance and load the LLM in the container and deploy. In under a minute (under 30 second in most cases) you can have the LLM deployed and ready to use. Oh ya, it scales down to zero!

This will be revolutionary as people can now host LLMs in a serverless manner, something that was largely unheard of. It doesn’t even have to be just LLMs either. You can host other formatted models in Cloud Run with GPUs.

I talked about inferencing-as-a-service in the past and also talked about Serverless AI. I think this is the next big leap as we are now able to directly serve LLMs in a serverless manner, not just host serverless apps that call the LLMs.

Microsoft Invests Heavily in Serverless Vector Databases

Yahoo! Finance recently talked about a $25M investment from Microsoft to Neon, a serverless Postgres provider. With that, Neon is now coming to Azure. We have talked about them before and their growth in a previous post.

Due to the rise in Generative AI and LLMs, we are starting to see more and more people adopt Retrieval-Augmented Generation (or RAG). This is a way to customize the results of an LLM without having to retrain the foundational model.

The most common way to accomplish this is with a vector database. With the pgvector extension, you can now give your Postgres database vector functionality. Now we are seeing the rise of Serverless Postgres with pgvector installed for RAG purposes.

Microsoft making a major investment here AND working with the company to integrate their offering into Azure shows that Microsoft sees a future in this setup.

Microsoft is already heavily invested in OpenAI for the foundational LLMs and now by integrating a serverless vector database into their offering, they are starting to show how serious they are about generative AI.

After all, LLMs are trained on very specific data that likely doesn’t include your proprietary information (or at least I hope it doesn’t). If the GenAI solution doesn’t include your custom data, it may not be useful. Imagine a chatbot that can’t answer a question about how a customer can pay their bill. Fine-tuning and grounding is a solution but RAG is a best practice.

BTW, if you want a quick rundown on RAGs and their role with LLMs, check out a YouTube video that a friend made!

AWS Lambda Recursive Detection

This may not seem like a big story but to me, it is. AWS Lambda introduces recursive loop detection APIs. Now that may sound like a lot of tech jargon but let me explain.

A recursive loop is when a function recalls itself until it doesn’t anymore. In and of itself, this isn’t a bad thing. You may call such a loop if you are parsing a large JSON file with a lot of nested information.

When used improperly you could find yourself dealing with a runaway workload. Over time, this can cause performance issues and/or a runaway bill. Now you can turn-on detection where AWS can warn you if such a loop exist.

Why does this matter? Not too long ago I talked about run away bills with serverless. While serverless is supposed to be cheaper, if you don’t put up guardrails, you can easily see that bill skyrocket. Having recursive loops that aren’t being monitored can cause such a scenario to happen.

In addition, as mentioned before you may see performance isues. What if the loop gets stuck and now your service is failing? This will give your users a bad experience and no one wants that.

As serverless apps begin to scale-out, we need ways to enable guard rails to keep the app functioning the way we want it to while also making the bill manageable.

Closing Thoughts

These are by no means the only serverless stories as of late. These are some really great ones though. They demonstrate how serverless compute is being taken seriously by the big three hyperscalers. They are not just investing in serverless but are making big bets into their products’ futures and their business by expanding what they can do.

What’s next for serverless? Only time will tell!

—Photo courtesy Aleksandar Pasaric on Pexels—

The Cloud Is Serverless

Discussion about this post