Bridging the Gap Between Infrastructure and Artificial Intelligence 

Elizabeth Whitney
April 3, 2024
Application Modernization | Blogs | Digital Infrastructure

I sat down with Converge’s Vice President of Engineered Solutions Darren Livingston and Senior Director of AI and AppDev Dr. Jonathan Gough to discuss how they’ve managed to bridge the gap between infrastructure and artificial intelligence (AI) when designing and implementing AI solutions for Converge’s clients. The wide-ranging discussion begins with a brief back-and-forth on running AI workloads in the cloud versus on premises and then moves into a description of how they’ve managed to bring together infrastructure and data science perspectives.  

What are some reasons to run AI workloads on premises or in the cloud? 

Darren: From an infrastructure perspective, the cost to consume GPU workloads drives a lot of the decisions on cloud vs. on prem. If the environment is being consistently utilized and/or growing at a rapid pace, it makes financial sense to build some GPU capacity on premises. Like many workloads, it’s ultimately hybrid; for example, inferencing could be on prem, in the cloud, could be a laptop, could be anywhere. Jonathan, from a workload perspective, what are your thoughts? 

Jonathan: From a workload perspective, I think what really matters is the volume and the priority. Where we’re seeing the greatest value with AI is when there’s a lot of high-value, high-volume compute being leveraged, and writing on prem is the most cost-effective way to do that. If you’re running something once a month or once a quarter or once a day, using some sort of spot instance in the cloud is a great idea, because if you’re a small organization, you don’t have big workloads. There are great ways to leverage cloud compute that don’t break the bank and you don’t have to monitor it, don’t have to watch it, don’t have to take care of it.  

But the high-value workloads that many organizations are running and the way that things are moving is AI is going to be literally embedded in everything. And a lot of that is more cost effective to run on prem. There are tighter controls, and security is also key. A lot of organizations are concerned about the data. We’re not sure where AI is going in terms of legislation and governance, and therefore having tighter control over things, knowing what’s happening, and being able to monitor it on prem is the way many organizations are going. Darren, would you agree? 

Darren: I would agree with that. Absolutely, security and cost control are big for on-prem deployments. I mean, here’s the thing: lots of clients start in the cloud. They play in the cloud, POC it, and then they get that bill in six months. They like it, but they have to find a balance between cost-effectiveness and minimizing security risk. The other thing that comes up a lot too around legislation is data sovereignty. That’s just one more element to consider: data, how good the data is, where the data is, and where you’re running models.   

Jonathan: I think that’s a great point, Darren, if you’re going to have a couple data scientists fool around with things, they only need a GPU once a week. They can run it for an hour or two and turn it off. That works, but once you move towards production, that’s where you try before you buy, explore the value, and decide what it is you actually need. Was it measured? Measure twice, cut once, right? It’s important to be doing that measuring.  

Explain the technology stack and architecture of Converge’s AI solution, highlighting any innovative or noteworthy components. 

Darren: How about this, Jonathan? I’ll start and then you finish.  

Jonathan: You start at the bottom, and I’ll finish at the top. 

Darren: There we go. I like that – the whole racecar / racecar driver thing or dance floor / dancer thing. I’m going to talk about the on-prem piece, and I’ll throw in the cloud piece along the way. It comes down to four basic things: computing, storage, network, and GPUs.  And this is all sitting in a client’s data center or a colo. They might start off in their own data center with a GPU, and then they start using the cloud for inferencing. According to Jonathan’s point earlier, there’s so much data; the more data you process and can ingest into models, the more outcomes you have.  From an OEM perspective – we work with all of them: NVIDIA, Intel, Dell, Lenovo, Nutanix, HPE, DDN, and Vast Data, to name a few. I’ll let you finish, Jonathan.  

Jonathan: I think you should talk about how you set up things in VMware. It makes it, from an administrative standpoint, easier for the team to set up workloads. 

Darren: Oh, sure. So, we’ve done some things in our lab over the last four years. We started with bare metal, predominantly Red Hat, and we have some Ubuntu. We’ve also used VMware, and, to Jonathan’s point, this was in the early days when I needed clarification on what Jonathan and his team required, and server updates would blow stuff up. So, we started using VMware to virtualize these servers and have optimized it so you couldn’t even tell that these things were virtual. We are now able to recover quickly, within minutes of installing a bad open-source code, for ease of day-two operations. We’re able to do some really cool stuff. Now, we’re going to start working with OpenShift, which is RedHat’s equivalent to VMware. Now that’s coming. It’s a little too early to say how far that’s going to go, but in the end, we’ve done both—virtual and bare metal, any combination from an infrastructure perspective. We can definitely talk about your day-two operations experiences and how to maintain and upgrade it.  So, go for it, Jonathan.  

Jonathan: So, I mean, Darren basically exposes to the team what they need and, based on the business use case based on the workload we may be using, a single machine may use a hyperconverged infrastructure like Nutanix or OpenShift to deploy these solutions and the endpoints for production workloads from there. We like to containerize our applications and create immutable images so that it works well with best practices aligning with software development lifecycle (SDLC) practices, DevOps, continuous integration, and continuous updates both from the system level and from the application level, but also those models inside those containers. We leverage a lot of the NVIDIA SDKs, their images, because they’re optimized. We’re big fans of all the stuff that NVIDIA is putting out there. The training inference server is extremely powerful. They make it really easy to leverage an inference server and inference image. That’s optimizing – bring your own models, deploy them. I think that’s what’s noteworthy, that a lot of organizations are just rolling their own. And they don’t get the full value of the NVIDIA GPUs, because they’re not leveraging those SDKs properly.  

I think what is also noteworthy is that we’re leveraging VMware. We’re leveraging different hyperconverged infrastructures. We only need to use the right amount of hardware to get the job done. It’s about running workloads that are sustainable, cost effective, and also provide business value. If your workload costs more than the dollar savings or how many dollars you’re making, then you’re doing it wrong, you know? 

Darren: Well, it’s like spending 100 bucks to save 5 bucks. 

Jonathan: Yeah. I mean, people ask, what is AI? Well, it’s technology that either saves money or makes money. And if you’re not doing one of those two things, you have a problem. 

What key challenges did you encounter while designing and implementing this solution, and how did you overcome them? 

Darren: I will talk personally on this one. I learned how to speak data science from an infrastructure perspective. It’s like speaking two languages. In the beginning, talking with Jonathan and several of his team members, I’d have no idea what they were asking for. Jonathan would laugh because I’d have to use a safe word, like “that’s interesting,” and then Jonathan would know I had no idea. But this is going back to the early days, Jonathan. Would you not agree? 

Jonathan: Yeah, 100%. 

Darren: We learned to work with each other and got to an understanding of what he wants and what I want. You know, everything changes by the hour, right? Probably even by the minute, people are trying out new code, and they’re breaking the server. If we didn’t have an easy way to recover, we’d spend more time on recovering than actually doing anything good with it.   

Our first step was to understand each other’s needs and find a common solution. For me, this meant learning the language of data science and finding ways to optimize performance, whether on-prem or in the cloud. The decisions we make, be it for performance, security, or ease of operations, are influenced by a multitude of factors. Without a collaborative approach, it’s like speaking two different languages. 

Jonathan: Or even French to a French Canadian. If somebody from France talks to somebody from Quebec, they have a hard time communicating. Technically, it’s the same language, but they’ve shifted to different dialects, and I think in many ways that’s like the challenges that Darren and I have faced, which is that the cloud spoiled the data scientist because they found the way to speak data science quickly and put a product out there that people could use easily. 

And the way data scientists and machine learning experts use compute is different than the way traditional business applications use compute. This is a very rapidly evolving landscape, and it really does take a team of people working together – the data scientist, the machine learning engineers, the digital infrastructure team partnering together, listening, and working together to put together the right solution for the problem that needs to be solved. 

You know the analogy I use is, somebody says dig a hole. I can do it, but do I need a shovel? Do I need a spade, or do I need a backhoe? How much do I need, and how long do I need it? And doing that effectively is critical to having your infrastructure team be able to carve up and serve up what you need so that the team who’s using it can use it appropriately. It takes work and effort for those two teams to learn how to speak the same language. 

You know, on my birthday, a year and a half ago, ChatGPT was released and other open-source models of similar size, and compute needs came out. It was very difficult, took a lot of power to power those models, but things changed over the course of six weeks. It took four GPUs to run a single model, but four to six weeks later, now you can run the same model on one or two GPUs because people got inventive and creative when they saw it was a problem. That’s not the way traditional infrastructure works. You don’t see changes happening month over month, week over week. Changes happen quarter over quarter or year over year. That’s huge in the infrastructure world, right? But here you’ve got things changing weekly, monthly and so it’s critical to be able to partner together and design the solution that’s needed and implement it effectively. 

Darren: That’s a great point.  

Jonathan: Yeah, I think that the challenge, the biggest challenge going forward is infrastructure teams are used to just thinking about compute and now they’re going to be pulled in to help think about business use cases. And so, the business of solving business problems is going to start trickling down to those technology teams more and more. 

Darren: I would even go one step further. I think traditional infrastructure people have always thought if I buy this, if I buy some hardware, I actually have AI or I have automatically fixed an issue, right? 

Jonathan: Right. 

Darren: I think that in this world, it won’t work like that. Just because you buy a server packed full of GPUs doesn’t mean that now you’re running AI workloads; you’re not. That’s not how it is. People fail to realize there’s the infrastructure supporting the workload. 

Jonathan: Right. 

Darren: There must be some thought put into this: you can’t throw servers at something when you don’t know what you’re trying to accomplish.  

Jonathan: Right. And the machine learning data science team, they like toys. You have to restructure teams.  

Darren: Yeah. 

Jonathan: And it’s a partnership too. You have to find that balance. 

Darren: Last thoughts: I’ve been a customer and I’ve also resold. I’m an integrator, and I’ve done the whole gamut. I’ve worked for OEMs and everything. I mean, in the beginning, four years ago, it was humbling. You know, it’s not like your traditional databases or traditional infrastructure. But you must remain open-minded, right? You do a lot more listening than talking, right?  I’ve learned a lot from him. His team is absolutely fantastic to work with and learn from.  

Jonathan: Yeah, there are a lot of ways to solve these problems. And doing it well and correctly is a partnership. You know, there isn’t just one tool in the toolbox. There’s a lot of them, whether it’s on prem or in the cloud. It’s a partnership to figure out the best way to do this in such a rapidly evolving landscape. And, I’ll say from a development standpoint, from my team’s perspective, it’s critical that we build these things so that they’re flexible and so they scale, so it’s reproducible and it can be moved from one environment to the next seamlessly. 

Follow Us

Recent Posts

Medallion Architecture in Lakehouse Systems: An Overview

In the world of data architecture, the medallion architecture format provides a powerful framework, particularly within lakehouse systems. This approach organizes data into three distinct layers: bronze, silver, and gold. Each layer serves a specific purpose, ensuring...

Building Data Resiliency to Combat Ransomware Threats

Constant threats put IT estates at risk, demanding proactive protection. From natural disasters, outages, credential breaches, to cyberattacks—every scenario requires careful planning. Among these threats, ransomware presents unique challenges for IT departments....

Want To Read More?

Categories

You May Also Like…

Let’s Talk