DDN on files, objects and AI training and inference

Interview DDN has been supplying high speed file access with its EXAScaler Lustre parallel file system software for many years. This supports Nvidia’s GPUDirect protocol.  It has just announced its Infinia 2.0 fast access object storage and this will support GPUDirect for objects.

Led by VAST Data and Glen Lockwood, a  Microsoft Azure AI supercomputer architect, there is a move to file-based consign AI training on NVIDIA GPUs to history and use object storage in future. How does DDN view this concept? We discussed this with by James Coomer, SVP For Products at DDN, and the following Q and A has been edited for brevity.

Blocks & Files: Where would you start discussing file and object storage use and AI?

James Coomer

James Coomer: There’s three main points. Firstly, it’s not really as if all large scale AI systems can, out of the box, use S3 and we try it, right? So we’ve got the fastest S3 in the planet with Infinia. The [general object storage challenge] is it’s really not mature. So the fact is, just ask any one of these SuperPODs, ask Nvidia, ask the largest language model developers on the planet. They’re using file systems and they’re using parallel file systems. So are they all wrong? I’m not sure they are. So firstly, the current state of play isn’t what they’re stating. If you were starting from scratch today, could you design an IO model which could avoid the use of a parallel filesystem and try to do things in parallel?

Yes, you could. And I think gradually things are moving that way and that’s why we built Infinia.

That’s why we built the lowest latency S3 platform with a faster state of discovery like way faster than any of our competitors, because we see that a very important aspect is object store’s scalability. They’ve got great benefits; you can scale ’em easier, you can handle distributed data easier. It’s great, but they’re just not great at moving large amounts of data to small numbers of threads.

Blocks & Files: Talk about checkpointing

James Coomer: Our competitors tend to dumb down the challenges of checkpointing because their writes are slow, so they can’t write fast. They say, “oh, it doesn’t matter.” And I’ve seen now four articles over the past three years where first they said, “oh, checkpoints don’t matter.” And then it became obvious that they did matter and they said, “it’s okay, our writes are faster now,” but they still weren’t that fast.

And now they say, “oh, you don’t need it at all.” It’s just like any excuse to kind of pretend that checkpoints aren’t a big problem and parallel filesystems are the fastest way of doing it. The challenge is this; it’s not the box – as in the box could say, “I’m doing a terabyte a second, 10 terabytes a second.” It can say whatever it likes. The challenge is how do you push the data into the application threads?

Blocks & Files: What do you mean?

James Coomer: Application threads are running up there across maybe one GPU, maybe a thousand GPUs, maybe 10,000 GPUs, maybe a 100,000 GPUs. But there’ll be a subset of ranks which are requesting IO. And the cool thing about a parallel filesystem is we can really pack that thread full of data when it reads a request, we can pack the single thread so anybody can create a million threads and get great aggregate performance. But that’s not what these applications are doing.

They are spawning a few IO threads. A few ranks out of the parallel programme are making data requests. And the parallel file system magic is in pushing massive amounts of data to those limited number of threads. So when you’re talking about performance is such a big spectrum of course, but it’s certainly not an aggregate performance question. It’s about application performance.

How is the application behaving? And, over time, applications change their behaviour for the better. But as we saw in HPC over the past, what, 30 years, have they changed? Think of all these applications. No, they still all use parallel file systems. All of them, like all big HPC sites are using parallel file systems because the applications take ages to change.

Now in AI, definitely faster, definitely much better, and we’re right in the heart of that. We’re working with Nvidia on NEMO, enhancing its IO tools to paralellize the S3 stuff. We’re right in the middle of it because we’ve built the fastest S3. 

But still, the fact is parallel file systems currently for the broad range of AI frameworks are by far the fastest way of moving data to those GPUs and getting the most out of your 10 million, a hundred million, a billion dollars spend on AI infrastructure, data, data scientists, et cetera.

Blocks & Files: You said you had three points to make. What are the others?

James Coomer: The first one is the world isn’t ready yet. Secondly, the people who implement these systems aren’t the ones installing the AI frameworks and running them. That’s the data scientists. Often they don’t even know who the infrastructure people are. So they’ll come along and they’ll do what they’re going to do. You’ve got no control over it. How are they going to push their IO and stuff like that? So the fact is the majority of these AI frameworks run really, really fast on parallel file systems and don’t run very fast in S3 or don’t work at all.

The third one, is if it [object] was ready, we’d be the best because Infinia is like a hundred times faster than the object storage we’ve compared with in data discovery, what we call bucket listing. We’re twenty, twenty five times faster in time to first byte from the best, the fastest object store competitor we could find on the same hardware system. And we provide around five times the puts and gets per second than anything else we’ve seen. So we think, if it was the case, that object was the new utopia for AI, then we are down there ready, and we can compete with everybody on that plane.


Blocks & Files: Is this much, much more than just satisfying a GPUDirect for object protocol?

James Coomer: Oh yes, yes. I’m glad you said that because, otherwise, DDN will be like the others who are going to be supporting that protocol.

Most of the latency from the object data path does happen at the storage end. HTTP’s non-RDMA accelerated stack at S3 is not exactly fast and it’s not very well paralellised. So by using GPUDirect for object, we’re going to accelerate the RDMA network component, but then we’re not accelerating what the storage is doing it to make that data available. Once the servers receive the object, how does it make it safe and then respond with the acknowledgement?

That is actually, in our tests, where the bulk of the latency resides. It’s actually the back-ends. And it can make sense historically because until today, until now, nobody’s really cared that much about huge object store latency and straight performance.

Blocks & Files: No, they haven’t. They haven’t at all, because it’s been basically disk-based arrays for objects, being cheap and very, very scalable data storage; not for real time data. And you’ve been writing Infinia for five years or more? 

James Coomer: Yes, more than five years. 

Blocks & Files: You’ve been putting in a lot of effort to do this, and what you’re saying makes sense. You’re not just sticking a fast GPUDirect funnel on top of your object store. You’ve got the tentacles going down deep inside it so you can feed the funnel fast.

James Coomer: We’ve been really paranoid about it during the development of our underlying data structures. So the Key-Value store we use is designed to provide a very, very low latency data path for small rights and small reads, and a completely different, also optimised, path for large writes and reads, all in software. This does contrast with the rest. 

Definitely, let’s say the first generation of object stores and the second generation were archives. They’re supposed to be low cost and supposed to be disk-based. And if they’re talking about performance, they tend to talk about aggregate performance, which anybody can do. You have a million requesters and a million disks and you can get a great number. That’s never a performance challenge. Anybody can do that because there’s no contention in the IO requests. 

The challenge is from a single server, from a small amount of infrastructure, from a small number of clients, how much data can you push them? What’s the latency for the individual put in response? And we’ve got that probably sub millisecond, which is less than a thousandth of a second. If you look at other object stores, you’ll see the best of the best that we’ve seen out there is 10 times that And then if you go to cloud-based object stores, historically, they’re going to be often close to a second in response time, especially when the load is high. 

So that’s one aspect, the latency of the puts and gets. But actually it’s not the biggest complaint we see from consumers of object stores who are interested in performance. The biggest complaint is the first thing they’ve got to do when they’re doing the data preparation is find the bits of data they’re interested in doing the preparation on. 

Blocks & Files: Can you give me an example?

James Coomer: So imagine you’re a autonomous driving company. And you’ve got a million videos from a million cars, and each video’s got a label. Well, actually, it’s got a thousand labels on each video, and for every frame on each video, there’s maybe a hundred labels because it’s being unsupervised learning.

It’s going to label all the contents of that frame and say “there’s a cat, there’s a traffic light, there’s a car, there’s a building.” It’s all there in metadata. And the data scientist, who wants to do with these million files, each one with tens of thousands of labels, is going to say, I am interested in finding all the video file frames with a ginger cat, and that is data discovery. It is object listing. You’re not even getting the objects, you’re just finding out where they are. And that’s really the biggest issue, the biggest bottleneck right now for data preparation, which is about 30 percent of the overall walk-on time of AI, is doing this bit.

And the biggest part of doing that bit, the data preparation part, is finding these pieces of data. And we’ve done that. We do that really, really fast, like a hundred times faster than anything else we’ve tested. This object listing sounds like it’s really deep in the weeds, but that’s how fast we find the data. And data scientists are doing it all the time. 

So I think when people focus on checkpointing, it’s for a marketing reason. And actually what we’re trying to do is work out firstly where people are spending their time and where storage is a bottleneck, and data is a bottleneck. And we find in the data preparation cycle, we think of this as ingest, preparation, training and inference.

It’s in the data preparation cycle actually where objects storage is mostly used because you tend to get images and stuff coming into object stores, satellite data, microscope data tend to go to an object store.

The challenge there for the data scientist is munging that data at massive scale using things like Spark, like Hadoop. They’re basically preparing their data, creating these datasets ready for training. And the biggest problem there is finding that stuff, and that’s object listing. 

The second one is time to first byte. Third one is puts and gets per second, fourth one is actually throughput. So fourth in the list of importance is the throughput of the objects in this case. 

Blocks & Files: Regarding the object listing; I’m moderately familiar with trying to find a particular file in a billion file/folder infrastructure with tree walks being involved and so on and so forth. But you obviously don’t have a file-folder structure with object, yet you’ve got this absolutely massive metadata problem. Could be just a simple flat space or is it partitioned up somehow? Do you parallelize the search of that space?

James Coomer: It is massively paralellized, but also, it’s a multi-protocol data plane. So even though we are exposing an object store first, we’re not an object store. 

It’s actually a key-value (KV) store underneath. It’s a KV store with a certain data structure. And we chose this particular KV store because it’s really fast at managing commits and renders as we call them or reads. 

So it’s writes and reads, but small things and big things, and we do a lot of this stuff in a very intelligent way. The log structuring is done very intelligently, so we don’t need a write buffer that other people need and those kind of things. But the point is, think about your knowledge of file systems and blocks and kind of modify writes. That’s nothing to do with what we’re doing. Nothing. Even though we are going to present a file system and an object store and an SQL database and a block store at this key-value store, there’s none of that traditional file system construct built into the backend of what we’re doing and how we’re laying this data out. 

So we can index data and create parallel indexes of the metadata of all the objects, and we can basically cache these indexes and then query them in parallel like you would a parallel database, like something like Cassandra. The same kind of architecture that Cassandra uses is a similar approach to how we’re allowing people to access and search the metadata associated with the objects.

Blocks & Files: When you were talking about an underlying key value store with object on top of it, and I believe you just mentioned file systems on top of it as well. Then you could stick a block system on top of it if you wanted.

James Coomer: Exactly. Completely. So the KV store is here and we just expose the KV store through different protocols. So we expose it now through a containerized set of services, exposing it as S3. it’s a really thin layer. The S3 is really thin. 

Then we can do the same. We’d expose the same key KV pairs, which are basically hosting the data and metadata, as file, as block, as an SQL database. All of these different things can be exposed very efficiently from a KV store. And they’re all peers, right? There’s nothing static. They’re all peers to each other, these different services, and they’re all just exposing the same underlying scalable KV store, which can hold tiny things and big things, millions and billions of pieces of tiny metadata or a small number of terabyte size objects equally efficiently.

We do it all in software. You don’t need any special storage class memory, anything like that. It’s all software, all scalable, super fast and very, very simple. We’ve done a better architectural job than others in building this underlying scalable data plane with this KV store to suit the different characteristics of block versus SQL versus S3 versus file.