Hammerspace challenges object storage norms for AI

February 7, 2025

Data orchestrator Hammerspace is challenging the conventional wisdom that object storage is the optimal solution for AI training and inference, arguing that universal, protocol-agnostic data access is far more crucial.

In a sense, that would be natural as Hammerspace has AI model training customers, such as Meta. Its technology is based on parallel NFS and it supports Nvidia’s GPUDirect fast file access protocol. However, Hammerspace supports S3 data access as well as file access. It has a partnership with object storage supplier Cloudian so that its HyperStore object storage repository can be used by Hammerspace’s Global Data Platform software. HyperStore supports Nvidia’s GPUDirect for object storage, designed to provide faster object access.

Molly Presley, Hammerspace SVP for marketing, discussed the file-vs-objects AI topic with Blocks and Files, and moved onto making data suitable for AI processing – vectorization and how data should be organized for the AI LLM/agent era.

Blocks & Files: Why is Hammerspace focused on a hybrid data platform instead of just file or object storage?

Molly Presley: In Glenn Lockwood’s article, he calls out the pain points of parallel file systems due to their proprietary nature and needing specialized headcount. This is a huge reason why Hammerspace, with over 2,400 contributions to the Linux kernel, is so focused on a standards-based data platform. The choice for customers is not limited to just object storage if they need standards-based access without proprietary clients and silos.

It’s not about choosing between file systems and object storage interfaces; the conversation is also about scalability, efficiency at scale, understanding data sources, and seamlessly orchestrating data regardless of its format.

Focusing solely on storage interfaces and file vs object storage trivializes the complexity of today’s AI demands. Each workload has different performance requirements, is connected to different applications with different storage interface requirements, and may use data sources from a wide variety of locations. The optimal platform delivers performance through orchestration, scalability, and intelligent workload-specific optimizations.

Blocks & Files: Are AI infrastructure purchase decisions primarily being made around training workloads?

Molly Presley: No. As organizations are assessing their AI investments, they are thinking beyond more than just training. Data architecture investments for most organizations need to accommodate far more than training. They need to span inference, RAG, real-time analytics, and more. Each requires specific optimizations that go beyond generic, one-size-fits-all storage systems. A data platform is needed and must adapt to each phase of AI workloads, not force them into outdated storage paradigms.

A data platform must provide real-time data ingestion (aka data assimilation), intelligent metadata management, security, and resilience. Storage interfaces alone don’t solve the full challenge – data must be fluid, orchestrated, and dynamically placed for optimal performance across workloads.

Blocks & Files: We have been concerned about the spread of LLMs as that implies the LLMs need access, in principle, to an organization’s entire data estate. Will an organization’s entire data estate need to be vectorized? If not all, which parts? Mission-critical, near-time, archival?

Molly Presley: At Hammerspace, we don’t see vectorization as the immediate challenge or top-of-mind concern for buyers and architects – it’s global access and orchestration. Organizing data sets, ensuring clean data, and moving data to available compute are much more urgent in today’s training, RAG, and iteration workloads.

The need to vectorize an organization’s entire data estate is highly use-case and industry-specific. While the answer varies, full vectorization is typically unnecessary. Mission-critical and near-time data are the primary candidates, while archival data can be selectively sampled to identify relevance or patterns that justify further vectorization.

The key to effective implementation is enabling applications to access all data across storage types at a metadata control plane level – without requiring migrations or centralization. This ensures scalability and efficiency.

Blocks & Files: Will an organization’s chatbots/AI agents need, collectively and in principle, access to its entire data estate? How do they get it?

Molly Presley: Chatbots and AI agents typically don’t need access to an organization’s entire data estate – only a curated subset relevant to their function. Security and compliance concerns make unrestricted access impractical. Instead, leveraging global data access with intelligent orchestration ensures AI tools can access the right data without uncontrolled sprawl.

Even if an organization vectorized everything, the resulting data store would be near-real-time, not truly real-time. Performance is constrained by update latency – vector representations are only as current as their latest refresh. API integration and fast indexing can help, but real-time responsiveness depends on continuous updates. Hammerspace’s relevant angle remains metadata-driven, automated orchestration rather than full-scale vectorization.

Blocks & Files: Will the prime interface to data become LLMs for users in an organization that adopts LLM agents?

Molly Presley: Good question. LLMs are rapidly becoming an important interface for data in organizations adopting AI agents. Their ability to process natural language and provide contextual insights makes them a powerful tool for accessibility and decision-making. However, they won’t replace traditional BI and analytics tools – rather, they will integrate with them. Enterprises require structured reporting, governance, and auditability, which remain best served by established standards. The near-term (next few years at least) future lies in a hybrid approach: LLMs will enhance data interaction and discovery, while enterprise-grade analytics tools ensure precision, compliance, and operational control.

Blocks & Files: In a vector data space, do the concepts of file storage and object storage lose their meaning?

Molly Presley: File and object storage don’t disappear; they evolve. In a vector data space, data is accessed by semantic relationships, not file paths or object keys. However, storage type still matters in terms of performance, cost, and scale.

Blocks & Files: Will we see a VQL, Vector Query Language, emerge like SQL?

Molly Presley: Yes, a Vector Query Language will emerge, though it may not take the exact form of SQL. Standardization is critical. Just as SQL became the universal language for structured data, vector search will need a standardized query language to make it more accessible and interoperable across tools and platforms.

APIs and embeddings aren’t enough. Right now, vector databases rely on APIs and embedding models for similarity search, but businesses will demand more intuitive, high-level query capabilities as adoption grows. Hybrid queries will be key. Future AI-driven analytics will need queries that blend structured (SQL) and unstructured (VQL) data, allowing users to seamlessly pull insights from both.

Blocks & Files: Can a storage supplier provide a data space abstraction covering block, file, and object data?

Molly Presley: Some storage vendors can abstract storage types across file and object, and some offer block as well – but that’s not a true global data space. They create global namespaces within their own ecosystem but fail to unify data across vendors, clouds, and diverse formats (structured, unstructured, vectorized).

Standards are a critical part of this conversation as well. Organizations are typically unwilling to add software to their GPU servers or change their approved IT build environments. Building the data layer client interface into Linux as the most adopted OS is critical, and using interfaces like pNFS, NFS, and S3, which applications natively write to, is often mandated.

A global data space is about universal access, not just storage abstraction. It must integrate rich metadata, enable advanced analytics, and orchestrate data dynamically – without migrations, duplication, or vendor lock-in.

Bottom line: storage type is irrelevant. Without true global orchestration, data stays siloed, infrastructure-bound, and inefficient.

Blocks & Files: How do we organize an organization’s data estate and its storage in a world adopting LLM-based agents?

Molly Presley: We need a tiered approach to data, organized not in traditional HSM (Hierarchical Storage Management) terms of time, but with rich contextual relevance to automate orchestration of curated subsets of data non-disruptively from anywhere to anywhere when needed.

Focus on the data, not the storage. Especially in LLM-based ecosystems, the storage type is opportunistic and workflow-driven. All storage types have their uses, from flash to tape to cloud. When the type of storage is abstracted with intelligent, non-disruptive orchestration, then the storage decisions can be made tactically based on cost, performance, location, preferred hardware vendor, etc.

Unified access via standard protocols and APIs that can bridge all storage types and locations. This provides direct data access, regardless of where the data is today, or moves to tomorrow. In this way, data is curated in place so that applications can access the relevant subset of the data estate without requiring disruptive and costly migrations.

There is rich metadata in files and objects that typically is unused in traditional storage environments. Custom metadata, semantic tagging, and other rich metadata can be used to drive more granularity in the curation of the datasets. Combining these metadata into the global file system to trigger automated data orchestration minimizes unnecessary data movement, reduces underutilized storage costs, and improves accuracy and contextual insights for LLM-based use cases.

Data mobility and the ability to scale linearly are essential. LLM workflows inevitably result in data growth but, more importantly, may require cloud-based compute resources when local GPUs are unavailable. Modern organizations must put their data in motion without the complexity and limitations of traditional siloed and vendor-locked storage infrastructures.

Hammerspace challenges object storage norms for AI

ABOUT US

FOLLOW US

Rubrik touts new cyber-resilience features

PEAK:AIO AI Data Server peaks at 120 GBps

Supermicro says Fibre Channel is on the way out, but SAS lives on and...