A VAST effort: Data estate vectorization and storage for the AI era

February 12, 2025

All data storage companies are having to respond to the wave of generative AI and intelligent agents interrogating an organization’s data estate, be it block, file, or object, and structured, semi-structured, or unstructured. VAST Data has built an AI-focused software infrastructure stack on its storage base. We asked the company about its approach to the issues we can see.

Blocks & Files: How large is the vector data store for a vectorized multi-format data estate? What are the considerations in making such a judgment?

VAST Data: The overall “cost” or overhead for vectorization is determined by the specific details outlined during the AI design specification phase. Typically, this overhead ranges between 5 and 15 percent for a standard embedding. The exact percentage depends on several factors, including the use cases being addressed, the types of data involved, and the specificity of unique data elements required to effectively meet the needs of those use cases. These considerations ensure the vectorization process is both efficient and tailored to the enterprise’s requirements.

Blocks & Files: Will an organization’s entire data estate need to be vectorized? If not all, which parts? Mission-critical, near-real-time, archival?

VAST Data: While any given AI project may only require a subset of data, over time, as AI penetrates all functions of a business (marketing, HR, support, sales, finance, etc.), the answer is simple: every piece of data should be cataloged and vectorized to unlock its potential for modern applications. Without vectors or proper labeling, data is a liability, not an asset – like a book in a box without a label. Mission-critical and near-real-time data will naturally take priority for vectorization, but even archival data can yield value when cataloged.

Challenges arise from diverse data sources – files, databases, or SaaS platforms like social media. The VAST Data Platform uniquely supports all data types from TB to EB scale and bridges these gaps with file triggers and real-time monitoring, ensuring data changes trigger immediate vectorization. For external sources, event-based or batch processing delivers adaptability for varying latencies.

Blocks & Files: Will an organization’s AI agents need, collectively and in principle, access to its entire data estate? How do they get it?

VAST Data: A key component of the design of the VAST Insight Engine is the preservation of each individual’s data access rights. An AI interface to data, like a chatbot, must respect the users’ assigned data rights and then continue adhering to the data governance rules of the organization. So while agents in aggregate may access all of the data, any given agent will only be able to access data that it has specifically been granted permissions.

Many AI solutions using third-party non-integrated solutions have a hard time achieving this because once you remove the existing classifications on data, rebuilding it based on context can have unpredictable consequences.

The consumption of the VAST Data Insight Engine brings the AI data to the file as an attribute, meaning that the respective ACLs and control metadata are never lost. When consuming from VAST Data, the user interacts with the chat box, the chat box then searches the questions that hit the Insight Engine, which then only returns the chunks of files or files the user has rights to. It’s seamless to the user and the only reliable way to ensure adherence to governance rules. Of course, all of this user access is logged to the VAST DB for compliance as needed.

Blocks & Files: Does all the information in a vector data store become effectively real-time? Is a vector data store a single real-time response resource?

VAST Data: Think of the VAST Platform Insight Engine as the beating heart of your data ecosystem, delivering near-real-time responses within its environment. While internal data written to the VAST DataStore pulses instantly through AI pipelines, external sources bring a natural delay based on their rhythm, creating a dynamic yet highly responsive system for enterprise use. Regardless of the source’s data, once the chunks and vectors are saved to the VAST DB, it is instantly available for inference operations.

Blocks & Files: Will the prime interface to data become LLMs for users in an organization that adopts LLM agents?

VAST Data: AI-powered interfaces are the inevitable future of data access. Users seek simplicity – asking questions and receiving precise answers, without navigating complex systems. As LLM agents mature, they’ll transform how we interact with data, replacing traditional CRUD applications with intuitive, conversational experiences that make data truly accessible to everyone.

Blocks & Files: Must a vector data store, all of it, be held on flash drives for response time purposes? Is there any role for disk and tape storage in the vectorized data environment?

VAST Data: In the AI era, the speed of insight defines competitiveness. Traditional vector databases are loaded into RAM and often sharded across multiple servers to scale, adding complexity and decreasing performance as they scale. This, however, is preferred over swapping to HDDs due to significant impact on query latency. VAST enables exabyte scale while delivering linear scaling low latency performance by distributing the chunks and indexes across NVMe flash storage for instantaneous response times, aligning with business-critical, real-time needs.

Blocks & Files: In a vector data space, do the concepts of file storage and object storage lose their meaning?

VAST Data: The shift to vectorized data reframes how we think about storage entirely. File and object storage, once foundational concepts, lose their meaning in the eyes of users. What matters now is data accessibility and performance, with storage evolving to support these priorities invisibly in the background.

Blocks & Files: Can you vectorize structured data? If not, why not?

VAST Data: Yes, structured data can and should be vectorized when possible and often delivers better results than text-to-SQL queries. Vectorization better captures complex, non-linear relationships, while text-to-SQL is limited to relationships defined in the schema. While its organization in rows and columns serves traditional applications, vectorization prepares it for the future of AI and machine learning. By converting structured data into numerical vectors, organizations can unlock advanced analytics, cross-domain integration, and more powerful AI-driven insights.

Blocks & Files: Can you vectorize knowledge graphs? If not, why not?

VAST Data: Yes, vectorizing knowledge graphs is not only possible but essential for AI applications. By embedding nodes, edges, and their relationships into vector space, organizations can unlock advanced analytics, enabling their knowledge graphs to power recommendation systems, semantic search, and reasoning tasks in a scalable, AI-ready format.

Blocks & Files: Will we see a VQL, Vector Query Language, emerge like SQL?

VAST Data: The creation of a Vector Query Language would signify a pivotal moment for vector databases, akin to SQL’s role in the relational era. However, such a standard would require both a unifying purpose and cooperation among vendors – a challenge in today’s competitive, rapidly evolving market. If history teaches us anything, it’s that demand for simplicity and interoperability often drives innovation.

Blocks & Files: Will high-capacity flash drives, ones over 60 TB, need to be multi-port, not just single or dual-port, to get their I/O density down to acceptable levels in a real-time data access environment?

VAST Data: The short answer is no – high-capacity flash drives won’t need more than dual ports. Modern SSDs have already outgrown the limitations of HDDs by scaling PCIe lane bandwidth alongside capacity, maintaining a consistent bandwidth-per-GB ratio.

While concepts like Ethernet-enabled drives are interesting for expanding connectivity, most real-time access environments don’t require more than two logical connections. This design simplicity ensures performance, reliability, and scalability without unnecessary complexity.

Blocks & Files: Can a storage supplier provide a data space abstraction covering block, file, and object data?

VAST Data: Yes, a storage supplier can provide a unified data space abstraction across block, file, and object data. The VAST Data Platform does precisely that, creating a seamless environment that integrates all data types into a single namespace.

Blocks & Files: How does a data space differ from a global namespace?

VAST Data: We have in the past used the VAST DataSpace to mean the VAST global namespace. However, we are redefining DataSpace to include all the features that connect multiple VAST clusters plus VAST on Cloud clusters and instances that are most useful when participating in the DataSpace with other VAST clusters.

Snap-to-Object – A VAST feature that replicates data to S3-compatible object storage.
Global Clone – A clone made on one VAST cluster based on a snapshot of a folder on a different VAST cluster. Global Clones can be full, transferring the full contents of the folder, or lazy where the remote cluster only fetches data from the snapshot on the original cluster when data is read. Writes to global clones are always local.
Asynchronous replication – Replication from a source VAST cluster to one or more (1:1 and 1:many) other VAST clusters based on snapshots. Frequently called native replication because Snap-to-Object was developed first so this was native VAST-to-VAST.
Synchronous Replication – Active-Active VAST clusters with synchronous replication.
Global Access – The feature name/GUI menu that manages the VAST global namespace making Global Folders available on multiple VAST clusters.
Global Folder – A folder that’s made available on multiple VAST clusters via Global Access, the VAST global namespace
Origin – A VAST cluster holding a full copy of a Global Folder. VAST 5.2 supports one origin per Global Folder but future releases will support multiple origins with replication.
Satellite – A VAST cluster that presents a Global Folder for local access caching the folder’s contents on its local storage.
Write Lease – A write lease grants a VAST cluster the right to write to an element, or byte-range within an element. In VAST 5.2, the Origin cluster holds the Write Lease to the entire contents of a Global Folder and so all writes are proxied to the Origin which can apply them to the data.
Read Lease – A read lease is a guarantee of cache currency. When a satellite cluster fetches data from a Global Folder, it takes out a read lease on that data and registers that lease with the write lease holder. If the data should change, the read lease is invalidated and satellites will have to fetch the new version.

Blocks & Files: In other words, how do we organize an organization’s data estate and its storage in a world adopting LLM-based chatbots/agents?

VAST Data: In the era of LLM-based chatbots and agents, the VAST Insight Engine on the VAST Data Platform offers a transformative way to organize an organization’s data estate. It seamlessly vectorizes data for AI-driven workflows while preserving user attributes, file permissions, and rights. This ensures secure, role-based access to insights, enabling real-time responsiveness without compromising compliance or data governance.

A VAST effort: Data estate vectorization and storage for the AI era

ABOUT US

FOLLOW US

Rubrik touts new cyber-resilience features

PEAK:AIO AI Data Server peaks at 120 GBps

Supermicro says Fibre Channel is on the way out, but SAS lives on and...