Skip to main content
Loading

Ingestion and indexing

This page highlights key concepts related to ingesting and indexing new data in Aerospike Vector Search (AVS). AVS updates vector records in real time, but index records are updated asynchronously. This means that records may be available for retrieval right away, but the same records may not appear immediately in search results.

Record data updates

As you update and insert data into AVS, there are several processing steps before the update is considered complete. First, record data, including the vector, is written to the Aerospike Database. You can see your record data immediately in Aerospike Database. To be indexed, each record must contain at least one vector field that is mapped to an index.

HNSW index construction

AVS supports Hierarchical Navigable Small World (HNSW) indexing, constructing a multi-layer graph where nodes represent data points and edges connect each node to its nearest neighbors. This can be handled in two ways: A distributed approach that is designed to spread index construction across your cluster, and a standalone index construction process that builds the index in memory on a specific node.

Standalone index construction

In scenarios where you are building a new index, or rebuilding an index for a new embedding model, you may want to build the index using a standalone indexing node. Standalone indexing is done completely in memory on an isolated node of your cluster. This allows the index to be built much faster, but with the tradeoff of not having results available during indexing.

The following actions happen automatically when you create a standalone index:

  1. Scan of all your vector records in a specified namespace and set.
  2. The vector is used from each record to generate the HNSW index in memory.
  3. Upon completion, index is written to the Aerospike Database, and is made available for searching and streaming updates.
tip

When batch processing index updates, you can can monitor standalone indexing with the vector_records metric.

image

Distributed index construction

AVS manages index construction across nodes while concurrently making that index available for search. This means AVS can be scaled out to handle streaming updates to an index. The index updates are processed asynchronously, and over time, the entire index reflects the changes that are streamed in. This makes AVS ideal for scenarios where your vector embeddings are changing regularly and you want to stream in those updates as they occur, while simultaneously providing search results for the index.

tip

You can monitor streaming index construction using the indexing_queue_size metric for monitoring your ingest queue and the requests_metric for monitoring your total indexed records.

image

Index healing

Distributed indexer nodes regularly revisit live indexes to ensure all data is properly indexed. During HNSW index construction, it is important to rebalance the neighborhood graphs regularly. The index healing process runs periodically in the background and performs these key functions:

  • Account for AVS node outages: Since each node holds in memory a queue of records to be indexed, the healer ensures that those records are indexed, even if an outage occurs.

  • Graph rebalance: To maintain index quality, it is necessary to rebalance or rebuild parts of the graph periodically, especially if the addition of new items or loss of a node significantly alters the data distribution. This process can be resource-intensive. Make sure your deployment is able to withstand the performance impact on the system before rebalancing.

  • Index garbage collection: The healer is responsible for deleting index records in the Aerospike Database to free up storage.

tip

You can configure defaults for the healer at the index level, and you can monitor healer performance using the healer_cycle metric.