Scaling LanceDB: Running 700 million vectors in production
Published on
Contents
❗ NOTE: The context manager approach (opening and closing the connection/table on each use, e.g. in
ConnectionManager) works forlancedb==0.21.2.However, as of
lancedb==0.25.0(update on 09.09.2025), the issues withconnect_asynchave been fixed. Using the old context manager pattern will now cause memory leaks.To avoid this, use a singleton pattern (see the example at the end) for managing connections and tables.
Introduction
In this article, I will walk through the process of migrating an existing vector database from Milvus DB to LanceDB. The primary motivation for this move was the excessive running costs associated with Milvus, which led to the exploration of more cost-effective alternatives. LanceDB emerged as a promising solution due to its potential for lower operational costs while maintaining the necessary performance for handling large-scale vector data.
The migration process involved transferring a massive dataset of 700 million vectors and successfully running LanceDB in a production environment. However, during this process, I encountered several challenges that complicated the migration. In the following sections, I will delve into these issues and share the lessons learned while overcoming them.
Problems
Indexing
The dataset migration was successful, with 700 million vectors transferred from Milvus DB to LanceDB without major issues. However, the indexing process for such a large amount of data did not go as smoothly.
My focus was on two types of indexes: scalar index BTree and vector index IvfPq.
The indexing process for the BTree went normally, it took several hours.
>>> from lancedb import connect_async
>>> from lancedb.index import BTree
>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")
>>> await table.create_index(
column="id",
config=BTree(),
replace=False
)
>>> await table.list_indices()
[Index(BTree, columns=["id"], name="id_idx")]
>>> await table.index_stats("id_idx")
IndexStatistics(
num_indexed_rows=700_000_000,
num_unindexed_rows=0,
index_type='BTREE',
distance_type=None,
num_indices=1,
)
Things got more interesting with the IvfPq index because, during the creation
process, LanceDB temporarily stores intermediate data in the /tmp
directory. However, this volume didn’t have enough disk space for the
gigabytes of data required. Since we had a separate volume for LanceDB’s
data storage, worked around this limitation by redirecting the default
/tmp location using an environment variable.
$ export TMPDIR=/storage/lancedb
>>> from lancedb import connect_async
>>> from lancedb.index import IvfPq
>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")
>>> await table.create_index(
column="vector",
config=IvfPq(
distance_type="l2",
num_partitions=26_457,
num_sub_vectors=64
),
replace=False
)
According to the docs, the values for num_partitions and num_sub_vectors
are calculated as follows:
- $ \text{number of partitions} = \sqrt{\text{number of rows}} $
- $ \text{number of subvectors} = \frac{\text{vector dimension}}{16} $
The creation of the vector index ran for about a day before ultimately failing due to insufficient RAM, despite the machine having 128GB of memory. This prompted me to investigate the issue, but after some time, decided to abandon the initial approach and explore an alternative strategy. Instead of indexing the entire dataset at once, batched the data into chunks of 50 million documents. This approach proved successful, as I was able to create the vector index for the first 50 million records without any issues.
>>> from lancedb import connect_async
>>> from lancedb.index import IvfPq
>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")
>>> await table.create_index(
column="vector",
config=IvfPq(
distance_type="l2",
num_partitions=7071,
num_sub_vectors=64
),
replace=False
)
>>> await table.list_indices()
[
Index(BTree, columns=["id"], name="id_idx"),
Index(
IvfPq, columns=["vector"], name="vector_idx"
)
]
>>> await table.index_stats("vector_idx")
IndexStatistics(
num_indexed_rows=50_000_000,
num_unindexed_rows=0,
index_type='IVF_PQ',
distance_type='l2',
num_indices=1,
)
After each 50 million chunk
optimize
step was executed. This step is crucial because LanceDB creates indexes only
once, and doesn’t automatically index newly added data, requiring explicit
optimization after each addition.
>>> from datetime import timedelta
>>> from lancedb import connect_async
>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")
>>> await table.optimize(
cleanup_older_than=timedelta(seconds=0),
delete_unverified=True
)
With batching & optimisation approach, successfully created both the full vector and scalar indexes for the entire dataset.
Memory leaks
Indexing was sorted out, however, when started running LanceDB in production, memory leaks quickly became apparent. Although the database performed well under normal conditions, it started consuming increasing amounts of memory when running under Uvicorn, as it was deployed as an API.
Connections & tables should be closed after each operation but they started piling up, leading to excessive memory consumption, so a connection manager was introduced to overcome this problem.
To get hands-on experience, set up LanceDB with a dataset,
experiment, and monitor memory usage. Below requirements.txt & script
with a connection manager.
# Ran on Python 3.11.11
# requirements.txt
annotated-types==0.7.0
deprecation==2.1.0
lancedb==0.21.2
loguru==0.7.3
numpy==2.2.4
overrides==7.7.0
packaging==24.2
pandas==2.2.3
pyarrow==19.0.1
pydantic==2.11.1
pydantic-core==2.33.0
python-dateutil==2.9.0.post0
pytz==2025.2
six==1.17.0
tqdm==4.67.1
typing-extensions==4.13.0
typing-inspection==0.4.0
tzdata==2025.2
# This script creates a lance table with
# 10k documents/vectors via connection manager.
import asyncio
import numpy as np
from lancedb.db import AsyncConnection, AsyncTable
from lancedb.pydantic import LanceModel, Vector
from loguru import logger
from pydantic import Field
from lancedb import connect_async
def generate_normalised_vectors(
*,
num_vectors: int,
dimension: int = 512,
max_distance: float = 1.0,
):
# Generate random vectors
vectors = np.random.randn(num_vectors, dimension)
# Normalize vectors to unit length (L2 norm = 1)
norms = np.linalg.norm(vectors, axis=1)
norm_vecs = vectors / norms[:, np.newaxis]
# Scale each vector by a random factor between 0
# and max_distance
scales = (
np.random.rand(num_vectors) * max_distance
)
scaled_vectors = norm_vecs * scales[:, np.newaxis]
return scaled_vectors
class ConnectionManager:
table: AsyncTable
def __init__(
self,
*,
conn: AsyncConnection,
table_name: str
) -> None:
self.conn = conn
self.table_name = table_name
async def __aenter__(self):
self.table = await self.conn.open_table(
self.table_name
)
logger.debug(
f"Opened LanceDB table {self.table_name}"
)
return self
async def __aexit__(self, exc_type, exc, tb):
self.table.close()
self.conn.close()
logger.debug(
"Closed LanceDB table "
f"{self.table_name} & its connection."
)
class Document(LanceModel):
id: int
title: str
content: str
distance: float = Field(
alias="_distance", default=0.0
)
vector: Vector(512)
async def main():
# Establish a connection with a LanceDB.
conn = await connect_async(uri="/storage/lancedb")
# Create a lance table based on pydantic
# `Document` schema.
await conn.create_table(
name="documents",
schema=Document,
mode="create",
)
# Open documents table and close its connection &
# table upon exit from an async context manager.
async with ConnectionManager(
conn=conn,
table_name="documents"
) as manager:
# Generate 10k pydantic `Document` instances
# with normalised vectors.
vectors = generate_normalised_vectors(
num_vectors=10_000
)
documents = [
Document(
id=i,
title=f"Document {i}",
content=f"Content of document {i}",
vector=vectors[i],
)
for i in range(10_000)
]
await manager.table.add(documents)
# Get first 100 documents from documents table
# and validate them with a pydantic model.
raw_documents = (
await manager.table.query()
.limit(100)
.to_list()
)
documents = [
Document.model_validate(doc)
for doc in raw_documents
]
# Iterate trough 100 documents and find 5 most
# similar vectors based on the base vector
# from a document.
for doc in documents:
raw_docs = await (
manager.table.vector_search(
query_vector=doc.vector
)
# don't include itself in a search
.where(f"id != {doc.id}")
.limit(5)
.to_list()
)
docs = [
Document.model_validate(doc)
for doc in raw_docs
]
# Show only vectors with a close distance.
for doc in docs:
if doc.distance > 0.8:
logger.info(
f"Doc id: {doc.id}, "
f"distance {doc.distance}."
)
asyncio.run(main())
Disk storage
Versioning
While dealing with indexing and memory leaks, also encountered excessive disk usage, which grew into terabytes.
The main reason for this excessive disk usage was LanceDB’s versioning system (see this), which stored multiple versions of the data, rapidly increasing storage consumption. Each operation on the table generates new files in directories located at paths like:
/storage/lancedb/documents.lance/_transactions/storage/lancedb/documents.lance/_versions
To manage this, the
optimize
operation was invoked
every hour, indexing new data and removing all previous versions of the table.
This operation doesn’t block the table, so the searching capabilities of
vectors remain unaffected and continue to function without interruption.
# hourly script
from datetime import timedelta
from lancedb import connect_async
conn = await connect_async(uri="/storage/lancedb")
table = await conn.open_table("documents")
await table.optimize(
# such params delete previous versions
cleanup_older_than=timedelta(seconds=0),
delete_unverified=True
)
Indexes
LanceDB’s search mechanism isn’t blocked by the optimisation process because it leaves the old indexes untouched. It creates new indexes based on the existing ones, adding the unindexed data, and then switches to the new indexes. The old indexes can be safely deleted afterward, helping to reduce disk storage usage.
$ ls -lt /storage/lancedb/documents.lance/_indices
drwxrwxr-x 2 vald vald 4096 бер 30 17:06 aa9325d8-8cd2-4373-870b-94e26c1ad732
drwxrwxr-x 2 vald vald 4096 бер 30 17:06 9b93c185-52dd-4e54-8cae-c4ba1b0d9f66
# NOTE: below directories can be deleted assuming
# you have only two indexes scalar and vector ones
drwxrwxr-x 2 vald vald 4096 бер 30 17:01 5484f99e-dda2-4caa-bce4-8ea7987fff61
drwxrwxr-x 2 vald vald 4096 бер 30 17:00 125fee6d-6170-4600-b084-3680a7b35e59
I automated this process after the otpimize step because it produces new
indexes. Using a CRON job, set up a script that ensures only the two latest
directories representing the scalar and vector indexes (if so) are retained,
automatically deleting any older, unused directories.
import asyncio
import argparse
import shutil
from datetime import timedelta
from pathlib import Path
from lancedb import connect_async
from loguru import logger
async def main(
lancedb_path: str,
table_names: list[str]
) -> None:
lancedb_volume = Path(lancedb_path)
lock_file = lancedb_volume / ".lock-file"
if lock_file.exists():
logger.warning("LanceDB optimisation, cleanup process is in progress. Exiting...")
return
lock_file.touch()
db = await connect_async(uri=lancedb_path)
logger.debug(
f"LanceDB volume path: {lancedb_path}"
)
logger.debug(
"Going to process "
"the following tables: {table_names}"
)
print()
for table_name in table_names:
table = await db.open_table(table_name)
# Clean up files in dirs _versions,
# _transactions, merge data in data dir, add
# new vectors into scalar and vector indexes.
logger.warning(
f"Optimizing {table_name} table..."
)
await table.optimize(
cleanup_older_than=timedelta(seconds=0),
delete_unverified=True,
)
logger.info(
f"Finished optimizing {table_name} table."
)
print()
# Remove .tmp* directories in LanceDB data volume
# after `optimize` runs.
for item in lancedb_volume.rglob(".tmp*"):
if item.is_dir():
logger.warning(
"Deleting temporary "
f"directory: {item}"
)
shutil.rmtree(item)
print()
# Delete indexes duplicates in each directory of
# table and leave only the latest scalar and
# vector ones.
for table_name in table_names:
table_dir = Path(
f"{lancedb_path}/{table_name}"
".lance/_indices"
)
index_dirs = sorted(
(
item
for item in table_dir.iterdir()
if item.is_dir()
),
key=lambda d: d.stat().st_mtime,
reverse=True,
)
for index_dir in index_dirs[2:]:
logger.warning(
"Deleting index "
f"directory: {index_dir}"
)
shutil.rmtree(index_dir)
lock_file.unlink()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--lancedb-path",
type=str,
default="/storage/lancedb",
help="Path where LanceDB data lives",
)
parser.add_argument(
"--table-names",
type=str,
default="documents",
help="Comma separated list of tables",
)
args = parser.parse_args()
table_names = args.table_names.split(",")
asyncio.run(
main(
lancedb_path=args.lancedb_path,
table_names=table_names,
)
)
Final thoughts
Eventually, after all manipulations, 700 milion documents took 1.6TB disk space comparing with previous $ \infty \text{TB} $, 32CPU, 128GB RAM machine to make searches against such dataset.
Migration from Milvus onto LanceDB reduced our monthly costs from 30,000 USD to 7,000 USD in our setup, making it definitely worth the effort, despite requiring more manual management and support.
Update (09.09.2025)
LanceDB singleton connector
With newer LanceDB version, a singleton ensures that the connection is created only once and safely reused across your application. This avoids repeatedly opening and closing resources, which is what leads to memory leaks in the updated implementation.
Tables should use open_table() for every request. During the optimize()
process, tables keep pointing to old indices while new ones are created and
old ones are deleted. This causes a LanceError(IO) stating that the index
does not exist.
from typing import Self
from lancedb import (
AsyncConnection,
connect_async
)
class LanceDBConnector:
"""Singleton class for LanceDB connection
& tables management"""
instance: Self | None = None
conn: AsyncConnection | None = None
tables: dict[str, AsyncTable] = {}
def __new__(cls) -> "LanceDBConnector":
if cls.instance is None:
cls.instance = super().__new__(cls)
return cls.instance
async def get_connection(self) -> AsyncConnection:
if self.conn is None:
self.conn = await connect_async(
uri="/storage/lancedb"
)
return self.conn
async def close(self) -> None:
for table in self.tables.values():
table.close()
self.tables.clear()
if self.conn:
self.conn.close()
self.conn = None
Clean up script
LanceDB fixed issues with stale indices, so script for CRON job should be updated as well:
import asyncio
import argparse
import shutil
from datetime import timedelta
from pathlib import Path
from lancedb import connect_async
from loguru import logger
async def main(
lancedb_path: str, table_names: list[str]
) -> None:
lancedb_volume = Path(lancedb_path)
lock_file = lancedb_volume / ".lock-file"
if lock_file.exists():
logger.warning(
"LanceDB optimisation, cleanup process is in progress. Exiting..."
)
return
lock_file.touch()
db = await connect_async(uri=lancedb_path)
logger.debug(
f"LanceDB volume path: {lancedb_path}"
)
logger.debug(
f"Going to process the following tables: {table_names}"
)
print()
for table_name in table_names:
table = await db.open_table(table_name)
# Clean up files in dirs _versions, _transactions, merge data in data
# dir, add new vectors into scalar and vector indices.
logger.warning(
f"Optimizing {table_name} table..."
)
await table.optimize(
cleanup_older_than=timedelta(seconds=0),
delete_unverified=True,
)
logger.info(
f"Finished optimizing {table_name} table."
)
print()
# Remove .tmp* directories in LanceDB data after `optimize` runs.
for item in lancedb_volume.rglob(".tmp*"):
if item.is_dir():
logger.warning(
f"Deleting temporary directory: {item}"
)
shutil.rmtree(item)
print()
# Delete empty index directories inside each table
for table_name in table_names:
table_dir = Path(
f"{lancedb_path}/{table_name}.lance/_indices"
)
if not table_dir.exists():
continue
for index_dir in table_dir.iterdir():
if index_dir.is_dir() and not any(
index_dir.iterdir()
):
logger.warning(
f"Deleting empty index directory: {index_dir}"
)
shutil.rmtree(index_dir)
lock_file.unlink()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--lancedb-path",
type=str,
default="/storage/lancedb",
help="Path where LanceDB data lives",
)
parser.add_argument(
"--table-names",
type=str,
default="documents",
help="Comma separated list of tables",
)
args = parser.parse_args()
table_names = args.table_names.split(",")
asyncio.run(
main(
lancedb_path=args.lancedb_path,
table_names=table_names,
)
)