Scaling LanceDB: Running 700 million vectors in production

Published on 2025-03-22

Contents

❗ NOTE: The context manager approach (opening and closing the connection/table on each use, e.g. in ConnectionManager) works for lancedb==0.21.2.

However, as of lancedb==0.25.0 (update on 09.09.2025), the issues with connect_async have been fixed. Using the old context manager pattern will now cause memory leaks.

To avoid this, use a singleton pattern (see the example at the end) for managing connections and tables.

Introduction

In this article, I will walk through the process of migrating an existing vector database from Milvus DB to LanceDB. The primary motivation for this move was the excessive running costs associated with Milvus, which led to the exploration of more cost-effective alternatives. LanceDB emerged as a promising solution due to its potential for lower operational costs while maintaining the necessary performance for handling large-scale vector data.

The migration process involved transferring a massive dataset of 700 million vectors and successfully running LanceDB in a production environment. However, during this process, I encountered several challenges that complicated the migration. In the following sections, I will delve into these issues and share the lessons learned while overcoming them.

Problems

Indexing

The dataset migration was successful, with 700 million vectors transferred from Milvus DB to LanceDB without major issues. However, the indexing process for such a large amount of data did not go as smoothly.

My focus was on two types of indexes: scalar index BTree and vector index IvfPq.

The indexing process for the BTree went normally, it took several hours.

>>> from lancedb import connect_async
>>> from lancedb.index import BTree

>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")

>>> await  table.create_index(
    column="id",
    config=BTree(),
    replace=False
)

>>> await table.list_indices()
[Index(BTree, columns=["id"], name="id_idx")]

>>> await table.index_stats("id_idx")
IndexStatistics(
    num_indexed_rows=700_000_000,
    num_unindexed_rows=0,
    index_type='BTREE',
    distance_type=None,
    num_indices=1,
)

Things got more interesting with the IvfPq index because, during the creation process, LanceDB temporarily stores intermediate data in the /tmp directory. However, this volume didn’t have enough disk space for the gigabytes of data required. Since we had a separate volume for LanceDB’s data storage, worked around this limitation by redirecting the default /tmp location using an environment variable.

$ export TMPDIR=/storage/lancedb

>>> from lancedb import connect_async
>>> from lancedb.index import IvfPq

>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")

>>> await table.create_index(
    column="vector",
    config=IvfPq(
        distance_type="l2",
        num_partitions=26_457,
        num_sub_vectors=64
    ),
    replace=False
)

According to the docs, the values for num_partitions and num_sub_vectors are calculated as follows:

$ \text{number of partitions} = \sqrt{\text{number of rows}} $
$ \text{number of subvectors} = \frac{\text{vector dimension}}{16} $

The creation of the vector index ran for about a day before ultimately failing due to insufficient RAM, despite the machine having 128GB of memory. This prompted me to investigate the issue, but after some time, decided to abandon the initial approach and explore an alternative strategy. Instead of indexing the entire dataset at once, batched the data into chunks of 50 million documents. This approach proved successful, as I was able to create the vector index for the first 50 million records without any issues.

>>> from lancedb import connect_async
>>> from lancedb.index import IvfPq

>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")

>>> await table.create_index(
    column="vector",
    config=IvfPq(
        distance_type="l2",
        num_partitions=7071,
        num_sub_vectors=64
    ),
    replace=False
)

>>> await table.list_indices()
[
    Index(BTree, columns=["id"], name="id_idx"),
    Index(
        IvfPq, columns=["vector"], name="vector_idx"
    )
]
>>> await table.index_stats("vector_idx")
IndexStatistics(
    num_indexed_rows=50_000_000,
    num_unindexed_rows=0,
    index_type='IVF_PQ',
    distance_type='l2',
    num_indices=1,
)

After each 50 million chunk optimize step was executed. This step is crucial because LanceDB creates indexes only once, and doesn’t automatically index newly added data, requiring explicit optimization after each addition.

>>> from datetime import timedelta
>>> from lancedb import connect_async

>>> conn = await connect_async(uri="/storage/lancedb")
>>> table = await conn.open_table("documents")

>>> await table.optimize(
    cleanup_older_than=timedelta(seconds=0),
    delete_unverified=True
)

With batching & optimisation approach, successfully created both the full vector and scalar indexes for the entire dataset.

Memory leaks

Indexing was sorted out, however, when started running LanceDB in production, memory leaks quickly became apparent. Although the database performed well under normal conditions, it started consuming increasing amounts of memory when running under Uvicorn, as it was deployed as an API.

Connections & tables should be closed after each operation but they started piling up, leading to excessive memory consumption, so a connection manager was introduced to overcome this problem.

To get hands-on experience, set up LanceDB with a dataset, experiment, and monitor memory usage. Below requirements.txt & script with a connection manager.

# Ran on Python 3.11.11
# requirements.txt 
annotated-types==0.7.0
deprecation==2.1.0
lancedb==0.21.2
loguru==0.7.3
numpy==2.2.4
overrides==7.7.0
packaging==24.2
pandas==2.2.3
pyarrow==19.0.1
pydantic==2.11.1
pydantic-core==2.33.0
python-dateutil==2.9.0.post0
pytz==2025.2
six==1.17.0
tqdm==4.67.1
typing-extensions==4.13.0
typing-inspection==0.4.0
tzdata==2025.2

# This script creates a lance table with
# 10k documents/vectors via connection manager.
import asyncio

import numpy as np
from lancedb.db import AsyncConnection, AsyncTable
from lancedb.pydantic import LanceModel, Vector
from loguru import logger
from pydantic import Field

from lancedb import connect_async


def generate_normalised_vectors(
    *,
    num_vectors: int,
    dimension: int = 512,
    max_distance: float = 1.0,
):
    # Generate random vectors
    vectors = np.random.randn(num_vectors, dimension)

    # Normalize vectors to unit length (L2 norm = 1)
    norms = np.linalg.norm(vectors, axis=1)
    norm_vecs = vectors / norms[:, np.newaxis]

    # Scale each vector by a random factor between 0
    # and max_distance
    scales = (
        np.random.rand(num_vectors) * max_distance
    )
    scaled_vectors = norm_vecs * scales[:, np.newaxis]

    return scaled_vectors


class ConnectionManager:
    table: AsyncTable

    def __init__(
        self,
        *,
        conn: AsyncConnection,
        table_name: str
    ) -> None:
        self.conn = conn
        self.table_name = table_name

    async def __aenter__(self):
        self.table = await self.conn.open_table(
            self.table_name
        )
        logger.debug(
            f"Opened LanceDB table {self.table_name}"
        )

        return self

    async def __aexit__(self, exc_type, exc, tb):
        self.table.close()
        self.conn.close()
        logger.debug(
            "Closed LanceDB table "
            f"{self.table_name} & its connection."
        )


class Document(LanceModel):
    id: int
    title: str
    content: str
    distance: float = Field(
        alias="_distance", default=0.0
    )
    vector: Vector(512)


async def main():
    # Establish a connection with a LanceDB.
    conn = await connect_async(uri="/storage/lancedb")

    # Create a lance table based on pydantic
    # `Document` schema.
    await conn.create_table(
        name="documents",
        schema=Document,
        mode="create",
    )

    # Open documents table and close its connection &
    # table upon exit from an async context manager.
    async with ConnectionManager(
        conn=conn,
        table_name="documents"
    ) as manager:
        # Generate 10k pydantic `Document` instances
        # with normalised vectors.
        vectors = generate_normalised_vectors(
            num_vectors=10_000
        )
        documents = [
            Document(
                id=i,
                title=f"Document {i}",
                content=f"Content of document {i}",
                vector=vectors[i],
            )
            for i in range(10_000)
        ]
        await manager.table.add(documents)

        # Get first 100 documents from documents table
        # and validate them with a pydantic model.
        raw_documents = (
            await manager.table.query()
            .limit(100)
            .to_list()
        )
        documents = [
            Document.model_validate(doc)
            for doc in raw_documents
        ]

        # Iterate trough 100 documents and find 5 most
        # similar vectors based on the base vector
        # from a document.
        for doc in documents:
            raw_docs = await (
                manager.table.vector_search(
                    query_vector=doc.vector
                )
                # don't include itself in a search
                .where(f"id != {doc.id}")
                .limit(5)
                .to_list()
            )
            docs = [
                Document.model_validate(doc)
                for doc in raw_docs
            ]

            # Show only vectors with a close distance.
            for doc in docs:
                if doc.distance > 0.8:
                    logger.info(
                        f"Doc id: {doc.id}, "
                        f"distance {doc.distance}."
                    )


asyncio.run(main())

Disk storage

Versioning

While dealing with indexing and memory leaks, also encountered excessive disk usage, which grew into terabytes.

The main reason for this excessive disk usage was LanceDB’s versioning system (see this), which stored multiple versions of the data, rapidly increasing storage consumption. Each operation on the table generates new files in directories located at paths like:

/storage/lancedb/documents.lance/_transactions
/storage/lancedb/documents.lance/_versions

To manage this, the optimize operation was invoked every hour, indexing new data and removing all previous versions of the table. This operation doesn’t block the table, so the searching capabilities of vectors remain unaffected and continue to function without interruption.

# hourly script
from datetime import timedelta
from lancedb import connect_async

conn = await connect_async(uri="/storage/lancedb")
table = await conn.open_table("documents")

await table.optimize(
    # such params delete previous versions
    cleanup_older_than=timedelta(seconds=0),
    delete_unverified=True
)

Indexes

LanceDB’s search mechanism isn’t blocked by the optimisation process because it leaves the old indexes untouched. It creates new indexes based on the existing ones, adding the unindexed data, and then switches to the new indexes. The old indexes can be safely deleted afterward, helping to reduce disk storage usage.

$ ls -lt /storage/lancedb/documents.lance/_indices
drwxrwxr-x 2 vald vald 4096 бер 30 17:06 aa9325d8-8cd2-4373-870b-94e26c1ad732
drwxrwxr-x 2 vald vald 4096 бер 30 17:06 9b93c185-52dd-4e54-8cae-c4ba1b0d9f66
# NOTE: below directories can be deleted assuming
# you have only two indexes scalar and vector ones
drwxrwxr-x 2 vald vald 4096 бер 30 17:01 5484f99e-dda2-4caa-bce4-8ea7987fff61
drwxrwxr-x 2 vald vald 4096 бер 30 17:00 125fee6d-6170-4600-b084-3680a7b35e59

I automated this process after the otpimize step because it produces new indexes. Using a CRON job, set up a script that ensures only the two latest directories representing the scalar and vector indexes (if so) are retained, automatically deleting any older, unused directories.

import asyncio
import argparse
import shutil
from datetime import timedelta
from pathlib import Path

from lancedb import connect_async
from loguru import logger


async def main(
    lancedb_path: str,
    table_names: list[str]
) -> None:
    lancedb_volume = Path(lancedb_path)

    lock_file = lancedb_volume / ".lock-file"
    if lock_file.exists():
        logger.warning("LanceDB optimisation, cleanup process is in progress. Exiting...")
        return

    lock_file.touch()
    db = await connect_async(uri=lancedb_path)

    logger.debug(
        f"LanceDB volume path: {lancedb_path}"
    )
    logger.debug(
        "Going to process "
        "the following tables: {table_names}"
    )
    print()

    for table_name in table_names:
        table = await db.open_table(table_name)

        # Clean up files in dirs _versions,
        # _transactions, merge data in data dir, add
        # new vectors into scalar and vector indexes.
        logger.warning(
            f"Optimizing {table_name} table..."
        )
        await table.optimize(
            cleanup_older_than=timedelta(seconds=0),
            delete_unverified=True,
        )
        logger.info(
            f"Finished optimizing {table_name} table."
        )

    print()

    # Remove .tmp* directories in LanceDB data volume
    # after `optimize` runs.
    for item in lancedb_volume.rglob(".tmp*"):
        if item.is_dir():
            logger.warning(
                "Deleting temporary "
                f"directory: {item}"
            )
            shutil.rmtree(item)

    print()

    # Delete indexes duplicates in each directory of
    # table and leave only the latest scalar and
    # vector ones.
    for table_name in table_names:
        table_dir = Path(
            f"{lancedb_path}/{table_name}"
            ".lance/_indices"
        )
        index_dirs = sorted(
            (
                item
                for item in table_dir.iterdir()
                if item.is_dir()
            ),
            key=lambda d: d.stat().st_mtime,
            reverse=True,
        )
        for index_dir in index_dirs[2:]:
            logger.warning(
                "Deleting index "
                f"directory: {index_dir}"
            )
            shutil.rmtree(index_dir)

    lock_file.unlink()



if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--lancedb-path",
        type=str,
        default="/storage/lancedb",
        help="Path where LanceDB data lives",
    )
    parser.add_argument(
        "--table-names",
        type=str,
        default="documents",
        help="Comma separated list of tables",
    )
    args = parser.parse_args()
    table_names = args.table_names.split(",")

    asyncio.run(
        main(
            lancedb_path=args.lancedb_path,
            table_names=table_names,
        )
    )

Final thoughts

Eventually, after all manipulations, 700 milion documents took 1.6TB disk space comparing with previous $ \infty \text{TB} $, 32CPU, 128GB RAM machine to make searches against such dataset.

Migration from Milvus onto LanceDB reduced our monthly costs from 30,000 USD to 7,000 USD in our setup, making it definitely worth the effort, despite requiring more manual management and support.

Update (09.09.2025)

LanceDB singleton connector

With newer LanceDB version, a singleton ensures that the connection is created only once and safely reused across your application. This avoids repeatedly opening and closing resources, which is what leads to memory leaks in the updated implementation.

Tables should use open_table() for every request. During the optimize() process, tables keep pointing to old indices while new ones are created and old ones are deleted. This causes a LanceError(IO) stating that the index does not exist.

from typing import Self

from lancedb import (
    AsyncConnection,
    connect_async
)


class LanceDBConnector:
    """Singleton class for LanceDB connection
    & tables management"""

    instance: Self | None = None
    conn: AsyncConnection | None = None
    tables: dict[str, AsyncTable] = {}

    def __new__(cls) -> "LanceDBConnector":
        if cls.instance is None:
            cls.instance = super().__new__(cls)

        return cls.instance

    async def get_connection(self) -> AsyncConnection:
        if self.conn is None:
            self.conn = await connect_async(
                uri="/storage/lancedb"
            )

        return self.conn

    async def close(self) -> None:
        for table in self.tables.values():
            table.close()

        self.tables.clear()

        if self.conn:
            self.conn.close()
            self.conn = None

Clean up script

LanceDB fixed issues with stale indices, so script for CRON job should be updated as well:

import asyncio
import argparse
import shutil
from datetime import timedelta
from pathlib import Path

from lancedb import connect_async
from loguru import logger


async def main(
    lancedb_path: str, table_names: list[str]
) -> None:
    lancedb_volume = Path(lancedb_path)

    lock_file = lancedb_volume / ".lock-file"
    if lock_file.exists():
        logger.warning(
            "LanceDB optimisation, cleanup process is in progress. Exiting..."
        )
        return

    lock_file.touch()
    db = await connect_async(uri=lancedb_path)

    logger.debug(
        f"LanceDB volume path: {lancedb_path}"
    )
    logger.debug(
        f"Going to process the following tables: {table_names}"
    )
    print()

    for table_name in table_names:
        table = await db.open_table(table_name)

        # Clean up files in dirs _versions, _transactions, merge data in data
        # dir, add new vectors into scalar and vector indices.
        logger.warning(
            f"Optimizing {table_name} table..."
        )
        await table.optimize(
            cleanup_older_than=timedelta(seconds=0),
            delete_unverified=True,
        )
        logger.info(
            f"Finished optimizing {table_name} table."
        )

    print()

    # Remove .tmp* directories in LanceDB data after `optimize` runs.
    for item in lancedb_volume.rglob(".tmp*"):
        if item.is_dir():
            logger.warning(
                f"Deleting temporary directory: {item}"
            )
            shutil.rmtree(item)

    print()

    # Delete empty index directories inside each table
    for table_name in table_names:
        table_dir = Path(
            f"{lancedb_path}/{table_name}.lance/_indices"
        )
        if not table_dir.exists():
            continue

        for index_dir in table_dir.iterdir():
            if index_dir.is_dir() and not any(
                index_dir.iterdir()
            ):
                logger.warning(
                    f"Deleting empty index directory: {index_dir}"
                )
                shutil.rmtree(index_dir)

    lock_file.unlink()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--lancedb-path",
        type=str,
        default="/storage/lancedb",
        help="Path where LanceDB data lives",
    )
    parser.add_argument(
        "--table-names",
        type=str,
        default="documents",
        help="Comma separated list of tables",
    )
    args = parser.parse_args()
    table_names = args.table_names.split(",")

    asyncio.run(
        main(
            lancedb_path=args.lancedb_path,
            table_names=table_names,
        )
    )