A python template
Embedding reader is a module to make it easy to read efficiently a large collection of embeddings stored in any file system.
pip install embedding_reader
Checkout these examples to call this as a lib:
from embedding_reader import EmbeddingReader
embedding_reader = EmbeddingReader(embeddings_folder="embedding_folder", file_format="npy")
print("embedding count", embedding_reader.count)
print("dimension", embedding_reader.dimension)
print("total size", embedding_reader.total_size)
print("byte per item", embedding_reader.byte_per_item)
for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
print(emb.shape)
In laion5B you can find 5B ViT-L/14 image embeddings, you can read them with that code:
from embedding_reader import EmbeddingReader
embedding_reader = EmbeddingReader(embeddings_folder="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/", file_format="npy")
print("embedding count", embedding_reader.count)
print("dimension", embedding_reader.dimension)
print("total size", embedding_reader.total_size)
print("byte per item", embedding_reader.byte_per_item)
for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
print(emb.shape)
It takes about 3h to read laion2B-en embeddings at 300MB/s
The parquet_npy format supports reading from both a .npy collection and a .parquet collection that are in the same order. Here is an example of usage:
from embedding_reader import EmbeddingReader
embedding_reader = EmbeddingReader(
embeddings_folder="embedding_folder",
metadata_folder="metadata_folder",
meta_columns=['image_path', 'caption'],
file_format="parquet_npy"
)
for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
print(emb.shape)
print(meta["image_path"], meta["caption"])
emb
is a numpy array like the previous examples while meta
is a pandas dataframe with the columns requested in meta_columns
.
Some use cases of embedding reader include:
Embeddings are a powerful concept, they allow turning highly complex data into point in a linearly separable space. Embeddings are also much smaller and more efficient to manipulate than usual data (images, audio, video, text, interaction items, ...)
To learn more about embeddings read Semantic search blogpost
Thanks to fsspec, embedding_reader supports reading and writing files in many file systems.
To use it, simply use the prefix of your filesystem before the path. For example hdfs://
, s3://
, http://
, or gcs://
.
Some of these file systems require installing an additional package (for example s3fs for s3, gcsfs for gcs).
See fsspec doc for all the details.
This module exposes one class:
initialize the reader by listing all files and retrieving their metadata
the embedding folder
total number of embedding in this folder
dimension of one embedding
size of one embedding in bytes
size in bytes of the collection
total number of embedding files in this folder
max size in bytes of the embedding files of the collection
Produces an iterator that yields tuples (data, meta) with the given batch_size
The main architecture choice of this lib is the build_pieces
function that builds decently sizes pieces of embedding files (typically 50MB) initially.
These pieces metadata can then be used to fetch in parallel these pieces, which are then used to build the embedding batches and provided to the user.
In order to reach the maximal speed, it is better to read files of equal size. The number of threads used is constrained by the maximum size of your embeddings files: the lower the size, the more threads are used (you can also set a custom number of threads, but ram consumption will be higher).
In practice, it has been observed speed of up to 100MB/s when fetching embeddings from s3, 1GB/s when fetching from an nvme drive. That means reading 400GB of embeddings in 8 minutes (400M embeddings in float16 and dimension 512) The memory usage stays low and flat thanks to the absence of copy. Decreasing the batch size decreases the amount of memory consumed, you can also set max_ram_usage_in_bytes to have a better control on the ram usage.
Either locally, or in gitpod (do export PIP_USER=false
there)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black
to reformat the code
python -m pytest -x -s -v tests -k "dummy"
to run a specific test