Skip to content

FAISSDB: Documentation

The FAISSDB class is a highly customizable wrapper for the FAISS (Facebook AI Similarity Search) library, designed for efficient similarity search and clustering of dense vectors. This class facilitates the creation of a Retrieval-Augmented Generation (RAG) system by providing methods to add documents to a FAISS index and query the index for similar documents. It supports custom embedding models, preprocessing functions, and other customizations to fit various use cases.

Parameters

Parameter Type Default Description
dimension int 768 Dimension of the document embeddings.
index_type str 'Flat' Type of FAISS index to use ('Flat' or 'IVF').
embedding_model Optional[Any] None Custom embedding model.
embedding_function Optional[Callable[[str], List[float]]] None Custom function to generate embeddings from text.
preprocess_function Optional[Callable[[str], str]] None Custom function to preprocess text before embedding.
postprocess_function Optional[Callable[[List[Dict[str, Any]]], List[Dict[str, Any]]]] None Custom function to postprocess the results.
metric str 'cosine' Distance metric for FAISS index ('cosine' or 'l2').
logger_config Optional[Dict[str, Any]] None Configuration for the logger.

Methods

__init__

Initializes the FAISSDB instance, setting up the logger, creating the FAISS index, and configuring custom functions if provided.

add

Adds a document to the FAISS index.

Parameters

Parameter Type Default Description
doc str None The document to be added.
metadata Optional[Dict[str, Any]] None Additional metadata for the document.

Example Usage

db = FAISSDB(dimension=768)
db.add("This is a sample document.", {"category": "sample"})

query

Queries the FAISS index for similar documents.

Parameters

Parameter Type Default Description
query str None The query string.
top_k int 5 The number of top results to return.

Returns

Type Description
List[Dict[str, Any]] A list of dictionaries containing the top_k most similar documents.

Example Usage

results = db.query("What is artificial intelligence?")
for result in results:
    print(f"Score: {result['score']}, Text: {result['metadata']['text']}")

Internal Methods

_setup_logger

Sets up the logger with the given configuration.

Parameters

Parameter Type Default Description
config Optional[Dict[str, Any]] None Configuration for the logger.

_create_index

Creates and returns a FAISS index based on the specified type and metric.

Parameters

Parameter Type Default Description
index_type str 'Flat' Type of FAISS index to use.
metric str 'cosine' Distance metric for FAISS index.

Returns

Type Description
faiss.Index FAISS index instance.

_default_embedding_function

Default embedding function using the SentenceTransformer model.

Parameters

Parameter Type Default Description
text str None The input text to embed.

Returns

Type Description
List[float] Embedding vector for the input text.

_default_preprocess_function

Default preprocessing function.

Parameters

Parameter Type Default Description
text str None The input text to preprocess.

Returns

Type Description
str Preprocessed text.

_default_postprocess_function

Default postprocessing function.

Parameters

Parameter Type Default Description
results List[Dict[str, Any]] None The results to postprocess.

Returns

Type Description
List[Dict[str, Any]] Postprocessed results.

Usage Examples

Example 1: Basic Usage

# Initialize the FAISSDB instance
db = FAISSDB(dimension=768, index_type="Flat")

# Add documents to the FAISS index
db.add("This is a document about AI.", {"category": "AI"})
db.add("Python is great for data science.", {"category": "Programming"})

# Query the FAISS index
results = db.query("Tell me about AI")
for result in results:
    print(f"Score: {result['score']}, Text: {result['metadata']['text']}")

Example 2: Custom Functions

from transformers import AutoTokenizer, AutoModel
import torch

# Custom embedding function using a HuggingFace model
def custom_embedding_function(text: str) -> List[float]:
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    model = AutoModel.from_pretrained("bert-base-uncased")
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
    return embeddings

# Custom preprocessing function
def custom_preprocess(text: str) -> str:
    return text.lower().strip()

# Custom postprocessing function
def custom_postprocess(results: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    for result in results:
        result["custom_score"] = result["score"] * 2  # Example modification
    return results

# Initialize the FAISSDB instance with custom functions
db = FAISSDB(
    dimension=768,
    index_type="Flat",
    embedding_function=custom_embedding_function,
    preprocess_function=custom_preprocess,
    postprocess_function=custom_postprocess,
    metric="cosine",
    logger_config={
        "handlers": [
            {"sink": "custom_faiss_rag_wrapper.log", "rotation": "1 GB"},
            {"sink": lambda msg: print(f"Custom log: {msg}", end="")}
        ],
    },
)

# Add documents to the FAISS index
db.add("This is a document about machine learning.", {"category": "ML"})
db.add("Python is a versatile programming language.", {"category": "Programming"})

# Query the FAISS index
results = db.query("Explain machine learning")
for result in results:
    print(f"Score: {result['score']}, Custom Score: {result['custom_score']}, Text: {result['metadata']['text']}")

Additional Information and Tips

  • Ensure that the dimension of the document embeddings matches the dimension specified during the initialization of the FAISSDB instance.
  • Use custom embedding functions to leverage domain-specific models for generating embeddings.
  • Custom preprocessing and postprocessing functions can help tailor the text processing and

result formatting to specific needs. - FAISS supports various types of indices; choose the one that best fits the application requirements (e.g., Flat for brute-force search, IVF for faster search with some accuracy trade-off). - Properly configure the logger to monitor and debug the operations of the FAISSDB instance.

References and Resources

By following this documentation, users can effectively utilize the FAISSDB class for various similarity search and document retrieval tasks, customizing it to their specific needs through the provided hooks and functions.