The FAISSDB class is a highly customizable wrapper for the FAISS (Facebook AI Similarity Search) library, designed for efficient similarity search and clustering of dense vectors. This class facilitates the creation of a Retrieval-Augmented Generation (RAG) system by providing methods to add documents to a FAISS index and query the index for similar documents. It supports custom embedding models, preprocessing functions, and other customizations to fit various use cases.
# Initialize the FAISSDB instancedb=FAISSDB(dimension=768,index_type="Flat")# Add documents to the FAISS indexdb.add("This is a document about AI.",{"category":"AI"})db.add("Python is great for data science.",{"category":"Programming"})# Query the FAISS indexresults=db.query("Tell me about AI")forresultinresults:print(f"Score: {result['score']}, Text: {result['metadata']['text']}")
fromtransformersimportAutoTokenizer,AutoModelimporttorch# Custom embedding function using a HuggingFace modeldefcustom_embedding_function(text:str)->List[float]:tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")model=AutoModel.from_pretrained("bert-base-uncased")inputs=tokenizer(text,return_tensors="pt",padding=True,truncation=True,max_length=512)withtorch.no_grad():outputs=model(**inputs)embeddings=outputs.last_hidden_state.mean(dim=1).squeeze().tolist()returnembeddings# Custom preprocessing functiondefcustom_preprocess(text:str)->str:returntext.lower().strip()# Custom postprocessing functiondefcustom_postprocess(results:List[Dict[str,Any]])->List[Dict[str,Any]]:forresultinresults:result["custom_score"]=result["score"]*2# Example modificationreturnresults# Initialize the FAISSDB instance with custom functionsdb=FAISSDB(dimension=768,index_type="Flat",embedding_function=custom_embedding_function,preprocess_function=custom_preprocess,postprocess_function=custom_postprocess,metric="cosine",logger_config={"handlers":[{"sink":"custom_faiss_rag_wrapper.log","rotation":"1 GB"},{"sink":lambdamsg:print(f"Custom log: {msg}",end="")}],},)# Add documents to the FAISS indexdb.add("This is a document about machine learning.",{"category":"ML"})db.add("Python is a versatile programming language.",{"category":"Programming"})# Query the FAISS indexresults=db.query("Explain machine learning")forresultinresults:print(f"Score: {result['score']}, Custom Score: {result['custom_score']}, Text: {result['metadata']['text']}")
Ensure that the dimension of the document embeddings matches the dimension specified during the initialization of the FAISSDB instance.
Use custom embedding functions to leverage domain-specific models for generating embeddings.
Custom preprocessing and postprocessing functions can help tailor the text processing and
result formatting to specific needs.
- FAISS supports various types of indices; choose the one that best fits the application requirements (e.g., Flat for brute-force search, IVF for faster search with some accuracy trade-off).
- Properly configure the logger to monitor and debug the operations of the FAISSDB instance.
By following this documentation, users can effectively utilize the FAISSDB class for various similarity search and document retrieval tasks, customizing it to their specific needs through the provided hooks and functions.