6 min read

Vector Search Using OpenAI Embeddings With Qdrant

Vector Search Using OpenAI Embeddings With Qdrant
Created using Kandinsky

In my experience, Qdrant is one of the simplest vector databases to work with as a developer.

And it's open source!

In this article, I will show how to perform basic CRUD operations with the Qdrant vector database.

Let's get started!

Installing Qdrant locally requires docker and is fairly straightforward.

Installation:

docker pull qdrant/qdrant
docker run -p 6333:6333 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant

Once we have Qdrant up and running we need to install the Qdrant python package.

pip install qdrant-client

That's it!

We're all set to use Qdrant.

One last thing we will need is an OpenAI developer account. We will be using the OpenAI embeddings endpoint to create our vector embedding which we will store inside Qdrant.

Make sure you have your OpenAI API key handy.


Code

Importing packages:

import uuid
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from qdrant_client.http.models import PointStruct, CollectionStatus, UpdateStatus
from qdrant_client.http.models import Filter, FieldCondition, MatchValue
from qdrant_client.http import models
from typing import List

import openai
from openai.embeddings_utils import get_embedding

openai.api_key = "YOUR-OPENAI-API-KEY"

This might seem like a lot of imports, but we're going to need them for performing CRUD operations with the database.

To keep things organized, we'll be creating a python class to hold all of our functions related to Qdrant.

class QdrantVectorStore:

    def __init__(self):
    	pass

Step 1: Connecting to Qdrant

Let's add some code to our init method that allows us to connect to Qdrant and set up a collection.

class QdrantVectorStore:

    def __init__(self,
                 host: str = "localhost",
                 port: int = 6333,
                 db_path: str = "/path/to/db/qdrant_storage",
                 collection_name: str = "test_collection",
                 vector_size: int = 1536,
                 vector_distance=Distance.COSINE
                 ):

        self.client = QdrantClient(
            url=host,
            port=port,
            path=db_path
        )
        self.collection_name = collection_name

        try:
            collection_info = self.client.get_collection(collection_name=collection_name)
        except Exception as e:
            print("Collection does not exist, creating collection now")
            self.set_up_collection(collection_name,  vector_size, vector_distance)

    def set_up_collection(self, collection_name: str, vector_size: int, vector_distance: str):

        self.client.recreate_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(size=vector_size, distance=vector_distance)
        )

        collection_info = self.client.get_collection(collection_name=collection_name)

The host and port you see above are the default values Qdrant launches with.

The db_path parameter is the path to where your data is stored on your local machine. If you go to the directory where you ran the docker run command earlier you should see a directory called qdrant_storage which was automatically created. We want to set db_path as the path to this directory.

The collection_name can be any string value you like, by default it's set to test_collection.

The vector_size and vector_distance parameters are required by Qdrant when creating a collection. The OpenAI vector embeddings have a size of 1536. The vector_distance parameter tells Qdrant which distance metric to use when calculating the distance between two vectors.

Once we've successfully connected to Qdrant, we need to create a collection if it does not already exist. A collection essentially works as a group to hold our vectors. The set_up_collection function is a helper function that creates the collection for us.

If you want to run the code above you can run it like so outside of the class definition.

class QdrantVectorStore:
    def __init__(self, host, port, ....):
    def set_up_collection(self, ...):


if __name__ == "__main__":
    vector_db = QdrantVectorStore()

Step 2: Inserting Data

Now that we have a way to connect to Qdrant and created a collection, we can begin inserting data into Qdrant.

The data we will be inserting is a series of quotes spoken by famous people.

Here is an example of that data:

[

   {"quote": "A rose by any other name would smell as sweet.", "person": "William Shakespeare"},
   {"quote": "All that glitters is not gold.", "person": "William Shakespeare"},
   {"quote": "Ask not what your country can do for you; ask what you can do for your country.", "person": "John Kennedy"},
   {"quote": "Genius is one percent inspiration and ninety-nine percent perspiration.", "person": "Thomas Edison"},
   {"quote": "He travels the fastest who travels alone.", "person": "Rudyard Kipling"},
   {"quote": "Houston, we have a problem.", "person": "Jim Lovell"},
   {"quote": "That’s one small step for a man, a giant leap for mankind.", "person": "Neil Armstrong"}
    
]

Code for inserting the quotes data:

    def upsert_data(self, data: List[dict]):
        points = []
        for item in data:
            quote = item.get("quote")
            person = item.get("person")

            text_vector = get_embedding(quote, engine="text-embedding-ada-002")
            text_id = str(uuid.uuid4())
            payload = {"quote": quote, "person": person}
            point = PointStruct(id=text_id, vector=text_vector, payload=payload)
            points.append(point)

        operation_info = self.client.upsert(
            collection_name=self.collection_name,
            wait=True,
            points=points)

        if operation_info.status == UpdateStatus.COMPLETED:
            print("Data inserted successfully!")
        else:
            print("Failed to insert data")

Alright, let's break down what's happening in the code above.

The data parameter is a list of dictionaries as you have seen from the example data above.

The for loop runs over your data and does the following for each item:

  1. It extracts the quote and the person who said it
  2. Converts the quote into a vector embedding using the OpenAI embeddings endpoint.
  3. Creates a random ID for every piece of data
  4. The payload is any metadata that you want to save along with your vectors.
  5. Creates a PointStruct based on the information in the prior steps
  6. Appends the point to a list called points

Points are the mechanism that Qdrant uses to store and retrieve data. Points not only contain the vector embedding itself but also any additional metadata you want to include.

The self.client.upsert function will insert our data into Qdrant once we have our data converted into a list of Points.

You can insert the data using the code below:

if __name__ == "__main__":
    vector_db = QdrantVectorStore()
    famous_quotes = [
        {"quote": "A rose by any other name would smell as sweet.", "person": "William Shakespeare"},
        {"quote": "All that glitters is not gold.", "person": "William Shakespeare"},
        {"quote": "Ask not what your country can do for you; ask what you can do for your country.", "person": "John Kennedy"},
        {"quote": "Genius is one percent inspiration and ninety-nine percent perspiration.", "person": "Thomas Edison"},
        {"quote": "He travels the fastest who travels alone.", "person": "Rudyard Kipling"},
        {"quote": "Houston, we have a problem.", "person": "Jim Lovell"},
        {"quote": "That’s one small step for a man, a giant leap for mankind.", "person": "Neil Armstrong"}
    ]

    vector_db.upsert_data(famous_quotes)

Sweet! Once our data is uploaded into Qdrant we can start performing vector search.

To perform vector search we need to compare an input vector with vectors in our Qdrant database. Our input vector will come from text a user has typed in.

We will take the user's text, convert it into a vector embedding and then find the closest vectors in Qdrant that match our input text vector.

    def search(self, input_query: str, limit: int = 3):
        input_vector = get_embedding(input_query, engine="text-embedding-ada-002")
        search_result = self.client.search(
            collection_name=self.collection_name,
            query_vector=input_vector,
            limit=limit
        )

        result = []
        for item in search_result:
            similarity_score = item.score
            payload = item.payload
            data = {"id": item.id, "similarity_score": similarity_score, "quote": payload.get("quote"), "person": payload.get("person")}
            result.append(data)

        return result

In the function above the input_query is the raw text the user has typed. The input_vector simply converts the input_query into an embedding so that we can compute vector distances.

The limit is the number of vectors we want to return, closest to the input vector.

The Qdrant self.client.search function returns a list of results that we can loop over to extract the necessary information. In this case, we are extracting the quote, person, and similarity score.

You can run the code like so:

if __name__ == "__main__":
    vector_db = QdrantVectorStore()
    result = vector_db.search("gold rush")

Step 4:  Vector Search With Filters

In the example above we performed vector search by creating a vector from the input text a user passes in and comparing it to existing vectors in the database.

We can filter these results further based on a specific key-value pair in the payload.

    def search_with_filter(self, input_query: str, filter: dict, limit: int = 3):
        input_vector = get_embedding(input_query, engine="text-embedding-ada-002")
        filter_list = []
        for key, value in filter.items():
            filter_list.append(
                FieldCondition(key=key, match=MatchValue(value=value))
            )

        search_result = self.client.search(
            collection_name=self.collection_name,
            query_vector=input_vector,
            query_filter=Filter(
                must=filter_list
            ),
            limit=limit
        )

        result = []
        for item in search_result:
            similarity_score = item.score
            payload = item.payload
            data = {"id": item.id, "similarity_score": similarity_score, "quote": payload.get("quote"),
                    "person": payload.get("person")}
            result.append(data)

        return result

Suppose we only want to filter the quotes by William Shakespeare.

if __name__ == "__main__":
    vector_db = QdrantVectorStore()
    results = vector_db.search_with_filter("gold rush", filter={"person": "William Shakespeare"})

Pretty cool, right?

Step 5:  Delete Vectors In Qdrant

We've added some vectors, searched through them, and even learned to use a filter.

But how do we delete vectors?

    def delete(self, text_ids: list):
        self.client.delete(
            collection_name=self.collection_name,
            points_selector=models.PointIdsList(
                points=text_ids,
            )
        )

If you look back at Step 2 where we inserted vectors we had a variable called text_id which was the ID of each point in our vector database. We can delete vectors easily by passing a list of those ids into the Qdrant delete function.

if __name__ == "__main__":
    vector_db = QdrantVectorStore()
    vector_db.delete(text_ids=["ea44cfaa-c14e-40d8-8f99-931e564d8539"])

Phew!

I know this post was kinda long, but I hope it helps you get started with Qdrant. The information covered here is just the beginning and there are far more advanced topics you can delve into.

Check out the Qdrant documentation if you want to learn more.

Thanks for reading!