Adding pgvector to Django
In this post, I walk through setting up pgvector in Django, working with dense and sparse vectors, and performing similarity queries, along with an important warning about sparse vector indexing that can break your results.
Introduction
Working with large datasets can be hard on databases, especially when you need to compare items to find which ones are most similar. Recently, while building a tool to compare data that has many dimensions, I realized we needed a way to do similarity searches right inside our main database.
This is where pgvector comes in. If you are a developer looking to add vector search to a Django project, this post will walk you through the setup, how to use it, and a tricky detail about sparse vectors that can easily cause bugs.
What is pgvector and Why Use It?
In the past, if you wanted to find similar items based on data points, you usually had to send your data to a separate, special database. This meant you had to manage two databases and try to keep them in sync.
pgvector is a tool (an extension) that turns PostgreSQL into a vector database. It lets you store your vectors (arrays of numbers) right next to your normal data and search them easily. This keeps your setup simple.
Setting Up pgvector in Django
First, make sure your PostgreSQL database has the extension installed. Then, install the Python packages:
pip install pgvector psycopg
A quick note on the package: The pgvector package is maintained by the same developers who created the PostgreSQL extension. This makes it the most reliable choice for this work.
Next, you need to turn on the extension in your database using a Django migration. Create an empty migration:
python manage.py makemigrations --empty your_app_name
You then use Django’s VectorExtension() in your migration file like this:
from django.db import migrations
from pgvector.django import VectorExtension
class Migration(migrations.Migration):
dependencies = [
('your_app_name', '0001_initial'),
]
operations = [
VectorExtension(),
]
Alternatively, you can use a custom SQL command with IF NOT EXISTS. Open the migration file and use this code instead:
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('your_app_name', '0001_initial'),
]
operations = [
migrations.RunSQL(
sql="CREATE EXTENSION IF NOT EXISTS vector;",
reverse_sql="DROP EXTENSION IF EXISTS vector;"
),
]
This tells the database to only create the extension if it isn’t already there.
Normal Vectors (Dense Vectors): Inserting Data
For normal use cases, where your array is full of numbers, you use VectorField.
Here is how you set up the model:
from django.db import models
from pgvector.django import VectorField
class Document(models.Model):
title = models.CharField(max_length=255)
embedding = VectorField(dimensions=4)
To save data, you just pass a standard Python list:
Document.objects.create(
title="My First Doc",
embedding=[0.5, 0.2, 0.9, 0.3]
)
Sparse Vectors: When Most of Your Data is Zero
For our project, normal vectors didn’t make sense. Our data had thousands of dimensions, but almost all the values were zero. Storing an array of 10,000 numbers where 9,950 of them are zero takes up too much memory and space.
This is where SparseVectorField is very useful. It only saves the numbers that are not zero, and remembers their position.
Here is the model:
from pgvector.django import SparseVectorField
class ScienceData(models.Model):
name = models.CharField(max_length=100)
features = SparseVectorField(dimensions=10000)
Instead of a plain list or dictionary, you must convert your dictionary into a SparseVector object before saving it. You provide the dictionary and the total number of dimensions.
from pgvector.utils import SparseVector
ScienceData.objects.create(
name="Sample A",
# Pass the dictionary and the total dimension count (10000)
features=SparseVector({0: 5.5, 42: 1.2}, 10000)
)
In the call SparseVector({0: 5.5, 42: 1.2}, 10000), the first argument is a dictionary where the keys (0 and 42) represent the zero-indexed positions of your non-zero values and the corresponding values (5.5 and 1.2) are the actual data points at those spots, telling the database that any position not listed is exactly zero. The second argument (10000) defines the vector’s total overall length, which is required so the database knows the scope of the space and can correctly align and compare this record against other 10,000-dimension vectors during similarity searches.
Querying: Ordering and Annotations
Whether you are using dense vectors or sparse vectors, the way you search is the same. I will use sparse vectors for these examples.
1. Just Ordering (Simple Case)
If you only care about getting the closest matches and don’t need to see the actual score, you can just order by the distance.
from pgvector.django import CosineDistance
from pgvector.utils import SparseVector
target = SparseVector({0: 5.1, 42: 1.0}, 10000)
ordered_items = ScienceData.objects.order_by(
CosineDistance('features', target)
)[:10]
2. Just Annotating (To see the score)
If you want to know exactly how similar the items are, you can use .annotate(). This calculates the distance and attaches it to each result.
annotated_items = ScienceData.objects.annotate(
distance=CosineDistance('features', target)
)
3. Complex: Annotate, Filter, and Order
Often, you want to do all three: see the score, filter out bad matches, and order the results.
similar_items = ScienceData.objects.annotate(
distance=CosineDistance('features', target)
).filter(
distance__lt=0.2 # Only keep items with a distance less than 0.2
).order_by('distance') # Order from closest to furthest
Choosing the Right Distance Type
pgvector supports several ways to calculate how similar two vectors are. You should choose the one that fits your data:
-
CosineDistance: Measures the angle between two vectors. It is great for text embeddings or data where the overall pattern shape matters more than the exact size of the numbers. -
L2Distance(Euclidean): Measures the straight-line distance between two points. Use this when the exact size or magnitude of your numbers is important. -
MaxInnerProduct: Multiplies the matching numbers together. This is mostly used for recommendation engines where the vectors are already normalized (scaled to a length of 1). -
L1Distance(Taxicab): Measures distance by only moving in right angles (like a taxi driving along city blocks). -
HammingDistance&JaccardDistance: These are best used when your data is binary (just 0s and 1s) and you want to see how many bits overlap.
Exact vs. Approximate Search
You might read online that you should add an HNSW or IVFFlat index to your vector table to make searches faster. These are Approximate Nearest Neighbor (ANN) indexes. They group data together so the database doesn’t have to check every row.
While they make searches very fast, they give you an approximate answer. They might miss the true most similar item. For our project, we needed the answers to be exactly correct. Because we needed exact answers, we did not use these indexes. We simply did the math on the raw data table, which performs an exact search by checking every single row.
0-Indexing vs. 1-Indexing
If you use sparse vectors, you need to know this: PostgreSQL arrays start at 1, but Python starts at 0.
In the actual database table, pgvector uses 1-indexing. However, the Python package expects a 0-indexed dictionary inside the SparseVector object. When you insert data or query data in Django, the Python code automatically adds 1 to your keys before sending them to the database.
To visualize what we mean, look at how the exact same data looks in Python compared to how it looks saved inside the PostgreSQL database (assuming 5 total dimensions):
In Python (Django expects this):
# The keys start at 0
my_features = SparseVector({0: 1, 2: 2, 4: 3}, 5)
In the PostgreSQL Database (pgvector saves this format):
-- The keys are shifted up by 1
'{1:1,3:2,5:3}/5'
Why this matters: If you are reading data from a file or manually converting the data, and you parse it without taking this into account, all of your numbers will be in the wrong spots. The code will not crash, but your similarity search will give you completely wrong results. Always make sure your dictionary keys start at 0 before wrapping them in a SparseVector!
Conclusion
By keeping your vector search inside PostgreSQL, you keep your system simple while getting exactly the results you need. Just remember to watch out for those zero-indexes!