Doing on-device retrieval augmented generation with Ollama and SQLite

Learn how to build a local movie recommendation system using on-device RAG with Ollama and SQLite, complete with embeddings and vector search

Nadeesha Cabral on 12-11-2024

Doing RAG with postgres is a very popular approach, but it requires some infrastructure to get running. Namely a postgres server, and a running compute instance that can run the postgres DB.

This is mostly a solved problem if you can run a docker container:

docker run -e POSTGRES_USERNAME="postgres" -e POSTGRES_PASSWORD="postgres" -p 5432:5432 -v ~/postgres-data:/var/lib/postgresql/data pgvector/pgvector:pg16

But if you don't want to run a postgres server, you can use SQLite.

The advantage of SQLite is that it's a single file, and you can easily copy it around. And storing the vectors, and being able to query them, is the only real requirement for basic RAG. And it's also a great way to get started with RAG if you don't want to deal with server setups as well.

Setting up

sqlite

To start this process off, we'll install sqlite cli, sqlite-utils, and the sqlite-vec extension. sqlite-utils lets us manage extensions, and sqlite-vec is the extension that lets us store and query vectors in SQLite.

brew install sqlite sqlite-utils
sqlite-utils install sqlite-utils-sqlite-vec

This will give you a sqlite3 command that you can use to interact with a SQLite database. So, let's create a database called rag.db and an embeddings table.

ollama

Ollama is a lightweight, open-source, and easy-to-use way to run LLMs in your local environment. If you haven't already, you'll need to install Ollama.

We're going to use the mxbai-embed-large model for our embeddings. So, let's pull that model down to our local Ollama instance.

ollama pull mxbai-embed-large

Creating the database

Since we're using the sqlite-vec extension, we need to create a virtual table to store our embeddings. A virtual table looks like a normal table, but it's backed by an extension that provides the actual storage and query capabilities. In our case, the vec0 extension is providing the storage and query capabilities for our embeddings.

The following command creates a virtual table called embeddings in our rag.db database.

sqlite-utils rag.db "CREATE VIRTUAL TABLE embeddings USING vec0(
    document TEXT PRIMARY KEY,
    embedding FLOAT[1024]
);"

FLOAT[1024] is the data type for a 1024-dimensional float32 vector. Where does the 1024 come from? That's the dimension of the vector space that we're embedding our documents into. 1024 is defined by the embedding model we're going to use: mxbai-embed-large.

Ollama provides a simple API for us to call to get embeddings for a given text.

curl http://localhost:11434/api/embeddings -d '{
  "model": "mxbai-embed-large",
  "prompt": "Hello, world!"
}'

If you run this, you'll get a response that looks like this:

{
  "embedding": [0.30582195520401,0.7939208745956421,0.009617628529667854 ... ] // 1024 elements
}

This is a 1024-dimensional vector that represents the input text.

Loading the database with documents

We have a few dozen movies in the data.json file, which we can generate embeddings for and load into our database.

To do that, we'll curl the embeddings API, and use jq to send each element to our embedding model.

curl https://raw.githubusercontent.com/inferablehq/sqlite-ollama-rag/refs/heads/main/data.json | jq -c '.movies | .[]' | while read -r movie; do
    TITLE=$(echo $movie | jq -r '.title')
    SYNOPSIS=$(echo $movie | jq -r '.synopsis')
    YEAR=$(echo $movie | jq -r '.year')

    MOVIE="$TITLE($YEAR): $SYNOPSIS"

    EMBEDDING=$(curl http://localhost:11434/api/embeddings -d "{ \"model\": \"mxbai-embed-large\", \"prompt\": \"$MOVIE\" }" | jq '.embedding')

    sqlite-utils rag.db "INSERT INTO embeddings(document, embedding) VALUES ('$MOVIE', '$EMBEDDING');"
done

Let's ensure that our embeddings table has the right data:

sqlite-utils rag.db "SELECT * FROM embeddings;"

You should see base64 encoded embeddings like this:

[
  {
    "document": "The Last Echo(2024): A forensic memory archaeologist discovers an encrypted memory that leads to a decades-old conspiracy.",
    "embedding": { "$base64": true, "encoded": "+f+aPvwJ67..." }
  }
  ...
]

Querying the database

Now that we have stored our embeddings in the database, we need to find the closest embedding to a given query, like "A movie about a bunch of actors being stuck in their characters".

In order to do this, we have to generate the embeddings for our search query.

QUERY="A movie about a bunch of actors being stuck in their characters"

QUERY_EMBEDDING=$(curl http://localhost:11434/api/embeddings -d "{ \"model\": \"mxbai-embed-large\", \"prompt\": \"$QUERY\" }" | jq '.embedding')

Now, we can do a semantic similarity search to find the closest embedding to our query.

sqlite-utils rag.db "SELECT document,distance FROM embeddings WHERE embedding match '$QUERY_EMBEDDING' ORDER BY distance LIMIT 5;"

This will return the 5 closest documents to our query, along with the distance between the query embedding and the document embedding.

[
  {
    "document": "Final Act(2021): Theater performers become trapped in their roles after the curtain falls.",
    "distance": 14.468597412109375
  },
  {
    "document": "Time Lock(2021): Bank robbers get trapped in a time loop inside a vault.",
    "distance": 16.6357421875
  },
  {
    "document": "The Voice(2024): A voice actor discovers her recordings can alter listeners memories.",
    "distance": 16.638591766357422
  },
  {
    "document": "Echo Chamber(2021): Social media influencers become trapped in a digital reality of their own making.",
    "distance": 16.7578125
  },
  {
    "document": "The Forgotten(2023): A small towns residents start losing their memories in a mysterious pattern.",
    "distance": 17.152050018310547
  }
]

Putting it all together

The complete source code for this example is available here.
The quality of the results really depends on the embedding model you're using. A lot of foundation models are available as hosted APIs from OpenAI, Bedro, Google, etc that may work better for your use case. Each embedding model will have its own embedding size, which is the dimension of the vector space that it's embedding our documents into. The mxbai-embed-large model we used has a vector size of 1024.
At the time of writing, the sqlite-vec extension is still in beta, so there may be some instability with the extension. Internally at inferable, we use pgvector with postgres, which does the same thing, but for postgres.