Philosophy with Vector Embeddings, OpenAI and Cassandra / Astra DB

Stefano Lottini

Aug 29, 2023

CassIO version

In this quickstart you will learn how to build a "philosophy quote finder & generator" using OpenAI's vector embeddings and DataStax Astra DB (or a vector-capable Apache Cassandra® cluster, if you prefer) as the vector store for data persistence.

The basic workflow of this notebook is outlined below. You will evaluate and store the vector embeddings for a number of quotes by famous philosophers, use them to build a powerful search engine and, after that, even a generator of new quotes!

The notebook exemplifies some of the standard usage patterns of vector search -- while showing how easy is it to get started with the Vector capabilities of Astra DB.

For a background on using vector search and text embeddings to build a question-answering system, please check out this excellent hands-on notebook: Question answering using embeddings.

Choose-your-framework

Please note that this notebook uses the CassIO library, but we cover other choices of technology to accomplish the same task. Check out this folder's README for other options. This notebook can run either as a Colab notebook or as a regular Jupyter notebook.

Table of contents:

Setup
Get DB connection
Connect to OpenAI
Load quotes into the Vector Store
Use case 1: quote search engine
Use case 2: quote generator
(Optional) exploit partitioning in the Vector Store

How it works

Indexing

Each quote is made into an embedding vector with OpenAI's Embedding. These are saved in the Vector Store for later use in searching. Some metadata, including the author's name and a few other pre-computed tags, are stored alongside, to allow for search customization.

1_vector_indexing

Search

To find a quote similar to the provided search quote, the latter is made into an embedding vector on the fly, and this vector is used to query the store for similar vectors ... i.e. similar quotes that were previously indexed. The search can optionally be constrained by additional metadata ("find me quotes by Spinoza similar to this one ...").

2_vector_search

The key point here is that "quotes similar in content" translates, in vector space, to vectors that are metrically close to each other: thus, vector similarity search effectively implements semantic similarity. This is the key reason vector embeddings are so powerful.

The sketch below tries to convey this idea. Each quote, once it's made into a vector, is a point in space. Well, in this case it's on a sphere, since OpenAI's embedding vectors, as most others, are normalized to unit length. Oh, and the sphere is actually not three-dimensional, rather 1536-dimensional!

So, in essence, a similarity search in vector space returns the vectors that are closest to the query vector:

3_vector_space

Generation

Given a suggestion (a topic or a tentative quote), the search step is performed, and the first returned results (quotes) are fed into an LLM prompt which asks the generative model to invent a new text along the lines of the passed examples and the initial suggestion.

4_quote_generation

Setup

First install some required packages:

!pip install "cassio>=0.1.3" openai

Get DB connection

In order to connect to you Astra DB, you need two things:

An Astra Token, with role "Database Administrator" (it looks like AstraCS:...)
the database ID (it looks like 3df2a5b6-...)

Make sure you have both strings, Both are obtained in the Astra UI once you sign in. For more information, see here: database ID and Token.

If you want to connect to a Cassandra cluster (which however must support Vectors), replace with cassio.init(session=..., keyspace=...) with suitable Session and keyspace name for your cluster.

from getpass import getpass

astra_token = getpass("Please enter your Astra token ('AstraCS:...')")
database_id = input("Please enter your database id ('3df2a5b6-...')")

Please enter your Astra token ('AstraCS:...') ········
Please enter your database id ('3df2a5b6-...') 00000000-0000-0000-0000-000000000000

import cassio

cassio.init(token=astra_token, database_id=database_id)

Creation of the DB connection

This is how you create a connection to Astra DB:

(Incidentally, you could also use any Cassandra cluster (as long as it provides Vector capabilities), just by changing the parameters to the following Cluster instantiation.)

Creation of the Vector Store through CassIO

You need a table which support vectors and is equipped with metadata. Call it "philosophers_cassio":

# create a vector store with cassIO
from cassio.table import MetadataVectorCassandraTable

v_table = MetadataVectorCassandraTable(table="philosophers_cassio", vector_dimension=1536)

Connect to OpenAI

Set up your secret key

OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key:  ········

import openai

openai.api_key = OPENAI_API_KEY

A test call for embeddings

Quickly check how one can get the embedding vectors for a list of input texts:

embedding_model_name = "text-embedding-ada-002"

result = openai.Embedding.create(
    input=[
        "This is a sentence",
        "A second sentence"
    ],
    engine=embedding_model_name,
)

print(f"len(result.data)              = {len(result.data)}")
print(f"result.data[1].embedding      = {str(result.data[1].embedding)[:55]}...")
print(f"len(result.data[1].embedding) = {len(result.data[1].embedding)}")

len(result.data)              = 2
result.data[1].embedding      = [-0.011011358350515366, 0.0033741754014045, 0.004608382...
len(result.data[1].embedding) = 1536

Load quotes into the Vector Store

Get a JSON file containing our quotes. We already prepared this collection and put it into this repo for quick loading.

(Note: we adapted the following from a Kaggle dataset -- which we acknowledge -- and also added a few tags to each quote.)

# Don't mind this cell, just autodetecting if we're on a Colab or not
try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

import json
import requests

if IS_COLAB:
    # load from Web request to (github) repo
    json_url = "https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/vector_databases/cassandra_astradb/sources/philo_quotes.json"
    quote_dict = json.loads(requests.get(json_url).text)    
else:
    # load from local repo
    quote_dict = json.load(open("./sources/philo_quotes.json"))

A quick inspection of the input data structure:

print(quote_dict["source"])

total_quotes = sum(len(quotes) for quotes in quote_dict["quotes"].values())
print(f"\nQuotes loaded: {total_quotes}.\nBy author:")
print("\n".join(f"  {author} ({len(quotes)})" for author, quotes in quote_dict["quotes"].items()))

print("\nSome examples:")
for author, quotes in list(quote_dict["quotes"].items())[:2]:
    print(f"  {author}:")
    for quote in quotes[:2]:
        print(f"    {quote['body'][:50]} ... (tags: {', '.join(quote['tags'])})")

Adapted from this Kaggle dataset: https://www.kaggle.com/datasets/mertbozkurt5/quotes-by-philosophers (License: CC BY-NC-SA 4.0)

Quotes loaded: 450.
By author:
  aristotle (50)
  freud (50)
  hegel (50)
  kant (50)
  nietzsche (50)
  plato (50)
  sartre (50)
  schopenhauer (50)
  spinoza (50)

Some examples:
  aristotle:
    True happiness comes from gaining insight and grow ... (tags: knowledge)
    The roots of education are bitter, but the fruit i ... (tags: education, knowledge)
  freud:
    We are what we are because we have been what we ha ... (tags: history)
    From error to error one discovers the entire truth ... (tags: )

Insert quotes into vector store

You will compute the embeddings for the quotes and save them into the Vector Store, along with the text itself and the metadata planned for later use. Note that the author is added as a metadata field along with the "tags" already found with the quote itself.

To optimize speed and reduce the calls, you'll perform batched calls to the embedding OpenAI service, with one batch per author.

(Note: for faster execution, Cassandra and CassIO would let you do concurrent inserts, which we don't do here for a more straightforward demo code.)

for philosopher, quotes in quote_dict["quotes"].items():
    print(f"{philosopher}: ", end="")
    result = openai.Embedding.create(
        input=[quote["body"] for quote in quotes],
        engine=embedding_model_name,
    )
    for quote_idx, (quote, q_data) in enumerate(zip(quotes, result.data)):
        v_table.put(
            row_id=f"q_{philosopher}_{quote_idx}",
            body_blob=quote["body"],
            vector=q_data.embedding,
            metadata={**{tag: True for tag in quote["tags"]}, **{"author": philosopher}},
        )
        print("*", end='')
    print(f" Done ({len(quotes)} quotes inserted).")
print("Finished inserting.")

aristotle: ************************************************** Done (50 quotes inserted).
freud: ************************************************** Done (50 quotes inserted).
hegel: ************************************************** Done (50 quotes inserted).
kant: ************************************************** Done (50 quotes inserted).
nietzsche: ************************************************** Done (50 quotes inserted).
plato: ************************************************** Done (50 quotes inserted).
sartre: ************************************************** Done (50 quotes inserted).
schopenhauer: ************************************************** Done (50 quotes inserted).
spinoza: ************************************************** Done (50 quotes inserted).
Finished inserting.

Use case 1: quote search engine

For the quote-search functionality, you need first to make the input quote into a vector, and then use it to query the store (besides handling the optional metadata into the search call, that is).

Encapsulate the search-engine functionality into a function for ease of re-use:

def find_quote_and_author(query_quote, n, author=None, tags=None):
    query_vector = openai.Embedding.create(
        input=[query_quote],
        engine=embedding_model_name,
    ).data[0].embedding
    metadata = {}
    if author:
        metadata["author"] = author
    if tags:
        for tag in tags:
            metadata[tag] = True
    #
    results = v_table.ann_search(
        query_vector,
        n=n,
        metadata=metadata,
    )
    return [
        (result["body_blob"], result["metadata"]["author"])
        for result in results
    ]

Putting search to test

Passing just a quote:

find_quote_and_author("We struggle all our life for nothing", 3)

[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
  'schopenhauer'),
 ('The meager satisfaction that man can extract from reality leaves him starving.',
  'freud'),
 ('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche')]

Search restricted to an author:

find_quote_and_author("We struggle all our life for nothing", 2, author="nietzsche")

[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
  'nietzsche')]

Search constrained to a tag (out of those saved earlier with the quotes):

find_quote_and_author("We struggle all our life for nothing", 2, tags=["politics"])

[('Mankind will never see an end of trouble until lovers of wisdom come to hold political power, or the holders of power become lovers of wisdom',
  'plato'),
 ('Everything the State says is a lie, and everything it has it has stolen.',
  'nietzsche')]

Cutting out irrelevant results

The vector similarity search generally returns the vectors that are closest to the query, even if that means results that might be somewhat irrelevant if there's nothing better.

To keep this issue under control, you can get the actual "distance" between the query and each result, and then set a cutoff on it, effectively discarding results that are beyond that threshold. Tuning this threshold correctly is not an easy problem: here, we'll just show you the way.

To get a feeling on how this works, try the following query and play with the choice of quote and threshold to compare the results:

Note (for the mathematically inclined): this "distance" is exactly the cosine difference between the vectors, i.e. the scalar product divided by the product of the norms of the two vectors. As such, it is a number ranging from -1 to +1. Elsewhere (e.g. in the "CQL" version of this example) you will see this quantity rescaled to fit the [0, 1] interval, which means the numerical values and adequate thresholds will be slightly different.

quote = "Animals are our equals."
# quote = "Be good."
# quote = "This teapot is strange."

metric_threshold = 0.8

quote_vector = openai.Embedding.create(
    input=[quote],
    engine=embedding_model_name,
).data[0].embedding

results = list(v_table.metric_ann_search(
    quote_vector,
    n=8,
    metric="cos",
    metric_threshold=metric_threshold,
))

print(f"{len(results)} quotes within the threshold:")
for idx, result in enumerate(results):
    print(f"    {idx}. [distance={result['distance']:.3f}] \"{result['body_blob'][:70]}...\"")

8 quotes within the threshold:
    0. [distance=0.858] "The assumption that animals are without rights, and the illusion that ..."
    1. [distance=0.849] "Animals are in possession of themselves; their soul is in possession o..."
    2. [distance=0.846] "At his best, man is the noblest of all animals; separated from law and..."
    3. [distance=0.840] "Man is the only animal that must be encouraged to live...."
    4. [distance=0.838] ".... we are a part of nature as a whole, whose order we follow...."
    5. [distance=0.828] "Because Christian morality leaves animals out of account, they are at ..."
    6. [distance=0.827] "Every human endeavor, however singular it seems, involves the whole hu..."
    7. [distance=0.826] "A dog has the soul of a philosopher...."

Use case 2: quote generator

For this task you need another component from OpenAI, namely an LLM to generate the quote for us (based on input obtained by querying the Vector Store).

You also need a template for the prompt that will be filled for the generate-quote LLM completion task.

completion_model_name = "gpt-3.5-turbo"

generation_prompt_template = """"Generate a single short philosophical quote on the given topic,
similar in spirit and form to the provided actual example quotes.
Do not exceed 20-30 words in your quote.

REFERENCE TOPIC: "{topic}"

ACTUAL EXAMPLES:
{examples}
"""

Like for search, this functionality is best wrapped into a handy function (which internally uses search):

def generate_quote(topic, n=2, author=None, tags=None):
    quotes = find_quote_and_author(query_quote=topic, n=n, author=author, tags=tags)
    if quotes:
        prompt = generation_prompt_template.format(
            topic=topic,
            examples="\n".join(f"  - {quote[0]}" for quote in quotes),
        )
        # a little logging:
        print("** quotes found:")
        for q, a in quotes:
            print(f"**    - {q} ({a})")
        print("** end of logging")
        #
        response = openai.ChatCompletion.create(
            model=completion_model_name,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=320,
        )
        return response.choices[0].message.content.replace('"', '').strip()
    else:
        print("** no quotes found.")
        return None

Putting quote generation to test

Just passing a text (a "quote", but one can actually just suggest a topic since its vector embedding will still end up at the right place in the vector space):

q_topic = generate_quote("politics and virtue")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - Happiness is the reward of virtue. (aristotle)
**    - Enthusiasm is always connected with the senses, whatever be the object that excites it. The true strength of virtue is serenity of mind, combined with a deliberate and steadfast determination to execute her laws. That is the healthful condition of the moral life; on the other hand, enthusiasm, even when excited by representations of goodness, is a brilliant but feverish glow which leaves only exhaustion and languor behind. (kant)
** end of logging

A new generated quote:
Politics without virtue is like a ship without a compass - destined to drift aimlessly, guided only by self-interest and corruption.

Use inspiration from just a single philosopher:

q_topic = generate_quote("animals", author="schopenhauer")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - Because Christian morality leaves animals out of account, they are at once outlawed in philosophical morals; they are mere 'things,' mere means to any ends whatsoever. They can therefore be used for vivisection, hunting, coursing, bullfights, and horse racing, and can be whipped to death as they struggle along with heavy carts of stone. Shame on such a morality that is worthy of pariahs, and that fails to recognize the eternal essence that exists in every living thing, and shines forth with inscrutable significance from all eyes that see the sun! (schopenhauer)
**    - The assumption that animals are without rights, and the illusion that our treatment of them has no moral significance, is a positively outrageous example of Western crudity and barbarity. Universal compassion is the only guarantee of morality. (schopenhauer)
** end of logging

A new generated quote:
By disregarding the worth of animals, we reveal our own moral ignorance. True morality lies in extending compassion to all living beings.

(Optional) Partitioning

There's an interesting topic to examine before completing this quickstart. While, generally, tags and quotes can be in any relationship (e.g. a quote having multiple tags), authors are effectively an exact grouping (they define a "disjoint partitioning" on the set of quotes): each quote has exactly one author (for us, at least).

Now, suppose you know in advance your application will usually (or always) run queries on a single author. Then you can take full advantage of the underlying database structure: if you group quotes in partitions (one per author), vector queries on just an author will use less resources and return much faster.

We'll not dive into the details here, which have to do with the Cassandra storage internals: the important message is that if your queries are run within a group, consider partitioning accordingly to boost performance.

You'll now see this choice in action.

First, you need a different table abstraction from CassIO:

from cassio.table import ClusteredMetadataVectorCassandraTable

v_table_partitioned = ClusteredMetadataVectorCassandraTable(table="philosophers_cassio_partitioned", vector_dimension=1536)

Now repeat the compute-embeddings-and-insert step on the new table.

Compared to what you have seen earlier, there is a crucial difference in that now the quote's author is stored as the partition id for the inserted row, instead of being added to the catch-all "metadata" dictionary.

While you are at it, by way of demonstration, you will insert all quotes by a given author concurrently: with CassIO, this is done by usng the asynchronous put_async method for each quote, collecting the resulting list of Future objects, and calling the result() method on them all afterwards, to ensure they all have executed. Cassandra / Astra DB well supports a high degree of concurrency in I/O operations.

(Note: one could have cached the embeddings computed previously to save a few API tokens -- here, however, we wanted to keep the code easier to inspect.)

for philosopher, quotes in quote_dict["quotes"].items():
    print(f"{philosopher}: ", end="")
    result = openai.Embedding.create(
        input=[quote["body"] for quote in quotes],
        engine=embedding_model_name,
    )
    futures = []
    for quote_idx, (quote, q_data) in enumerate(zip(quotes, result.data)):
        futures.append(v_table_partitioned.put_async(
            partition_id=philosopher,
            row_id=f"q_{philosopher}_{quote_idx}",
            body_blob=quote["body"],
            vector=q_data.embedding,
            metadata={tag: True for tag in quote["tags"]},
        ))
    for future in futures:
        future.result()
    print(f"Done ({len(quotes)} quotes inserted).")
print("Finished inserting.")

aristotle:  Done (50 quotes inserted).
freud:  Done (50 quotes inserted).
hegel:  Done (50 quotes inserted).
kant:  Done (50 quotes inserted).
nietzsche:  Done (50 quotes inserted).
plato:  Done (50 quotes inserted).
sartre:  Done (50 quotes inserted).
schopenhauer:  Done (50 quotes inserted).
spinoza:  Done (50 quotes inserted).
Finished inserting.

With this new table, the similarity search changes accordingly (note the arguments to ann_search):

def find_quote_and_author_p(query_quote, n, author=None, tags=None):
    query_vector = openai.Embedding.create(
        input=[query_quote],
        engine=embedding_model_name,
    ).data[0].embedding
    metadata = {}
    partition_id = None
    if author:
        partition_id = author
    if tags:
        for tag in tags:
            metadata[tag] = True
    #
    results = v_table_partitioned.ann_search(
        query_vector,
        n=n,
        partition_id=partition_id,
        metadata=metadata,
    )
    return [
        (result["body_blob"], result["partition_id"])
        for result in results
    ]

That's it: the new table still supports the "generic" similarity searches all right ...

find_quote_and_author_p("We struggle all our life for nothing", 3)

[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
  'schopenhauer'),
 ('The meager satisfaction that man can extract from reality leaves him starving.',
  'freud'),
 ('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche')]

... but it's when an author is specified that you would notice a huge performance advantage:

find_quote_and_author_p("We struggle all our life for nothing", 2, author="nietzsche")

[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
  'nietzsche')]

Well, you would notice a performance gain, if you had a realistic-size dataset. In this demo, with a few tens of entries, there's no noticeable difference -- but you get the idea.

Conclusion

Congratulations! You have learned how to use OpenAI for vector embeddings and Astra DB / Cassandra for storage in order to build a sophisticated philosophical search engine and quote generator.

This example used CassIO to interface with the Vector Store - but this is not the only choice. Check the README for other options and integration with popular frameworks.

To find out more on how Astra DB's Vector Search capabilities can be a key ingredient in your ML/GenAI applications, visit Astra DB's web page on the topic.

Cleanup

If you want to remove all resources used for this demo, run this cell (warning: this will delete the tables and the data inserted in them!):

# we peek at CassIO's config to get a direct handle to the DB connection
session = cassio.config.resolve_session()
keyspace = cassio.config.resolve_keyspace()

session.execute(f"DROP TABLE IF EXISTS {keyspace}.philosophers_cassio;")
session.execute(f"DROP TABLE IF EXISTS {keyspace}.philosophers_cassio_partitioned;")

<cassandra.cluster.ResultSet at 0x7fc7c4287940>