Azure Cognitive Search as a vector database for OpenAI embeddings

OpenAI Logo
Farzad Sunavala
Sep 11, 2023
Open in Github

This notebook provides step by step instuctions on using Azure Cognitive Search as a vector database with OpenAI embeddings. Azure Cognitive Search (formerly known as "Azure Search") is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.

Prerequistites:

For the purposes of this exercise you must have the following:

! pip install wget
! pip install azure-search-documents --pre 
import openai
import json  
import openai
import wget
import pandas as pd
import zipfile
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.models import Vector 
from azure.search.documents import SearchIndexingBufferedSender
from azure.search.documents.indexes.models import (  
    SearchIndex,  
    SearchField,  
    SearchFieldDataType,  
    SimpleField,  
    SearchableField,  
    SearchIndex,  
    SemanticConfiguration,  
    PrioritizedFields,  
    SemanticField,  
    SearchField,  
    SemanticSettings,  
    VectorSearch,  
    HnswVectorSearchAlgorithmConfiguration,   
)
openai.api_type = "azure"
openai.api_base = "YOUR_AZURE_OPENAI_ENDPOINT"
openai.api_version = "2023-05-15"
openai.api_key = "YOUR_AZURE_OPENAI_KEY"
model: str = "text-embedding-ada-002"
search_service_endpoint: str = "YOUR_AZURE_SEARCH_ENDPOINT"
search_service_api_key: str = "YOUR_AZURE_SEARCH_ADMIN_KEY"
index_name: str = "azure-cognitive-search-vector-demo"
credential = AzureKeyCredential(search_service_api_key)
embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
'vector_database_wikipedia_articles_embedded.zip'
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../../data")
article_df = pd.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')  
  
# Read vectors from strings back into a list using json.loads  
article_df["title_vector"] = article_df.title_vector.apply(json.loads)  
article_df["content_vector"] = article_df.content_vector.apply(json.loads)  
article_df['vector_id'] = article_df['vector_id'].apply(str)  
article_df.head()  
id url title text title_vector content_vector vector_id
0 1 https://simple.wikipedia.org/wiki/April April April is the fourth month of the year in the J... [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August August (Aug.) is the eighth month of the year ... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art Art Art is a creative activity that expresses imag... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A or a is the first letter of the English alph... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air Air refers to the Earth's atmosphere. Air is a... [0.02224554680287838, -0.02044147066771984, -0... [0.021524671465158463, 0.018522677943110466, -... 4
# Configure a search index
index_client = SearchIndexClient(
    endpoint=search_service_endpoint, credential=credential)
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String),
    SimpleField(name="vector_id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="url", type=SearchFieldDataType.String),
    SearchableField(name="title", type=SearchFieldDataType.String),
    SearchableField(name="text", type=SearchFieldDataType.String),
    SearchField(name="title_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_configuration="my-vector-config"),
    SearchField(name="content_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_configuration="my-vector-config"),
]

# Configure the vector search configuration
vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            kind="hnsw",
            parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 500,
                "metric": "cosine"
            }
        )
    ]
)

# Optional: configure semantic reranking by passing your title, keywords, and content fields
semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="title"),
        prioritized_keywords_fields=[SemanticField(field_name="url")],
        prioritized_content_fields=[SemanticField(field_name="text")]
    )
)
# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the index 
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search, semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print(f'{result.name} created')
azure-cognitive-search-vector-demo created

Insert text and embeddings into vector store

In this notebook, the wikipedia articles dataset provided by OpenAI, the embeddings are pre-computed. The code below takes the data frame and converts it into a dictionary list to upload to your Azure Search index.

# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field  
article_df['id'] = article_df['id'].astype(str)  
article_df['vector_id'] = article_df['vector_id'].astype(str)  
  
# Convert the DataFrame to a list of dictionaries  
documents = article_df.to_dict(orient='records')  
  
# Use SearchIndexingBufferedSender to upload the documents in batches optimized for indexing 
with SearchIndexingBufferedSender(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)) as batch_client:  
    # Add upload actions for all documents  
    batch_client.upload_documents(documents=documents)  
  
print(f"Uploaded {len(documents)} documents in total")  
Uploaded 25000 documents in total

If your dataset didn't already contain pre-computed embeddings, you can create embeddings by using the below function using the openai python library. You'll also notice the same function and model are being used to generate query embeddings for performing vector searches.

# Example function to generate document embedding  
def generate_document_embeddings(text):  
    response = openai.Embedding.create(  
        input=text, engine=model)  
    embeddings = response['data'][0]['embedding']  
    return embeddings  
  
# Sampling the first document content as an example 
first_document_content = documents[0]['text']  
print(f"Content: {first_document_content[:100]}")    
    
# Generate the content vector using the `generate_document_embeddings` function    
content_vector = generate_document_embeddings(first_document_content)    
print(f"Content vector generated")    
Content: April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March
Content vector generated
# Function to generate query embedding
def generate_embeddings(text):
    response = openai.Embedding.create(
        input=text, engine=model)
    embeddings = response['data'][0]['embedding']
    return embeddings

# Pure Vector Search
query = "modern art in Europe"
  
search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))  
vector = Vector(value=generate_embeddings(query), k=3, fields="content_vector")  
  
results = search_client.search(  
    search_text=None,  
    vectors=[vector],  
    select=["title", "text", "url"] 
)
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")  
Title: Documenta
Score: 0.8599451
URL: https://simple.wikipedia.org/wiki/Documenta

Title: Museum of Modern Art
Score: 0.85260946
URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art

Title: Expressionism
Score: 0.85235393
URL: https://simple.wikipedia.org/wiki/Expressionism

# Hybrid Search
query = "Famous battles in Scottish history"  
  
search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))  
vector = Vector(value=generate_embeddings(query), k=3, fields="content_vector")  
  
results = search_client.search(  
    search_text=query,  
    vectors=[vector],
    select=["title", "text", "url"],
    top=3
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")  
Title: Wars of Scottish Independence
Score: 0.03306011110544205
URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence

Title: Battle of Bannockburn
Score: 0.022253260016441345
URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn

Title: Scottish
Score: 0.016393441706895828
URL: https://simple.wikipedia.org/wiki/Scottish

# Semantic Hybrid Search
query = "Famous battles in Scottish history" 

search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))  
vector = Vector(value=generate_embeddings(query), k=3, fields="content_vector")  

results = search_client.search(  
    search_text=query,  
    vectors=[vector], 
    select=["title", "text", "url"],
    query_type="semantic", query_language="en-us", semantic_configuration_name='my-semantic-config', query_caption="extractive", query_answer="extractive",
    top=3
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Title: {result['title']}")
    print(f"URL: {result['url']}")
    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")
Semantic Answer: The<em> Battle of Bannockburn,</em> fought on 23 and 24 June 1314, was an important Scottish victory in the Wars of Scottish Independence. A smaller Scottish army defeated a much larger and better armed English army. Background  When King Alexander III of Scotland died in 1286, his heir was his granddaughter Margaret, Maid of Norway.
Semantic Answer Score: 0.8857421875

Title: Wars of Scottish Independence
URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence
Caption: Important Figures Scotland King David II King John Balliol King Robert I the Bruce William Wallace  England King Edward I King Edward II King Edward III  Battles  Battle of Bannockburn  The Battle of Bannockburn (23–24 June 1314) was an important Scottish victory. It was the decisive battle in the First War of Scottish Independence.

Title: Battle of Bannockburn
URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn
Caption: The Battle of Bannockburn, fought on 23 and 24 June 1314, was an important<em> Scottish</em> victory in the Wars of<em> Scottish</em> Independence. A smaller Scottish army defeated a much larger and better armed English army. Background  When King Alexander III of Scotland died in 1286, his heir was his granddaughter Margaret, Maid of Norway.

Title: First War of Scottish Independence
URL: https://simple.wikipedia.org/wiki/First%20War%20of%20Scottish%20Independence
Caption: The First War of<em> Scottish Independence</em> lasted from the outbreak of the war in 1296 until the 1328. The Scots were defeated at Dunbar on 27 April 1296. John Balliol abdicated in Montrose castle on 10 July 1296.