Visualizing the embeddings in Kangas

OpenAI Logo
Douglas Blank
Jul 11, 2023
Open in Github

In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions.

What is Kangas?

Kangas as an open source, mixed-media, dataframe-like tool for data scientists. It was developed by Comet, a company designed to help reduce the friction of moving models into production.

1. Setup

To get started, we pip install kangas, and import it.

%pip install kangas --quiet
import kangas as kg

2. Constructing a Kangas DataGrid

We create a Kangas Datagrid with the original data and the embeddings. The data is composed of a rows of reviews, and the embeddings are composed of 1536 floating-point values. In this example, we get the data directly from github, in case you aren't running this notebook inside OpenAI's repo.

We use Kangas to read the CSV file into a DataGrid for further processing.

data = kg.read_csv("https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv")
Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...
1001it [00:00, 2412.90it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]

We can review the fields of the CSV file:

data.info()
DataGrid (in memory)
    Name   : fine_food_reviews_with_embeddings_1k
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 INTEGER             
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 TEXT                

And get a glimpse of the first and last rows:

data
row-id Column 1 ProductId UserId Score Summary Text combined n_tokens embedding
1 0 B003XPF9BO A3R7JR3FMEBXQB 5 where does one Wanted to save Title: where do 52 [0.007018072064
2 297 B003VXHGPK A21VWSCGW7UUAR 4 Good, but not W Honestly, I hav Title: Good, bu 178 [-0.00314055196
3 296 B008JKTTUA A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
4 295 B000LKTTTW A14MQ40CCU8B13 5 Best tomato sou I have a hard t Title: Best tom 111 [-0.00139322795
5 294 B001D09KAM A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
...
996 623 B0000CFXYA A3GS4GWPIBV0NT 1 Strange inflamm Truthfully wasn Title: Strange 110 [0.000110913533
997 624 B0001BH5YM A1BZ3HMAKK0NC 5 My favorite and You've just got Title: My favor 80 [-0.02086931467
998 625 B0009ET7TC A2FSDQY5AI6TNX 5 My furbabies LO Shake the conta Title: My furba 47 [-0.00974910240
999 619 B007PA32L2 A15FF2P7RPKH6G 5 got this for th all i have hear Title: got this 50 [-0.00521062919
1000 999 B001EQ5GEO A3VYU0VO6DYV6I 5 I love Maui Cof My first experi Title: I love M 118 [-0.00605782261
[1000 rows x 9 columns]
* Use DataGrid.save() to save to disk
** Use DataGrid.show() to start user interface

Now, we create a new DataGrid, converting the numbers into an Embedding:

import ast # to convert string of a list of numbers into a list of numbers

dg = kg.DataGrid(
    name="openai_embeddings",
    columns=data.get_columns(),
    converters={"Score": str},
)
for row in data:
    embedding = ast.literal_eval(row[8])
    row[8] = kg.Embedding(
        embedding, 
        name=str(row[3]), 
        text="%s - %.10s" % (row[3], row[4]),
        projection="umap",
    )
    dg.append(row)

The new DataGrid now has an Embedding column with proper datatype.

dg.info()
DataGrid (in memory)
    Name   : openai_embeddings
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 TEXT                
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 EMBEDDING-ASSET     

We simply save the datagrid, and we're done.

dg.save()

3. Render 2D Projections

To render the data directly in the notebook, simply show it. Note that each row contains an embedding projection.

Scroll to far right to see embeddings projection per row.

The color of the point in projection space represents the Score.

dg.show()

Group by "Score" to see rows of each group.

dg.show(group="Score", sort="Score", rows=5, select="Score,embedding")