CSCI 572 HW5 Vector-based similarity search! solved




5/5 - (1 vote)


In this final HW, you will use Weaviate [ ], which is a vector DB (stores data as vectors, and computes a search
query by vectorizing it and doing similarity search with existing vectors).


The (three) steps we need are really simple:
install Weaviate plus vectorizer via Docker as images, run them as containers
specify a schema for data, upload data (in .json format) to have it be vectorized
run a query (which gets vectorized and sim-searched), get back results (as JSON)
The following sections describe the above steps. The entire HW will only take you 2 to 3 hours to complete, pinky swear 🙂

1. Installing Weaviate and a vectorizer module

After installing Docker, bring it up (eg. on Windows, run Docker Desktop). Then, in your (ana)conda shell, run this docker-compose
command that uses this ‘docker-compose.yml’ config file to pull in two images: the ‘weaviate’ one, and a text2vec transformer
called ‘t2v-transformers’:
docker-compose up -d

These screenshots show the progress, completion, and subsequently, two containers automatically being started (one for weaviate,
one for t2v-transformers):

Yeay! Now we have the vectorizer transformer (to convert sentences to vectors), and weaviate (our vector DB search engine)
running! On to data handling 🙂

2. Loading data to search for

This is the data that we’d like searched, part of which will get returned to us as results. The data is conveniently represented as an
array of JSON documents, similar to Solr/Lunr. is our data file, conveniently named data.json (you can rename it if you like) –
place it in the ‘root’ directory of your webserver (see below). As you can see, each datum/’row’/JSON contains three k:v pairs, with
‘Category’, ‘Question’, ‘Answer’ as keys – as you might guess, it seems to be in Jeopardy(TM) answer-question (reversed) format 🙂
The file is actually called , I simply made a local copy called data.json.

The overall idea is this: we’d get the 10 documents vectorized, then specify a query word, eg. ‘biology’, and automagically have that
pull up related docs, eg. the ‘DNA’ one! This is a really useful semantic search feature where we don’t need to specify exact
keywords to search for.
Start by installing the weaviate Python client:
pip install weaviate-client
So, how to submit our JSON data, to get it vectorized? Simply use Python script, do:

You will see this:

If you look in the script, you’ll see that we are creating a schema – we create a class called ‘SimSearch’ (you can call it something else
if you like). The data we load into the DB, will be associated with this class (the last line in the script does this via add_data_object()).
NOTE – you NEED to run a local webserver [in a separate ana/conda (or other) shell], eg. via ‘python’ like you did for
HW4 – it’s what will ‘serve’ data.json to weaviate 🙂
Great! Now we have specified our searchable data, which has been first vectorized (by ‘t2v-transformers’), then stored as vectors
(in weaviate).
Only one thing left: querying!

3. Querying our vectorized data

To query, use this simple shell script called , and run this:
As you can see in the script, we search for ‘physics’-related docs, and sure enough, that’s what we get:

Why is this exciting? Because the word ‘physics’ isn’t in any of our results!

Now it’s your turn:

• first, MODIFY the contents of data.json, to replace the 10 docs in it, with your own data, where you’d replace (“Category”,”Question”,”Answer”)
with ANYTHING you like, eg. (“Author”,”Book”,”Summary”), (“MusicGenre”,”SongTitle”,”Artist”), (“School”,”CourseName”,”CourseDesc”), etc, etc –
HAVE fun coming up with this! You can certainly add more docs, eg. have 20 of them instead of 10
• next, MODIFY the query keyword(s) in the query .sh file – eg. you can query for ‘computer science’ courses, ‘female’ singer, ‘American’ books,
[‘Indian’,’Chinese’] food dishes (the query list can contain multiple items), etc. Like in the above screenshot, ‘cat’ the query, then run it, and get a
screenshot to submit. BE SURE to also modify the data loader .py script, to put in your keys (instead of (“Category”,”Question”,”Answer”))
That’s it, you’re done w/ the HW 🙂 In RL you will have a .json or file (or data in other formats) with BILLIONS of items! Later, do
feel free to play with bigger JSON files, eg. this Jeopardy JSON file 🙂

Here are two more things you can do, via ‘curl’:
[you can also do ‘ ‘ in your browser]
[you can also do ‘ ‘ in your browser]

Weaviate has a cloud version too, called – you can try that as an alternative to using the Dockerized version:
Run 🙂
Also, for fun, see if you can print the raw vectors for the data (the 10 docs)…
More info:

Whatto submit

• your data.json that contains the data (10 docs) you put in
• a screenshot of the ‘cat’ of your query and the results

Alternative (!) submission [omg]
You can just submit a README.txt file that notes why you didn’t/couldn’t do the HW.
“Wait, WHAT?” Turns out THIS HW IS OPTIONAL!!! You will get the full 10 points for this HW, regardless of what you submit – but
you DO need to submit something – either the .json+screenshot combo, or a README. Such a deal! For your own benefit, it’s
worth doing the HW of course – you’ll get first-hand experience using a vector DB (Weaviate is a worthy alternative to Pinecone
btw); but if you aren’t able, we understand (hope you’ll do it after the course!) 🙂

Getting help

There is a hw5 ‘forum’ on Piazza, for you to post questions/answers. You can also meet w/ the TAs, CPs, or me.
Have fun! This is a really useful piece of tech to know. Vector DBs are sure be used more and more in the near future, as a way to
provide ‘infinite external runtime memory’ for pretrained LLMs.