Forum OpenACS Development: Re: Semantic Search in OpenACS

Collapse
Posted by Neophytos Demetriou on
One thing to note here is that the quality of the results depend on the language model being used. I used a very simplistic language model in the demo because of its size (20MB). In production, I would use a better model like all-mpnet-base-v2 (200MB). Please let me know if you need instructions how to download and convert it so that it can be used. I can also add a parameter to the two drivers so that you can setup a different model easily.

The other thing to note is that the results from pgvector-driver were poor in the demo after I added an index on the table (does approximate search in that case). So, I decided to switch the VectorDriver parameter in the search package to pgembedding-driver that produces better results now. In other words, the demo now uses pgembedding-driver for vector similarity search.

Finally, you cannot switch language model after data has been indexed without migrating to the new language model. You have to make a choice from the beginning and go with it.

Collapse
Posted by Adrian Ferenc on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

Also, I finally got access to the docker container today. I was working with my colleague (Dr. Yuen). We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line:

RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

to the Dockerfile so it would listen on the ip docker assigns. That sed command, or its equivalent in the config file may need to be refined. And then to run it, we used

docker run -d -p 8000:8000 pgvector-driver:latest

For reference, I am working with macOS and I believe my colleague is working on Windows.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing. I'm not sure how concerned you are with that, but I thought I'd bring it to your attention just in case. Again, thanks for your help

Collapse
Posted by Neophytos Demetriou on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

I will try to write up something over the weekend. In the meantime, you might want to check out this video on the way they are trained as part of a language model: https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture

In short embeddings encode semantic relations (bring relevant words in meaning together). So the vector of a word will be very close in distance (e.g. euclidean, cosine) to the vector of a similar word. For example, cat and dog are similar in at least one dimension i.e. they are both animals.

This is done when the language model is trained via a neural network. tbert is based on bert.cpp that does inference of BERT neural net architecture with pooling and normalization from SentenceTransformers (https://www.sbert.net/ - this is what I used in Python). tbert computes the embeddings vector based on a language model. There are lots of them in huggingface.

When some title is indexed, pgvector-driver and pgembedding-driver ask tbert to compute the vector based on the language model that is used and the result is stored in pgvector or pgembedding columns in the database. Upon search, tbert again computes the vector of the query of the user and then asks pgvector or pgembedding to rank them by similarity (basically euclidean distance between the vectors in both of these drivers).

We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line: RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

Will check it out and make the change. Thanks.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing.

I forgot I had it there as well. I was updating the one in openacs-packages and copying to pgvector-driver. Fixed in pgembedding-driver as well.

Collapse
Posted by Adrian Ferenc on
Thank you. That video and your explanation was very helpful. I hope in what you write up you can also explain/point to the code of the implementation, for example the steps that go from making a query in openacs to creating an embedding with tbert to querying the db with the computed vector.
Collapse
Posted by Neophytos Demetriou on
Here is the document I promised: Semantic Search with tBERT

Looking forward to improve it based on your feedback.

Collapse
Posted by Adrian Ferenc on
Thank you so much! I am very excited to look through it
Collapse
Posted by Neophytos Demetriou on
Hi Adrian, thanks for being so kind. If I can elaborate on anything either in the document or here, please do not hesitate and let me know.