Forum OpenACS Development: Re: Semantic Search in OpenACS

25: Re: Semantic Search in OpenACS (response to 22)

Posted by Neophytos Demetriou on 08/02/23 09:28 AM

One thing to note here is that the quality of the results depend on the language model being used. I used a very simplistic language model in the demo because of its size (20MB). In production, I would use a better model like all-mpnet-base-v2 (200MB). Please let me know if you need instructions how to download and convert it so that it can be used. I can also add a parameter to the two drivers so that you can setup a different model easily.

The other thing to note is that the results from pgvector-driver were poor in the demo after I added an index on the table (does approximate search in that case). So, I decided to switch the VectorDriver parameter in the search package to pgembedding-driver that produces better results now. In other words, the demo now uses pgembedding-driver for vector similarity search.

Finally, you cannot switch language model after data has been indexed without migrating to the new language model. You have to make a choice from the beginning and go with it.

26: Re: Semantic Search in OpenACS (response to 25)

Posted by Adrian Ferenc on 08/03/23 10:16 PM

Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

Also, I finally got access to the docker container today. I was working with my colleague (Dr. Yuen). We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line:

RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

to the Dockerfile so it would listen on the ip docker assigns. That sed command, or its equivalent in the config file may need to be refined. And then to run it, we used

docker run -d -p 8000:8000 pgvector-driver:latest

For reference, I am working with macOS and I believe my colleague is working on Windows.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing. I'm not sure how concerned you are with that, but I thought I'd bring it to your attention just in case. Again, thanks for your help

27: Re: Semantic Search in OpenACS (response to 26)

Posted by Neophytos Demetriou on 08/03/23 10:46 PM

Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

I will try to write up something over the weekend. In the meantime, you might want to check out this video on the way they are trained as part of a language model: https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture

In short embeddings encode semantic relations (bring relevant words in meaning together). So the vector of a word will be very close in distance (e.g. euclidean, cosine) to the vector of a similar word. For example, cat and dog are similar in at least one dimension i.e. they are both animals.

This is done when the language model is trained via a neural network. tbert is based on bert.cpp that does inference of BERT neural net architecture with pooling and normalization from SentenceTransformers (https://www.sbert.net/ - this is what I used in Python). tbert computes the embeddings vector based on a language model. There are lots of them in huggingface.

When some title is indexed, pgvector-driver and pgembedding-driver ask tbert to compute the vector based on the language model that is used and the result is stored in pgvector or pgembedding columns in the database. Upon search, tbert again computes the vector of the query of the user and then asks pgvector or pgembedding to rank them by similarity (basically euclidean distance between the vectors in both of these drivers).

We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line: RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

Will check it out and make the change. Thanks.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing.

I forgot I had it there as well. I was updating the one in openacs-packages and copying to pgvector-driver. Fixed in pgembedding-driver as well.

28: Re: Semantic Search in OpenACS (response to 27)

Posted by Adrian Ferenc on 08/04/23 08:24 PM

Thank you. That video and your explanation was very helpful. I hope in what you write up you can also explain/point to the code of the implementation, for example the steps that go from making a query in openacs to creating an embedding with tbert to querying the db with the computed vector.

30: Re: Semantic Search in OpenACS (response to 28)

Posted by Neophytos Demetriou on 08/06/23 10:16 AM

Here is the document I promised: Semantic Search with tBERT

Looking forward to improve it based on your feedback.

31: Re: Semantic Search in OpenACS (response to 30)

Posted by Adrian Ferenc on 08/07/23 06:06 PM

Thank you so much! I am very excited to look through it

33: Re: Semantic Search in OpenACS (response to 31)

Posted by Neophytos Demetriou on 08/07/23 09:31 PM

Hi Adrian, thanks for being so kind. If I can elaborate on anything either in the document or here, please do not hesitate and let me know.