Building Question Similarity Search using Vector DB

Stack Overflow Posts, Sentence Embedding, Vector DB Milvus

Feb 11, 2024

∙ Paid

green ceramic statue of a man — Photo by Kenny Eliason on Unsplash

Have you ever had a programming question and wanted to ask the Stack Overflow community, only to later discover that a similar question had already been answered? This common scenario leads to duplicated posts, making it harder for both users and contributors to navigate the forum efficiently.

In this tutorial, we'll explore how to create a question similarity system based on Stack Overflow question posts using Vector DBs. Imagine you're a beginner programmer who encounters an error and wants to ask the community for help. With this system, as you type your question's title, a list of similar questions with relevant answers appears. This not only saves you from creating a duplicate post, but also reduces the burden on contributors who would otherwise have to check for existing answers.

By leveraging text embeddings and Vector DBs, this system enables smarter question management and enhances the overall user experience on Stack Overflow. Let's dive into the details and learn how to implement this solution!

There are 4 steps to build a question vector db.

Get the sample dataset from BigQuery
Create database schema using Milvus
Transform the title of the post to embedding
Input questions and find the similarity

Let’s get into it.

Get the sample dataset from BigQuery

Load bigquery python SDK and authenticate.

import pandas as pd
from google.cloud import bigquery

client = bigquery.Client()

Keep reading with a 7-day free trial

Subscribe to The Beep to keep reading this post and get 7 days of free access to the full post archives.