The Beep

The Beep

Share this post

The Beep
The Beep
Building Question Similarity Search using Vector DB

Building Question Similarity Search using Vector DB

Stack Overflow Posts, Sentence Embedding, Vector DB Milvus

Andreas's avatar
Andreas
Feb 11, 2024
∙ Paid

Share this post

The Beep
The Beep
Building Question Similarity Search using Vector DB
Share

green ceramic statue of a man
Photo by Kenny Eliason on Unsplash

Have you ever had a programming question and wanted to ask the Stack Overflow community, only to later discover that a similar question had already been answered? This common scenario leads to duplicated posts, making it harder for both users and contributors to navigate the forum efficiently.

In this tutorial, we'll explore how to create a question similarity system based on Stack Overflow question posts using Vector DBs. Imagine you're a beginner programmer who encounters an error and wants to ask the community for help. With this system, as you type your question's title, a list of similar questions with relevant answers appears. This not only saves you from creating a duplicate post, but also reduces the burden on contributors who would otherwise have to check for existing answers.

By leveraging text embeddings and Vector DBs, this system enables smarter question management and enhances the overall user experience on Stack Overflow. Let's dive into the details and learn how to implement this solution!

There are 4 steps to build a question vector db.

  1. Get the sample dataset from BigQuery

  2. Create database schema using Milvus

  3. Transform the title of the post to embedding

  4. Input questions and find the similarity

Let’s get into it.

Get the sample dataset from BigQuery

Load bigquery python SDK and authenticate.

import pandas as pd
from google.cloud import bigquery

client = bigquery.Client()

Keep reading with a 7-day free trial

Subscribe to The Beep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Andreas Chandra and Alamsyah Hanz
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share