The Beep

The Beep

Share this post

The Beep
The Beep
Multimodal Search Using Vector DB

Multimodal Search Using Vector DB

Image, Text, CLIP, Vector Database, All in one

Andreas's avatar
Andreas
Feb 25, 2024
∙ Paid
2

Share this post

The Beep
The Beep
Multimodal Search Using Vector DB
1
Share
spiral concrete staircase
Photo by Tine Ivanič on Unsplash

In today's information-driven world, search capabilities have become increasingly advanced, enabling users to retrieve relevant results from vast amounts of data. One such advancement is multimodal search, which allows users to search using multiple types of data simultaneously, such as text, images (google image search), and audio (Shazam music). To create this cutting-edge search functionality, vector databases is one of the tool that we can create our own search engine for efficiently storing and querying high-dimensional vector representations of data.

What is embedding?

In deep learning, an embedding is a way to represent information, like words or images, in a numerical format that a computer can understand. It's like translating human language or pictures into a special code that machines can use to learn and make decisions. For example, in natural language processing, words can be turned into embeddings using techniques like Word2Vec or GloVe. These embeddings capture the meaning of words in a way that computers can process. This helps machines understand how words are related to each other and learn from them more effectively. However, embedding is not only used for text, it can also be used for images and audio.

So, embeddings are a key part of deep learning that helps machines make sense of complex information, allowing them to learn and improve their performance in tasks like image recognition, language translation, and more.

Types of Embeddings

As mentioned before, there is a lot of embedding depending on the form of the data source.

Text embeddings are a category of representation for text data that maps words / subwords or we usually called token to vectors. The vectors represent the meaning and context of the token. Before we use embedding for text machine learning, we use one hot encoding or bag of words to represent the words and usually in the level of document. The embedding used in many application using deep learning algorithms such as question-answering, text summarization, machine translation, etc.

Similarly, image embeddings is a way to represent an image as a vector. It is indeed the value of an image is two or three dimensional array that values the RGB. However, to capture semantic objects, we need a single long vector to represent the image. The application of image embeddings are face similarity, image retrieval, image generation, etc.

In addition, audio embeddings are vector representation of audio signals that capture the signal content of audio in a short form. The vector aims to capture comprehensive meaning so that we can create an application on top of it such as speaker recognition, song recognition, etc.

Multimodal Embeddings

Multimodal embeddings are a mixture of two or more types of embedding that share the same space. An image of a dog and the words “A dog in a backyard” will have similar vectors. The purpose of multimodal embedding is to capture semantic similarity between two different sources. The embedding is used for text to image search, image to image search, etc.

In this article, I will use Qdrant as vector database. The vector database is used to store image embedding. The CLIP model is used to generate the embedding for images and text, and this tutorial has some preprocessing.

Prerequisites

  • autogluon

  • torch, torchaudio, torchvision

  • qdrant-client

Install libraries

!pip install --q autogluon
!pip install --q torch torchvision torchaudio
!pip install --q qdrant-client

Import modules

import json
import pandas as pd
from PIL import Image
from autogluon.multimodal import download
from autogluon.multimodal import MultiModalPredictor
import numpy as np

Dataset

Download COCO dataset image and annotation for image caption generation

Keep reading with a 7-day free trial

Subscribe to The Beep to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Andreas Chandra and Alamsyah Hanz
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share