Multimodal Search Using Vector DB

Image, Text, CLIP, Vector Database, All in one

Feb 25, 2024

∙ Paid

spiral concrete staircase — Photo by Tine Ivanič on Unsplash

In today's information-driven world, search capabilities have become increasingly advanced, enabling users to retrieve relevant results from vast amounts of data. One such advancement is multimodal search, which allows users to search using multiple types of data simultaneously, such as text, images (google image search), and audio (Shazam music). To create this cutting-edge search functionality, vector databases is one of the tool that we can create our own search engine for efficiently storing and querying high-dimensional vector representations of data.

What is embedding?

In deep learning, an embedding is a way to represent information, like words or images, in a numerical format that a computer can understand. It's like translating human language or pictures into a special code that machines can use to learn and make decisions. For example, in natural language processing, words can be turned into embeddings using techniques like Word2Vec or GloVe. These embeddings capture the meaning of words in a way that computers can process. This helps machines understand how words are related to each other and learn from them more effectively. However, embedding is not only used for text, it can also be used for images and audio.

So, embeddings are a key part of deep learning that helps machines make sense of complex information, allowing them to learn and improve their performance in tasks like image recognition, language translation, and more.

Types of Embeddings

As mentioned before, there is a lot of embedding depending on the form of the data source.

Text embeddings are a category of representation for text data that maps words / subwords or we usually called token to vectors. The vectors represent the meaning and context of the token. Before we use embedding for text machine learning, we use one hot encoding or bag of words to represent the words and usually in the level of document. The embedding used in many application using deep learning algorithms such as question-answering, text summarization, machine translation, etc.

Similarly, image embeddings is a way to represent an image as a vector. It is indeed the value of an image is two or three dimensional array that values the RGB. However, to capture semantic objects, we need a single long vector to represent the image. The application of image embeddings are face similarity, image retrieval, image generation, etc.

In addition, audio embeddings are vector representation of audio signals that capture the signal content of audio in a short form. The vector aims to capture comprehensive meaning so that we can create an application on top of it such as speaker recognition, song recognition, etc.

Multimodal Embeddings

Multimodal embeddings are a mixture of two or more types of embedding that share the same space. An image of a dog and the words “A dog in a backyard” will have similar vectors. The purpose of multimodal embedding is to capture semantic similarity between two different sources. The embedding is used for text to image search, image to image search, etc.

In this article, I will use Qdrant as vector database. The vector database is used to store image embedding. The CLIP model is used to generate the embedding for images and text, and this tutorial has some preprocessing.

Prerequisites

autogluon
torch, torchaudio, torchvision
qdrant-client

Install libraries

!pip install --q autogluon
!pip install --q torch torchvision torchaudio
!pip install --q qdrant-client

Import modules

import json
import pandas as pd
from PIL import Image
from autogluon.multimodal import download
from autogluon.multimodal import MultiModalPredictor
import numpy as np

Dataset

Download COCO dataset image and annotation for image caption generation

Keep reading with a 7-day free trial

Subscribe to The Beep to keep reading this post and get 7 days of free access to the full post archives.