In this article, we'll show how to use the CLIP model from OpenAI for Text-to-Image and Image-to-Image searching. We'll also do a comparative analysis of the PyTorch model, FP16 OpenVINO format, and INT8 OpenVINO format in terms of speedup.
Here's a summary of what's covered:
- Using the PyTorch model
- Using OpenVINO conversion to speed up by 70%
- Using Quantization with OpenVINO NNCF to speed up by 400%
All results reported below are from a 13th Gen Intel® Core™ i5–13420H* using OpenVINO=2023.2 and NNCF=2.7.0 version.
If you'd like to code along, here's a Colab notebook with all the code you need to get started!
CLIP from OpenAI
CLIP (Contrastive Language–Image Pre-training) is a neural network capable of processing both images and text.
CLIP is a multimodal model, which means it can process both text and images. This capability allows it to embed different types of inputs in a shared multimodal space, where the positions of images and text have semantic meaning, regardless of their format.
The following image presents a visualization of the pre-training procedure.

OpenVINO by Intel
OpenVINO toolkit is a free toolkit facilitating the optimization of a deep learning model from a framework and deploying an inference engine onto Intel hardware. We'll use the FP16 and INT8 formats using the OpenVINO CLIP model.
This post demonstrates how to use OpenVINO to accelerate an embedding pipeline in LanceDB.
Implementation
In the implementation section, we see the comparative implementation of the CLIP model from Hugging Face and OpenVINO formats, using the conceptual caption dataset.
We start with the first step of loading the conceptual caption dataset from Hugging Face.
We will select a sample of 100 images from this large number of images
Helper functions to validate image URLs and get images and captions from image URL
Now we have prepared the dataset and we are ready to start with CLIP using Hugging Face and OpenVINO and their performance comparative analysis in terms of speed.
PyTorch CLIP using Hugging Face
We'll start with CLIP using Hugging Face and report the time taken to extract embeddings and search using LanceDB.
Let's write a helper function to extract text and image embeddings:
Use LanceDB for storing the embeddings & search
Extracting Embeddings of 83 images using CLIP Hugging faces model and time taken to extract embeddings.
This pipeline to extract embeddings of 83 images took 55.79 sec.
Data ingestion and creating embeddings in LanceDB
Next, we show how to create the embeddings and ingest them into LanceDB.
Query the embeddings
You can easily query the embeddings via similarity in LanceDB as follows:
CLIP model using FP16 OpenVINO format
Next, we'll show the results from the same pipeline with the CLIP F16 OpenVINO format.
Compiling the CLIP OpenVINO model
Extracting the embeddings of 83 images using CLIP FP16 OpenVINO model now takes 31.79 seconds – this is a 43% reduction!
The embeddings can be ingested to LanceDB the same as before:
We query the embeddings and run search just like before:
NNCF INT 8-bit Quantization
You can also use 8-bit Post Training Optimization from NNCF (Neural Network Compression Framework) and run inference on the quantized model via OpenVINO Toolkit.
Here's a helper function to convert into Int8 format using NNCF:
Initializing NNCF and Saving the Quantized Model
Compiling the INT8 model and Helper function for extracting features
With the updated pipeline using CLIP OpenVINO format, the time taken to extract embeddings of 83 images is brought down to just 13.70 sec! That's a 75.4% reduction from
the original CLIP model!
We can ingest the embeddings into LanceDB as follows:
We've now shown the performance improvement using all the CLIP model formats PyTorch from Hugging Face, FP16 OpenVINO, and INT8 OpenVINO.
Conclusions
All these results are on CPU for comparison of the PyTorch model with the OpenVINO model formats(FP16/ INT8)
| Format | Time (s) |
|---|---|
| PyTorch model from Hugging Face | 55.26 |
| OpenVINO FP16 format | 31.79 |
| OpenVINO INT8 format | 13.70 |
The performance acceleration achieved with an FP16 model is 1.73 times the PyTorch model, which is a relatively modest (yet decent) increase in speed. However, when switching to the INT8 OpenVINO format, there is a 4.03 times increase in speed compared to the PyTorch model.
Visit the LanceDB GitHub to learn more about how to work with vector search at scale, and for more such tutorials and demo applications, visit the vectordb-recipes repo. For the latest updates from LanceDB, follow our LinkedIn and X pages.




