How to Build a Local Multimodal Search Engine (Whisper, Florence-2 & MiniLM)

A modern local hard drive contains a huge variety of files: PDFs, word documents, spreadsheets, screenshots of receipts, photos of whiteboards, and recordings of Zoom meetings or webinars.

If you try to search this digital archive using native OS tools, you run into a brick wall. Traditional file search is keyword-based (lexical) and blind to media. Unless you have meticulously named your files or manually tagged them, a screenshot of an invoice or a video of a presentation is completely invisible.

To solve this, we can build a Local Multimodal Search Engine. By combining open-source AI models, we can translate documents, images, and video transcripts into a single mathematical vector space where files can be searched by their conceptual meaning.

Here is an architectural blueprint and step-by-step guide on how to build this locally, 100% offline.

The Multimodal Pipeline Architecture

A multimodal search engine needs to handle three distinct types of content: text documents, images/screenshots, and audio/video files.

Instead of building a massive, slow end-to-end multimodal network, a highly efficient and practical approach is to use a multimodal-to-text pipeline. We translate media modalities into text representations first, and then map all text into a unified embedding space using a single, fast text-embedding model.

For our pipeline, we will combine three specialized local models:

1. OpenAI Whisper (Speech-to-Text): Processes audio and video files to generate text transcripts with timestamps. 2. Microsoft Florence-2 (Vision-Language): Analyzes images and screenshots to generate descriptive text captions, tags, and OCR outputs. 3. MiniLM (Text Embeddings): Converts all the generated text (from documents, transcripts, and image captions) into 384-dimensional vector embeddings.

All files remain secure and indexed on your machine locally

Why Use Florence-2 + MiniLM Instead of CLIP?

You might wonder why we don't just use a direct vision-language embedding model like CLIP. While CLIP can map images and text to the same vector space, it has significant limitations:

Poor Document Context: CLIP only supports very short text inputs (typically max 77 tokens), making it useless for indexing long-form PDFs or transcripts.
Strict Query Nuances: CLIP is optimized for matching image descriptions, not for handling complex conceptual searches or synonyms in documents.
Unified Space Complexity: Aligning different modalities directly often leads to reduced accuracy in text-to-text search.

By using Florence-2 to describe the image in rich text (e.g., "A whiteboard with handwritten flowchart diagrams about database schema") and then embedding that text with MiniLM, we keep everything in a single, high-quality text embedding space. This allows you to query documents, images, and videos using the exact same search model.

---

Step-by-Step Pseudocode Implementation

Let’s look at the conceptual pseudocode to build this multimodal pipeline locally on your machine.

Step 1: Transcribing Audio and Video with Whisper We load a local Whisper model to extract spoken text from audio or video files.

``` function transcribe_audio(file_path): # Load a lightweight local speech model model = LoadSpeechModel("whisper-base") # Transcribe the file to text result = model.transcribe(file_path) return result.text ```

Step 2: Captioning Images with Florence-2 We load a local vision model (like Florence-2) to perform OCR (read text in receipts) and describe the visual contents of images.

``` function describe_image(image_path): # Load a lightweight local vision-language model model = LoadVisionModel("florence-2-base") # Generate a detailed caption describing visual objects and text description = model.generate(image_path, task="detailed_caption") return description ```

Step 3: Generating Vector Embeddings with MiniLM Now that we can convert any file type into text (documents are already text, videos become transcripts, images become captions), we convert this text into vector embeddings using MiniLM.

``` function get_embedding(text): # Load a fast, local text embedding model (produces 384 dimensions) embedder = LoadEmbeddingModel("all-MiniLM-L6-v2") # Generate the vector representation vector = embedder.encode(text) return vector ```

---

Bringing It to the Desktop with Dhito

While building a python pipeline is a fun weekend project, running this in production on your desktop is a different story. Background indexing, parsing corrupt files, managing model lifecycles, and optimizing for battery and memory usage require thousands of lines of robust engineering.

This is exactly why we built Dhito.

Dhito packages this entire local AI architecture into a beautiful, native macOS app.

Unified Apple Silicon Optimization: Instead of slow CPU execution or complex GPU setups, Dhito compiles and runs Whisper, Florence-2, and MiniLM models using macOS Metal and the Apple Neural Engine.
Smart Background Indexing: Dhito watches your local folders and indexes files in the background only when your machine is idle, preserving battery life and performance.
Zero Configuration: No terminal, no Python dependencies, and no model downloads. Dhito runs out of the box with 100% offline privacy.

By running these state-of-the-art multimodal embeddings locally, Dhito gives you the speed and intelligence of a search engine, with absolute control over your private data.

If you are ready to search your local documents, images, and videos by concept, download Dhito today and experience the future of private local file management.

How to Build a Local Multimodal Search Engine (Whisper, Florence-2 & MiniLM)

The Multimodal Pipeline Architecture

Why Use Florence-2 + MiniLM Instead of CLIP?

Step-by-Step Pseudocode Implementation

Step 1: Transcribing Audio and Video with Whisper We load a local Whisper model to extract spoken text from audio or video files.

Step 2: Captioning Images with Florence-2 We load a local vision model (like Florence-2) to perform OCR (read text in receipts) and describe the visual contents of images.

Step 3: Generating Vector Embeddings with MiniLM Now that we can convert any file type into text (documents are already text, videos become transcripts, images become captions), we convert this text into vector embeddings using MiniLM.

Bringing It to the Desktop with Dhito

Related Articles

What is Semantic File Search? (And Why It Beats Traditional Search)

Master Your Local Files with Dhito

Chat with Your Files: Why Local Document QA is a Game Changer

Want to try Dhito?