NetraEmbed / README.md
AdithyaSK's picture
Update README.md
7ea8536 verified
|
raw
history blame
5.1 kB
metadata
license: gemma
language:
  - en
base_model:
  - google/gemma-3-4b-it
pipeline_tag: visual-document-retrieval
library_name: transformers

NetraEmbed

NetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.

Model Description

NetraEmbed is a multilingual multimodal embedding model that encodes both visual documents and text queries into single dense vectors. It supports multiple languages and enables efficient similarity search at multiple embedding dimensions (768, 1536, 2560) through Matryoshka representation learning.

  • Model Type: Multilingual Multimodal Embedding Model with Matryoshka embeddings
  • Architecture: BiEncoder with Gemma3-2B backbone
  • Embedding Dimensions: 768, 1536, 2560 (Matryoshka)
  • Capabilities: Multilingual, Multimodal (Vision + Text)
  • Use Case: Visual document retrieval, multilingual semantic search, cross-lingual document understanding

Paper

📄 M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Installation

pip install git+https://github.com/adithya-s-k/colpali.git

Quick Start

import torch
from PIL import Image
from colpali_engine.models import BiGemma3, BiGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/NetraEmbed"

# Choose embedding dimension: 768, 1536, or 2560
embedding_dim = 1536  # Use lower dims for faster search, higher for better accuracy

model = BiGemma3.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="cuda",
    embedding_dim=embedding_dim,  # Matryoshka dimension
)
processor = BiGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_texts(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)  # Shape: (num_images, embedding_dim)
    query_embeddings = model(**batch_queries)  # Shape: (num_queries, embedding_dim)

# Compute similarity scores using cosine similarity
scores = processor.score(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")

Matryoshka Embeddings

NetraEmbed supports three embedding dimensions:

Dimension Use Case Speed Accuracy
768 Fast search, large-scale ⚡⚡⚡ ⭐⭐
1536 Balanced performance ⚡⚡ ⭐⭐⭐
2560 Maximum accuracy ⭐⭐⭐⭐

Choose the dimension that best fits your latency and accuracy requirements. You can even switch dimensions without retraining!

Use Cases

  • Efficient Document Retrieval: Fast search through millions of documents
  • Semantic Search: Find visually similar documents
  • Scalable Vector Search: Works with FAISS, Milvus, Pinecone, etc.
  • Cross-lingual Retrieval: Multilingual visual document search

Model Details

  • Base Model: Gemma3-2B
  • Vision Encoder: SigLIP
  • Training Data: Multilingual document datasets
  • Embedding Strategy: Single-vector (BiEncoder)
  • Similarity Function: Cosine similarity
  • Matryoshka Dimensions: 768, 1536, 2560

Integration with Vector Databases

NetraEmbed works seamlessly with popular vector databases:

import faiss
import numpy as np

# Create FAISS index
dimension = 1536
index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity

# Add image embeddings to index
embeddings_np = image_embeddings.cpu().numpy()
faiss.normalize_L2(embeddings_np)  # Embeddings are already normalized
index.add(embeddings_np)

# Search
query_np = query_embeddings[0:1].cpu().numpy()
k = 5  # Top 5 results
distances, indices = index.search(query_np, k)

print(f"Top {k} matches:", indices[0])
print(f"Scores:", distances[0])

Performance

NetraEmbed achieves competitive performance on visual document retrieval benchmarks while being significantly faster than multi-vector approaches. See our paper for detailed evaluation.

Citation

@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}

License

This model is released under the same license as the base Gemma3 model.

Acknowledgments

Built on top of the Gemma3 architecture with Matryoshka representation learning.