--- library_name: kernels license: apache-2.0 --- ![maxsim](https://cdn-uploads.huggingface.co/production/uploads/63148c4db031f7b1c7bc36f9/5_3oT7_vFstoke-_0VjEf.png) # MaxSim A fast, memory-efficient exact MaxSim kernel for late-interaction retrieval and reranking (PyLate / ColBERT-style), packaged as a [Hugging Face `kernels`](https://github.com/huggingface/kernels) repo. The kernel computes ``` score(q, d) = sum_i max_j ``` over a batch of `(query, document)` pairs **without materialising the full `[Lq, Ld]` similarity matrix**. It tiles over document tokens, keeps running per-q-token maxima in shared memory, and reduces those into the per-pair score. ## Install ```bash uv add kernels # or: pip install kernels ``` ## Usage Two entry points. The packed/ragged form is the canonical kernel-facing API; the padded form is an ergonomic wrapper around the same kernel for the common batched-reranking case. ### Packed (ragged segments) ```python import torch from kernels import get_kernel maxsim = get_kernel("erikkaum/maxsim", version=1) scores = maxsim.score_pairs_packed( queries, # [total_q_tokens, dim] query_offsets, # [num_queries + 1], int32/int64 documents, # [total_d_tokens, dim] document_offsets, # [num_documents + 1], int32/int64 pair_query_ids, # [num_pairs] pair_document_ids, # [num_pairs] ) # scores.shape == [num_pairs], dtype == float32 ``` ### Padded (batched reranking) ```python scores = maxsim.score_candidates_padded( queries, # [B, Lq, dim] documents, # [B, candidates, Ld, dim] query_lengths, # [B] doc_lengths, # [B, candidates] ) # scores.shape == [B, candidates], dtype == float32 ``` A pure-PyTorch reference (`maxsim.maxsim_reference`, `maxsim.score_pairs_packed_reference`, `maxsim.score_candidates_padded_reference`) ships alongside for tests and benchmarks. ## Supported | Backend | Devices | Input dtypes | Accum / output | | ------- | ---------------------- | ----------------- | -------------- | | Metal | Apple Silicon (MPS) | fp32 / fp16 / bf16 | fp32 | | CUDA | sm_80, sm_86, sm_89 (Ampere + Lovelace) | fp32 / fp16 / bf16 | fp32 | `dim` is generic; the fast `simdgroup_matrix` / WMMA paths fire when `dim % 8 == 0` (Metal) / `dim % 16 == 0` (CUDA), which covers the typical embedding sizes (64 / 96 / 128). ## Benchmarks Three padded-API workloads taken straight from the design plan, comparing the kernel to a vectorised but naïve PyTorch baseline that materialises the `[Lq, Ld]` similarity matrix. ### Apple M3 Pro (Metal, fp16, dim=128) | Workload | Kernel | Naive | Speedup | | -------------------------------------------------------------- | ------ | ------ | ------- | | SmallRerank — B=32, C=10, Lq=32, Ld=180 | 0.45 ms | 1.44 ms | **3.18×** | | HeavyRerank — B=32, C=100, Lq=32, Ld=256 | 4.34 ms | 16.63 ms | **3.83×** | | LongDocStress — B=8, C=16, Lq=64, Ld=1024 | 1.69 ms | 3.70 ms | **2.19×** | ### NVIDIA CUDA (fp16, dim=128) | Workload | A10G (sm_86) | L4 (sm_89) | A100 (sm_80) | | ------------- | ------------ | ---------- | ------------ | | SmallRerank | 2.28× | 2.05× | 2.80× | | HeavyRerank | **4.48×** | **5.18×** | **5.29×** | | LongDocStress | 3.41× | **6.21×** | 1.89× | (A100's naive einsum is so well-tuned by cuBLAS that LongDocStress barely benefits there; on memory-bandwidth-bound GPUs like L4 the kernel pulls ahead significantly.) Reproduce with: ```bash kernels benchmark erikkaum/maxsim ``` ## Limitations - No backward pass (forward-only scoring kernel for now). - No argmax-position output (just the score). - CUDA fast path requires `dim % 16 == 0` and `Lq_max % 16 == 0`; other shapes fall back to a correctness-preserving scalar kernel. - Hopper (sm_90) is supported via PTX forward-compat but doesn't yet use WGMMA — Ampere/Lovelace gets the best tuning. ## Source / contribute Source: . License: Apache-2.0.