paper/📕 jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

ml/embedding

ml/multimodal embedding

paper author

(1 field)

:_data_records:

:END:

paper author

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

** Abstract We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

[[https://www.semanticscholar.org/paper/jina-embeddings-v4%3A-Universal-Embeddings-for-Günther-Sturua/fd7dcb4a20948da8af9db3cca092a3f91d8319a9][url]]

(3 fields)

:migrated_at :2026-05-14T06:45:48.523Z

:migration_source:settings_button

:syntax_version :org-lite-v1

:END:

ml/embeddingml/multimodal embedding

Hot takes

They don't train the embedding model at all! They just use (a very complicated) recipe on 3 LORA adapters on top of a frozen QWEN models. The key is: to get the embedding property you want, you need to train it with a corresponding loss and methodology. They have multiple LORA adapters because the tasks are fundamentally incompatible with each other. E.g. we /want/ different embeddings for these tasks:

Asymmetric Query-Document Retrieval
Semantic Similarity and Symmetric Retrieval
Code Retrieval

Separating them into LORA makes training lightweight, while leveraging the computation put into QWEN 3B as base model. I really, really like the configurable embedder aspect. You pick the lora adapter based on what you want to embed, and what you want to do with it.
3B model is beyond something we can use locally on CPU and pushes it to API territory. This is unfortunate. I wonder if it would make more sense to disentangle each modality into smaller model instead of doing larger model to do more, because 4B parameter is quite steep for something to output a vector
The training methodology is quite bonkers compared to regular LLM pretraining. It shows how retrieval learning is a bit of alchemy with no easy recipe. I wonder if this is directly enabled by the speed of LORA training. If they had to train 3B model from scratch they would not be able to do this kind of experimentation.
I liked the analysis of alignment between CLIP image and text embeddings and comparing it with their method, it is very difficult to reason about contrastive models and their behavior in embeddings to me.
I previously saw mentions of matryoshka learning, but it looks incredible here. By default it outputs 2048 dim embeddings, but due to matryoshka learning, the values in the vector are ordered by semantic significance, so you can just truncate (CUT OUT) a 512-sized slice and have some guarantees to the performance degradation when used this way. That is really cool@ I am caveman.
I enjoy that it has multiple outputs modes (single vector, multivector), and each output type has some guarantee to it's behavior because of how meticulously they defined the multitask training. It definitely feels better than just pretraining and praying.
While reusing autoregressive QWEN 3B is super fast and efficient, I wonder if we can disentagle things more. I imagine we have separate lightweight (CPU friendly) 300M semantic embedders for code, image and text. Maybe trained multilingually to map them to similar vectors. During embedding search, we configure what we want to search using the same approach as above. E.g. for asymetric query document retrieval we have another mapping model from base 512dim vector to specialized (either query, or "query target") vectors, and run semantic search on that. By configuring them in two step, you define the base information, and how it should be searched.

(3 fields)

:migrated_at :2026-05-14T06:45:48.559Z

:migration_source:settings_button

:syntax_version :org-lite-v1

:END:

Jina v4 architecture -> Most parameters are frozen!