update
Browse files
app/public/images/transformers/classic_encoders.png
ADDED
|
Git LFS Details
|
app/src/content/article.mdx
CHANGED
|
@@ -68,7 +68,7 @@ These principles were not decided in a vacuum. The library _evolved_ towards the
|
|
| 68 |
<li class="tenet">
|
| 69 |
<a id="source-of-truth"></a>
|
| 70 |
<strong>Source of Truth</strong>
|
| 71 |
-
<p>We aim to be a [source of truth for all model definitions](https://huggingface.co/blog/transformers-model-definition). This is more of a goal than a tenet, but it strongly guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original implementations. If we are successful, they should become reference baselines for the ecosystem, so they'll be easily adopted by downstream libraries and projects. It's much easier for a project to
|
| 72 |
<em>This overarching guideline ensures quality and reproducibility across all models in the library, and aspires to make the community work easier.</em>
|
| 73 |
</li>
|
| 74 |
|
|
@@ -81,20 +81,20 @@ These principles were not decided in a vacuum. The library _evolved_ towards the
|
|
| 81 |
<li class="tenet">
|
| 82 |
<a id="code-is-product"></a>
|
| 83 |
<strong>Code is Product</strong>
|
| 84 |
-
<p>Optimize for reading, diffing, and tweaking, our users are power users. Variables
|
| 85 |
<em>Code quality matters as much as functionality - optimize for human readers, not just computers.</em>
|
| 86 |
</li>
|
| 87 |
<li class="tenet">
|
| 88 |
<a id="standardize-dont-abstract"></a>
|
| 89 |
<strong>Standardize, Don't Abstract</strong>
|
| 90 |
-
<p>If it's model behavior, keep it in the file; abstractions only for generic infra.</p>
|
| 91 |
<em>Model-specific logic belongs in the model file, not hidden behind abstractions.</em>
|
| 92 |
</li>
|
| 93 |
<li class="tenet">
|
| 94 |
<a id="do-repeat-yourself"></a>
|
| 95 |
<strong>DRY* (DO Repeat Yourself)</strong>
|
| 96 |
<p>Copy when it helps users; keep successors in sync without centralizing behavior.</p>
|
| 97 |
-
<p><strong>
|
| 98 |
<em>Strategic duplication can improve readability and maintainability when done thoughtfully.</em>
|
| 99 |
</li>
|
| 100 |
<li class="tenet">
|
|
@@ -160,7 +160,7 @@ Transformers is an opinionated library. The previous [philosophy](https://huggin
|
|
| 160 |
|
| 161 |
We amended the principle of [DRY*](#do-repeat-yourself) by progressively removing all pieces of code that were "copied from" another file.
|
| 162 |
|
| 163 |
-
It works as follows. In order to contribute a model,
|
| 164 |
The modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.
|
| 165 |
|
| 166 |
<summary id="generated-modeling">Auto-generated modeling code</summary>
|
|
@@ -216,7 +216,7 @@ The _attention computation_ itself happens at a _lower_ level of abstraction tha
|
|
| 216 |
However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api). Next section explains what we did.
|
| 217 |
|
| 218 |
<div class="crumbs">
|
| 219 |
-
Evidence: effective (i.e.,
|
| 220 |
|
| 221 |
<strong>Next:</strong> how the attention interface stays standard without hiding semantics.
|
| 222 |
</div>
|
|
@@ -236,8 +236,8 @@ attention_interface: Callable = eager_attention_forward
|
|
| 236 |
if self.config._attn_implementation != "eager":
|
| 237 |
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
| 238 |
```
|
| 239 |
-
|
| 240 |
-
|
| 241 |
|
| 242 |
Backend integrations sometimes require specific kwargs.
|
| 243 |
|
|
@@ -365,23 +365,20 @@ So what do we see?
|
|
| 365 |
Check out the [full viewer here](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) (tab "dependency graph", hit "build graph") for better manipulation and exploration.
|
| 366 |
<HtmlEmbed src="transformers/dependency-graph.html" />
|
| 367 |
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
Llama is a basis and an influence for many models, and it shows.
|
| 371 |
|
| 372 |

|
| 373 |
|
| 374 |
-
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 375 |
|
| 376 |
-
Audio models form sparser archipelagos, see for instance wav2vec2 which is a significant basis.
|
| 377 |
|
| 378 |

|
| 379 |
|
| 380 |
-
In the case of VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
|
| 381 |
-
)
|
| 382 |
|
| 383 |
-
|
| 384 |
-
As you can see, there is a small DETR island:
|
| 385 |

|
| 386 |
|
| 387 |
There is also a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
|
|
@@ -402,7 +399,7 @@ Llama-lineage is a hub; several VLMs remain islands — engineering opportunity
|
|
| 402 |
|
| 403 |
I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
|
| 404 |
|
| 405 |
-
It is interesting, for our comparison, to look at _when_ we deployed the modular logic and what was its rippling effect on the library.
|
| 406 |
|
| 407 |
Yet, we still have a lot of gaps to fill.
|
| 408 |
|
|
@@ -412,13 +409,21 @@ Zoom out below - it's full of models. You can click on a node to see its connect
|
|
| 412 |
|
| 413 |
Let's look at a few highly connected models. Let's start by the foundational work of [Llava](https://arxiv.org/abs/2304.08485).
|
| 414 |
|
| 415 |
-
 but being much more readable with [DRY*](#do-repeat-yourself).
|
| 419 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 420 |
<div class="crumbs">
|
| 421 |
-
Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
|
|
|
|
|
|
|
| 422 |
</div>
|
| 423 |
|
| 424 |
### VLM improvements, avoiding abstraction
|
|
@@ -489,6 +494,8 @@ The following [Pull request to standardize placeholder masking](https://github.c
|
|
| 489 |
|
| 490 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
|
| 491 |
|
|
|
|
|
|
|
| 492 |
<div class="crumbs">
|
| 493 |
Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
|
| 494 |
<strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
|
|
@@ -497,9 +504,9 @@ Keep VLM embedding mix in the modeling file (semantics), standardize safe helper
|
|
| 497 |
|
| 498 |
### On image processing and processors
|
| 499 |
|
| 500 |
-
Deciding to become a `torch`-first library meant relieving a tremendous amount of support for `jax ` and `TensorFlow`, and it also meant that we could be more lenient
|
| 501 |
|
| 502 |
-
The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore,
|
| 503 |
|
| 504 |

|
| 505 |
<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
|
|
@@ -519,19 +526,21 @@ Having a framework means forcing users into it. It restrains flexibility and cre
|
|
| 519 |
|
| 520 |
Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
|
| 521 |
|
| 522 |
-
|
|
|
|
|
|
|
| 523 |
|
| 524 |
|
| 525 |
<div class="crumbs">
|
| 526 |
The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
|
|
|
|
| 527 |
<strong>Next:</strong> power tools enabled by a consistent API.
|
| 528 |
</div>
|
| 529 |
|
| 530 |
|
| 531 |
### <a id="encoders-ftw"></a> Models popularity
|
| 532 |
|
| 533 |
-
Talking about dependencies, we can take a look at the number of downloads as a measure of popularity. One thing we see is the prominence of encoders, despite the apparent prevalence of decoder LLMs. The reason is that encoders are used to generate embeddings, which have multiple downstream uses. Just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders portion of the library viable, usable, fine-
|
| 534 |
-
|
| 535 |
|
| 536 |
<div>
|
| 537 |
<HtmlEmbed src="transformers/model-visualisation.html" />
|
|
@@ -552,6 +561,8 @@ Encoders remain critical for embeddings and retrieval; maintaining them well ben
|
|
| 552 |
|
| 553 |
## A surgical toolbox for model development
|
| 554 |
|
|
|
|
|
|
|
| 555 |
### Attention visualisation
|
| 556 |
|
| 557 |
All models have the same API for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes).
|
|
@@ -579,7 +590,9 @@ It just works with PyTorch models and is especially useful when aligning outputs
|
|
| 579 |
|
| 580 |
|
| 581 |
<div class="crumbs">
|
| 582 |
-
Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth."
|
|
|
|
|
|
|
| 583 |
</div>
|
| 584 |
|
| 585 |
|
|
@@ -613,7 +626,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
|
|
| 613 |
```
|
| 614 |
|
| 615 |
|
| 616 |
-
`transformers-serve` uses continuous batching (see [this PR](https://github.com/huggingface/transformers/pull/38085) and also [this one](https://github.com/huggingface/transformers/pull/40426)) for better GPU utilization, and is very much linked to the great work of vLLM with the `paged attention kernel` – a
|
| 617 |
|
| 618 |
`transformers-serve` is not meant for user-facing production services, tools like vLLM or SGLang are super optimized for that, but it's useful for several use cases:
|
| 619 |
- Quickly verify that your model is compatible with continuous batching and paged attention.
|
|
@@ -624,6 +637,7 @@ For model deployment, check [Inference Providers](https://huggingface.co/docs/in
|
|
| 624 |
|
| 625 |
<div class="crumbs">
|
| 626 |
OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
|
|
|
|
| 627 |
<strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
|
| 628 |
</div>
|
| 629 |
|
|
@@ -635,13 +649,16 @@ The transformers-serve CLI built on transformers, for sure, but the library is m
|
|
| 635 |
Adding a model to transformers means:
|
| 636 |
|
| 637 |
- having it immediately available to the community
|
| 638 |
-
- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In
|
|
|
|
| 639 |
|
| 640 |
-
|
|
|
|
| 641 |
|
| 642 |
|
| 643 |
<div class="crumbs">
|
| 644 |
Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
|
|
|
|
| 645 |
<strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
|
| 646 |
</div>
|
| 647 |
|
|
|
|
| 68 |
<li class="tenet">
|
| 69 |
<a id="source-of-truth"></a>
|
| 70 |
<strong>Source of Truth</strong>
|
| 71 |
+
<p>We aim to be a [source of truth for all model definitions](https://huggingface.co/blog/transformers-model-definition). This is more of a goal than a tenet, but it strongly guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original implementations. If we are successful, they should become reference baselines for the ecosystem, so they'll be easily adopted by downstream libraries and projects. It's much easier for a project to always refer to the transformers implementation, than to learn a different research codebase every time a new architecture is released.</p>
|
| 72 |
<em>This overarching guideline ensures quality and reproducibility across all models in the library, and aspires to make the community work easier.</em>
|
| 73 |
</li>
|
| 74 |
|
|
|
|
| 81 |
<li class="tenet">
|
| 82 |
<a id="code-is-product"></a>
|
| 83 |
<strong>Code is Product</strong>
|
| 84 |
+
<p>Optimize for reading, diffing, and tweaking, our users are power users. Variables should be explicit, full words, even several words, readability is primordial.</p>
|
| 85 |
<em>Code quality matters as much as functionality - optimize for human readers, not just computers.</em>
|
| 86 |
</li>
|
| 87 |
<li class="tenet">
|
| 88 |
<a id="standardize-dont-abstract"></a>
|
| 89 |
<strong>Standardize, Don't Abstract</strong>
|
| 90 |
+
<p>If it's model behavior, keep it in the file; use abstractions only for generic infra.</p>
|
| 91 |
<em>Model-specific logic belongs in the model file, not hidden behind abstractions.</em>
|
| 92 |
</li>
|
| 93 |
<li class="tenet">
|
| 94 |
<a id="do-repeat-yourself"></a>
|
| 95 |
<strong>DRY* (DO Repeat Yourself)</strong>
|
| 96 |
<p>Copy when it helps users; keep successors in sync without centralizing behavior.</p>
|
| 97 |
+
<p><strong>Evolution:</strong> With the introduction and global adoption of <a href="#modular">modular</a> transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet.</p>
|
| 98 |
<em>Strategic duplication can improve readability and maintainability when done thoughtfully.</em>
|
| 99 |
</li>
|
| 100 |
<li class="tenet">
|
|
|
|
| 160 |
|
| 161 |
We amended the principle of [DRY*](#do-repeat-yourself) by progressively removing all pieces of code that were "copied from" another file.
|
| 162 |
|
| 163 |
+
It works as follows. In order to contribute a model, `GLM` for instance, we define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_ already existing in the library.
|
| 164 |
The modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.
|
| 165 |
|
| 166 |
<summary id="generated-modeling">Auto-generated modeling code</summary>
|
|
|
|
| 216 |
However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api). Next section explains what we did.
|
| 217 |
|
| 218 |
<div class="crumbs">
|
| 219 |
+
Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
|
| 220 |
|
| 221 |
<strong>Next:</strong> how the attention interface stays standard without hiding semantics.
|
| 222 |
</div>
|
|
|
|
| 236 |
if self.config._attn_implementation != "eager":
|
| 237 |
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
| 238 |
```
|
| 239 |
+
Having the attention interfaces functionalized allows to do dynamic switching of attentions as well, increasing their [hackability](#code-is-product).
|
| 240 |
+
Another strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies.
|
| 241 |
|
| 242 |
Backend integrations sometimes require specific kwargs.
|
| 243 |
|
|
|
|
| 365 |
Check out the [full viewer here](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) (tab "dependency graph", hit "build graph") for better manipulation and exploration.
|
| 366 |
<HtmlEmbed src="transformers/dependency-graph.html" />
|
| 367 |
|
| 368 |
+
Let's walk through some sections of this graph together.
|
| 369 |
+
First, Llama is a basis and an influence for many models, and it is very visible.
|
|
|
|
| 370 |
|
| 371 |

|
| 372 |
|
| 373 |
+
The models linked sometimes pull components from other models than `llama` of course. Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 374 |
|
| 375 |
+
Audio models form sparser archipelagos, see for instance wav2vec2 which is a significant basis for a dozen of them.
|
| 376 |
|
| 377 |

|
| 378 |
|
| 379 |
+
In the case of VLMs which have massively grown in popularity since 2024, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
|
|
|
|
| 380 |
|
| 381 |
+
As you can see, there is a small `DETR` island:
|
|
|
|
| 382 |

|
| 383 |
|
| 384 |
There is also a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
|
|
|
|
| 399 |
|
| 400 |
I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
|
| 401 |
|
| 402 |
+
It is interesting, for our comparison, to look at _when_ we deployed the modular logic and what was its rippling effect on the library. Looking at the timeline makes it obvious: adding modular allowed to connect more and more models to solid reference points.
|
| 403 |
|
| 404 |
Yet, we still have a lot of gaps to fill.
|
| 405 |
|
|
|
|
| 409 |
|
| 410 |
Let's look at a few highly connected models. Let's start by the foundational work of [Llava](https://arxiv.org/abs/2304.08485).
|
| 411 |
|
| 412 |
+

|
| 413 |
|
| 414 |
|
| 415 |
You see that `llava_video` is a red node, connected by a red edge to `llava`: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
|
| 416 |
|
| 417 |
+
The same can be identified with the classical encoders family, centered on `BERT`:
|
| 418 |
+
|
| 419 |
+
Here `roberta`, `xlm_roberta`, `ernie` are `modular`s of BERT, while models like `mobilebert` are likely candidates.
|
| 420 |
+

|
| 421 |
+
|
| 422 |
+
|
| 423 |
<div class="crumbs">
|
| 424 |
+
Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
|
| 425 |
+
|
| 426 |
+
<strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
|
| 427 |
</div>
|
| 428 |
|
| 429 |
### VLM improvements, avoiding abstraction
|
|
|
|
| 494 |
|
| 495 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
|
| 496 |
|
| 497 |
+
What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of `Llama` for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
|
| 498 |
+
|
| 499 |
<div class="crumbs">
|
| 500 |
Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
|
| 501 |
<strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
|
|
|
|
| 504 |
|
| 505 |
### On image processing and processors
|
| 506 |
|
| 507 |
+
Deciding to become a `torch`-first library meant relieving a tremendous amount of support for `jax ` and `TensorFlow`, and it also meant that we could be more lenient about the amount of torch-dependent utilities that we were able to accept. One of these is the _fast processing_ of images. Where inputs were once minimally assumed to be ndarrays, enforcing native `torch` and `torchvision` inputs allowed us to massively improve processing speed for each model.
|
| 508 |
|
| 509 |
+
The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, lets us run the whole pipeline solely on GPU.
|
| 510 |
|
| 511 |

|
| 512 |
<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
|
|
|
|
| 526 |
|
| 527 |
Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
|
| 528 |
|
| 529 |
+
These additions are immediately available for other models to use.
|
| 530 |
+
|
| 531 |
+
Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
|
| 532 |
|
| 533 |
|
| 534 |
<div class="crumbs">
|
| 535 |
The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
|
| 536 |
+
|
| 537 |
<strong>Next:</strong> power tools enabled by a consistent API.
|
| 538 |
</div>
|
| 539 |
|
| 540 |
|
| 541 |
### <a id="encoders-ftw"></a> Models popularity
|
| 542 |
|
| 543 |
+
Talking about dependencies, we can take a look at the number of downloads as a measure of popularity. One thing we see is the prominence of encoders, despite the apparent prevalence of decoder LLMs. The reason is that encoders are used to generate embeddings, which have multiple downstream uses. Just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders portion of the library viable, usable, fine-tunable.
|
|
|
|
| 544 |
|
| 545 |
<div>
|
| 546 |
<HtmlEmbed src="transformers/model-visualisation.html" />
|
|
|
|
| 561 |
|
| 562 |
## A surgical toolbox for model development
|
| 563 |
|
| 564 |
+
Transformers provides many tools that can help you add a new architecture, understand the inner workings of a model, as well as the library itself.
|
| 565 |
+
|
| 566 |
### Attention visualisation
|
| 567 |
|
| 568 |
All models have the same API for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes).
|
|
|
|
| 590 |
|
| 591 |
|
| 592 |
<div class="crumbs">
|
| 593 |
+
Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth."
|
| 594 |
+
|
| 595 |
+
<strong>Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.
|
| 596 |
</div>
|
| 597 |
|
| 598 |
|
|
|
|
| 626 |
```
|
| 627 |
|
| 628 |
|
| 629 |
+
`transformers-serve` uses continuous batching (see [this PR](https://github.com/huggingface/transformers/pull/38085) and also [this one](https://github.com/huggingface/transformers/pull/40426)) for better GPU utilization, and is very much linked to the great work of vLLM with the `paged attention kernel` – a further justification of [external kernels](#community-kernels).
|
| 630 |
|
| 631 |
`transformers-serve` is not meant for user-facing production services, tools like vLLM or SGLang are super optimized for that, but it's useful for several use cases:
|
| 632 |
- Quickly verify that your model is compatible with continuous batching and paged attention.
|
|
|
|
| 637 |
|
| 638 |
<div class="crumbs">
|
| 639 |
OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
|
| 640 |
+
|
| 641 |
<strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
|
| 642 |
</div>
|
| 643 |
|
|
|
|
| 649 |
Adding a model to transformers means:
|
| 650 |
|
| 651 |
- having it immediately available to the community
|
| 652 |
+
- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In the case of vLLM, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of _existing_ transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
|
| 653 |
+
- being the reference code for implementations in MLX, llama.cpp and other libraries.
|
| 654 |
|
| 655 |
+
|
| 656 |
+
This further cements the need for a [consistent public surface](#consistent-public-surface): we are a backend and a reference, and there's more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132), for instance.
|
| 657 |
|
| 658 |
|
| 659 |
<div class="crumbs">
|
| 660 |
Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
|
| 661 |
+
|
| 662 |
<strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
|
| 663 |
</div>
|
| 664 |
|
app/src/styles/components/_tenet.css
CHANGED
|
@@ -5,7 +5,7 @@
|
|
| 5 |
}
|
| 6 |
|
| 7 |
.tenet-list ol {
|
| 8 |
-
counter-reset: tenet-counter
|
| 9 |
list-style: none;
|
| 10 |
padding-left: 0;
|
| 11 |
display: grid;
|
|
|
|
| 5 |
}
|
| 6 |
|
| 7 |
.tenet-list ol {
|
| 8 |
+
counter-reset: tenet-counter 0;
|
| 9 |
list-style: none;
|
| 10 |
padding-left: 0;
|
| 11 |
display: grid;
|