allenai
/

Molmo2-O-7B

@@ -36,7 +36,7 @@ You can find all models in the Molmo2 family [here](https://huggingface.co/colle
 **Learn more** about the Molmo2 family [in our announcement blog post](https://allenai.org/blog/molmo2).
-Molmo2 7B is based on [Olmo3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) and uses [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone.
 It outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos.
 Ai2 is commited to open science. The Molmo2 datasets are available [here](https://huggingface.co/collections/allenai/molmo2-data).
@@ -63,7 +63,7 @@ pip install torch pillow einops torchvision accelerate decord2 molmo_utils
 from transformers import AutoProcessor, AutoModelForImageTextToText
 import torch
-model_id="allenai/Molmo2-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
@@ -122,7 +122,7 @@ import torch
 from molmo_utils import process_vision_info
 import re
-model_id="allenai/Molmo2-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
@@ -219,7 +219,7 @@ import torch
 from molmo_utils import process_vision_info
 import re
-model_id="allenai/Molmo2-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
@@ -315,7 +315,7 @@ import torch
 import requests
 from PIL import Image
-model_id="allenai/Molmo2-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
@@ -377,7 +377,7 @@ import re
 from PIL import Image
 import requests
-model_id="allenai/Molmo2-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
@@ -500,7 +500,7 @@ For details on the evals, refer to the main video results table in our [technica
 | VideoChat-Flash-7B | 56.1 |
 | Molmo2-4B | 62.8 |
 | Molmo2-8B | 63.1 |
-| **Molmo2-7B (this model)** | 59.7 |
 ## License and Use

 **Learn more** about the Molmo2 family [in our announcement blog post](https://allenai.org/blog/molmo2).
+Molmo2-O-7B is based on [Olmo3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) and uses [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone.
 It outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos.
 Ai2 is commited to open science. The Molmo2 datasets are available [here](https://huggingface.co/collections/allenai/molmo2-data).
 from transformers import AutoProcessor, AutoModelForImageTextToText
 import torch
+model_id="allenai/Molmo2-O-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
 from molmo_utils import process_vision_info
 import re
+model_id="allenai/Molmo2-O-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
 from molmo_utils import process_vision_info
 import re
+model_id="allenai/Molmo2-O-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
 import requests
 from PIL import Image
+model_id="allenai/Molmo2-O-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
 from PIL import Image
 import requests
+model_id="allenai/Molmo2-O-7B"
 # load the processor
 processor = AutoProcessor.from_pretrained(
 | VideoChat-Flash-7B | 56.1 |
 | Molmo2-4B | 62.8 |
 | Molmo2-8B | 63.1 |
+| **Molmo2-O-7B (this model)** | 59.7 |
 ## License and Use