Improve model card with Updates, Model Zoo, and Training information
#3
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -5,9 +5,9 @@ library_name: transformers
|
|
| 5 |
license: apache-2.0
|
| 6 |
metrics:
|
| 7 |
- accuracy
|
|
|
|
| 8 |
tags:
|
| 9 |
- multimodal
|
| 10 |
-
pipeline_tag: video-text-to-text
|
| 11 |
model-index:
|
| 12 |
- name: InternVL2.5_HiCo_R64
|
| 13 |
results:
|
|
@@ -61,23 +61,31 @@ model-index:
|
|
| 61 |
value: 66.4
|
| 62 |
name: accuracy
|
| 63 |
verified: true
|
| 64 |
-
|
| 65 |
---
|
| 66 |
|
| 67 |
-
# π
|
| 68 |
<!-- [\[π° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
|
| 69 |
[\[π GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
|
| 70 |
[\[π Tech Report\]](https://arxiv.org/abs/2501.12386)
|
| 71 |
<!-- [\[π¨οΈ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
| Model | MVBench | LongVideoBench | VideoMME(w/o sub)|
|
| 79 |
-
| --- | --- | --- | --- |
|
| 80 |
-
|InternVL2.5_HiCo_R64| 74.4 | 62.7 | 66.4|
|
| 81 |
|
| 82 |
## π How to use the model
|
| 83 |
|
|
@@ -233,7 +241,8 @@ with torch.no_grad():
|
|
| 233 |
|
| 234 |
pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
|
| 235 |
pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
|
| 236 |
-
video_prefix = "".join([f"Frame{i+1}: <image
|
|
|
|
| 237 |
# single-turn conversation
|
| 238 |
question1 = "Describe this video in detail."
|
| 239 |
question = video_prefix + question1
|
|
|
|
| 5 |
license: apache-2.0
|
| 6 |
metrics:
|
| 7 |
- accuracy
|
| 8 |
+
pipeline_tag: video-text-to-text
|
| 9 |
tags:
|
| 10 |
- multimodal
|
|
|
|
| 11 |
model-index:
|
| 12 |
- name: InternVL2.5_HiCo_R64
|
| 13 |
results:
|
|
|
|
| 61 |
value: 66.4
|
| 62 |
name: accuracy
|
| 63 |
verified: true
|
|
|
|
| 64 |
---
|
| 65 |
|
| 66 |
+
# πInternVideo2.5_HiCo_R64β‘
|
| 67 |
<!-- [\[π° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
|
| 68 |
[\[π GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
|
| 69 |
[\[π Tech Report\]](https://arxiv.org/abs/2501.12386)
|
| 70 |
<!-- [\[π¨οΈ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
|
| 71 |
|
| 72 |
+
InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).
|
| 73 |
+
|
| 74 |
+
## π Updates
|
| 75 |
+
- `2025/06/11`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) achieves an accuracy of 53.2\% on [VideoEval-Pro (MCQ)](https://huggingface.co/spaces/TIGER-Lab/VideoEval-Pro) (Thanks for their benchmark). This result positions InternVideo2.5 as one of the top-performing open-source MLLMs in 7-8B parameter size.
|
| 76 |
+
- `2025/01/23`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) and [InternVL2.5-HiCo](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) have been officially released on HuggingFace.
|
| 77 |
+
- `2025/01/22`: The [technical report](https://arxiv.org/pdf/2501.12386) of InternVideo2.5 is released.
|
| 78 |
|
| 79 |
+
## π Model Zoo
|
| 80 |
+
| MLLM | Link | MVBench | Perception Test | LongVideoBench | MLVU | VideoMME | LVBench | #Tokens per frame | #Params |
|
| 81 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 82 |
+
| InternVideo2.5 | [huggingface](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B)| 75.7 | 74.9 | 60.6 | 72.8 | 65.1 | 46.4 | 16 | 8B |
|
| 83 |
+
| InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) | 74.0 | 71.4 | 59.6 | 71.5 | 64.9 | - | 16 | 8B |
|
| 84 |
+
| InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R64) | 74.4 | 71.9 | 62.7 | 72.6 | 66.4 | - | 64 | 8B |
|
| 85 |
|
| 86 |
+
## βοΈ Training
|
| 87 |
|
| 88 |
+
See [Finetuning Code](https://github.com/OpenGVLab/VideoChat-Flash/tree/main/xtuner-train_internvideo2_5).
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## π How to use the model
|
| 91 |
|
|
|
|
| 241 |
|
| 242 |
pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
|
| 243 |
pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
|
| 244 |
+
video_prefix = "".join([f"Frame{i+1}: <image>
|
| 245 |
+
" for i in range(len(num_patches_list))])
|
| 246 |
# single-turn conversation
|
| 247 |
question1 = "Describe this video in detail."
|
| 248 |
question = video_prefix + question1
|