Instructions to use star092304/vi-sign-language-videomae-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use star092304/vi-sign-language-videomae-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("video-classification", model="star092304/vi-sign-language-videomae-base")# Load model directly from transformers import AutoImageProcessor, AutoModelForVideoClassification processor = AutoImageProcessor.from_pretrained("star092304/vi-sign-language-videomae-base") model = AutoModelForVideoClassification.from_pretrained("star092304/vi-sign-language-videomae-base") - Notebooks
- Google Colab
- Kaggle
Vietnamese Sign Language Recognition (VSLR) Model
This repository houses a fine-tuned VideoMAE (Base) model optimized for multi-class Vietnamese Sign Language Recognition (VSLR). The model architecture adapts self-supervised video representations to accurately classify short video clips of sign gestures into distinct Vietnamese text labels.
๐ Model Description
- Base Architecture: MCG-NJU/videomae-base-finetuned-kinetics (Video Masked Autoencoders)
- Dataset utilized: star092304/ViSignLanguage-Video
- Task: Multi-class Video Classification (Spatiotemporal Feature Extraction)
- Target Language: Vietnamese Sign Language (VNSL)
The model processes short video sequences by partitioning them into spatiotemporal patches, mapping sequential gestures (such as "ฤn", "Bแปnh viแปn", "Xin lแปi") to their corresponding semantic classes.
๐ Training & Evaluation Visualizations
The training routine was monitored closely across key evaluation metrics to prevent overfitting while maximizing classification accuracy on the validation split.
1. Training and Validation Metrics
The plot below illustrates the progression of accuracy, precision, recall, and F1-score across successive training epochs.
2. Loss Curves
The loss progression shows stable convergence, highlighting the adaptation of downstream spatiotemporal features from the initial Kinetics-400 pretraining weights.
๐ Inference & Usage
The following example is adapted from inference/inference_for_colab.ipynb and demonstrates how to run local inference using the fine-tuned VideoMAE model.
Prerequisites
pip install transformers torch decord huggingface-hub
Python Inference Example
import torch
import torch.nn as nn
import numpy as np
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
from decord import VideoReader, cpu
from huggingface_hub import hf_hub_download
MODEL_NAME = "star092304/vi-sign-language-videomae-base"
VIDEO_PATH = "path_to_a_test_sign_video.mp4"
NUM_FRAMES = 16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = VideoMAEImageProcessor.from_pretrained(MODEL_NAME)
model = VideoMAEForVideoClassification.from_pretrained(
MODEL_NAME,
ignore_mismatched_sizes=True,
)
# Rebuild the sequential classifier head exactly as used in the original notebook.
in_features = model.classifier.in_features
NUM_CLASSES = model.config.num_labels
model.classifier = nn.Sequential(
nn.LayerNorm(in_features),
nn.Dropout(0.3),
nn.Linear(in_features, NUM_CLASSES),
)
seq_ckpt_path = hf_hub_download(
repo_id=MODEL_NAME,
filename="classifier_sequential.pth",
)
seq_sd = torch.load(seq_ckpt_path, map_location="cpu", weights_only=True)
model.load_state_dict(seq_sd, strict=False)
model = model.to(DEVICE)
model.eval()
def load_video(video_path: str, num_frames: int = 16) -> list:
vr = VideoReader(video_path, ctx=cpu(0))
total = len(vr)
indices = np.linspace(0, total - 1, num_frames).astype(int)
frames = vr.get_batch(indices).asnumpy()
return list(frames)
frames = load_video(VIDEO_PATH, num_frames=NUM_FRAMES)
inputs = processor(frames, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
pred_id = logits.argmax(-1).item()
pred_label = model.config.id2label[pred_id]
probs = torch.softmax(logits, dim=-1)[0]
print(f"Predicted class : {pred_label}")
print(f"Class ID : {pred_id}")
print(f"Confidence : {probs[pred_id].item():.4f}")
print("\nTop-5 predictions:")
for rank, idx in enumerate(torch.argsort(probs, descending=True)[:5], 1):
idx = idx.item()
print(f" {rank}. [{idx:3d}] {model.config.id2label[idx]:<30s} {probs[idx].item():.4f}")
๐ฅ Acknowledgments
- Dataset source: The star092304/ViSignLanguage-Video collection, originally hosted via the PTIT AI Challenge platform.
- Pretrained Weights: Multimedia Computing Group, Nanjing University (MCG-NJU).
- Downloads last month
- 64
Model tree for star092304/vi-sign-language-videomae-base
Base model
MCG-NJU/videomae-base-finetuned-kinetics
