Vietnamese Sign Language Recognition (VSLR) Model

This repository houses a fine-tuned VideoMAE (Base) model optimized for multi-class Vietnamese Sign Language Recognition (VSLR). The model architecture adapts self-supervised video representations to accurately classify short video clips of sign gestures into distinct Vietnamese text labels.

🚀 Model Description

Base Architecture: MCG-NJU/videomae-base-finetuned-kinetics (Video Masked Autoencoders)
Dataset utilized: star092304/ViSignLanguage-Video
Task: Multi-class Video Classification (Spatiotemporal Feature Extraction)
Target Language: Vietnamese Sign Language (VNSL)

The model processes short video sequences by partitioning them into spatiotemporal patches, mapping sequential gestures (such as "Ăn", "Bệnh viện", "Xin lỗi") to their corresponding semantic classes.

📊 Training & Evaluation Visualizations

The training routine was monitored closely across key evaluation metrics to prevent overfitting while maximizing classification accuracy on the validation split.

1. Training and Validation Metrics

The plot below illustrates the progression of accuracy, precision, recall, and F1-score across successive training epochs.

2. Loss Curves

The loss progression shows stable convergence, highlighting the adaptation of downstream spatiotemporal features from the initial Kinetics-400 pretraining weights.

🛠 Inference & Usage

The following example is adapted from inference/inference_for_colab.ipynb and demonstrates how to run local inference using the fine-tuned VideoMAE model.

Prerequisites

pip install transformers torch decord huggingface-hub

Python Inference Example

import torch
import torch.nn as nn
import numpy as np
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
from decord import VideoReader, cpu
from huggingface_hub import hf_hub_download

MODEL_NAME = "star092304/vi-sign-language-videomae-base"
VIDEO_PATH = "path_to_a_test_sign_video.mp4"
NUM_FRAMES = 16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

processor = VideoMAEImageProcessor.from_pretrained(MODEL_NAME)
model = VideoMAEForVideoClassification.from_pretrained(
    MODEL_NAME,
    ignore_mismatched_sizes=True,
)

# Rebuild the sequential classifier head exactly as used in the original notebook.
in_features = model.classifier.in_features
NUM_CLASSES = model.config.num_labels
model.classifier = nn.Sequential(
    nn.LayerNorm(in_features),
    nn.Dropout(0.3),
    nn.Linear(in_features, NUM_CLASSES),
)

seq_ckpt_path = hf_hub_download(
    repo_id=MODEL_NAME,
    filename="classifier_sequential.pth",
)
seq_sd = torch.load(seq_ckpt_path, map_location="cpu", weights_only=True)
model.load_state_dict(seq_sd, strict=False)

model = model.to(DEVICE)
model.eval()


def load_video(video_path: str, num_frames: int = 16) -> list:
    vr = VideoReader(video_path, ctx=cpu(0))
    total = len(vr)
    indices = np.linspace(0, total - 1, num_frames).astype(int)
    frames = vr.get_batch(indices).asnumpy()
    return list(frames)

frames = load_video(VIDEO_PATH, num_frames=NUM_FRAMES)
inputs = processor(frames, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
pred_id = logits.argmax(-1).item()
pred_label = model.config.id2label[pred_id]
probs = torch.softmax(logits, dim=-1)[0]

print(f"Predicted class : {pred_label}")
print(f"Class ID        : {pred_id}")
print(f"Confidence      : {probs[pred_id].item():.4f}")

print("\nTop-5 predictions:")
for rank, idx in enumerate(torch.argsort(probs, descending=True)[:5], 1):
    idx = idx.item()
    print(f"  {rank}. [{idx:3d}] {model.config.id2label[idx]:<30s} {probs[idx].item():.4f}")

👥 Acknowledgments

Dataset source: The star092304/ViSignLanguage-Video collection, originally hosted via the PTIT AI Challenge platform.
Pretrained Weights: Multimedia Computing Group, Nanjing University (MCG-NJU).

Downloads last month: 64

Safetensors

Model size

86.3M params

Tensor type

F32

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for star092304/vi-sign-language-videomae-base

Base model

MCG-NJU/videomae-base-finetuned-kinetics

Finetuned

(280)

this model

star092304
/

vi-sign-language-videomae-base