Vietnamese Sign Language Recognition (VSLR) Model

This repository houses a fine-tuned VideoMAE (Base) model optimized for multi-class Vietnamese Sign Language Recognition (VSLR). The model architecture adapts self-supervised video representations to accurately classify short video clips of sign gestures into distinct Vietnamese text labels.


๐Ÿš€ Model Description

The model processes short video sequences by partitioning them into spatiotemporal patches, mapping sequential gestures (such as "ฤ‚n", "Bแป‡nh viแป‡n", "Xin lแป—i") to their corresponding semantic classes.


๐Ÿ“Š Training & Evaluation Visualizations

The training routine was monitored closely across key evaluation metrics to prevent overfitting while maximizing classification accuracy on the validation split.

1. Training and Validation Metrics

The plot below illustrates the progression of accuracy, precision, recall, and F1-score across successive training epochs.

Training Metrics

2. Loss Curves

The loss progression shows stable convergence, highlighting the adaptation of downstream spatiotemporal features from the initial Kinetics-400 pretraining weights.

Training and Validation Loss


๐Ÿ›  Inference & Usage

The following example is adapted from inference/inference_for_colab.ipynb and demonstrates how to run local inference using the fine-tuned VideoMAE model.

Prerequisites

pip install transformers torch decord huggingface-hub

Python Inference Example

import torch
import torch.nn as nn
import numpy as np
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
from decord import VideoReader, cpu
from huggingface_hub import hf_hub_download

MODEL_NAME = "star092304/vi-sign-language-videomae-base"
VIDEO_PATH = "path_to_a_test_sign_video.mp4"
NUM_FRAMES = 16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

processor = VideoMAEImageProcessor.from_pretrained(MODEL_NAME)
model = VideoMAEForVideoClassification.from_pretrained(
    MODEL_NAME,
    ignore_mismatched_sizes=True,
)

# Rebuild the sequential classifier head exactly as used in the original notebook.
in_features = model.classifier.in_features
NUM_CLASSES = model.config.num_labels
model.classifier = nn.Sequential(
    nn.LayerNorm(in_features),
    nn.Dropout(0.3),
    nn.Linear(in_features, NUM_CLASSES),
)

seq_ckpt_path = hf_hub_download(
    repo_id=MODEL_NAME,
    filename="classifier_sequential.pth",
)
seq_sd = torch.load(seq_ckpt_path, map_location="cpu", weights_only=True)
model.load_state_dict(seq_sd, strict=False)

model = model.to(DEVICE)
model.eval()


def load_video(video_path: str, num_frames: int = 16) -> list:
    vr = VideoReader(video_path, ctx=cpu(0))
    total = len(vr)
    indices = np.linspace(0, total - 1, num_frames).astype(int)
    frames = vr.get_batch(indices).asnumpy()
    return list(frames)

frames = load_video(VIDEO_PATH, num_frames=NUM_FRAMES)
inputs = processor(frames, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
pred_id = logits.argmax(-1).item()
pred_label = model.config.id2label[pred_id]
probs = torch.softmax(logits, dim=-1)[0]

print(f"Predicted class : {pred_label}")
print(f"Class ID        : {pred_id}")
print(f"Confidence      : {probs[pred_id].item():.4f}")

print("\nTop-5 predictions:")
for rank, idx in enumerate(torch.argsort(probs, descending=True)[:5], 1):
    idx = idx.item()
    print(f"  {rank}. [{idx:3d}] {model.config.id2label[idx]:<30s} {probs[idx].item():.4f}")

๐Ÿ‘ฅ Acknowledgments

  • Dataset source: The star092304/ViSignLanguage-Video collection, originally hosted via the PTIT AI Challenge platform.
  • Pretrained Weights: Multimedia Computing Group, Nanjing University (MCG-NJU).
Downloads last month
64
Safetensors
Model size
86.3M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for star092304/vi-sign-language-videomae-base

Finetuned
(280)
this model

Dataset used to train star092304/vi-sign-language-videomae-base