Kimi-Audio Whisper Encoder

Kimi-Audioでファインチューニングされたwhisperエンコーダー。音声から連続的な音響特徴量を抽出。

Model Info

Installation

pip install transformers librosa torch

Usage

Using Transformers (Recommended)

import torch
import librosa
from transformers import WhisperModel

# Load model
model = WhisperModel.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
model = model.encoder.to("cuda", dtype=torch.bfloat16)
model.eval()

# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)

# Extract features using Whisper's feature extractor
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)

# Get encoder output
with torch.no_grad():
    encoder_output = model(input_features)
    features = encoder_output.last_hidden_state  # [1, T, 1280]

print(f"Features shape: {features.shape}")

Pooled Features

# Mean pooling for utterance-level embedding
pooled = features.mean(dim=1)  # [1, 1280]

Output

  • Sequential features: [batch, time_steps, 1280] - 時系列特徴量
  • Pooled features: [batch, 1280] - 発話レベル特徴量

License

See moonshotai/Kimi-Audio-7B-Instruct for license information.

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Atotti/Kimi-Audio-Whisper-Encoder

Finetuned
(2)
this model

Collection including Atotti/Kimi-Audio-Whisper-Encoder