ALM Audio Encoders
Collection
I'm currently in the process of preparing the inference code.
•
8 items
•
Updated
•
1
Kimi-Audioでファインチューニングされたwhisperエンコーダー。音声から連続的な音響特徴量を抽出。
pip install transformers librosa torch
import torch
import librosa
from transformers import WhisperModel
# Load model
model = WhisperModel.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
model = model.encoder.to("cuda", dtype=torch.bfloat16)
model.eval()
# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)
# Extract features using Whisper's feature extractor
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)
# Get encoder output
with torch.no_grad():
encoder_output = model(input_features)
features = encoder_output.last_hidden_state # [1, T, 1280]
print(f"Features shape: {features.shape}")
# Mean pooling for utterance-level embedding
pooled = features.mean(dim=1) # [1, 1280]
[batch, time_steps, 1280] - 時系列特徴量[batch, 1280] - 発話レベル特徴量See moonshotai/Kimi-Audio-7B-Instruct for license information.
Base model
moonshotai/Kimi-Audio-7B-Instruct