ESPnet簡介
ESPnet是一個端到端語音處理工具包。主要側重於端到端語音識別和端到端語音合成。ESPnet使用Chaine和PyTorch作為主要的深度學習引擎,並且還遵循Kaldi風格的數據處理、特徵提取/格式化和配方(recipe,Kaldi的處理方式),以提供用於語音識別和其他語音處理實驗的完整設置。
拉取Docker image
Docker image已預安裝ESPnet的依賴Kaldi。ESPnet使用Conda環境來安裝Python及其信賴。
<code> git pull espnet/espnet:200~gpu-cuda10.0-cudnn7-u18/<code>
也可以不使用Docker image,有兩種方式:
- 從源碼編譯安裝ESPnet,這時得自己編譯安裝Kaldi與Warp-CTC
- 使用ESPnet預編譯的二進制Kaldi與ESPnet
下載預訓練中文ASR模型
官方提供了使用Aishell數據集的中文預訓練ASR模型。
<code>| Task | CER (%) | WER (%) | Pretrained model |
| ----------- | :----: | :----: | :----: |
| Aishell dev | 6.0 | N/A | [link](https://github.com/espnet/espnet/blob/master/egs/aishell/asr1/RESULTS.md#transformer-result-default-transformer-with-initial-learning-rate--10-and-epochs--50) |
| Aishell test | 6.7 | N/A | same as above |/<code>
克隆ESPnet源碼
<code> git clone [email protected]:espnet/espnet/<code>
預訓練模型放入egs/aishell目錄中
<code> ├── conf
│ ├── decode.yaml
│ └── train.yaml
├── data
│ └── train_sp
│ └── cmvn.ark
└── exp
├── train_rnnlm_pytorch_lm
│ ├── model.json
│ └── rnnlm.model.best
└── train_sp_pytorch_train_pytorch_transformer_lr1.0
└── results
├── model.json
└── model.last10.avg.best/<code>
啟動容器
一分部源碼目錄需要映射進容器中使用,這裡是參考egs/aishell/asr1/run.sh的內容。
<code> docker run -it --rm \\
-v /home/ubuntu/jack/espnet/egs:/espnet/egs \\
-v /home/ubuntu/jack/espnet/espnet:/espnet/espnet \\
-v /home/ubuntu/jack/espnet/test:/espnet/test \\
-v /home/ubuntu/jack/espnet/utils:/espnet/utils \\
-v /home/ubuntu/jack/espnet/demo_asr:/espnet/demo_asr \\
--workdir /espnet/demo_asr/ \\
espnet/espnet:gpu-cuda10.0-cudnn7-u18 \\
/bin/bash/<code>
運行中文ASR識別示例
預訓練的中文ASR模型包含語言模型。使用的是transformer模型架構。此Demo沒有使用語言模型。
隨機挑選一個Aishell訓練集中的音頻文件作示例:BAC009S0730W0125.wav。
<code>import json
import torch
import argparse
from espnet.bin.asr_recog import get_parser
from espnet.nets.pytorch_backend.e2e_asr_transformer import E2E
import os
import scipy.io.wavfile as wav
from python_speech_features import fbank
filename = os.path.join(os.path.dirname(__file__), 'BAC009S0730W0125.wav')
sample_rate, waveform = wav.read(filename)
fbank = fbank(waveform,samplerate=16000,winlen=0.025,winstep=0.01,
nfilt=86,nfft=512,lowfreq=0,highfreq=None,preemph=0.97)
root = "espnet/egs/aishell/asr1"
root = os.path.join(os.path.dirname(__file__), '../..', root)
model_dir = root + "/exp/train_sp_pytorch_train_pytorch_transformer_lr1.0/results"
# load model
with open(model_dir + "/model.json", "r") as f:
idim, odim, conf = json.load(f)
model = E2E(idim, odim, argparse.Namespace(**conf))
model.load_state_dict(torch.load(model_dir + "/model.last10.avg.best"), strict=False)
model.cpu().eval()
# load tocken_list
token_list = conf['char_list']
# recognize speech
parser = get_parser()
args = parser.parse_args(["--beam-size", "2", "--ctc-weight", "1.0", "--result-label", "out.json", "--model", ""])
result = model.recognize(fbank, args, token_list)
s = "".join(conf["char_list"][y] for y in result[0]["yseq"]).replace("", "").replace("<space>", " ").replace("<blank>", "") /<code>
print("prediction: ", s)/<blank>/<space>
識別結果
識別結果為空,原因待分析。
<code>python demo_asr.py
(280, 86)
result
[{'score': -5.416276266070469, 'yseq': [4232, 4232]}]
prediction: /<code>
閱讀更多 IT充電寶 的文章