DDESE is an efficient end-to-end automatic speech recognition (ASR) engine based on the FPGA of Xilinx, which is designed for Deep Neural Networks (especially for LSTM), with the deep learning acceleration solution of algorithm, software and hardware co-design (containing pruning, quantization, compilation and FPGA inference) by DeePhi. We Use DeepSpeech2 framework with LibriSpeech 1000h dataset for model training and compression. DDESE has been officially launched in AWS Marketplace and HUAWEI cloud, users could run the test scripts for performance comparison of CPU/FPGA and single sentence recognition.
Innovative full-stack acceleration solution for deep learning in the field of automatic speech recognitionESE: best paper of FPGA2017 >>
- Support both unidirectional and bi-directional LSTM acceleration on FPGA for model inference
- Support CNN layers, Fully-Connected (FC) layers, Batch Normalization (BN) layers and varieties of activation functions such as Sigmoid, Tanh and HardTanh
- Support testing for both performance comparison of CPU/FPGA and single sentence recognition
- Supporting user’s own test audio recognition (English, 16kHz sample rate, no longer than 3 seconds)
Our solution is algorithm, software and hardware co-design (containing pruning, quantization, compilation and FPGA inference).
After pruning, the model is pruned to a sparse one (15%~20% density) with little loss of accuracy, then the weights and activations are quantized to 16bits so that the whole model is compressed by more than 10X and could be easily compiled by CSC (Compressed Sparse Column) format and deployed on the Descartes platform for efficient inference with the help of FPGA.
Our ASR system and model structure are as follows:
Our achievements of DDESE are as follows:
2.87X and 2.56X speedup could be achieved compared to GPU (Tesla P4 + cudnn) for unidirectional and bi-directional LSTM model respectively, if only considering LSTM layers.
For LSTM layers only(input audio:1 second)
2.06X speedup could be achieved compared to GPU (Tesla P4 + cudnn) for the whole end-to-end speech recognition process if considering both CNN and bi-directional LSTM layers for further acceleration.
For CNN layers + bi-directional LSTM layers (input audio: 1 second)Note: E2E is short for end-to-end, ACT is short for activation, WER is short for word error rate, input audio length: 1 second.
The details of performance comparison for bi-directional LSTM model are as follows:
if you are not. You should launch and login to DDESE instance before the test.
# sudo bash (make sure you are under root environment)
# source /opt/Xilinx/SDx/2017.1.rte/setup.sh(start SDAccel platform)
# cd ASR_Accelerator/deepspeech2 (where the test tool are placed)
# source activate test_py3 (activate python3.6 environment)
After the above steps are done, you are free to test the ASR process.
The following command deploys a model on CPU and transcribes the same sentence 1000 times.
# python aws_test.py --audio_path data/middle_audio/wav/middle1.wav --single_test
The following command deploys a model on FPGA and transcribes the same sentence 1000 times.
# python aws_test.py --fpga_config deephi/config/fpga_cnnblstm_0.15.json --audio_path data/middle_audio/wav/middle1.wav --no_cpu --single_test
With the help of these tests, you could compare the performance of the same automatic speech recognition task on CPU and FPGA.
In this part, we will detail more commands that you could use to test the DeePhi_ASRAcc. Furthermore, you can change some parameters according to the parameter descriptions.
By default, this command will deploy a model on CPU and transcribe all the sentences (“.wav” format) under data/short_audio/wav/ and print the output logs.
# python aws_test.py (multi-sentence test to show the performance of FPGA over CPU)
By default, this command will deploy the model on CPU and transcribe data/short_audio/wav/short_audio1.wav and print the output logs.
# python transcribe.py (single-sentence test to show the accuracy of the model)
By default both commands deploy model only CPU, you could add FPGA configuration to deploy the model on FPGA, like below:
By running this command, models will be deployed on CPU AND FPGA and the ASR process will be tested on CPU and FPGA one by one.
# python aws_test.py --fpga_config deephi/config/fpga_bilstm_0.15.json
(deploy the model on both CPU and FPGA and run the test)
By running this command, model will be deployed on FPGA INSTEAD of on CPU, together with the ASR process.
# python transcribe.py --fpga_config deephi/config/fpga_bilstm_0.15.json
(deploy the model on FPGA and do the ASR)
Command Parameters Description
A. for command aws_test.py:
:set this parameter to avoid running the ASR process on CPU
:specify the ROOTDIR_OF_YOUR_WAV_FILE to the folder where wav files are saved, then this command will transcribe every .wav file under this folder,this parameter SHOULD NOT be used together with -- single_test parameter
:specify the PATH_TO_YOUR_WAV_FILE to the wav file that you want to transcribe, then this command will transcribe the specified sentence for 1000 times,this parameter SHOULD be used together with -- single_test parameter
:set this parameter to run single test mode, thus, transcribe the same sentence 1000 times on the specified models. Otherwise transcribe all the sentences under the specified folder for 1 time.
B. for command transcribe.py:
:specify the PATH_TO_YOUR_WAV_FILE to the wav file that you want to transcribe.
Note: The folder named “data” consist of short audios, middle audios and long audios.
Try Using Your Own Input
Please upload your own wav file (must be 16kHz sample rate, recorded in clean environment, shorter than 3 seconds).Then use the following command to transcribe the uploaded sentence:
# python transcribe.py --audio_path PATH_TO_YOUR_WAV_FILE