DDESE
DeePhi Descartes Efficient Speech Recognition Engine

DDESE is an efficient end-to-end automatic speech recognition (ASR) engine based on the FPGA of Xilinx, which is designed for Deep Neural Networks (especially for LSTM), with the deep learning acceleration solution of algorithm, software and hardware co-design (containing pruning, quantization, compilation and FPGA inference) by DeePhi. We Use DeepSpeech2 framework with LibriSpeech 1000h dataset for model training and compression. DDESE has been officially launched in AWS Marketplace and HUAWEI cloud, users could run the test scripts for performance comparison of CPU/FPGA and single sentence recognition.

Innovative full-stack acceleration solution for deep learning in the field of automatic speech recognition

ESE: best paper of FPGA2017 >>
  • Support both unidirectional and bi-directional LSTM acceleration on FPGA for model inference
  • Support CNN layers, Fully-Connected (FC) layers, Batch Normalization (BN) layers and varieties of activation functions such as Sigmoid, Tanh and HardTanh
  • Support testing for both performance comparison of CPU/FPGA and single sentence recognition
  • Supporting user’s own test audio recognition (English, 16kHz sample rate, no longer than 3 seconds)
Usage
Hardware PCIE interface, software API
Supported layer
CNN、uni/bo-directional LSTM、FC、BN
LSTM layer number
According to the requirements and source
Channel number
According to the requirements and source
Quantiozation
16bit
Maximum input of LSTM
1024
Maximum size of LSTM
2048
Density of LSTM
Any
Peephile in LSTM
Selectable
Projection in LSTM
Selectable
Activation function
Sigmoid、Tanh、HardTanh
Note:It is available to change the hardware configuaration for different requirements such as layer number and model size

Our solution is algorithm, software and hardware co-design (containing pruning, quantization, compilation and FPGA inference).

After pruning, the model is pruned to a sparse one (15%~20% density) with little loss of accuracy, then the weights and activations are quantized to 16bits so that the whole model is compressed by more than 10X and could be easily compiled by CSC (Compressed Sparse Column) format and deployed on the Descartes platform for efficient inference with the help of FPGA.

Our ASR system and model structure are as follows:

Our achievements of DDESE are as follows:

2.87X and 2.56X speedup could be achieved compared to GPU (Tesla P4 + cudnn) for unidirectional and bi-directional LSTM model respectively, if only considering LSTM layers.

  • For LSTM layers only(input audio:1 second)

    2.06X speedup could be achieved compared to GPU (Tesla P4 + cudnn) for the whole end-to-end speech recognition process if considering both CNN and bi-directional LSTM layers for further acceleration.

  • For CNN layers + bi-directional LSTM layers (input audio: 1 second)
    Note: E2E is short for end-to-end, ACT is short for activation, WER is short for word error rate, input audio length: 1 second.

The details of performance comparison for bi-directional LSTM model are as follows:

For DDESE on HUAWEI cloud, please visit:
https://app.huaweicloud.com/product/00301-111291-0--0
For DDESE on AWS Amazon, please visit:
https://aws.amazon.com/marketplace/pp/B079N2J42R?qid=1528341878497&sr=0-1&ref_=srh_res_product_title
We assume you are familiar with AWS F1 instance. Please refer to
https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UseGuide/concepts.html
if you are not. You should launch and login to DDESE instance before the test.
Environment Settings
# sudo bash (make sure you are under root environment)
# source /opt/Xilinx/SDx/2017.1.rte/setup.sh(start SDAccel platform)

# cd ASR_Accelerator/deepspeech2 (where the test tool are placed)
# source activate test_py3 (activate python3.6 environment)

After the above steps are done, you are free to test the ASR process.

Test Example

The following command deploys a model on CPU and transcribes the same sentence 1000 times.

# python aws_test.py --audio_path data/middle_audio/wav/middle1.wav --single_test

The following command deploys a model on FPGA and transcribes the same sentence 1000 times.

# python aws_test.py --fpga_config deephi/config/fpga_cnnblstm_0.15.json --audio_path data/middle_audio/wav/middle1.wav --no_cpu --single_test

With the help of these tests, you could compare the performance of the same automatic speech recognition task on CPU and FPGA.

Command Description

In this part, we will detail more commands that you could use to test the DeePhi_ASRAcc. Furthermore, you can change some parameters according to the parameter descriptions.

By default, this command will deploy a model on CPU and transcribe all the sentences (“.wav” format) under data/short_audio/wav/ and print the output logs.

# python aws_test.py (multi-sentence test to show the performance of FPGA over CPU)

By default, this command will deploy the model on CPU and transcribe data/short_audio/wav/short_audio1.wav and print the output logs.

# python transcribe.py (single-sentence test to show the accuracy of the model)

By default both commands deploy model only CPU, you could add FPGA configuration to deploy the model on FPGA, like below:

By running this command, models will be deployed on CPU AND FPGA and the ASR process will be tested on CPU and FPGA one by one.

# python aws_test.py --fpga_config deephi/config/fpga_bilstm_0.15.json
(deploy the model on both CPU and FPGA and run the test)

By running this command, model will be deployed on FPGA INSTEAD of on CPU, together with the ASR process.

# python transcribe.py --fpga_config deephi/config/fpga_bilstm_0.15.json
(deploy the model on FPGA and do the ASR)
Command Parameters Description

A. for command aws_test.py:

--no_cpu
:set this parameter to avoid running the ASR process
on CPU

--wav_folder ROOTDIR_OF_YOUR_WAV_FILES
:specify the ROOTDIR_OF_YOUR_WAV_FILE to the folder
 where wav files are saved, then this command will transcribe every .wav file under this folder,this parameter SHOULD NOT be used together with --
 single_test parameter
--audio_path PATH_TO_YOUR_WAV_FILE
:specify the PATH_TO_YOUR_WAV_FILE to the wav file
that you want to transcribe, then this command will transcribe the specified sentence for 1000 times,this parameter SHOULD be used
together with --
 single_test parameter
--single_test
:set this parameter to run single test mode, thus,
transcribe the same sentence 1000 times on the specified models. Otherwise transcribe all the
 sentences under the specified folder for 1 time.

B. for command transcribe.py:

--audio_path PATH_TO_YOUR_WAV_FILE
:specify the PATH_TO_YOUR_WAV_FILE to the wav file that you want to transcribe.

Note: The folder named “data” consist of short audios, middle audios and long audios.

Try Using Your Own Input

Please upload your own wav file (must be 16kHz sample rate, recorded in clean environment, shorter than 3 seconds).Then use the following command to transcribe the uploaded sentence:

# python transcribe.py --audio_path PATH_TO_YOUR_WAV_FILE

If you are interested in our work or have any problems in running our solution on AWS F1, please contact us at the following email address: