Accelerating and Improving Deep Learning Inference on Embedded Platforms via Tensorrt

Zhou, Yuxiao

Accelerating and Improving Deep Learning Inference on Embedded Platforms via Tensorrt

Files

ZHOU-DISSERTATION-2024.pdf (6.06 MB)

Date

2024-05

Authors

Zhou, Yuxiao

Abstract

Deep learning (DL) has experienced remarkable advancement, emerging as one of the most successful machine learning techniques. Adopting DL models on a variety of computing platforms, including embedded systems, edge devices, and system-on-chips (SoCs), has led to a wide range of DL-enabled applications. Despite achieving impressive accuracy, the large size of deep neural networks may require significant execution time and computing resource consumption not only for training but also for inference, where the latter could be more time-sensitive and need to run on embedded and edge platforms with limited computing resources. To address these challenges, NVIDIA TensorRT has been developed and can be seamlessly integrated with popular DL frameworks like PyTorch and TensorFlow. On the hardware side, modern SoCs, such as NVIDIA Jetson devices, are often equipped with a diverse range of accelerators, each of which is characterized by distinct power and performance features. This dissertation focuses on improving the execution time and power efficiency of model inference using TensorRT on embedded hardware platforms. The specific contributions are threefold. First, we explore and evaluate several alternative workflows for the deployment of TensorRT in model inference. Each of such workflows involves steps to convert a given PyTorch or TensorFlow model to a TensorRT counterpart. We implement each of these workflows and evaluate them in several crucial aspects including model accuracy, inference time, inference throughput, and GPU utilization. Our experiment results demonstrate that TensorRT significantly enhances inference efficiency metrics without compromising accuracy. Furthermore, each of the alternative workflows for incorporating TensorRT has its pros and cons and a discussion and comparison is provided for selecting them in different circumstances. Second, we aim to further improve inference efficiency by incorporating model quantization with TensorRT. We deploy model quantization methods in the TensorRT framework and assess their effectiveness through experiments on the NVIDIA Jetson Orin SoC. There are also different workflows for implementing quantization with TensorRT. We investigate and profile them with respect to various DL models and batch sizes. Our experiments demonstrate that employing quantization within TensorRT significantly enhances the efficiency of inference metrics while maintaining a high level of inference accuracy. Additionally, we explore these workflows for implementing quantization using TensorRT and discuss their advantages and disadvantages. Based on our analysis of these workflows, we provide recommendations for selecting an appropriate workflow for different application scenarios. Third, we investigate the approaches to efficiently leverage multiple accelerators on an SoC to execute model inference using NVIDIA TensorRT. We profile the execution time and energy characteristics for neural network layers executing on various accelerators. We examine various factors influencing layer execution. We propose two algorithms to assign individual layers of a single model to execute on multiple accelerators, aiming to minimize energy consumption while adhering to a predetermined target NN inference execution time. We implement the proposed approaches using the ResNet50 model on the NVIDIA Jetson Orin platform. Our experiments demonstrate that adopting a coarse-grained layer grouping strategy and properly assigning layer groups to different accelerators can yield greater benefits in terms of energy consumption while preserving desired end-to-end inference time.

Keywords

deep learning, SoCs, TensorRT, model optimization, DLA, NVIDIA Jetson, model inference, multiple accelerators

Citation

Zhou, Y. (2024). Accelerating and improving deep learning inference on embedded platforms via tensorrt (Unpublished dissertation). Texas State University, San Marcos, Texas.

URI

https://hdl.handle.net/10877/18657

Collections

Graduate Theses and Dissertations

Full item page

Accelerating and Improving Deep Learning Inference on Embedded Platforms via Tensorrt

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Rights

Rights Holder

Rights License

Rights URI

URI

Collections