http://hdl.handle.net/1893/36026
Appears in Collections: | Computing Science and Mathematics eTheses |
Title: | Domain-Specific Optimisations for Image Processing Algorithms on Heterogeneous Architectures |
Author(s): | Ali, Teymoor R |
Supervisor(s): | Nicol, Robert L Bhowmik, Deepayan |
Keywords: | Image Processing FPGA Hetergeneous Computing Hardware Benchmarking Computer Vision Domain-Specifc Optimisations |
Issue Date: | 29-Dec-2023 |
Publisher: | University of Stirling |
Citation: | Teymoor Ali, Deepayan Bhowmik, and Robert Nicol. 2023. Domain-Specific Optimisations for Image Processing on FPGAs. J. Signal Process. Syst. 95, 10 (Oct 2023), 1167–1179. https://doi.org/10.1007/s11265-023-01888-2 |
Abstract: | As real-time embedded vision systems become more ubiquitous, the demand for better energy efficiency, runtime, and accuracy have become vital metrics in evaluating overall performance. These requirements have led to innovative computing architectures, leveraging heterogeneity that combine various accelerators into a single processing fabric. These new architectures lead to new challenges in understanding the most efficient way to partition and optimise algorithms on the most suitable accelerator. In this thesis, domain-specific optimisation techniques are applied to enhance performance and resource efficiency for image processing algorithms on heterogeneous hardware. Domain-specific optimisations are preferred for being hardware agnostic and their ability to cater to a wider range of image processing pipelines within the domain. First, a literature analysis is conducted on image processing implementations on heterogeneous hardware, high-level synthesis tools, optimisation strategies, and frameworks. The first objective is to develop macro-micro benchmarks for image processing algorithms to determine the suitability of these algorithms on hardware accelerators. The profiling led to the development of a comprehensive benchmarking framework, Heterogeneous Architecture Benchmarking on Unified Resources (HArBoUR). The framework decomposes each algorithm into its fundamental properties that would affect overall performance. A collection of representative image processing algorithms from various operation domains (\eg Filters, Morphological, Geometric, Arithmetic, CNNs, Feature Extraction ) and full pipelines (\eg edge detection, feature extraction, convolutional neural network) are used as examples to understand the compute efficiency of on three hardware platforms (CPU, GPU, FPGA). The results show that parallelism and memory access patterns influence hardware performance. GPUs excel for algorithms with large data-size parallel operations and regular memory access patterns. FPGAs better suit lower parallel factor and data-sized operations. In addition, optimising for irregular memory access patterns and complex computations remains challenging on both FPGA and GPU architectures. However, FPGAs offer high performance relative to their resource and clock speed, but their specialised architecture requires careful implementation for optimal results. In the case of feature extraction algorithms, GPU acceleration is preferable for high matrix operation-intensive stages due to faster execution times. At the same time, FPGAs are more suitable for lower arithmetic stages due to comparable performance and energy consumption profiles. Edge detection and CNN pipelines demonstrate GPUs faster performance but at a significantly higher energy consumption than FPGAs. FPGAs exhibit lower latency than GPUs, considering initialisation and memory transfer times. CPUs perform comparably to both hardware in low-complexity and data-dependant algorithms. In CNN pipelines, FPGAs compute particular layers faster but generally have slower total inference times than GPUs. Nonetheless, FPGAs offer flexibility with bit-widths and operation-fused custom kernels. Domain-specific optimisations are applied to algorithms such as SIFT feature extraction, filter operations, and CNN pipelines to understand the runtime, energy, and accuracy. Techniques such as downsampling, datatype conversion, and convolution kernel size reduction are investigated to enhance performance. These optimisations notably improve computation time across different processing architectures, with the SIFT algorithm implementation surpassing state-of-the-art FPGA implementations and achieving comparable runtime to GPUs at low power. However, these optimisations led to a 5-20\% image accuracy loss across all algorithms. Finally, the research outcomes described above are applied to two constructed heterogeneous architectures aimed at two domains, low-power (LP) and high-power (HP) systems. Partitioning strategies are explored for mapping CNN layers and operation stages of feature extraction algorithms onto heterogeneous architectures. The results demonstrate that layer-based partitioning methods outperform their fasted homogeneous accelerator counterparts regarding energy efficiency and execution time, suggesting a promising approach for efficient deployment on heterogeneous architectures. |
Type: | Thesis or Dissertation |
URI: | http://hdl.handle.net/1893/36026 |
File | Description | Size | Format | |
---|---|---|---|---|
Teymoor_Thesis_2023.pdf | 17.73 MB | Adobe PDF | Under Embargo until 2026-01-01 Request a copy |
Note: If any of the files in this item are currently embargoed, you can request a copy directly from the author by clicking the padlock icon above. However, this facility is dependent on the depositor still being contactable at their original email address.
This item is protected by original copyright |
Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
The metadata of the records in the Repository are available under the CC0 public domain dedication: No Rights Reserved https://creativecommons.org/publicdomain/zero/1.0/
If you believe that any material held in STORRE infringes copyright, please contact library@stir.ac.uk providing details and we will remove the Work from public display in STORRE and investigate your claim.