Transposed convolution is a crucial operation in several computer vision applications, including emerging Convolutional Neural Networks for super-resolution, generative adversarial and segmentation tasks. Such algorithms deal with high computational loads and memory requirements, which hinder their implementation in real-time and power-constrained embedded systems. In addition, they may adopt different kernel sizes along the network, thus making the design of flexible yet efficient hardware architectures highly desirable. This paper presents a reconfigurable accelerator able to runtime adapt its computational capabilities to perform transposed convolution with different kernel sizes. When accommodated within the Xilinx XC7Z020 and XC7K410T chips, the proposed design dissipates less than 95 mW at 125MHz and 179 mW at 250MHz, exhibiting a throughput of 1.95 and 3.9 Giga output per second, respectively. Both the implementations overcome state-of-the-art counterparts, achieving an energy efficiency up to 4.4 times higher. When used to accelerate the Fast Super Resolution Convolutional Neural Networks, the novel reconfigurable architecture achieves an energy efficiency at least 23% better than the competitors.
Runtime Reconfigurable Hardware Accelerator for Energy-Efficient Transposed Convolutions
Spagnolo F.;Perri S.
2022-01-01
Abstract
Transposed convolution is a crucial operation in several computer vision applications, including emerging Convolutional Neural Networks for super-resolution, generative adversarial and segmentation tasks. Such algorithms deal with high computational loads and memory requirements, which hinder their implementation in real-time and power-constrained embedded systems. In addition, they may adopt different kernel sizes along the network, thus making the design of flexible yet efficient hardware architectures highly desirable. This paper presents a reconfigurable accelerator able to runtime adapt its computational capabilities to perform transposed convolution with different kernel sizes. When accommodated within the Xilinx XC7Z020 and XC7K410T chips, the proposed design dissipates less than 95 mW at 125MHz and 179 mW at 250MHz, exhibiting a throughput of 1.95 and 3.9 Giga output per second, respectively. Both the implementations overcome state-of-the-art counterparts, achieving an energy efficiency up to 4.4 times higher. When used to accelerate the Fast Super Resolution Convolutional Neural Networks, the novel reconfigurable architecture achieves an energy efficiency at least 23% better than the competitors.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.