State-of-the-art Convolutional Neural Networks are characterized by heterogeneous convolutional layers to proper balance accuracy and computational complexity. Run-time adaptive convolution architectures able to process feature maps with kernels of various sizes and strides are highly desirable to achieve a favorable speed/power dissipation balance. This paper presents the design of an adaptive architecture able to manage efficiently convolutional layers with different running parameters. In order to guarantee high resources utilization for all the supported kernel sizes and strides, in contrast with existing competitors, the proposed design combines non-uniform basic blocks differently customized from each other. As a further nice characteristic, the hardware architecture here presented efficiently manages both odd and even kernel sizes, useful in models also requiring transposed convolutional layers. When accommodated within a Xilinx XC7Z045 FPGA SoC device, the proposed engine reaches a peak throughput of 217.2 GOPS and dissipates about 2.75 W at the 150 MHz clock frequency.
Run-time adaptive hardware accelerator for convolutional neural networks
Sestito C.;Spagnolo F.;Corsonello P.;Perri S.
2021-01-01
Abstract
State-of-the-art Convolutional Neural Networks are characterized by heterogeneous convolutional layers to proper balance accuracy and computational complexity. Run-time adaptive convolution architectures able to process feature maps with kernels of various sizes and strides are highly desirable to achieve a favorable speed/power dissipation balance. This paper presents the design of an adaptive architecture able to manage efficiently convolutional layers with different running parameters. In order to guarantee high resources utilization for all the supported kernel sizes and strides, in contrast with existing competitors, the proposed design combines non-uniform basic blocks differently customized from each other. As a further nice characteristic, the hardware architecture here presented efficiently manages both odd and even kernel sizes, useful in models also requiring transposed convolutional layers. When accommodated within a Xilinx XC7Z045 FPGA SoC device, the proposed engine reaches a peak throughput of 217.2 GOPS and dissipates about 2.75 W at the 150 MHz clock frequency.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.