Hypothetical Implementation of Field-Programmable Gate Arrays in Discreet Low-Profile Real-Time Image Analysis and Classification

Hypothetical Implementation of Field-Programmable Gate Arrays in Discreet Low-profile Real-time Image Analysis and Classification

Daniel P Corcoran
University of Cincinnati College of Engineering and Applied Science

Abstract: State-of-the-art implementations of real-time object detection systems remain limited by many factors. The neural networks at the core of systems such as YOLOv3 [1] are not capable of real-time object detection without a dedicated video card or substantially powerful CPU. The prospect of implementing a low-profile real-time neural image classification system has until now remained unheard of. Through proper implementation of a Field-Programmable Gate Array, a Raspberry Pi Zero W would hypothetically be capable of achieving desktop-speed image analysis (10-30FPS) [2] while maintaining a low physical footprint and low power profile. With this implementation a network of self-contained "Neural Cameras" could independently parameterize and report data to a thin client or web client with no dependence on external processing.
Introduction:
Problem Analysis: Current implementations of neural networks require expensive

Motivation: I believe that by parameterizing and analyzing visual data independently from typical neural network systems that the computational challenges of cognitive awareness and situational awareness can be separated. By simplifying the process of collecting and analyzing visual data, exciting new applications can be developed across many industries.

Potential Applications:
1. Streamline employee evaluations by directly quantifying performance metrics
2. Enhance classroom engagement by directly quantifying individual student attention
3. Augment wildlife research by autonomously collecting, categorizing, quantifying and parameterize biology data across a wide area
4. Identify which items need to be restocked or reorganized
5. Identify which areas need to be cleaned
6. Automatically clock in/clock out employees
7. Automatically evaluate quality of complex products
8. Simplify deployment of intelligent robots
9. Simplify geographical surveying
10. Quantify and analyze behaviors, shopping habits, tastes and preferences of large aggregate groups
System Design Overview:
Included is a rudimentary overview of the fundamental elements of the system.

TODO: Include representation of memory management between FPGA and Pi.
IMPORTANT: Look into openVINO (Accleration library for optimized computing with inte's hardware portfolio)




FPGA Design:  

Simple FPGAs that are in my budget:


Not an FPGA, but the intel movidius basically does exactly what I want, minus the camera.

Useful Exerpt from [4]:
_______________

One approach to reduce the silicon count (therefore power consumption) required for machine learning inference is reducing the dynamic range of calculations [4]. Reducing from 32-bit to 16-bit floating point arithmetic, for example, only slightly reduces the application performance in recognition accuracy, yet can greatly reduce hardware requirements. 


...


Processing convolutions within CNN networks require many millions of coefficients to be stored and processed. Traditionally, each of these coefficients are stored in full single precision representation. Research has demonstrated that coefficients can be reduced to half precision without any significant change to overall accuracy hile reducing the amount of storage needed and memory bandwidth requirements. Most of the pre-trained CNN models available today use partially reduced precison. 


_____________


TODO: 


1. Calculate the resources required to perform 1024 16 bit accumulations that represent each 32x32 convolution
2. Develop serial pathway for memory management between FPGA and




Data Serialization:
-Develop general format for forwarding matrix data to FPGA controller
Neural Network Library Compatibility:
-Implement data serialization service in place of the CUDA service in darknet... Is darknet the best choice for NN library? (Assuming so considering most general software compatibility issues have been worked out on pi)
Results:
Anecdotal - Running TINY-YOLO weights on Pi zero W SoC (64Mb to GPU, 1gb swap file, 850mHz clock, 2.4A power and serial-USB connection to windows while running SSH still takes 2 minutes to process a frame).

-I'd like to do some preliminary testing with an intel mobidius, worth shelling out $$$ for?

Assumption:
-Will the size of the FPGA be enough to handle most neural network applications? Am I being too
Future Improvements:
Conclusion:
Acknowledgements:





Citations: 
[1] https://pjreddie.com/media/files/papers/YOLOv3.pdf
[2] https://www.youtube.com/watch?v=_iMboyu8iWc
[3] https://forums.xilinx.com/t5/Xcell-Daily-Blog-Archived/Tincy-YOLO-a-real-time-low-latency-low-power-object-detection/ba-p/815840
[4] https://www.bittware.com/resources/bwnn/ 
[5] https://www.researchgate.net/publication/323375650_A_Lightweight_YOLOv2_A_Binarized_CNN_with_A_Parallel_Support_Vector_Regression_for_an_FPGA

Comments