# **RE-THINKING COMPUTING WITH NEURO-INSPIRED LEARNING: ALGORITHMS TO HARDWARE**

#### Kaushik Roy

kaushik@purdue.edu Elmore School of Electrical and Computer Engineering Purdue University



#### Motivation







Google Edge TPU



Retinanet DNN\* on a smart glass

| Performance         |              |  |  |  |
|---------------------|--------------|--|--|--|
| Frames/sec          | 13.3         |  |  |  |
| Battery Life        |              |  |  |  |
| Energy/op           | 0.5 pJ/op    |  |  |  |
| Energy/frame        | 0.15 J/frame |  |  |  |
| Time-to-die (2.1WH) | 64 mins      |  |  |  |

\*300 GOPs/inference

Where do the in-efficiencies come from?AlgorithmsSensors/Hardware ArchitectureCircuits and Devices

ML application trends (Training)

#### **Comparison with Biological Systems**

Biological systems still possess a level of functionality that is unmatched in artificial systems

> Consider a reactive behavior of a Fruit fly (~100K neurons)

- Fly fast while avoiding obstacles in cluttered environments
- Dodge dynamic obstacles and active attacks



Flying monkey UAV, UPenn ~1-2W compute



Fruit fly ~uW compute



Dickinson's lab Caltech

### **The Big Picture**

Enable autonomous intelligent systems by **improving the compute efficiency and robustness of cognitive tasks** through cross-layer innovations from algorithms to hardware

#### **Exemplary application driver: Autonomous drones**



#### **Cross-Layer Design: Sensors, Algorithms, Hardware**



## Modalities.....



#### **Cross-Layer Design: Sensors, Algorithms, Hardware**



#### **Cross-Layer Design: Sensors, Algorithms, Hardware**



#### **Tradeoffs: Starting with Sensors....**



#### **Frame vs Event-based Cameras**



#### **Cross-Layer Design: Sensors, Algorithms, Hardware**



#### **Our Approaches**



#### **Event inputs**

Data at high temporal but low spatial resolution

#### **Frame inputs**

• Data at high spatial but low temporal resolution

Combined inputs for a better flow estimation

#### Adaptive-FlowNet: Fully Spiking Architecture

SNN



Model with all spiking layers

- Directly compatible with event inputs
- Capture temporal information
- Combat vanishing gradient with adaptive spiking neuronal model



C. Lee, A. Kosta, and K. Roy., Fusion-FlowNet:..., ICRA 2022



**Full-fledged ANN** 

**Fusion-FlowNet** 

Alex Zihao Zhu, Dinesh Thakur, Tolga Ozaslan, Bernd Pfrommer, Vijay Kumar and Kostas Daniilidis . "The multivehicle stereo event camera dataset: An event camera dataset for 3D perception." IEEE Robotics and Automation Letters 3.3 (2018): 2032-2039.

Chankyu Lee, Adarsh Kosta and Kaushik Roy. "Spike-flownet: event-based optical flow estimation with energy-efficient hybrid neural networks." *European Conference on* Computer Vision, Springer, Cham, 2020.

Spike-FlowNet

#### Tradeoffs



**51x** Lower Compute Energy

Similar error as Base-ANN

1.05

0.48

0.95

1031.1

23.4

51x

26x

Kosta, Roy, ICRA 2023 FireFlow-SNN

Micro-SNN

0.93

0.27

0.057

14x

48x

142x

#### **Cross-Layer Design: Sensors, Algorithms, Hardware**



#### HALSIE: Hybrid Approach to Learning Segmentation by Simultaneously Exploiting Image and Event Modalities







#### **Cross-Layer Design: Sensors, Algorithms, Hardware**



# **DOTIE: Detecting Objects through Temporal Isolation of Events**



threshold

Time (t)

membrane

otential

Output



Our single layer network can isolate events corresponding to moving objects and detect objects accurately, with low latency and energy consumption

#### **Events do not contain photometric** characteristics such as light intensity and texture

Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).

Nagaraj, Liyanagedera, Roy, "DOTIE: ...", ICRA 2023

# Demonstration of detecting speeds of objects using the DOTIE Algorithm



# Demonstration of detecting speeds of objects using the DOTIE Algorithm



operating at

3 speeds

Nagaraj, Liyanagedera, Roy, "DOTIE: ...", ICRA 2023



## **EV-Catcher: Taking Inspiration from Nature**

<u>Aim</u>: To minimize *latency* in *reactive* behaviors such as:

Interception
Time to collision



Kostas Daniilidis, UPenn

#### **Cross-Layer Design: Sensors, Algorithms, Hardware**



#### Hardware Architecture

Circuits and architectures that can efficiently implement the algorithms (SNNs and ANNs): need for hybrid systems
Near-/In-Memory Computing for MVMs
Approximate and stochastic hardware
Neuromorphic devices and interconnects



#### Hardware Architecture: CiM



**Efficient MVM** 

#### Hardware Architecture: CiM



Efficient MVM

Spatially Distributed Cores

## **Hardware Implementations**



## Hardware Architecture: Jetson TX-2





**NVIDIA Jetson TX-2** 

#### **Example optical flow prediction**



Gray image



Spike image



Flow prediction

#### Performance estimation of optical flow networks



## **Hardware Implementations**



#### Energy Efficient DNN: Adaptive-SNR Sparsity-Aware CiM Core with Load Balancing Support



- Hierarchical Microarchitecture with Sparsity-aware Bit-Serial Compute Units and reconfigurable ADC
- Row Gating based on SNR requirements of DNN workloads
- On-chip row and column re-arrangement hardware support for load balancing

47

M. Ali, "A 35.5-127.2 tops/w dynamic sparsity-aware reconfigurable-precision compute-inmemory sram macro for machine learning", IEEE Solid-State Circuits Letters, 2021

#### **Chip Results & Summary**



| Chip Summary                             |                            |  |  |  |
|------------------------------------------|----------------------------|--|--|--|
| Technology (nm)                          | 65                         |  |  |  |
| Voltage (V)                              | 1.2                        |  |  |  |
| Frequency (MHz)                          | 100.806                    |  |  |  |
| Input/Weight Precision                   | 4b/4b, 4b/8b, 8b/4b, 8b/8b |  |  |  |
| Output Precision                         | 18b/22b                    |  |  |  |
| Total Area (mm <sup>2</sup> )            | 7                          |  |  |  |
| CiM Macro Area (mm <sup>2</sup> )        | 0.036                      |  |  |  |
| Total Digital SRAM (KB)                  | 90.2                       |  |  |  |
| CiM SRAM (KB)                            | 16                         |  |  |  |
| Performance (1b/1b operation)            | 117-552 GOPs               |  |  |  |
| CIFAR-10 Accuracy (Resnet-20)            | 91.8%                      |  |  |  |
| Chip Energy Efficiency (8b/8b operation) | 1.4-6.7 TOPs/W             |  |  |  |





## **Hardware Implementations**



#### 65 nm Spiking Neural Network (SNN) Accelerator based on in-memory Processing: Suitable with DVS Camera

- Motivation (DVS input...)
- Spiking Neural Networks (SNNs) can perform sequential learning tasks efficiently using spikebased membrane potential (Vme m) accumulation over several timesteps.
- However, the movement of Vmems creates additional memory accesses making datatransfer a bottleneck.
- Additionally, the sparsity in binary spike inputs can be leveraged for efficiency.



<u>A. Agrawal et. al., "IMPULSE: A 65nm Digital Compute-in-Memory Macro with Fused Weights and</u> <u>Membrane Potential for Spike-based Sequential Learning Tasks</u>", IEEE Solid-State Circuits Letters, 2021

#### 65 nm Spiking Neural Network (SNN) Accelerator based on in-memory Processing



## **Hardware Implementations**



#### Hardware Architecture: CiM



**Efficient MVM** 

# Energy and latency of the IMC architectures



Both IMC architectures (IMPULSE and ADC-Less) require less energy consumption compared to the Jetson platform. In addition, the ADC-Less can improve the latency enabling realtime inference.

#### – ADC-Less IMC energy and latency analysis: Spike-FlowNet



**36-76x less energy consumption** and **7.8–12x faster** than Jetson platform.

**1.9-2.5x less energy consumption** and **8.9–12.6x faster** than a conventional HP-ADC IMC.

#### Hardware-aware training for the ADC-Less IMC

Optical flow prediction of the Fully-Spiking FlowNet (FS-FN) during the Hardware-aware training.





Full precision

ADC-Less training

#### Performance on MVSEC dataset (dt=1) [AEE lower is better]

| Model         | IN1 – AEE   | IN2 – AEE   | IN3 - AEE   | Training          |
|---------------|-------------|-------------|-------------|-------------------|
| FSFN          | <u>0.82</u> | <u>1.21</u> | <u>1.07</u> | Full-precision    |
| FSFN          | 0.88        | 1.39        | 1.18        | ADC-Less training |
| Spike-FlowNet | 0.84        | 1.28        | 1.11        | Full-precision    |

## **Hardware Implementations**



### **Results of Implementing DOTIE on Loihi**

Intersection over union (IoU) between object detection bounding boxes of ~81% from single car events dataset



## **Key Takeaways – Sensors and Algorithms**



Frame

Event

Fusion

**Sensor-fusion** of Frame and Event data exploits their complementary benefits improving overall performance

Hybrid SNN-ANN models naturally handle event data while preserving performance benefits and ease of training of ANNs

**Fully-Spiking Architectures better capture** timing information and lead to lightweight models suitable for the edge

These techniques improve the current state-of-the art, both in terms of accuracy and efficiency on several tasks for vision-based autonomous navigation

world data

# Key Takeaways – Efficient Hardware Platforms





Adapting to SNR and exploiting sparsity in a workload can significantly improve the overall performance of IMC architectures.

**Specialized hardware accelerators** for Spiking Neutral Networks focused on reducing the membrane potential overhead can **give better performance and energy benefits**.



Hardware/Software co-design approaches can lead to energy-efficient implementations based on co-optimization processes.

Hardware architectures and design techniques enables the deployment of energy efficient vision-based autonomous navigation