A measured comparison of FPGA MAC array variants showing that compute sharing is only worth its fixed cost under genuine resource pressure.
Projects
A fixed-function FPGA CNN inference accelerator implemented in SystemVerilog. The project explores how architectural decisions around data layout, precision, and scheduling affect performance and flexibility.
Compute latency (frame loaded → TX start)
~466k cycles
≈ 4.7 ms @ 100 MHz
End-to-end latency (UART RX + compute + UART TX)
~7.28M cycles
≈ 73 ms @ 115,200 baud
MNIST accuracy
~92.2%
Float baseline: ~94.3% (1 epoch)
This project implements a low-latency market data ingestion pipeline in SystemVerilog. I built it to understand how protocol handling, backpressure, and control timing shape end-to-end latency in real hardware pipelines.
End-to-end latency (ingress → decision)
~38 cycles
≈ 152 ns @ 250 MHz (simulation)
Clock target
250 MHz
Closed with positive slack in Vivado
Verification
Cycle-accurate
Verilator + Python reference models