Systolic Array and Vector Multiplier

based on FPGA Implementation

Comparison between
Systolic Array and Vector Multiplier
based on FPGA Implementation 🛠️

This project pits systolic arrays against vector multipliers for matrix multiplication on Xilinx’s xcu250 FPGA. Let’s break it down! 🧠


Introduction 🎯

With AI accelerators booming, efficient matrix multiplication units (MXUs) are critical. We compare:

  • Systolic Arrays: Inspired by Google’s TPUv1
  • Vector Multipliers: GPU-style adder-tree designs

Tested on matrix sizes from 8×8 to 64×64, we dive into timing, resources, and scalability. 📈


Architectural Highlights 🏗️

  • Systolic Array: A grid of PEs with built-in pipelining. Weights stay put, data flows—simple yet powerful! 🔄
  • Vector Multiplier: Direct vector ops with an adder tree. Needs explicit pipelining for big matrices. 🧮

Implementation Snapshot 🔧

  • FPGA: xcu250 UltraScale+
  • Matrix Sizes: 8×8, 16×16, 32×32, 64×64
  • Vivado Flags: dont_touch, use_dsp, ram_style=block

Results Roundup 📊


Resource Use

  • Systolic Array: Steady rise—e.g., 4096 DSPs at 64×64.
  • Vector Multiplier (Non-pipelined): Leaner but timing flops at scale.
  • Vector Multiplier (Pipelined): Timing improves, but LUTs soar (e.g., 36166 at 64×64).
{
  "type": "bar",
  "data": {
    "labels": ["N=8", "N=16", "N=32", "N=64"],
    "datasets": [
      {
        "label": "Systolic Array (LUT)",
        "data": [202, 656, 1502, 4125],
        "backgroundColor": "rgba(75, 192, 192, 0.7)",
        "borderColor": "rgba(75, 192, 192, 1)",
        "borderWidth": 1
      },
      {
        "label": "Vector Multiplier (LUT)",
        "data": [38, 87, 169, 325],
        "backgroundColor": "rgba(255, 99, 132, 0.7)",
        "borderColor": "rgba(255, 99, 132, 1)",
        "borderWidth": 1
      },
      {
        "label": "Vector Multiplier Pipelined (LUT)",
        "data": [38, 87, 9129, 36166],
        "backgroundColor": "rgba(255, 159, 64, 0.7)",
        "borderColor": "rgba(255, 159, 64, 1)",
        "borderWidth": 1
      }
    ]
  },
  "options": {
    "responsive": true,
    "plugins": {
      "title": {
        "display": true,
        "text": "LUT Utilization Comparison (Log Scale)"
      },
      "tooltip": {
        "mode": "index",
        "intersect": false
      }
    },
    "scales": {
      "y": {
        "type": "logarithmic",
        "title": {
          "display": true,
          "text": "Number of LUTs (log scale)"
        }
      },
      "x": {
        "title": {
          "display": true,
          "text": "Matrix Size"
        }
      }
    }
  }
}



Timing (Max Delay)

  • Systolic Array: Rock-solid ~5ns across sizes. ⏱️
  • Vector Multiplier (Non-pipelined): Delay balloons to 54.936ns at 64×64. 😬
  • Vector Multiplier (Pipelined): Better at 22.722ns, but still lags.
{
  "type": "line",
  "data": {
    "labels": ["N=8", "N=16", "N=32", "N=64"],
    "datasets": [
      {
        "label": "Systolic Array",
        "data": [4.995, 4.904, 4.443, 6.433],
        "backgroundColor": "rgba(54, 162, 235, 0.2)",
        "borderColor": "rgba(54, 162, 235, 1)",
        "borderWidth": 2,
        "pointRadius": 5,
        "tension": 0.1
      },
      {
        "label": "Vector Multiplier (non-pipelined)",
        "data": [9.383, 13.413, 24.552, 54.936],
        "backgroundColor": "rgba(255, 99, 132, 0.2)",
        "borderColor": "rgba(255, 99, 132, 1)",
        "borderWidth": 2,
        "pointRadius": 5,
        "tension": 0.1
      },
      {
        "label": "Vector Multiplier (pipelined)",
        "data": [6.16, 8.986, 8.509, 22.722],
        "backgroundColor": "rgba(255, 159, 64, 0.2)",
        "borderColor": "rgba(255, 159, 64, 1)",
        "borderWidth": 2,
        "pointRadius": 5,
        "tension": 0.1
      },
      {
        "label": "Vector Multiplier (pipelined + DSP flag)",
        "data": [6.16, 8.986, 9.44, 25.555],
        "backgroundColor": "rgba(153, 102, 255, 0.2)",
        "borderColor": "rgba(153, 102, 255, 1)",
        "borderWidth": 2,
        "pointRadius": 5,
        "tension": 0.1
      }
    ]
  },
  "options": {
    "responsive": true,
    "plugins": {
      "title": {
        "display": true,
        "text": "Maximum Path Delay Comparison (ns)"
      },
      "tooltip": {
        "mode": "index",
        "intersect": false
      }
    },
    "scales": {
      "y": {
        "title": {
          "display": true,
          "text": "Delay (ns)"
        },
        "beginAtZero": true
      },
      "x": {
        "title": {
          "display": true,
          "text": "Matrix Size"
        }
      }
    }
  }
}


Key Takeaways 🔍

  • Scalability: Systolic arrays win with consistent timing and efficient growth. 🌟
  • Trade-offs: Vector multipliers need heavy pipelining, spiking resource use for big tasks.

Conclusion 🏁

  • Systolic Arrays: Champs for large-scale matrix ops. 💪
  • Vector Multipliers: Fit for smaller, parallel setups (think GPUs).

In a nutshell: Systolic arrays rule for big matrix crunching; vector multipliers shine in compact roles! 🚀

Summarized using Perplexity (Claude 3.7 Sonnet) · Retouched by Duhyeon Kim

Project Report

Your browser doesn't support embedded PDFs. Please click here to download the PDF.

Presentation

Download Presentation PDF