Matrix Multiplication

Matrix Multiplication

Libraries Optimized for GPU Programming Computing

To perform General Matrix Multiplication (GEMM) and Sparse Matrix-Matrix Multiplication (SpMM) on your NVIDIA RTX 3070 Ti, you can use several libraries optimized for GPU computing. Here are some of the most popular ones:

1. cuBLAS (CUDA Basic Linear Algebra Subroutines)

  • Purpose: Optimized for dense linear algebra operations, including GEMM.
  • Performance Metrics: You can measure FLOPS and latency using CUDA events or profiling tools like nvprof or Nsight Compute.
  • Usage:
    cublasSgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
  • Documentation: cuBLAS Documentation

2. cuSPARSE (CUDA Sparse Matrix Library)

  • Purpose: Optimized for sparse matrix operations, including SpMM.
  • Performance Metrics: Similar to cuBLAS, you can use CUDA events or profiling tools.
  • Usage:
    cusparseSpMM(handle, opA, opB, alpha, matA, matB, beta, matC, type, alg, buffer);
  • Documentation: cuSPARSE Documentation

3. CUTLASS (CUDA Templates for Linear Algebra Subroutines and Solvers)

  • Purpose: A CUDA C++ template library that provides highly optimized GEMM and SpMM implementations.
  • Performance Metrics: You can measure performance using CUDA events or profiling tools.
  • Usage: CUTLASS provides a more flexible and customizable interface compared to cuBLAS and cuSPARSE.
  • Documentation: CUTLASS Documentation

4. TensorRT

  • Purpose: Optimized for deep learning inference, but can also be used for GEMM operations.
  • Performance Metrics: TensorRT provides built-in profiling tools to measure latency and throughput.
  • Usage: TensorRT is more specialized for deep learning models but can be used for general matrix operations.
  • Documentation: TensorRT Documentation

5. Thrust

  • Purpose: A parallel algorithms library similar to the C++ Standard Template Library (STL), but for CUDA.
  • Performance Metrics: You can measure performance using CUDA events or profiling tools.
  • Usage: Thrust is more general-purpose but can be used for matrix operations.
  • Documentation: Thrust Documentation

6. Eigen (with CUDA support)

  • Purpose: A C++ template library for linear algebra, with optional CUDA support.
  • Performance Metrics: You can measure performance using CUDA events or profiling tools.
  • Usage: Eigen provides a high-level interface for matrix operations.
  • Documentation: Eigen Documentation

Measuring Performance Metrics

  • FLOPS: Calculate FLOPS by dividing the total number of floating-point operations by the execution time.
  • Latency: Measure the time taken for the operation using CUDA events (cudaEventRecord, cudaEventSynchronize, cudaEventElapsedTime).
  • Throughput: Measure the amount of data processed per unit time.

Example Code for Measuring Latency

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
// Perform GEMM or SpMM operation here
cudaEventRecord(stop);
cudaEventSynchronize(stop);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

printf("Latency: %f ms\n", milliseconds);

cudaEventDestroy(start);
cudaEventDestroy(stop);

Profiling Tools

  • nvprof: Command-line profiler for CUDA applications.
  • Nsight Compute: Advanced profiler for detailed performance analysis.
  • Nsight Systems: System-wide performance analysis tool.

Conclusion

For GEMM, cuBLAS is the most straightforward choice, while for SpMM, cuSPARSE is highly optimized. If you need more flexibility, consider using CUTLASS. Use CUDA events or profiling tools to measure performance metrics like FLOPS, latency, and throughput.

cuBLAS for GEMM

To use cuBLAS for performing General Matrix Multiplication (GEMM), you can follow these steps:

  1. Initialize cuBLAS: Create a cuBLAS handle using cublasCreate.

  2. Allocate Memory: Allocate memory for the matrices on the host (CPU) and device (GPU).

  3. Transfer Data: Copy the matrices from the host to the device using cudaMemcpy.

  4. Set Parameters: Define the parameters for the GEMM operation, including the dimensions of the matrices, the scaling factors (alpha and beta), and the leading dimensions (lda, ldb, ldc).

  5. Call cuBLAS GEMM: Use the cublasSgemm function to perform the matrix multiplication. The function signature is as follows:

    cublasStatus_t cublasSgemm(cublasHandle_t handle,
    cublasOperation_t transa, cublasOperation_t transb,
    int m, int n, int k,
    const float *alpha,
    const float *A, int lda,
    const float *B, int ldb,
    const float *beta,
    float *C, int ldc);
    • handle: The cuBLAS handle.
    • transa, transb: Specify whether the matrices A and B should be transposed.
    • m, n, k: Dimensions of the matrices.
    • alpha, beta: Scaling factors.
    • A, B, C: Pointers to the matrices on the device.
    • lda, ldb, ldc: Leading dimensions of the matrices.
  6. Retrieve Results: Copy the result matrix C from the device to the host.

  7. Clean Up: Free the allocated memory and destroy the cuBLAS handle.

Here is an example code snippet demonstrating these steps:

#include "cuda_runtime.h"
#include "cublas_v2.h"
#include <iostream>
#include <stdlib.h>

int main() {
// Define matrix dimensions
int const A_ROW = 5;
int const A_COL = 6;
int const B_ROW = 6;
int const B_COL = 7;

// Allocate memory for matrices on the host
float *h_A = (float *)malloc(sizeof(float) * A_ROW * A_COL);
float *h_B = (float *)malloc(sizeof(float) * B_ROW * B_COL);
float *h_C = (float *)malloc(sizeof(float) * A_ROW * B_COL);

// Initialize matrices with random values
for (int i = 0; i < A_ROW * A_COL; i++) h_A[i] = (float)(rand() % 10 + 1);
for (int i = 0; i < B_ROW * B_COL; i++) h_B[i] = (float)(rand() % 10 + 1);

// Allocate memory for matrices on the device
float *d_A, *d_B, *d_C;
cudaMalloc((void **)&d_A, sizeof(float) * A_ROW * A_COL);
cudaMalloc((void **)&d_B, sizeof(float) * B_ROW * B_COL);
cudaMalloc((void **)&d_C, sizeof(float) * A_ROW * B_COL);

// Copy matrices from host to device
cudaMemcpy(d_A, h_A, sizeof(float) * A_ROW * A_COL, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, sizeof(float) * B_ROW * B_COL, cudaMemcpyHostToDevice);

// Create cuBLAS handle
cublasHandle_t handle;
cublasCreate(&handle);

// Perform GEMM
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, A_ROW, B_COL, A_COL,
&alpha, d_A, A_COL, d_B, B_COL, &beta, d_C, A_ROW);

// Copy result matrix from device to host
cudaMemcpy(h_C, d_C, sizeof(float) * A_ROW * B_COL, cudaMemcpyDeviceToHost);

// Print result matrix
std::cout << "Matrix C (A * B):" << std::endl;
for (int i = 0; i < A_ROW * B_COL; ++i) {
std::cout << h_C[i] << " ";
if ((i + 1) % B_COL == 0) std::cout << std::endl;
}

// Free memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);

// Destroy cuBLAS handle
cublasDestroy(handle);

return 0;
}

This code performs the matrix multiplication C=A×BC = A \times B using cuBLAS. Note that cuBLAS uses column-major storage, so the leading dimensions (lda, ldb, ldc) should be set accordingly.