python的cuda C++扩展

导入

用C++编写简单的CUDA程序,比如矩阵乘法,并且编译成python的扩展供python调用,对比python矩阵乘法计算在CPU端和GPU端的速度差异,以及手写简单的cuda kernel计算和使用cuBLAS库的速度差异。

关于cuda的安装和基本的使用语法不是本文内容,主要是直接代码形式给出,不过都是简单的代码,没有什么门槛。

cuda kernel

简单说一下流程:

  • C++编写cuda kernel,然后nvcc编译成动态库(windows端dll,linux端so)

  • 利用Cython封装C++ cuda函数,提供python的函数接口。通过编写pyx实现,然后利用setup.py定义扩展,生成对应cpp文件,并且编译成python可以直接import的so动态库

下面直接给代码:

头文件matmul_wrapper.h:

#ifndef MATMUL_WRAPPER_H
#define MATMUL_WRAPPER_H

#ifdef __cplusplus
extern "C" {
#endif

void matrix_multiply_cuda(const float* h_A, const float* h_B, float* h_C, int M, int N, int K);
#ifdef __cplusplus
}
#endif

#endif // MATMUL_WRAPPER_H

matmal.cucuda C++核心文件:

#include "matmul_wrapper.h"
#include <cuda_runtime.h>
#include <iostream>

#define TILE_DIM 32

__global__ void matmul_shared_mem_kernel(const float* A, const float* B, float* C, int M, int N, int K) {
    __shared__ float sA[TILE_DIM][TILE_DIM];
    __shared__ float sB[TILE_DIM][TILE_DIM];

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    int row = by * TILE_DIM + ty;
    int col = bx * TILE_DIM + tx;

    float C_value = 0.0f;

    for (int t = 0; t < (K + TILE_DIM - 1) / TILE_DIM; ++t) {
        if (t * TILE_DIM + tx < K && row < M) {
            sA[ty][tx] = A[row * K + (t * TILE_DIM + tx)];
        } else {
            sA[ty][tx] = 0.0f;
        }

        if (t * TILE_DIM + ty < K && col < N) {
            sB[ty][tx] = B[(t * TILE_DIM + ty) * N + col];
        } else {
            sB[ty][tx] = 0.0f;
        }

        __syncthreads();
        for (int i = 0; i < TILE_DIM; ++i) {
            C_value += sA[ty][i] * sB[i][tx];
        }

        __syncthreads();
    }

    // 写回全局内存
    if (row < M && col < N) {
        C[row * N + col] = C_value;
    }
}

void matrix_multiply_cuda(const float* h_A, const float* h_B, float* h_C, int M, int N, int K) {
    float *d_A, *d_B, *d_C;
    size_t size_A = (size_t)M * K * sizeof(float);
    size_t size_B = (size_t)K * N * sizeof(float);
    size_t size_C = (size_t)M * N * sizeof(float);

    cudaError_t err;
    err = cudaMalloc((void**)&d_A, size_A);
    if (err != cudaSuccess) { std::cerr << "Failed to allocate device memory for A: " << cudaGetErrorString(err) << std::endl; return; }

    err = cudaMalloc((void**)&d_B, size_B);
    if (err != cudaSuccess) { std::cerr << "Failed to allocate device memory for B: " << cudaGetErrorString(err) << std::endl; return; }

    err = cudaMalloc((void**)&d_C, size_C);
    if (err != cudaSuccess) { std::cerr << "Failed to allocate device memory for C: " << cudaGetErrorString(err) << std::endl; return; }

    err = cudaMemcpy(d_A, h_A, size_A, cudaMemcpyHostToDevice);
    if (err != cudaSuccess) { std::cerr << "Failed to copy A to device: " << cudaGetErrorString(err) << std::endl; return; }

    err = cudaMemcpy(d_B, h_B, size_B, cudaMemcpyHostToDevice);
    if (err != cudaSuccess) { std::cerr << "Failed to copy B to device: " << cudaGetErrorString(err) << std::endl; return; }

    dim3 threadsPerBlock(TILE_DIM, TILE_DIM);
    dim3 numBlocks((N + TILE_DIM - 1) / TILE_DIM, (M + TILE_DIM - 1) / TILE_DIM);
    matmul_shared_mem_kernel<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, M, N, K);

    err = cudaGetLastError();
    if (err != cudaSuccess) { std::cerr << "Kernel launch failed: " << cudaGetErrorString(err) << std::endl; return; }
    err = cudaDeviceSynchronize();
    if (err != cudaSuccess) { std::cerr << "Device synchronization failed: " << cudaGetErrorString(err) << std::endl; return; }

    err = cudaMemcpy(h_C, d_C, size_C, cudaMemcpyDeviceToHost);
    if (err != cudaSuccess) { std::cerr << "Failed to copy C to host: " << cudaGetErrorString(err) << std::endl; return; }

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

接下来编译成共享库:

nvcc -shared -Xcompiler -fPIC matmul.cu -o libmatmul_lib.so

C++编译成的动态库,不管是不是cuda,python都是可以直接调用的,只需要使用标准库ctypes即可,无需任何封装,但是缺点是不够灵活,而使用Cython编译成扩展共享库,使用起来更加方便一点,不需要在代码中编写参数类型之类,因此这里主要用Cython来进一步封装。 稍微给出ctypes该如何定义这个函数原型:

import ctypes
# 首先加载共享库
matmul_lib = ctypes.CDLL(lib_path)

...
"""一个辅助函数,用于设置C函数的参数和返回类型"""
def setup_c_func(func_name):  
    try:
        func = getattr(matmul_lib, func_name)
        # C函数原型: void func(const float* h_A, const float* h_B, float* h_C, int M, int N, int K)
        func.argtypes = [
            # 使用ctypes.POINTER(ctypes.c_float)表示float* h_A类型也是可以的
            np.ctypeslib.ndpointer(dtype=np.float32, ndim=2, flags='C_CONTIGUOUS'), # const float* h_A
            np.ctypeslib.ndpointer(dtype=np.float32, ndim=2, flags='C_CONTIGUOUS'), # const float* h_B
            np.ctypeslib.ndpointer(dtype=np.float32, ndim=2, flags='C_CONTIGUOUS'), # float* h_C
            ctypes.c_int,  # int M
            ctypes.c_int,  # int N
            ctypes.c_int,  # int K
        ]
        # C函数返回类型为void,所以restype为None
        func.restype = None
        return func
    except AttributeError:
        print(f"错误: 函数 '{func_name}' 在共享库中未找到。")
        return None

# 获取并设置函数的原型
matrix_multiply_cuda = setup_c_func('matrix_multiply_cuda')

...
# 然后像python函数那样调用即可
matrix_multiply_cuda(A, B, C_gpu, M, N, K)

...

Cython封装

编写pyx文件:

import numpy as np
cimport numpy as np

# 通过cdef extern块调用C标准库或自定义C函数
cdef extern from "matmul_wrapper.h":
    void matrix_multiply_cuda(const float* h_A, const float* h_B, float* h_C, int M, int N, int K)
    void matrix_multiply_cublas(const float* h_A, const float* h_B, float* h_C, int M, int N, int K)

# 确保在编译时初始化NumPy C-API
np.import_array()

def matmul_gpu(np.ndarray[np.float32_t, ndim=2, mode="c"] A, 
               np.ndarray[np.float32_t, ndim=2, mode="c"] B):
    """
    使用自定义CUDA Kernel在GPU上执行矩阵乘法 C = A * B。
    """
    # 检查矩阵维度是否匹配
    if A.shape[1] != B.shape[0]:
        msg = "Matrix dimension mismatch for multiplication: A.shape={}, B.shape={}".format((<object>A).shape, (<object>B).shape)
        raise ValueError(msg)

    # 获取矩阵维度
    cdef int M = A.shape[0]
    cdef int K = A.shape[1] 
    cdef int N = B.shape[1]

    # 创建用于存放结果的NumPy数组
    cdef np.ndarray[np.float32_t, ndim=2, mode="c"] C = np.empty((M, N), dtype=np.float32)

    # 调用C++封装函数,将NumPy数组的数据指针传递过去
    # <float*>是类型转换,告诉Cython如何处理指针类型
    matrix_multiply_cuda(<const float*>A.data, <const float*>B.data, <float*>C.data, M, N, K)

    return C

封装成matmul_gpu供python调用,2个函数参数分别就是矩阵乘法的2个矩阵。

继续编写setup.py文件:

# setup.py
import os
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy

ext_modules = [
    Extension(
        "matmul_cython",
        sources=["matmul_cython.pyx"],  
        language="c++",
        include_dirs=[
            numpy.get_include(),
            '/usr/local/cuda/include'
        ],
        library_dirs=["."], 
        libraries=["matmul_lib"]
    )
]

setup(
    name="matmul_cython",
    ext_modules=cythonize(ext_modules),
)

然后编译扩展:

python setup.py build_ext --inplace

就会生成matmul_cython.cpp以及该CPP文件编译后的共享库文件matmul_cython.cpython-3xx-x86_64-linux-gnu.soxx对应你的python版本。这个共享库就是windows下的.pyd文件,是可以直接import使用的。

python调用

import numpy as np
import time
import matmul_cython # 导入我们编译好的模块

def main():
    # 定义矩阵大小
    M, K, N = 1024, 1024, 1024
    print(f"Performing matrix multiplication for C({M}x{N}) = A({M}x{K}) * B({K}x{N})")

    # 创建随机矩阵 (确保是float32类型,与Cython中定义的一致)
    print("Initializing matrices...")
    A = np.random.rand(M, K).astype(np.float32)
    B = np.random.rand(K, N).astype(np.float32)

    # --- GPU 计算 ---
    print("Calculating on GPU...")
    start_time_gpu = time.time()
    C_gpu = matmul_cython.matmul_gpu(A, B)
    end_time_gpu = time.time()
    gpu_duration = end_time_gpu - start_time_gpu
    print(f"GPU computation took: {gpu_duration:.6f} seconds")

    print("Calculating on CPU for verification...")
    start_time_cpu = time.time()
    C_cpu = np.dot(A, B)
    end_time_cpu = time.time()
    cpu_duration = end_time_cpu - start_time_cpu
    print(f"CPU computation took: {cpu_duration:.6f} seconds")

    # 三层循环无向量化计算方法
    # # 获取矩阵的形状
    # m, n = A.shape
    # n, p = B.shape

    # # 初始化结果矩阵 C
    # C = np.zeros((m, p),dtype=np.int32)

    # # 使用嵌套循环进行矩阵乘法(这种循环很适合numba优化#TODO)
    # for i in range(m):
    #     for j in range(p):
    #         for k in range(n):
    #             C[i, j] += A[i, k] * B[k, j]

    # --- 验证结果 ---
    print("Verifying results...")
    if np.allclose(C_gpu, C_cpu, atol=1e-3): # 使用atol因为GPU浮点计算可能有微小差异
        print("Verification successful! The results from GPU and CPU are close enough.")
        print(f"GPU was approximately {cpu_duration / gpu_duration:.2f} times faster than CPU.")
    else:
        print("Verification failed! The results are different.")
        diff = np.abs(C_gpu - C_cpu)
        print(f"Maximum difference: {np.max(diff)}")

if __name__ == "__main__":
    main()

运行应该可以看到类似的输出:

Performing matrix multiplication for C(1024x1024) = A(1024x1024) * B(1024x1024)
Initializing matrices...
Calculating on GPU...
GPU computation took: 0.659936 seconds
Calculating on CPU for verification...
CPU computation took: 0.012031 seconds
Verifying results...
Verification successful! The results from GPU and CPU are close enough.
GPU was approximately 0.02 times faster than CPU.

可能感觉有点奇怪,怎么矩阵计算这种并行绝对优势的计算,GPU速度比CPU速度还慢。其实有3点原因:

  • 一是此处使用的是1024*1024的矩阵乘法,其实并不算很大,无法完全发挥GPU的优势,而数据在CPU和GPU之间移动,还有一定耗时

  • 二是我们对比使用的是numpy的矩阵计算,是向量化计算,底层是BLAS库高度优化的,比如Intel MKL, OpenBLAS, ATLAS 等,速度本身就是非常之快的。上面代码也给出了三层循环无向量化的计算方式,你取消注释运算后会发现,运行时间会非常慢,我个人运行时间是10分钟,耗时是numpy向量化的几万倍。

  • 三是cuda其实是有个预热的,第一次需要建立上下文还有时钟频率的预热,这时间其实很短,但是这个矩阵计算本身就简单,只有零点几秒,因此这点时间就很多了。这里其实只需要在时间计算之前额外先进行一次cuda运算(matmul_cython.matmul_gpu(A, B)),时间立马就和numpy的向量化时间差不多了。

优化

我们编写的kernel实际上没有任何的优化,而可以做的优化其实很多,比如共享内存分块,寄存器使用,内存访问合并等,但是最简单最高效的其实就是直接使用cuBLAS库

头文件增加:

void matrix_multiply_cublas(const float* h_A, const float* h_B, float* h_C, int M, int N, int K);

matmal.cu文件增加:

#include <cublas_v2.h> // 引入cuBLAS库的头文件 
// 使用cuBLAS进行矩阵乘法的新函数
void matrix_multiply_cublas(const float* h_A, const float* h_B, float* h_C, int M, int N, int K) {
    float *d_A, *d_B, *d_C;
    size_t size_A = (size_t)M * K * sizeof(float);
    size_t size_B = (size_t)K * N * sizeof(float);
    size_t size_C = (size_t)M * N * sizeof(float);
    cudaError_t cuda_stat;
    cublasStatus_t cublas_stat;
    cublasHandle_t handle;

    // 在GPU上分配内存
    cuda_stat = cudaMalloc((void**)&d_A, size_A);
    if (cuda_stat != cudaSuccess) { std::cerr << "cuBLAS: Failed to allocate device memory for A" << std::endl; return; }
    cuda_stat = cudaMalloc((void**)&d_B, size_B);
    if (cuda_stat != cudaSuccess) { std::cerr << "cuBLAS: Failed to allocate device memory for B" << std::endl; return; }
    cuda_stat = cudaMalloc((void**)&d_C, size_C);
    if (cuda_stat != cudaSuccess) { std::cerr << "cuBLAS: Failed to allocate device memory for C" << std::endl; return; }

    // 将矩阵从CPU复制到GPU
    cuda_stat = cudaMemcpy(d_A, h_A, size_A, cudaMemcpyHostToDevice);
    if (cuda_stat != cudaSuccess) { std::cerr << "cuBLAS: Failed to copy A to device" << std::endl; return; }
    cuda_stat = cudaMemcpy(d_B, h_B, size_B, cudaMemcpyHostToDevice);
    if (cuda_stat != cudaSuccess) { std::cerr << "cuBLAS: Failed to copy B to device" << std::endl; return; }

    // 创建cuBLAS句柄
    cublas_stat = cublasCreate(&handle);
    if (cublas_stat != CUBLAS_STATUS_SUCCESS) { std::cerr << "cuBLAS: Failed to create handle" << std::endl; return; }

    // 定义alpha和beta参数
    const float alpha = 1.0f;
    const float beta = 0.0f;

    cublas_stat = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, M, K, &alpha, d_B, N, d_A, K, &beta, d_C, N);
    if (cublas_stat != CUBLAS_STATUS_SUCCESS) {
        std::cerr << "cuBLAS: cublasSgemm failed" << std::endl;
        return;
    }

    // 将结果从GPU复制回CPU
    cuda_stat = cudaMemcpy(h_C, d_C, size_C, cudaMemcpyDeviceToHost);
    if (cuda_stat != cudaSuccess) { std::cerr << "cuBLAS: Failed to copy C to host" << std::endl; return; }

    // 销毁cuBLAS句柄并释放GPU内存
    cublasDestroy(handle);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

引入cublas_v2.h头文件(同样位于/usr/local/cuda/include下),使用cuBLAS库的cublasSgemm函数计算矩阵乘法。Cython的pyx文件中添加封装:

def matmul_gpu_cublas(np.ndarray[np.float32_t, ndim=2, mode="c"] A, 
                      np.ndarray[np.float32_t, ndim=2, mode="c"] B):
    # 检查矩阵维度是否匹配
    if A.shape[1] != B.shape[0]:
        msg = "Matrix dimension mismatch for multiplication: A.shape={}, B.shape={}".format((<object>A).shape, (<object>B).shape)
        raise ValueError(msg)

    # 获取矩阵维度
    cdef int M = A.shape[0]
    cdef int K = A.shape[1]
    cdef int N = B.shape[1]

    # 创建用于存放结果的NumPy数组
    cdef np.ndarray[np.float32_t, ndim=2, mode="c"] C = np.empty((M, N), dtype=np.float32)

    # 调用使用cuBLAS的C++封装函数
    matrix_multiply_cublas(<const float*>A.data, <const float*>B.data, <float*>C.data, M, N, K)

    return C

同样的方式编译即可。然后主函数添加对应测试即可,后面结果校验部分同理,同时需要在前面添加预热代码,这里需要使用matmul_cython.matmul_gpu_cublas(A,预热,因为cublas第一次也需要一些内部初始化:

    print("Calculating on GPU with BLAS..")
    start_time_gpu2 = time.time()
    C_gpu_blas = matmul_cython.matmul_gpu_cublas(A, B)
    end_time_gpu2 = time.time()
    gpu_duration2 = end_time_gpu2 - start_time_gpu2
    print(f"GPU BLAS computation took: {gpu_duration2:.6f} seconds")

运行后输出类似:

Performing matrix multiplication for C(1024x1024) = A(1024x1024) * B(1024x1024)
Initializing matrices...
Calculating on GPU...
GPU computation took: 0.014495 seconds
Calculating on GPU with BLAS..
GPU BLAS computation took: 0.013747 seconds
Calculating on CPU for verification...
CPU computation took: 0.011865 seconds
Verifying results...
Verification successful! The results from GPU and CPU are close enough.
GPU was approximately 0.82 times faster than CPU.
Verifying results...
Verification successful! The results from GPU and CPU are close enough.
GPU was approximately 0.86 times faster than CPU.

可以看到numpy向量化,手写kernel GPU,cuBLAS库 GPU三者时间其实大差不差,这里numpy向量化耗时最短,还是前面提到的,因为计算相对比较简单,还没有完全发挥GPU性能,反而移动数据耗时多。但随着规模的增大,numpy向量化矩阵计算速度就会被GPU计算拉开,而cuBLAS和手写kernel的速度同样也会被拉开,但是还是比CPU快的。

比如现在把规模调整到2048*2048:

Performing matrix multiplication for C(2048x2048) = A(2048x2048) * B(2048x2048)
Initializing matrices...
Calculating on GPU.
GPU computation took: 0.081954 seconds
Calculating on GPU with BLAS..
GPU BLAS computation took: 0.039383 seconds
Calculating on CPU for verification...
CPU computation took: 0.085890 seconds
Verifying results...
Verification successful! The results from GPU and CPU are close enough.
GPU was approximately 1.05 times faster than CPU.
Verifying results...
Verification successful! The results from GPU and CPU are close enough.
GPU was approximately 2.18 times faster than CPU...

可以看到手写kernel的速度已经超过numpy向量化计算,而cuBLAS更是超过了2倍以上了,再提高规模,速度就会进一步拉开。

总结

numpy的向量化矩阵运算本身非常高效,底层高度BLAS库优化,对于中小规模矩阵运算已经足够,但是大规模矩阵运算,GPU还是有天然的巨大优势。我们也不需要自己手动编写cuda的扩展,可以使用pycuda numba,cupy之类的库,直接在GPU上运行代码,尤其是cupynumpy是基本完全兼容的,深度学习也可以借助各种深度学习框架。