MNIST CUDA Neural Network
Build, train, and test a CUDA-accelerated multilayer perceptron for MNIST digit classification.
What This Skill Does
This skill helps you work with a CUDA-accelerated neural network implementation for MNIST digit classification. The codebase includes:
CUDA kernels for forward/backward passesMNIST IDX file loading and normalizationASCII art digit visualizationComplete training pipeline with accuracy reportingMake-based build system for C and CUDA codeArchitecture Overview
**Network:** 784 (input) → 128 (hidden, ReLU) → 10 (output logits)
**Training Pipeline:**
```
MNIST IDX files → MNISTDataset → normalize_mnist() → NormalizedMNIST
↓
copy to GPU (d_input)
↓
forward() → compute_loss() → backward() → update_params()
```
**Core Files:**
`model.cu/h` - CUDA neural network implementation`train.cu` - Training program with GPU memory management`mnist.c/h` - IDX format data loader and normalizer`display.c/h` - ASCII visualization`bswap.c/h` - Big-endian byte swapping utilitiesInstructions
1. Verify Environment Setup
Check that CUDA is properly installed:
```bash
nvcc --version
```
If CUDA is not at `/usr/local/cuda`, set the `CUDA_PATH` environment variable:
```bash
export CUDA_PATH=/path/to/cuda
```
2. Verify MNIST Data Files
Ensure training and test data exist:
`data/mnist/train-images.idx3-ubyte``data/mnist/train-labels.idx1-ubyte``data/mnist/t10k-images.idx3-ubyte``data/mnist/t10k-labels.idx1-ubyte`If missing, download from [MNIST Database](http://yann.lecun.com/exdb/mnist/).
3. Build Commands
**Build all targets:**
```bash
make all
```
**Build and run digit display:**
```bash
make display
```
**Build and run C tests:**
```bash
make test
```
**Build and run CUDA model tests:**
```bash
make test-cuda
```
**Train the model:**
```bash
make train
```
**Clean build artifacts:**
```bash
make clean
```
Build outputs go to `bin/` directory.
4. Code Modification Guidelines
**Compilers:**
C code: Clang with C23 standard, strict warnings enabledCUDA code: NVCC with `-O2 -g` optimization**Memory Management Rules:**
All `Model` and `TrainState` pointers use GPU device memory (prefix `d_`)MNIST labels pointer is shared between `MNISTDataset` and `NormalizedMNIST` (only free once)Always use `CUDA_CHECK` macro for CUDA API calls**Code Style:**
Formatting enforced by `.clang-format` (LLVM style, 2-space indent)Run `make format` or ensure pre-commit hooks are active5. Understanding CUDA Kernels
**Forward Pass:**
`linear_relu_kernel` - Fused linear layer + ReLU for hidden layer`linear_kernel` - Linear layer for output logits**Loss & Gradients:**
`softmax_cross_entropy_kernel` - Numerically stable fused softmax, loss, and gradient**Backward Pass:**
`matmul_at_b_kernel` - Weight gradients via C = A^T × B`bias_grad_kernel` - Bias gradient by summing over batch`hidden_grad_kernel` - Backprop through layer 2 with ReLU mask**Optimization:**
`sgd_kernel` - SGD update: param -= lr × grad**Metrics:**
`count_correct_kernel` - Accuracy via argmax comparison6. Key Data Structures
**Model:**
`d_W1`, `d_b1` - Hidden layer weights and biases (GPU)`d_W2`, `d_b2` - Output layer weights and biases (GPU)**TrainState:**
`d_hidden`, `d_logits` - Forward pass activations (GPU)`d_dW1`, `d_db1`, `d_dW2`, `d_db2` - Parameter gradients (GPU)`d_loss` - Scalar loss value (GPU)**MNISTDataset:**
`pixels` - Raw uint8 pixel data (0-255), CPU memory`labels` - Label array (0-9), CPU memory**NormalizedMNIST:**
`pixels` - Normalized float pixel data (0.0-1.0), CPU memory`labels` - Shared pointer from source dataset7. Current Implementation Status
**Implemented:**
IDX data loading with big-endian supportXavier weight initializationForward pass with ReLU activationFused softmax + cross-entropy lossFull backward pass with gradient computationSGD parameter updatesAccuracy computationComplete training loop**Future Optimizations:**
Shared memory tiling for matrix operationscuBLAS integration for GEMM operationsMixed-precision trainingExample Usage
**Train for 10 epochs and monitor accuracy:**
```bash
make train
```
**Visualize sample digits:**
```bash
make display
```
**Run test suite:**
```bash
make test && make test-cuda
```
Important Notes
Ensure `CUDA_PATH` is set if CUDA is not in `/usr/local/cuda`The network uses Xavier initialization for stable trainingSoftmax + cross-entropy is fused for numerical stabilityAll GPU memory uses device pointers prefixed with `d_`MNIST labels are shared between raw and normalized datasets (avoid double-free)