Oliver Grainge

AI Engineer & Researcher

Specializing in production ML optimization and deployment. I take models from research to production with measurable performance improvements across edge and cloud platforms—through model compression, custom kernel development, and hardware-software co-design.

4 publications (IEEE RAL, AAAI) PhD candidate, Southampton CUDA / Triton / ARM NEON Open source contributor

Experience

  • Contract Researcher — Performance Engineering Feb 2025 – Present
    Arm · Remote
    Architected 6 hands-on tutorials for the Arm Total Performance toolkit covering memory optimization, library acceleration (APL, KleidiAI), and automated porting for AWS Graviton instances.
  • Research Assistant Jun 2025 – Nov 2025
    University College London · London, UK
    Engineered 1.58-bit precision pipeline for Stable Diffusion (4x memory reduction, 95% quality retention). Designed custom CUDA and Triton kernels for bit-packed tensor operations, delivering 30% speedup over PyTorch baseline.
  • Visiting Researcher Aug 2024 – Jan 2025
    Queensland University of Technology · Brisbane, Australia
    Engineered speculative decoding for vision-language transformers achieving 2.5x inference speedup for sub-100ms robotic navigation. Implemented training data filtering methods achieving equivalent accuracy with 38% less data.
  • Research Fellow Jan 2024 – Jan 2025
    AI Security Institute · Remote
    Built automated benchmarking framework evaluating 25+ VLMs across 26k geo-tagged images with 99.9% reliability over 500k+ API calls. Developed privacy-preserving techniques reducing geolocation accuracy by 40%, with interactive demo attracting 5k+ users.
  • Contract Researcher — AI Inference Optimization Nov 2024 – Jan 2025
    Arm · Remote
    Engineered demonstrations achieving 40% latency reduction via SIMD/INT8 on mobile and 2.1x throughput on cloud instances. Built Hyperopt-based per-layer precision optimizer demonstrating 22% memory reduction on GPT models.

Selected Publications

Open Source Projects

  • C++ · CUDA · ARM NEON · AVX2
    High-performance ternary matrix multiplication library with multi-backend support. 16x memory reduction via 2-bit weight packing with kernels outperforming PyTorch FP32 on edge devices.
  • PyTorch
    Quantization-aware training toolkit with drop-in BitLinear layers (BitNet, TWN, ParetoQ 1.58-bit). Train-to-deploy workflow with optional BitOps acceleration for 8x inference memory savings.
  • Python · Gradio · CUDA
    Interactive chat with 1.58-bit BitNet models. 24x speedup and 80% memory reduction on ARM M4 vs PyTorch FP32, with backend switching and streaming responses.
  • Python · OpenCV · NumPy
    Pure-Python stereo SLAM pipeline (KITTI-compatible) with feature tracking, stereo matching, PnP/ICP motion estimation, and bundle adjustment.

Technical Skills

Languages
Python C/C++ CUDA Triton Bash SQL
ML Frameworks
PyTorch TensorFlow Hugging Face vLLM llama.cpp ONNX Runtime TensorRT OpenCV
Model Optimization
Quantization (INT8/INT4/ternary) QAT / PTQ / GPTQ / AWQ Pruning Knowledge Distillation LoRA / QLoRA Custom CUDA Kernels FlashAttention SIMD (NEON, AVX2)
Infrastructure & MLOps
AWS (EC2, Graviton) Docker Kubernetes SLURM Ray MLflow W&B Triton Inference Server FastAPI

Education

PhD (iPhD) in Machine Intelligence
University of Southampton · Oct 2022 – Present
Thesis: Efficient Resource-Constrained Visual Place Recognition
BEng Electronics and Electrical Engineering — First Class Honours (83%)
University of Southampton · Sept 2019 – Jul 2022

Interested in collaborating on efficient ML research or production AI deployment?