Oliver Grainge
AI Engineer & Researcher
Specializing in production ML optimization and deployment. I take models from research to production with measurable performance improvements across edge and cloud platforms—through model compression, custom kernel development, and hardware-software co-design.
4 publications (IEEE RAL, AAAI) PhD candidate, Southampton CUDA / Triton / ARM NEON Open source contributor
Experience
- Contract Researcher — Performance Engineering Feb 2025 – PresentArm · RemoteArchitected 6 hands-on tutorials for the Arm Total Performance toolkit covering memory optimization, library acceleration (APL, KleidiAI), and automated porting for AWS Graviton instances.
- Research Assistant Jun 2025 – Nov 2025University College London · London, UKEngineered 1.58-bit precision pipeline for Stable Diffusion (4x memory reduction, 95% quality retention). Designed custom CUDA and Triton kernels for bit-packed tensor operations, delivering 30% speedup over PyTorch baseline.
- Visiting Researcher Aug 2024 – Jan 2025Queensland University of Technology · Brisbane, AustraliaEngineered speculative decoding for vision-language transformers achieving 2.5x inference speedup for sub-100ms robotic navigation. Implemented training data filtering methods achieving equivalent accuracy with 38% less data.
- Research Fellow Jan 2024 – Jan 2025AI Security Institute · RemoteBuilt automated benchmarking framework evaluating 25+ VLMs across 26k geo-tagged images with 99.9% reliability over 500k+ API calls. Developed privacy-preserving techniques reducing geolocation accuracy by 40%, with interactive demo attracting 5k+ users.
- Contract Researcher — AI Inference Optimization Nov 2024 – Jan 2025Arm · RemoteEngineered demonstrations achieving 40% latency reduction via SIMD/INT8 on mobile and 2.1x throughput on cloud instances. Built Hyperopt-based per-layer precision optimizer demonstrating 22% memory reduction on GPT models.
Selected Publications
- AAAI 2025 · First comprehensive benchmark of VLM geolocation capabilities across 4 datasets
- IEEE RAL · 65% memory reduction and 35% latency reduction for VPR transformers
- IEEE RAL · Deployment guidelines for extreme quantization on embedded devices
- IEEE RAL · Channel pruning achieving 21% latency and 16% memory reduction with <1% accuracy loss
Open Source Projects
- C++ · CUDA · ARM NEON · AVX2High-performance ternary matrix multiplication library with multi-backend support. 16x memory reduction via 2-bit weight packing with kernels outperforming PyTorch FP32 on edge devices.
- PyTorchQuantization-aware training toolkit with drop-in BitLinear layers (BitNet, TWN, ParetoQ 1.58-bit). Train-to-deploy workflow with optional BitOps acceleration for 8x inference memory savings.
- Python · Gradio · CUDAInteractive chat with 1.58-bit BitNet models. 24x speedup and 80% memory reduction on ARM M4 vs PyTorch FP32, with backend switching and streaming responses.
- Python · OpenCV · NumPyPure-Python stereo SLAM pipeline (KITTI-compatible) with feature tracking, stereo matching, PnP/ICP motion estimation, and bundle adjustment.
Technical Skills
Languages
Python C/C++ CUDA Triton Bash SQL
ML Frameworks
PyTorch TensorFlow Hugging Face vLLM llama.cpp ONNX Runtime TensorRT OpenCV
Model Optimization
Quantization (INT8/INT4/ternary) QAT / PTQ / GPTQ / AWQ Pruning Knowledge Distillation LoRA / QLoRA Custom CUDA Kernels FlashAttention SIMD (NEON, AVX2)
Infrastructure & MLOps
AWS (EC2, Graviton) Docker Kubernetes SLURM Ray MLflow W&B Triton Inference Server FastAPI
Education
PhD (iPhD) in Machine Intelligence
University of Southampton · Oct 2022 – Present
Thesis: Efficient Resource-Constrained Visual Place Recognition
BEng Electronics and Electrical Engineering — First Class Honours (83%)
University of Southampton · Sept 2019 – Jul 2022
Interested in collaborating on efficient ML research or production AI deployment?
