Low-Latency Bayesian Inference: Deploying Models with PyTorch and ONNX

Deploying Bayesian models in production often requires balancing predictive accuracy with low-latency inference. A common approach is to use inference data from PyMC and perform pm.sample_posterior_predictive. While effective, this method typically takes seconds, which is unacceptable when latency requirements are in milliseconds. Alternatively, implementing the inference logic manually is tedious and error-prone.

After years of exploring solutions, I developed a streamlined approach to serve Bayesian models efficiently:

* Represent the Bayesian posterior as PyTorch tensors in a (dim, draw) format
* Extract model transformation to into framework-agnostic code. PyTorch tensors and PyTensor share a highly compatible interface, making this step seamless.
* Use the same code for both online and offline inference, ensuring consistency and reducing maintenance overhead.
* Serialize the model to ONNX, which enables low-latency inference in production. ONNX’s standardized format makes deployment straightforward.

Below is a fully functional example (available on [GitHub https://gist.github.com/apetrov/48bf738d6ae252e1542e67ab486967ca]). For simplicity, it omits the encoding of categorical features to indices

Bonus: Since ONNX is language-agnostic, you can deploy Bayesian models in non-Python environments, such as Go, JVM or Rust.