home shape

Workshop: Serving AI Models at Scale with Nvidia Triton

In this workshop, join Machine Learning Research Engineer Sachin Sharma and learn how to use Nvidia’s Triton Inference server (formerly known as TensorRT Inference Server), which simplifies the deployment of AI models at scale in production. We focus on hosting/deploying multiple trained models (Tensorflow, PyTorch) on the Triton inference server to leverage its full potential for this examination. Once models are deployed, we can make inference requests and can get back the predictions.


In order to make the flow of the workshop smooth, the audience needs to have some packages install beforehand:

  1. Install Docker (https://docs.docker.com/get-docker/)
  2. Pulling triton server docker image from Nvidia NGC: docker pull nvcr.io/nvidia/tritonserver:21.05-py3
  3. Image size: 10.6 GB (10-15 mins to install depending upon the internet)
  4. To view the downloaded docker image: docker images
  5. The repository which we will follow throughout the workshop (optional) https://github.com/sachinsharma9780/AI-Enterprise-Workshop-Building-ML-Pipelines

– Introduction to ArangoDB and Nvidia’s Triton Inference Server (Need, features, applications, etc.)
– Setting up Triton Inference server on a local machine
– Deploy your first trained model (Tensorflow) with an application to image classification on Triton inference server
– Deploy almost any Hugging Face PyTorch models with an application to zero-short text classification on Triton inference server (Here we will convert given PyTorch models to Triton acceptable models)
– Once models are deployed, we can write a python-client side script to interact with the Triton server (i.e., sending requests and receiving back the predictions)
– Exploring the python image_client.py script to make an image classification request
– Writing down our own client-side script to interact with NLP Models
– Triton Metrics
– Storing inference results in ArangoDB using python-arango

About the Presenter:

Sarah is a Software Engineer at Slack working on the backend infrastructure team. She also loves exploring new technologies to solve complex problems from different unique perspectives. Her passion project is followmy.tv where she is modeling the TV space into a graph representation in ArangoDB.

rsz 202 0 2

Sachin Sharma