TensorRT + gunicorn + multi-threading，帶你飛起 • shytab

借助于tensorRT/gunicorn/multithreading，在单个M40机器上，Resnet101能够达到100+pcs/sec的处理速度。在我们的方案中需要下载和上传图像，如果只考虑inference的过程，实际可以达到400pcs/sec的理论速度。

each test finish 100 images’ download/upload/process

Much Thanks to TensorRT Development Team!

gunicorn 负责任务的分发，在多个模型instance之间调度任务。

TensorRT 除了速度快之外，对模型加载后的显存优化也使得同样的GPU计算环境允许加载更多instance(5个resnet27占用显存15% x 12G)，也就有了gunicorn更快的速度。当然实际测试中使用multi-threading后处理时间并非是instance数量的线性关系，而是达到速度上限后即使再多instance也无法再提升。

理论上来说，这应该是和thread的数量是有关系的。gunicorn+tensorRT也要受到线程最大并发数量的限制，如果没有网络通信对multi-threading的需求，改成本地处理，应当是满足线性的优化速度的。

PS: threading和multiprocess.pool.ThreadPool都有坑，前者无法唤起线程内的tensorRT调用，后者则很容易使用不当造成线程爆棚