Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 12111

Where is the Bottleneck for multiple requests using Whisper on Nvidia A100

$
0
0

I want to use Whisper-Large-v3 (Speech-to-Text) for a real-time application. However, I want to process several requests at the same time. My Whisper instance runs on an Nvidia A100 with 80GB VRAM.In principle, I would assume that I could process many requests at the same time, but that the KV matrices can probably only be accessed again once the first request has been processed. So it processes the requests sequentially, so to speak.

Then I used Gunicorn to start my application with several workers so that 2 independent Whisper instances are loaded.

Now I can process up to 2 requests at the same time, because every instance has his own attention weights. If I load more instances, e.g. 4 Whisper instances, I can still process a maximum of 2 requests at the same time and wait 2 more requests until the others are finished.

My question now consists of the contents, where the bottleneck is and whether someone has a more optimized implementation idea, how I can use the card for the inference more effectively. After all, there is enough VRAM.

Implementation method:

  • Transformers library + Pytorch
  • Attention: Flash Attention 2 or SDPA (both deliver pretty much the same results for short audios)
  • Pytorch backend

If anyone has tips for a faster inference or can simply enlighten me technically where the bottleneck is, that would be good.

My guess:Pytorch backend is blocking contiguous memory blocks for the output and attention matrices, which is why I can't access unused matrices to process more requests in parallel. Since the VRAM is too large for me to have a problem with the high bandwidth memory, I assume that the matrices are loaded into the SRAM, which is not large enough to handle multiple requests at the same time?

Constructive feedback and discussion please.


Viewing all articles
Browse latest Browse all 12111

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>