Where is the Bottleneck for multiple requests using Whisper on Nvidia A100

I want to use Whisper-Large-v3 (Speech-to-Text) for a real-time application. However, I want to process several requests at the same time. My Whisper instance runs on an Nvidia A100 with 80GB VRAM.In principle, I would assume that I could process many requests at the same time, but that the KV matrices can probably only be accessed again once the first request has been processed. So it processes the requests sequentially, so to speak.

Then I used Gunicorn to start my application with several workers so that 2 independent Whisper instances are loaded.

Now I can process up to 2 requests at the same time, because every instance has his own attention weights. If I load more instances, e.g. 4 Whisper instances, I can still process a maximum of 2 requests at the same time and wait 2 more requests until the others are finished.

My question now consists of the contents, where the bottleneck is and whether someone has a more optimized implementation idea, how I can use the card for the inference more effectively. After all, there is enough VRAM.

Implementation method:

Transformers library + Pytorch
Attention: Flash Attention 2 or SDPA (both deliver pretty much the same results for short audios)
Pytorch backend

If anyone has tips for a faster inference or can simply enlighten me technically where the bottleneck is, that would be good.

My guess:Pytorch backend is blocking contiguous memory blocks for the output and attention matrices, which is why I can't access unused matrices to process more requests in parallel. Since the VRAM is too large for me to have a problem with the high bandwidth memory, I assume that the matrices are loaded into the SRAM, which is not large enough to handle multiple requests at the same time?

Constructive feedback and discussion please.

Where is the Bottleneck for multiple requests using Whisper on Nvidia A100

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112