WhisperX along with whisper-diarization, runs at something around 20x of real ti...

WhisperX along with whisper-diarization, runs at something around 20x of real time on audio with a modern GPU, so for that part, you're looking at around $1 per twenty hours of content to run it on a g5.xlarge, not counting time to build up a node (or around 1/2 that for Spot prices, assuming you're much luckier than I am at getting stable spot instances these days).

You can short circuit that time to build up a node a bit with a prebaked AMI on AWS, but there's still some amount of time before a new node can start running at speed, around 10 minutes in my experience.

I haven't looked at this particular solution yet, but I really find the LLMs to be hit or miss at summarizing transcripts. Sometimes it's impressive, sometimes it's literally "informal conversation between multiple people about various topics"