Sampling theory says you can reconstruct a signal by sampling it at least twice per cycle. So 44.1KHz is an adequate sampling rate for an 22.05Khz signal.
Unfortunately to make Digital-to-Analog conversion work according to theory, you must first construct an ideal analog filter which filters out everything above 22.05Khz while leaving everything less than 22.05Khz unmolested. That's not possible in reality. If 20KHz is the goal, you have a measly 2.05KHz to make the filter ramp from kill-nothing to kill-everything. I'd imagine real-world CD players with cheap filters probably kill everything above 15Khz.
In reality you want a lot of headroom between half the sampling frequency and the actual max frequency you want to pass unmolested. Even 48KHz only grants you a 4Khz band in which to let the filter roll off.
Second, significant playback timing jitter can render the LSBs useless. At 44.1KHz with 16bit sampling, the max difference between a pair of samples of a 22.05KHz max-amplitude input is very roughly 2^15. What does that mean? If your jitter is more than 692ns [1] you have just lost an LSB.
Sure, 24/192 is serious and unnecessary overkill. The advantage is that it has lots of headroom. The disadvantage is that it takes more space. If you were designing a new format today in our era of large hard drives, why wouldn't you waste a bit of space?
This isn't gold-plated monstrous-cable properly-broken-in HDMI snake oil. The current format isn't perfect; an upgrade is a reasonable idea.
[1] 44.1KHz period is 22.6ms; one part in 2^15 of that is 692ns.
Author is an expert, and correct. In the linked video he mentions that the difficulty of creating a sharp analog high-pass filter is in practice completely mitigated by oversampling, which is described in the Wikipedia article on DACs.
Suppose you have a 96khz DAC coming from your computer. Surely you see that a computer can solve for the nyquist reconstruction of some lower sample-rate recording (e.g. CD audio) to that sample rate, and then (still in digital 96khz) remove ALL the noise between say 21khz and 96khz, at which point a final analog filter will have an easy time leaving 0-21khz unmolested while killing all higher frequencies, right? It's practically implied by what you wrote. Per what OP says, that's more or less the effect of any DAC you'll use today.
Typical way to get around the requirement of steep reconstruction filter is oversampling just before the DAC, which is something that was done even by earliest CD players, today in most cases audio DACs are sigma-delta which is essentially oversampling taken to the extreme, that causes that reconstruction filter is mostly irrelevant for output audio quality and is there only for EMC reasons.
It's kind of funny that typical DIY audiophile-grade DAC constructions use some R-2R DAC without oversampling and thus need significantly steeper reconstruction filters.
Overall point is that you don't need to store and more importantly transfer the additional data because they do not contain any useful information.
As for the jitter, I somehow don't believe it is as significant problem as it is often presented, but anyway it's perceived effects should be mostly independent of sample rates and sample sizes.
Both of the points you bring up are unrelated to the format in which files are stored. Sure, they would be requirements for the format that gets sent to a primitive DAC. But the answer to your problem is basically contained in your post: before the signal is sent to your theoretical DAC component, the signal's rate could be digitally upsampled by a factor of 2 (to avoid artifacts), and dither could be applied to eliminate any jitter problems (with the unnoticeable side-effect of raising the noise floor to -66dB, in the case of 16 bit depth).
Anyways, as dfox mentionned, most sound cards nowadays use sigma-delta DACs, which do not need upsampling as they do not involve a filter in the way you described.
You are overestimating the jitter problem by a large amount. The first S/PDIF receiver IC I find via google (CS8416) claims to have an output jitter of typ. 200ps on the clock output, running from its internal PLL synced to an externally supplied S/PDIF input.
Sampling theory says you can reconstruct a signal by sampling it at least twice per cycle. So 44.1KHz is an adequate sampling rate for an 22.05Khz signal.
Unfortunately to make Digital-to-Analog conversion work according to theory, you must first construct an ideal analog filter which filters out everything above 22.05Khz while leaving everything less than 22.05Khz unmolested. That's not possible in reality. If 20KHz is the goal, you have a measly 2.05KHz to make the filter ramp from kill-nothing to kill-everything. I'd imagine real-world CD players with cheap filters probably kill everything above 15Khz.
In reality you want a lot of headroom between half the sampling frequency and the actual max frequency you want to pass unmolested. Even 48KHz only grants you a 4Khz band in which to let the filter roll off.
Second, significant playback timing jitter can render the LSBs useless. At 44.1KHz with 16bit sampling, the max difference between a pair of samples of a 22.05KHz max-amplitude input is very roughly 2^15. What does that mean? If your jitter is more than 692ns [1] you have just lost an LSB.
Sure, 24/192 is serious and unnecessary overkill. The advantage is that it has lots of headroom. The disadvantage is that it takes more space. If you were designing a new format today in our era of large hard drives, why wouldn't you waste a bit of space?
This isn't gold-plated monstrous-cable properly-broken-in HDMI snake oil. The current format isn't perfect; an upgrade is a reasonable idea.
[1] 44.1KHz period is 22.6ms; one part in 2^15 of that is 692ns.