I did something along the lines of his suggestion using OpenGL sparse textures (D3D would call them tiled resources) and persistently/coherently mapped buffers with disk i/o being done with memory mapped files. It's a rather crude proof of concept for on-demand loading of large textures (I used a 16k x 8k satellite image). I didn't properly detect "page faults", but I had some of the mechanisms implemented (outputs yellow pixels instead of page faults).
To make it work fully end-to-end, it would look something like this:
1. Shader samples from a sparse texture, and detects that the requested page is non-resident.
1b. Fall back to lower mip map level.
2. Shader uses atomic-or to write to a "page fault" bitmap (one bit per page)
3. The bitmap is transferred to the CPU
4. For each set bit, start async copy from disk to DMA buffer (ie. pixel buffer object in GL)
5. When disk i/o is complete, start texture upload from buffer to a "page pool" texture
6. When texture upload is complete, re-map the texture page from "page pool" to the actual texture
Now this approach works alright, but there are a number of issues that make it impractical for the time being. Off the top of my head:
1. Sparse textures are only supported on Nvidia and AMD hardware. Not Intel, ARM or IMG.
2. Requires Vulkan or D3D12 for step #6 (the demo doesn't do this, so there may be pipeline stalls)
3. One or two frames of latency that can only be avoided if this was done in the kernel mode driver.
4. Poor fit for existing KMD architecture (which has its own concept of residency)
5. Detecting page faults is easy. Detecting which pages can be dropped is hard.
Here's the source code of my demo. It's not pretty because it was a one off demo project for very specific hardware (Android + OpenGL 4.5, which means Nvidia Shield hardware with Maxwell GPUs). The technique is portable, though.
(In the code above, all the interesting bits are the functions named xfer_*)
Based on the experience from writing this demo, I have to agree with Carmack here. File-backed textures would make a lot of sense for a lot of use cases.
Splash screens and loading bars vanish. Everything is just THERE.
I'm not sure I agree with this. It might be more convenient to have a filesystem-like interface, but at the end of the day everything still has to be loaded into the (rather limited) GPU memory at some point.
Most CPU applications can handle RAM swapping from disk, but I really doubt that big games could maintain 60fps if even a few assets needed re-loading.
If a frame is 16ms, and the best consumer SSDs are around 2GB/s, You can only load 33MB of assets in a one frame in a best-case scenario.
> If a frame is 16ms, and the best consumer SSDs are around 2GB/s, You can only load 33MB of assets in a one frame in a best-case scenario.
33 MB of assets per frame is more than enough for a lot of practical tasks. Modern texture compression achieves 2 bits per pixel (e.g. ASTC 8x8) with very little percievable loss of quality (when you apply texture mapping, lighting, effects, etc. ASTC 12x12 gets 0.89 bits per pixel, but noticeable loss of quality). At this rate, the 33 MB/frame is 16k x 8k worth of texture data, every frame. Vastly more than there are pixels on screen.
And this figure does not take into account the block cache at all. If the data has been used recently (several minutes depending on RAM size), a copy of it might be still around in RAM which would make it nearly instantaneous and almost a gigabyte per frame of data (assuming all memory bw is put to this use).
As for the framerate issues, it would be imperative that this technique is implemented in a stall-free way (which is possible with Vulkan/D3D12) so that a steady 60 fps is maintained. All textures should have at least a few mip map levels resident at all times to fall back on. A single "standard" GPU sparse page (64k) is 512x512 texels at ASTC 8x8, which is quite a large texture already.
In other words: 33 MB per frame is a lot more than what we get today.
The GPU doesn't have DMA to the disk or RAM it has to go through the CPU to access it which causes quite a delay.
Even if you read from RAM during normal operations you'll have very low framerate unless you are using pretty damn good latency masking a good example will be texture streaming where you hold low res textures in your GPU memory and load the higher res ones from RAM or disk and even on very high end systems it causes allot of texture popins which people find rather annoying.
Latency hiding is indeed required, but it's a solvable problem. There's no way that this would work synchronously.
See my other comment above. I implemented a technique like this (on a <10 watt mobile device!) and I had no issues whatsoever with framerate, even my suboptimal technique (using OpenGL sparse textures, which are quite restrictive) that might stall. These stalls can be avoided with Vulkan/D3D12 techniques (using aliased sparse pages, ie. one physical page mapped on several textures) so maintaining a stable framerate simply isn't an issue.
This was done with the CPU orchestrating the whole deal. The GPU isn't required to have access to disk or initiate the DMA transfer.
Latency, on the other hand, is an issue but it doesn't seem to be that bad in practice (warning: anecdotal evidence). In my simple demo, practically every page uploaded was resident on the GPU by the next frame after it was required. A small number of pages (less than 1%) had 2 frames of latency. None had 3 or more.
Hiding the visual artifacts from texture popping is a real issue too, but can be mitigated by speculatively uploading pages and applying filtering between mip map levels.
All of this can be done today using Vulkan or D3D12. What Carmack is suggesting (if I understood him correctly) is to make the kernel driver on the CPU initiate asynchronous upload without userspace intervention, which would improve latency and bandwidth.
What if the GPU had its own private SSD with textures installed to it when you install the game? A 100GB SSD is about 20% of the price of a high end video card.
Inb4 the next GTX titan comes with 24GB of HBM and an M.2 slot.
I don't think it's necessary for applications that require so much memory you can probably develop some proprietary tech to access storage (NVIDIA does it).
JC isn't talking about compute tho, he's talking about gaming he really likes megatextures but he's in a minority and I honestly can't claim to have enough expertise to judge if he's right or not.
GPU memory management is a mixed bag there are many cards with different speed of RAM, different bandwidths all doing the magic voodoo in the background regarding low level memory access and compression way beyond what you are exposed to on the driver and API layers (e.g. NVIDIA's incard memory compression which is enabled regardless of how you compress or load your textures in the first place).
WDDM allows lower end system to take advantage of more video memory to improve performance in less demanding applications which are most apps today while still enabling all the nice graphics we come to expect from out window manager and applications.
I'm not sure if allowing the GPU to read directly from the SSD or RAM (outside of the current scope of virtualized GPU memory and asset loading) has actually enough benefit to justify both the engineering costs and the potential security pitfall that can happen when you run multiple applications that all of a sudden get DMA access to your RAM and local storage. This is more so the case for gaming considering the amount of RAM that graphic cards come with today and this is only increasing I don't know if anyone but JC actually wants/needs this.
I am wondering how do consoles handle it, IIRC from playing a bit with the UDK (UE3) at the time I recall that it supported streaming textures directly from the optical media on the Xbox 360, so if anyone knows how it works under the hood and is willing to share I think it would add to this topic.
That could happen with Intel's optane SSDs but I don't think it is necessary. The latency between the CPU and GPU is not measured in any sort of human perceivable time, so it doesn't matter unless the GPU is making millions of cpu memory accesses, and that isn't how games or GPUs work for obvious reasons.
The bandwidth is also not a problem, the entire memory of a 16GB GPU can be filled in under a second.
The latency between the GPU and CPU is a very big problem for gaming, if it wasn't we weren't be having this discussion a bad CPU call can easily add 50MS of latency to your frame which makes you all of a sudden drop from 60fps to 20.
You are talking about two different things. One is the latency of data from main memory to the gpu memory and the other is whatever video game metric you are using.
To get 60fps your video card needs to push out a frame every 16ms.
If say 6ms of this is actual in card processing and 10ms is CPU/API overhead and you add to it another CPU call + system memory access that adds 30-35MS latency you are now operating at ~50ms per frame which means that you can only output 20 frames each second.
GPUDirect is very specific it require implementation on the PCIe switch (some modification to IOMMU), UEFI/BIOS and card, as well as the OS Kernel and the display drivers.
Because it requires both very specific software and hardware configurations only some features work on some systems.
Also calling this DMA is kind of misleading it does require the CUDA software layer to work, and for the most part it's a lot of hacks coupled together to form a feature set.
So I would still currently stand by what I said that there isn't a standard and generic way for GPU's to have DMA access.
GPU can just queue page faults and raise an IRQ. Those faults can then be handled in any size of group on the CPU side. By using a queue, latency can be effectively hidden.
Perhaps disk I/O DMA can even be directly connected to GPU DMA, bypassing system RAM entirely. Even currently, ethernet and disk DMA can bypass memory and go directly to CPU L3 cache. With existing flexibility like that, it's not far-fetched to be able to connect DMA between two devices.
Pretty much same happens when you touch a page that's not present from user mode. CPU will get interrupted and queue disk I/O to get the faulting page.
Even disks themselves have their internal queues (like SATA NCQ). Disks also have CPUs to serve queued I/O requests.
> GPU can just raise IRQ and have a queue for the page faults.
Handling GPU IRQ's is possible but has unnecessarily high latency and pipeline stalling problems. However, the kernel mode drivers do a bit of this behind the scenes (mainly when switching from app to app).
But using sparse textures (ie. GL_EXT_sparse_texture2), the GPU can detect if a fault would occur and react to it. Instead of IRQ'ing for every missed page, a list of all missed pages can be extracted at the end of a frame.
> Perhaps disk I/O DMA can even be connected to GPU DMA, bypassing system RAM entirely.
Afaik Nvidia's NVLink does this but it's aimed at super computers, not graphics.
> Handling GPU IRQ's is possible but has unnecessarily high latency and pipeline stalling problems. However, the kernel mode drivers do a bit of this behind the scenes (mainly when switching from app to app).
Time it takes for GPU to access data present in its local internal GDDR RAM: about 1 microsecond. Time it takes from GPU asserting (pulls up) [1] IRQ until CPU handles IRQ: 5-50 microseconds. Time it takes for CPU to insert I/O request into OS internal queue: unknown, but it's almost certainly a small fraction of a microsecond. Hundreds of faults can be handled with one IRQ request.
So there's not really that much more latency than in a normal CPU page fault. Actually probably less on average, because these faults can be grouped. GPU memory access latency is very slow and they are already built to handle high latency with a massive number of hardware threads.
IRQ can for example be edge triggered by the first GPU page fault. Each GPU page fault doesn't need to trigger CPU side interrupt. That'd be pointless, it'd just cause high CPU load for no benefit.
I think some sort of page fault FIFO is simpler hardware wise than to wait for something arbitrary event like "end of a frame".
GPU can just keep pushing new faults in a CPU-visible FIFO. That way it's possible to amortize latency and to avoid IRQ storm (=excessive number of IRQ requests).
CPU side can just pull faults from the FIFO at whatever rate I/O system can support.
Fault data from GPU could also include priority, so that more visually important data can be fetched first even if it wasn't encountered first. For example geometry or textures covering major parts of screen could have high priority.
(I've co-designed a HW mechanism for avoiding an excessive number of IRQs without sacrificing latency and designed and implemented a kernel mode driver to support it. Of course GPUs are quite a bit more complicated than that simple case.)
How would GPU or GPU drivers know how to issue DMA transfer and know when DMA is ready on another piece of hardware? Those things are specific to a piece of hardware.
Besides there's a small detail called IOMMU that prevents PCI(-e) devices from writing or reading wherever they please.
You can pull assets from RAM somewhat transparently already (if you want to do it well you need to optimize it further, but to some extent NVIDIA and AMD do quite a bit of optimization in the driver too), shared page table between the CPU and GPU is also already in place under WDDM.
So adding a disk based page file to this won't be so hard, it could more or less work the same way as the page file currently works.
I'm still not sure if this is actually needed, GPU's already come with stupid amount of RAM today, i have 24GB of GDDR5 in my system, and even mid range cards today will not come with less than 6GB of RAM.
> You can pull assets from RAM somewhat transparently already (if you want to do it well you need to optimize it further, but to some extent NVIDIA and AMD do quite a bit of optimization in the driver too), shared page table between the CPU and GPU is also already in place under WDDM.
Can GPU page table entry point to non-present page(s)? Or does it only work for "pinned pages" [1], that cannot paged out of RAM?
If it doesn't require pinning, then what prevents from mmapping assets on the disk even today?
If, however, it does require pinning, then John Carmack has a damn good point.
IIRC the driver handles the pinning, there are basically 3 types of video memory under WDDM, Dedicated Video Memory (on-card) Dedicated System Memory (memory that is assigned to the GPU only usually via BIOS configuration and is off limit to the OS) and Shared System Memory (A part of the virtual system memory allocated for the GPU), so in theory if you only use Dedicated Video Memory and Shared System Memory you could be pulling off data from the on-disk page file, but it's not like you can map a specific asset (say a texture file) on disk to your video memory directly.
WDDM also limited the volume and commit sizes based on some "arbitrary" limits that MSFT set out (IIRC it's something like system memory / 2 or something silly like that), there's a bit more silliness that for example if the limit of the max memory available for graphics is 2GB you can't commit 3GB but you can do 3x1GB just fine.
I'm pretty sure atm WDDM/Vendor Display Driver pin all pages to RAM only so it won't end up in the page file, TBH I haven't had a page file on my system for a long long time windows doesn't use SSD's for paging unless it really has too and considering I haven't been using a system with less than 32GB of RAM for the past 5 years I never had issues with it.
P.S.
Apologies if I used any terminology incorrectly this is stretching both my knowledge and recollection regarding this subject, it's also more or less limited to how GPU/Graphics are handled within Windows.
I think it can only work through pinning, because it seems to rely on GPU bus mastering for accessing pages that are dedicated for graphics.
So you have 32 GB RAM and say 8 GB GPU RAM. Imagine you have a program that has 100 GB of graphics assets without built in mechanism to guess which subset of assets might be required for current scene. The program needs about 50 MB in any given frame -- in other words it has 50 MB working set.
As an operating system, which assets are you going to keep in memory?
Now imagine you have multiple programs running concurrently, each having 100 GB of assets. Each app has that same 50 MB working set.
How can the system handle this situation efficiently (or at all!) if the pages need to be pinned to RAM?
If you can have true on demand GPU paging, all of these apps need to only swap their current working set of data. User would not perceive any delay when switching from app to app. She could even display all of them in the same time without any issues.
"In the event that video memory allocation is required, and both video memory and system memory are full, the WDDM and the overall virtual memory system will then turn to disk for video memory surfaces. This is an extremely unusual case, and the performance would suffer dearly in that case, but the point is that the system is sufficiently robust to allow this to occur and for the application to reliably continue."
The only issue here is that as far as i can understand you can't really choose how this is done very specifically.
When you allocate memory the driver pretty much takes over, if you want granular over memory allocation you pretty much have to go through the GPUMMU path which means each process has a separate GPU and CPU address space so while you can control what you store in GPU memory and what you store in System memory I still don't see a way to control mapping an asset to disk specifically other than it being an edge case of you running out of system memory which results in the page file being used.
I guess it depends on if the IOMMU has a fault interrupt that can be acted upon. If it does, then it can probably be done. However, I'm not sure if the OS will handle this or not.
When the OS runs out of RAM, it can swap pages to disk. If this page also has a mapping in an IOMMU, it can invalidate the mapping there as well.
Then, when the device attempts to touch the page, the IOMMU faults and the CPU would swap the page back in.
I'm not sure if this is possible, but this is the route I'd expect something like this to follow.
> I guess it depends on if the IOMMU has a fault interrupt that can be acted upon. If it does, then it can probably be done. However, I'm not sure if the OS will handle this or not.
Interesting idea. I'd also like to know if IOMMU faults can be acted upon. It might require protocol support between the bus and hardware device (GPU) [1], to tell it the page is not currently present. And a way to tell GPU once the page is available.
As far as I can see this kind of mechanism would require one CPU interrupt per GPU fault. That might be too inefficient.
[1]: Edit: It's indeed possible to handle, if the device supports "PCI-SIG PCIe Address Translation Services (ATS) Page Request Interface (PRI) extension".
> Can GPU page table entry point to non-present page(s)?
Yes, new GPUs allow you to do this. This feature is called sparse textures / buffers in OpenGL (GL_ARB_sparse_texture and GL_EXT_sparse_texture2) and Vulkan (optional feature in Vulkan 1.0 core) or tiled resources in D3D.
This allows you to leave textures (or buffers) partially non-resident (accesses to which are safe but results undefined) and allow you to detect when accessing a non-resident region (EXT_sparse_texture2) so that you can write a fallback path in the shader (lookup lower mip level and/or somehow tell the CPU that the page will be required for the next frame).
The OpenGL extensions are a bit restrictive, but Vulkan/D3D12 allow much greater control (such as sharing pages between textures or repeating the same page inside a texture).
Hardware support for this is not ubiquitous at the moment, but should improve as time goes on.
This feature is somewhat orthogonal to pinned pages and WDDM residency magic (which is afaik more oriented to switching between processes), hopefully it will get more unified in the future.
Of course, something like this would need driver support.
What I was getting at was the fact that there should be nothing physically preventing them from implementing DMA support, so I was wondering why they didn't already support it. I’m not familiar with GPUs, so I assumed that this is how all large transfers between system RAM and the GPU worked.
I can only speak for the Linux context, but the IOMMU isn’t an issue. You can just allocate memory with the DMA API (dma_alloc_coherent) which will automatically populate the IOMMU tables (if required), pin the pages, and return you a PCI bus address as well as a kernel virtual address which both correspond to the same chunk of physical memory. Or, you can map an existing buffer in page by page using the dma_map routines (I forget the names).
Now, you have a shared pool of memory which can be accessed by both devices at the same time. The coherency fabric (if one exists) will handle all synchronization automatically, though this can be a bottleneck sometimes. If the CPU isn’t cache coherent, then the pages get marked as no-cache in the kernel PTEs so that any read from the CPU side pulls straight from memory.
Passing “messages” can be accomplished by an external notification like an interrupt or something.
You can then even map this buffer into a user space program.
I’m sure there are some security concerns with this approach though.
More complicated things like device to device transfers (GPU to and from disk) would have to be arbitrated by the CPU, but I see no reason that the CPU would actually have to do the copy itself. Why couldn’t the CPU just provide the GPU with the PCI bus address of the disk controller which should be written to?
If the GPU wanted to write to the disk, the CPU would initiate a transfer to disk, but before writing the actual data, you pass the destination PCI address to the GPU and let it write the data. Then, the CPU can resume doing whatever it has to do while this happens in the background.
I suspect if anyone had thought through this it would be John Carmack. His game design skills can be called into question. His 3D engine skills are second to none[1].
[1] As an individual. I suspect why we haven't see as much from him since Quake 3 is that his skills don't scale to a team.
> I suspect why we haven't see as much from him since Quake 3
Not quite accurate. Since Quake 3 he made Unified lighting and shadowing (for Doom 3) and Megatextures (for Quake 5 and Rage).
I can't really think of any other major paradigm shifts over the last 10-15 in game graphics engines.
There are lots of little improvements everywhere, but they seem more evolutionary than the revolutionary steps we saw in the 90s (textureMapping, raycasting, BSPs, PVS, lightmaps, Sorted edge rasterization, bilinear filtering, bezier-surferaces, shaders, unified lighting/shadowing and beam trees, etc.).
>I can't really think of any other major paradigm shifts over the last 10-15 in game graphics engines.
Deferred shading is a pretty major paradigm shift compared to forward rendering and it's within ~10-15 year range (it wasn't practical on pre MRT hardware which was shader model 3 ?) along with plenty of screen space effects like SSAO.
Rage was gorgeous. The game was terrible, but it was incredibly pretty, mostly because of megatextures. Some pundit (TotalBiscuit?) referred to it looking like a moving painting, and in places it really was that breathtaking.
Again, the game was unfortunately a few fun sections interspersed with really un fun things, but that has nothing to do with the tech.
(FWIW I ran it on AMD and it had major problems, and then a year or so later it was perfect. AFAICT they patched most of the issues away on PC)
Personally, I didn't think Rage looked all that - I thought it was muddy and bland (and not in the good, post-apocalyptic way) and graphically as mediocre as the gameplay.
I think for Rage it was an idea ahead of its time, but they're still used in the new Doom which doesn't seem to have the same issues and has been praised for good optmisation.
I believe the new Doom still uses virtual textures to an extent, but they seem to have given up on the idea of using it to apply unique texture assets to every surface in the game as they did in Rage.
Without globally unique textures Megatexture goes from a paradigm shift to just an implementation detail, unfortunately.
>Let's be honest though, did anyone actually like megatextures?
Phrased this way, this question makes no sense to me. Perhaps you mean specifically the art in Rage. Virtual texture memory strictly increases the capacity for detail, realism, and/or expression.
I meant as a technology in general, given that they pushed it so hard as a killer feature of the game and it ended up being the reason most people couldn't play it.
Unless I'm understanding things the wrong way, he isn't saying it should be the only way, just the standard way. The majority of applications are not performance sensitive, and mmap will be simpler and probably save ressources (by leaving untouched assets on the disk).
If profiling show you're limited by pagefaults, then preloading by simply touching the memory will have it ready. It simplifies things, which is almost always a good thing.
Wouldn't that work completely naturally with unified memory? You mmap your assets, refer to that from GPU code, the OS would logically load the mapped data to RAM then push a subset of it to the GPU.
I suspect we'll see more of this as 16+GB of RAM becomes more standard, at least for gamers. In the past it just wasn't a case that was worth optimizing for.
I would really like it if some games would take advantage of the gigs of RAM I have just sitting around unused. Why are you thrashing disk loading assets all the time, when there's 12 GB of pristine, untouched RAM available?
Obviously, there's the x86/x64 divide there - hopefully nobody is buying new 32-bit systems anymore, and that limitation can go away - although I'd rather get a x64 version of Visual Studio, but apparently that's not a good idea, because reasons[1]
Similarly, I wish games would be smarter about level loading.
One game I recently played, a racing game, when you select "restart race" it reloads the entire level (3d models, textures etc) when really all it had to do was reset a handful of variable (car positions, time, a few other things like that).
If its not possible to reset these variables, then take a snapshot!
Its especially annoying when a game, eg, autosaves before a boss battle and then when you die, instead of just resetting some stats and inventory, you have to reload everything... They already know there's a good chance you will die (that's why they autosave!). If its too much effort to reload just the bits that need it, then take a snapshot of non-graphics memory at the autosave point and just reload that.
There's too many games out now where you can die very quickly (within seconds) if you're not that good yet, and then have to sit through a multi-minute loading phase over and over... very frustrating.
(obviously this applies only to "reloading", not terminating the game and loading)
Games are one-off code written in memory unsafe languages. They are usually a pile of hacks and bugs. Reset from known state is a nice, foolproof way to recover from problems, and reduce the likelihood to reach edge cases.
I mean, yeah, when a game does allow bypassing load screens, it is pretty amazing bonus to game play: its the key to success in games like Super Meat Boy, to constantly maintain flow through failure and difficulty. But its hard.
If you have that much RAM, there should be no disk thrashing since your OS will cache it anyway. Unless my "thrashing" you mean memory-memory copies and page remappings.
Ish. Right now, games explicitly handle resource loading/unloading from/to RAM in userspace. Using memory-mapped texture files would allow utilizing proven and much faster kernel infrastructure for this.
> If a frame is 16ms, and the best consumer SSDs are around 2GB/s, You can only load 33MB of assets in a one frame in a best-case scenario.
I think page faulting can work a lot better for graphics assets (= textures) than for general purpose computing, because the system has special knowledge about data and can avoid processing pauses by (temporarily) using lower fidelity mipmap levels.
1) All of those assets aren't present in the current frame. 33 MB might cover the whole scene anyways.
2) Texture aware page fault mechanism can transfer low lower detail mipmap levels first. Two detail levels lower mipmap is just 1/16th of data, but is still going to be good enough for those few tens of milliseconds until better LoD can be loaded.
3) So in 3 frames (5ms) you have 100 MB. This probably covers current scene pretty well.
So everything will just appear to be there. Eyes won't have time to focus in time it takes to load all full quality assets for current scene, even if your SSD can load just 500 MB/s.
Remember, with page fault mechanism you don't need to load all assets, just those that are actually visible in the current scene. So initially there's less data to transfer.
But it doesn't change anything. Lower LoD mipmap level is going to be less data with or without compression. Each lower level data size is just 25% of previous.
Slight caveat, Windows 10 rarely pages out to the disk.[1] I'm not sure if it's possible to ask it to treat your mmaps in this way. Regardless, implementing the synchronization required to pull this off would be a nightmare - especially in Vulkan/DX12. The OS would also need some form of API where you are notified that a mmaped page is faulting. Except for the very top studios (AAA) it would likely be a completely unapproachable API. Still, it would be fascinating to see what the very best could do with it.
> You can only load 33MB of assets in a one frame in a best-case scenario.
Carmack indicates that "you could still manually pre-touch media to guarantee residence". Meaning that we're back to loading screens (hopefully shorter ones, though).
1) Hardware doesn't usually support 3 byte pixels. Pretty safe to assume at least 32bpp pixels, 4 bytes per pixel.
2) Hardware might only support power-of-two line pitches (width, number of bytes between screen pixel rows). Horizontal resolution would still be 2560, remaining pixels would just be hidden. That way hardware can always rely on bit shifting when computing display addresses and possibly other tricks.
3) Frame buffer might not even be in row-major format in the first place.
So 2560 * 1600 32bpp might just as well be 4096 * 1600 * 4 = 25 MB. Or something else.
Which is tricky, actually, because real video can come in 24fps or 59.94fps (60/1.001). Your TV can handle this but your computer monitor can't, so you lose frames all the time.
I was facepalming too while reading this. I'd expect Carmack to understand the ugly truth behind mmap: it's not free. It's more than not free, demand-paging actually adds significant cost: TLB flushes aren't performance friendly.
Linux ppl realized this and implemented MAP_POPULATE but once you're doing that you might as well just eagerly populate the normal way.
Mmap() can be a convenient interface when you aren't latency sensitive but otherwise it's not appropriate.
It's just a lot better than current status quo. Loading more or less full scene assets to GPU RAM regardless whether they're actually present in current frame or not.
Even if some texture is visible, it might be only a tiny fraction of some particular mipmap level is actually needed.
TLB flush cost is pretty insignificant here. A bit like accidentally crushing your finger with a sledgehammer and complaining the hammer was a bit cold. Besides, you'd be talking about TLB flush cost on the GPU. Maybe GPUs can just hide all of that latency with a high number of hardware threads, just like they've hidden RAM latency for over 10 years.
Mmap() is not the only obvious solution for unused assets. There are other ways with more explicit operational specs. If you design the solution for this problem from first principles, you'll come up with something more appropriate than mmap()
Sure you can figure out if a texture is used in the current visible set (like in a certain part of a game level). But then it starts to get tricky!
Any concrete ideas how to determine which parts (= memory pages) of a texture are actually needed to draw the scene?
If you don't know, you're going to waste a lot of I/O and memory capacity for something you don't need in the first place.
Remember a texture... :
1) ... has multiple mipmap [1] levels. Say you have 1024x1024 texture. You'll also need 512x512, 256x256, 128x128... etc versions. Depending on the triangle orientation [2], GPU might need some spatial areas (not all!) from any of those levels. 1024x1024 for the corner that's near camera and 128x128 for the faraway portion.
2) ... has spatial data order. It's not in row-major order (think scanlines), but in some space filling curve order (like Z-order curve [3]). This helps caching AND paging schemes -- other accesses are spatial and very likely to be nearby memory addresses in the texture.
3) ... is rarely drawn alone. There are often multiple objects using same texture. Sometimes that's true even if actual visible portion is completely different, like texture atlases [4]. This makes it non-trivial to consider points 1 and 2.
4) ... is sometimes huge. Uncompressed 4096x4096 float32 RGBA texture takes 256 MB (4096 * 4096 * 4 * 4) memory just for the first mipmap level. All (traditional "pyramid" type) mipmap levels together would take about 341 MB.
So how are you going to determine which memory ranges of the texture really need to be in the memory?
It doesn't have to be this complicated. It only seems complicated because your solutions are stuck in the context of memory mapping.
A very simple solution is to add another logic layer to GPU processing, like a shader. It could be the "asset shader," it would run on each frame and tells the GPU what assets to start preloading, etc. That would give the programmer tight control over asset loading latencies without having to load everything at once.
> and tells the GPU what assets to start preloading, etc.
How does that shader gain information about which portions of the texture are needed at each mipmap level?
Or do you just load whole texture and consume memory you don't actually need to render the image. It'd perform badly due to unnecessary I/O causing a long loading time. You'd also waste large portions of GPU RAM.
Or does your shader try to guess? Do you attempt to reverse engineer exactly how GPU trilinear texture sampler operates, because otherwise you won't know which parts of asset data is needed -- guess wrong and you get weird artefacts, when GPU samples your texture at a memory location you didn't load. Oops. Rounding differences compared to hardware texture sampler would almost certainly get you. Not sure if it's still true, but at least in past different brand/model GPUs implemented texture sampling slightly differently [1], enough to force you to have a version for many GPU vendors and models.
Or do you disable trilinear sampling and use just one mipmap level you somehow pick. You'd get bad image quality, blur and/or aliasing (like moire).
Even after considering all that, how are you going to deal with texture atlases?
Your way sounds really complicated. Unless you're willing to do rather serious compromises. Robustness, quality or loading time performance.
The "shader" would take in the current application-specific description of the scene.
This description in theory would be much much much smaller than the corresponding assets required to render the scene. It would be trivial to eagerly bind to the shader.
The asset shader API could provide information on the current GPU's texture rendering parameters/quirks.
I have been advocating this for many years, but the case gets stronger all the time. Once more unto the breach.
GPUs should be able to have buffer and texture resources directly backed by memory mapped files. Everyone has functional faulting in the GPUs now, right? We just need extensions and OS work.
On startup, applications would read-only mmap their entire asset file and issue a bunch of glBufferMappedDataEXT() / glTexMappedImage2DEXT() or Vulkan equivalent extension calls. Ten seconds of resource loading and creation becomes ten milliseconds.
Splash screens and loading bars vanish. Everything is just THERE.
You could switch through a dozen rich media application with a gig of resources each, and come back to the first one without finding that it had been terminated to clear space for the others – read only memory mapped files are easy for the OS to purge and reload without input from the applications. This is Metaverse plumbing.
Not that many people give a damn, but asset loading code is a scary attack surface from a security standpoint, and resource management has always been a rich source of bugs.
It will save power. Hopefully these are the magic words. Lots of data gets loaded and never used, and many applications get terminated unnecessarily to clear up GPU memory, forcing them to be reloaded from scratch.
There are many schemes for avoiding the hard stop of a page fault by using a lower detail version of a texture and so on, but it always gets complicated and requires shader changes. I’m suggesting a complete hard stop and wait. GPU designers usually throw up their hands at this point and stop considering it, but this is a big system level win, even if it winds up making some frames run slower on the GPU.
You can actually handle quite a few page faults to an SSD while still holding 60 fps, and you could still manually pre-touch media to guarantee residence, but I suspect it largely won’t be necessary. There might also be little tweaks to be done, like boosting the GPU clock frequency for the remainder of the frame after a page fault, or maybe even the following frame for non-VR applications that triple buffer.
I imagine an initial implementation of GPU faulting to SSD would be an ugly multi-process communication mess with lots of inefficiency, but the lower limits set by the hardware are pretty exciting, and some storage technologies are evolving in directions that can have extremely low block read latencies.
Unity and Unreal could take advantage of this almost completely under the hood, making it a broadly usable feature. Asset metadata would be out of line, so the mapped data could be loaded conventionally if necessary on unsupported hardware.
A common objection is that there are lots of different tiling / swizzling layouts for uncompressed texture formats, but this could be restricted to just ASTC textures if necessary. I’m a little hesitant to suggest it, but drivers could also reformat texture data after a page fault to optimize a layout, as long as it can be done at something close to the read speed. Specifying a generously large texture tile size / page fault size would give a lot of freedom. Mip map layout is certainly an issue, but we can work it out.
There may be scheduling challenges for high priority tasks like Async Time Warp if a single unit of work can create dozens of page faults. It might be necessary to abort and later re-run a tile / bin that has suffered many page faults if a high priority job needs to run Right Now.
Come on, lets make this happen! Who is going to be the leader? I would love it to happen in the Samsung/Qualcomm Android space so Gear VR could immediately benefit, but it would probably be easiest for Apple to do it, and I would be just fine with that if everyone else chased them in a panic.
"This is a Facebook +Premium article. In order to access this content, you must sign in with a Facebook +Premium account[?]. [?] Facebook +Premium accounts are Facebook accounts where you have also confirmed your identify with your national passport [level1] and have your yearly retina scan access enabled and updated [level2]. As an option you may wish to access the Facebook extra features and free access to all services by enabling [level3] on your account by enabling location services and allowing us to store your GPS position throughout your day. A level 3 account will have access to location based and time based offers, content and Facebook friendship features that will simplify and improve your life. Level 4 access is not yet ready for us to offer you, but it will involve a small chip which you implant into your arm. This will let you get the full and unfettered Facebook experience without the need for any cellphone or other device!"
I'm not quite sure mmap is such a good idea if you're trying to have more low-level control over performance. Weird Carmack advocating this, because you can't really guarantee the latency of grabbing any resource if you incur a fault and need to grab it from disk.
He notes that reasonable hardware should have the performance margin to load a reasonable number of pages from a SSD without dropping a frame, which seems a very good plan. Looking forward to actual tests, of course.
Considering that prefetching schemes allow the programmer to spread asset loading evenly over many frames, and cheap rendering approximations can be used in troublesome frames, there should also be enough low-level control.
> reasonable hardware should have the performance margin to load a reasonable number of pages from a SSD without dropping a frame
My disks are usually encrypted though and sometimes I can choose faster or slower encryption methods (thus affecting throughput when loading). I don't see how this can work reliably without forcing the user to reserve specific disk areas just for GPU assets.
Just guessing, but from my reading it sounds like the aim is to maintain generally good frame rates and not worry too much about dropped frames due to page faults, since those will be rare. Presumably the idea is to rely on ATW so that when frames do get dropped, it's imperceptible.
It doesn't mean that everything has to be done like this, only that it would be an available feature. Even then you could touch memory to make sure it is available.
it seems to me that when apple does something it quickly becomes "accepted" by the consumers (even if it is a technical thing and not consumer facing). This is not always a bad thing for competitors
I think it's a combination of Apple owning enough of the stack to make it happen, and occasionally Apple's secrecy catching the rest of the industry flat footed. (see the 64bit ARM transition)
Funny you mentioned ARM! I have some inside stories to tell about that!
At the time it happened, many had already looked into the architecture and realized that to them there were no real benefits: 32-bit ARM could already address 1TB of memory, you could get the accelerated crypto instructions with an architecture extension and 64-bit ARM implementations were not very power efficient (a problem just recently solved by the latest A7x devices).
But when Apple switched to 64-bit ARM everyone else just had to follow along. This resulted in weaker Android phones for about a year. The funny thing is, the reason Apple switched so early is because they needed some head start since they use native apps. They didn't really needed the 64-bit ARM at that time yet.
I think one of the main advantages was to get onto the latest (or newer..) ARM instruction set, which I recall did increase performance Vs 32bit chips/instruction.
I'm definitely no expert though, any inside knowledge on this?
Yes, but you also lost a number of useful instructions (such as LDM). Also, many 3rd party implementations already incorporated at least a good subset of those accelerated instructions. So you could (and people did) create synthetic benchmarks where one architecture was 200% faster than the other.
Now, I am not saying that 64-bit was just fluff. The new design has a much nicer pipeline (specially thanks to the instructions they removed) which is MUCH better suited for things like speculative execution. But the implementation to make use of this wasn't really there until very recently. Here is a fun fact for you: the 64-bit Cortex-A53 and the 32-bit Cortex-A7 are 80% the same CPU. What does that tell you about the first generation of 64-bit devices?
Factor tech debt in: the sooner Apple introduced 64-bit ARM CPUs, the sooner they could be dropping support for 32-bit devices from the latest iOS (possibly even as soon as this year) then Xcode.
It does sound like memory mapped assets would be a great feature. One thing to read (not really an objection, just a commentary that remains relevant) is "On the design of display processors" Myer, Sutherland; Communications of the ACM, 1968 (also called the wheel of reincarnation http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland... ).
I recently upgraded from a 1.5MB L2 cached athlon 2 to a 6MB L3 core i5, and surprisingly, game loadings are still as slow. I guess that copying assets files onto RAM doesn't result in a speed up ?
So if I understand the problem right, it's because copying data to the GPU is made through the PCI express bus, and done "piece by piece", instead of larger batches ? A little like grouping draw calls ? That's funny how that problem can be seen everywhere in hardware, where multiplying queries will make latencies snowball.
I think it has more to do with the fact that GPU's memory accesses aren't cache coherent with the CPU, so a larger L2 doesn't really add much to the table.
Generally DMA to/from the GPU is cache coherent (either via DMA sniffing for cache invalidation or software managing regions for DMA, e.g. marking relevant PTEs as nocache).
So accesses are _coherent_, but the cache is simply irrelevant (or even more costly, if it's using snooping).
Funny, this is only necessary because SSDs are becoming so common and are crazy fast (at both latency and throughput). But it's true, we do need proper mmap-to-GPU. This is going to be challenging (and fun).
The GPU => CPU memory bus is a major bottleneck for NVIDIA's growing Deep Neural Net driven adoption.
GPUs churn through data once it is across the bus.
A heirarchy of GPUs, outputs wired to inputs, mirroring the heirarchy of deep nets would be useful for real time robots & cars, NVIDIA's other big market.
> A heirarchy of GPUs, outputs wired to inputs, mirroring the heirarchy of deep nets would be useful for real time robots & cars, NVIDIA's other big market.
Isn't that exactly what they're doing with NVLink?
You just use the PCI bus. GPUDirect gives you DMA access to the other GPUs on the same bus. If they're in PCI express x16 slots, it's relatively speedy.
Totally aside and ranty: facebook has started (recently?) greeting/forcing outsiders with a login dialog that covers the whole page. You can click "not now", but that just permanently lowers the login dialog box to about the bottom 1/3 of the page.
If you want to publicly make your content available in a free and seamless way stop using facebook. Please.
Your rant may be justified but I'm getting really tired of clicking on HN threads I'm interested in and finding the top comment plus a screenful or two of replies addressing some nit somebody wants to pick with the implementation of the site in the article instead of addressing the content of the article.
I'm at that point too. I know if I click a link, I'm not going to get to read intelligent discussion until much deeper in the page. Take for example the article about the LinkedIn acquisition, top comment has people arguing over the origin of the word secular...
How about we solve this problem once and for all by turning this thread into a two-page series of complaints about how HN doesn't allow you to collapse comment threads. :P
I look at the comments to determine whether the article is worth reading. "The article is unreadable" is as useful for me to know as something like "Hm, everyone seems to be talking about a programming language I don't use". So I appreciate complaints about popups and the like being at the top.
(The other very useful comment at the top is "Warning, it autoplays a video")
I think Hacker News should allow tangents, but make them easier to hide. Maybe have a button to collapse everything under the top-level comment somehow. This would also be good data to collect as input into ranking.
I disagree that we should discourage engaging discussion just because it happens to be meta or only tangentially related. This is a user experience problem, not a content problem. More specifically, this is a use case for comment-thread collapsing where when, as a user, you come across a long thread that doesn't interest you, you can collapse it and immediately be at more content, rather than needing to manually find the end of the uninteresting thread.
This allows both for the community to drive the discussion to what most interests the community, and to allow those with different tastes than the majority to still participate without alienating or frustrating them or those involved in said discussion.
There are many browser extensions for HN, and thread collapse is possibly the most common feature in them.
I'm of the opinion that HN doesn't implement features like this on purpose, because they expect you to find a way to work around the limitations that works best for you, and that will likely result in a more customized and comfortable experience than them forcing their opinion of how the content should be presented on you. HN is somewhat notorious for having these soft limitations (such as not showing reply on very active threads for a few minutes, which also has an easy workaround for those that know).
I completely agree that I should be able to easily collapse any sub thread and also see only the comments that have been made since the last time I viewed comments on an article.
And I wish there was some filter in HN that let us ignore upvoted posts that complain about gated sites, design, javascript requirements or how slow Atom is. It's an unfair world.
Another great option is the "Block Element" feature of uBlock Origin which lets you easily create a blocking rule for any element on the page. In Firefox it's found in the right click menu, and you can then visually choose the element to be blocked.
I regularly use it e.g. for the cookie acceptance boxes. They are a major annoyance -- as I don't save cookies, the setting for accepting cookies isn't preserved either and I have to approve them again and again. If I visit any such page for a second time it's easier to just hide the whole question with uBlock.
Having everything the app needs available to the GPU at all times without having to explicitly load them from disk to graphics memory beforehand in userland code (and cycled out as needed) - having it happen automatically so less code needs to be written & debugged in the app/framework, and having it happen in the kernel so it is potentially more efficient (allowing for more complex scenes and/or more detail and/or faster frames, on the same hardware).
Don't like the idea. Games that stream assets don't drop frames. Before the IO is complete, they display lower-quality placeholders. Even a universal placeholder like an amorphous black shadow is better then a game that stops rendering and waits for IO.
To make it work fully end-to-end, it would look something like this:
Now this approach works alright, but there are a number of issues that make it impractical for the time being. Off the top of my head: Here's the source code of my demo. It's not pretty because it was a one off demo project for very specific hardware (Android + OpenGL 4.5, which means Nvidia Shield hardware with Maxwell GPUs). The technique is portable, though.https://github.com/rikusalminen/sparsedemo/blob/master/jni/g...
(In the code above, all the interesting bits are the functions named xfer_*)
Based on the experience from writing this demo, I have to agree with Carmack here. File-backed textures would make a lot of sense for a lot of use cases.