Llama.cpp/ggml is uniquely suited to llms. The memory requirements are huge, quantization is effective, and token generation is surprisingly serial and bandwidth bound, making it good for CPUs, and an even better fit for ggml's unique pipelined CPU/GPU inference.
...But Stable Diffusion is not the same. It doesn't quantize as well, the unet is very compute intense, and batched image generation is effective and useful to single users. Its a better fit for GPUs/IGPs. Additionally, it massively benefits from the hackability of the Python implementations.
I think ML compilation to executables is the way for SD. AITemplate is already blazing fast [1], and TVM Vulkan is very promising if anyone will actually flesh out the demo implementation [2]. And they preserve most of the hackability of the pure PyTorch implementations.
The above project somewhat supports GPUs if you pass the correct GGML compile flags to it. `GGML_CUBLAS` for example is supported when compiling. You get a decent speedup relative to pure C/C++.
On the other hand, this is nice for anyone who wants to play with these networks locally and does not have a nvidia GPU with 6+ gigabytes of VRAM. I can run this on an old laptop, even if it takes a while.
> On the other hand, this is nice for anyone who wants to play with these networks locally and does not have a nvidia GPU with 6+ gigabytes of VRAM. I
SD 1.x works in stable-diffusion-web-ui (aka, A1111, one of the more popular frontends) with (reportedly) as little as 2GB of VRAM (I’ve personally used 1.5 and 2.1 without any problems with a 4GB card.)
Its about 20-40% depending on the GPU, from my tests.
And only very recent builds of torch 2.1 (with dynamic input) work properly, and it still doesn't like certain input changes, or augmentations like controlnet.
AIT is the most usable compiled implementation I have personally tested, but SHARK (running IREE/MLIR/Vulkan) and torch-mlir are said to be very good.
Hidet is promising but doesn't really work yet. TVM doesn't have a complete implementation outside of the WebGPU demo.
Speaking of CLIP, I'm always troubled that the next CLIP might not get released as both OpenAI and Google are shifting into competition mode. Sad to think there might be a more advanced version of CLIP already but sitting in a secret vault somewhere.
Edit: I'm not referring to a CLIP-2 but any advance on the same level of importance as CLIP.
it really depends on what you're trying to achieve, if you want to build a semantic image search then a small/base model would be fine, I think that bigger models usually leak to much information that makes the embeddings space to difficult to interpreter for simple algorithm like cosine similarity, if you want to condition a generative model then a bigger model should provide more information about the prompt or the image.
This is nice, it basically does what i asked a year ago[0] and at the time pretty much every solution wanted a litany of Python dependencies that i ended up failing to install because it took ages... and then i ran out of disk space.
No, really, this replaces literal gigabytes of disk space with just a 799KB binary. And as a bonus using the Q8_0 format (the one that seems to be the fastest) it also saves ~2.3GB of data too.
That said, it seems to be buggy with anything other than the default 512x512 image size. Some sizes (e.g. 544x544) tend to cause assert fails, sizes smaller than 512x512 (which i tried as 512x512 is quite slow on my PC) sometimes generate garbage (anything smaller than 384x384 seems to always do that).
> sizes smaller than 512x512 (which i tried as 512x512 is quite slow on my PC) sometimes generate garbage (anything smaller than 384x384 seems to always do that).
Not sure about the speed, but the garbage output might be due to the model instead of the library. I’ve always got garbage (using other tools) when I tried 256x256.
IME, SD 1.5 base model, IIRC (I don’t use the base model much) works tolerably well with the larger dimension at 512 and the smaller dimension between 384-512, and quite a lot of SD 1.5 based checkpoints work pretty well up to a larger dimension of 768 (some with smaller dimension up to that, too.)
But SD also has hard requirement that the output dimensions be divisible by 8 (I believe as a consequence of the pixel space to latent space ratio being 8:1 in each dimension) so its not surprising that 544x544 fails hard.
I did a quick run under profiler and on my AVX2-laptop the slowest part (>50%) was matrix multiplication (sgemm).
In current version of GGML if OpenBLAS is enabled, they convert matrices to FP32 before running sgemm.
If OpenBLAS is disabled, on AVX2 plaftorm they convert FP16 to FP32 on every FMA operation, which even worse (due to repetition). After that, both ggml_vec_dot_f16 and ggml_vec_dot_f32 took first place in profiler.
But I agree, that in theory, and only with AVX512 BF16 (not exactly FP16, but similar) will be fast with VDPBF16PS instruction. Implementation is not there yet.
I saw some discussion on llama.cpp that, theoretically, implementing matmul for each quantization should be much faster since it can skip the conversion. But practically, its actually quite difficult since the various BLAS libraries are so good.
There's just something special to these C/C++ implementations of AI stuff. They feel so clean and straightforward and make the entire field of AI feel tangible and learnable.
Nice to see ML folks getting weaned off of Python and using a language that can optimally exploit the underlying hardware and not require setting up a specialized environment to build and run.
That's a rather odd comparison to make. First of all, OP, like llama.cpp, doesn't use the GPU – in contrast to most Python ML code. It's not hard to write Python code that "optimally exploits" the GPU. You might call the GPU a "specialized environment to build and run" but it's arguably much better suited to the problem.
Second, OP, like llama.cpp, produced efficient and highly specialized code after it was clear the model being specialized for (StableDiffusion / LLaMa / …) works well. Where Python shines, though, is the prototyping phase when you have yet to find an appropriate model. We have yet to see this sort of easy & convenient prototyping in C++.
Now, this is not to take away anything from the fantastic work that's being done by the llama.cpp people (to whom I also count OP) in the "ML on a CPU" space. But the problems being solved are entirely different.
> Where Python shines, though, is the prototyping phase when you have yet to find an appropriate model. We have yet to see this sort of easy & convenient prototyping in C++.
+1.
To produce a highly-optimized C/C++ kernel that utilizes the CPU to the fullest extent, it requires tremendously amount of talent and expertise. For example, not everyone can write a hand-vectorized kernel with AVX2 intrinsics (outside a few specialized applications like 3D graphics, media encoding, and the likes), and even fewer people can exploit the underlying feature of the algorithm for optimization, such as producing usable output at greatly reduced numerical precision. The power of LLM provides strong motivation to drive the brainpower of countless programmers all over the world to do just that. New techniques are proposed and implemented on a monthly basis, with people thinking and applying every possible trick on the LLM optimization problems. In this regard, moving from Python to C is totally reasonable.
In comparison, right now I'm working on optimizing a niche open-source scientific simulation kernel with a naive C codebase. Before me, there were hardly any contributors in the last decade.
Python has its place because not everyone has a level of resource and expertise comparable to ML. In particular, when the bulk of the data processing of a Python script is in done in a function call to a C++ or FORTRAN kernel like scipy, the differences between naive C and naive Python code (or Julia code if you're following the trend) are not that much, especially when it's a one-off project for just publishing a single paper.
Yeah i make a living in the GPU space. I think my comment comes from colleagues having to hold my hand to set up their ML / Python environments with all of their picadellos. In fact its bad enough that i have to use docker to create an insular environment tailored to their specific setup. And Python is like a 1000 times slower when its not using other libs like numpy.
Everyone has their own way to do this. Every step is broken by some unfamiliar dependency that requires special arcane knowledge to fix. Part of me is a grumpy old man that doesn’t gravitate to the shiny new tools that come out every week that the younger devs keep up with :)
pip and venv are neither shiny nor new, it's the standard way of doing things for a while. I am an outsider to python and am incredibly thankful for this standardization, because i agree getting python env set up correctly before venv was a huge pain
If your guys arent on this I'd suggest you get them on it, it dramatically simplifies setup
Here is a tiny excerpt try to get dvc to work just so I could get the training weights for deployment ... remember I don't develop much w Python...
$ dvc pull
Command 'dvc' not found, but can be installed with:
sudo snap install dvc
$ sudo snap install dvc
error: This revision of snap "dvc" was published using classic confinement and thus may perform
arbitrary system changes outside of the security sandbox that snaps are usually confined to,
which may put your system at risk.
If you understand and want to proceed repeat the command including --classic.
ok I get dvc installed somehow -- don't remember. Time to get the weights...
$ python3 -m dvc pull
ERROR: unexpected error - Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
Finally I just have my colleague manually copy the weights. This kind of thing went for hours.
Thanks… i know my colleague uses it a lot. I generally use his models and don’t do much ML development yet. At some point I need to properly learn all of this. It seems ML tools are only for developers not for those who simply want to deploy and use the resulting NN.
> Are they not using venvs or something? It should be as simple as python -m venv venv; ./activate; pip install -r requirements.txt
In most cases, it would be possible to do close to that, but it is extremely common to run into things being distributed in the AI/ML space with install instructions that don’t include that, and instruct you to have a global install of a certain Python version, and then to pip install the dependencies (and globally install non-Python package dependencies, if there are any), so even if they’d work in a venv, you have to (1) indepently know you should be doing that, and (b) translate the instructions – which where (1) applies is usually trivial if all the dependencies are proper python packages, but can be more involved otherwise.
So, yeah, I can see that a lot of the time the path of least resistance is just to create an isolated container environment for it.
Unfortunatly its not that simple expecially for NVIDIA driver and cuda install. That's why we usually use conda that can handle cuda install but even with that some time it work flawlessly and some time not.
>You might call the GPU a "specialized environment to build and run" but it's arguably much better suited to the problem.
I feel like the person you're replying to knows that the GPU is better suited than the CPU to do this task, and your argument doesn't really make sense. I think they were referring to the python venv environment with all the library dependencies as the "specialized environment"
The point is that as awesome as this repo is it doesn't do much to ween the "ML folks" off of Python since it doesn't provide the flexibility and GPU support that people designing and training DL systems rely on.
I don't disagree that Python environments are a mess. I'm actually a developer on quite a prominent large scale neural network training library and a DL researcher that uses said library. With my developer hat on I like to have minimal dependencies and keep Python scripting as decoupled as possible from the CUDA C++ implementation. With my researcher hat on I don't want to be slowed down by C++ development every time I want to change my model or training pipeline. At least for me, C++ development is slower and more error prone than modifying Python.
Obviously doing any heavy lifting in Python is a bad idea. But as a scripting language I think it's good, especially if you keep the environment simple. I don't think the answer for DL training is to dump Python entirely and start over in pure C/C++/Rust/Julia/whatever. Learning C/C++ is too big of an ask for everyone working on the model design and training side and it would slow down progress significantly - most of that work is actually data munging and targeted model tweaks. But I do think there's still a lot that can be done to decouple Python from the underlying engine and yield networks where inference can be run in a minimal dependency environment. There's lots of great people working on all these things.
>That's a rather odd comparison to make. First of all, OP, like llama.cpp, doesn't use the GPU
When was the last time you looked at llama.cpp? It has supported GPU, GPU+CPU, and distributed inference using OpenMPI for awhile now. It also supports training, as well as negative prompting and grammars! The ease of getting llama.cpp running on just about anything has already started innovation.
not sure what "It's not hard to write Python code that "optimally exploits" the GPU", exactly means but Python is so far from exploiting the GPU resources even with C/C++ bindings that it's not even funny. I am sure that HPC folks would have migrated way from FORTRAN and C/C++ long time ago if it was so easy.
I wasn't trying to claim that Python is great at fully exploiting GPU resources on generic GPU tasks. But in ML applications it often does, at least in my experience.
Yup. I would much prefer if every ML model had a simple C inference API that could be called directly from pretty much any language on any platform, without a mess of dependencies and environment setup.
ML is such a beautiful and perfect setup for dependency free execution too. It should just be like downloading a mathematical function. I'm glad we're finally embracing that.
It's not like any performance significant component of the ML stack is actually implemented in Python.
Everything is and has always been cuda, c or c++ under the hood.
Python is just the extremely effective glue binding it all together.
Sometimes implementations will spend a little too much time in Python interpretation, but yeah, its largely lower level code.
The problem with PyTorch specifically is that (without Triton compilation) pretty much all projects run in eager mode. That's fine for experimentation and demonstrations in papers, but its crazy that its used so much for production without any compilation. It would be like using debug C binaries for production, and they only work with any kind of sane performance on a single CPU maker.
I really appreciate the people doing this work. It's the only way I've run these models without any headaches. The difference is so stark, even with CUDA and Linux it's bad, with AMD and Windows it's miserable. I'm pretty sure it's not just me..
It’s interesting to me that my CPU can run some of these things in quantized form almost as fast as the GPU. Has the whole thing been all about memory bandwidth all along?
In addition to compute the GPU architecture is one that somewhat colocates working memory alongside compute. Units have local memories that sync with global memory. Is that a big part of why GPUs are so good for this?
Yeah, and since C++17 the language is already quite productive for scripting like workflows, the missing piece of the puzzle is that there are two few C++ REPLs aroung, ROOT/CINT being one of the few well known ones.
Since when does C++ optimally exploit the underlying hardware? It has no vector instructions, does not run on the GPU and is arguably too hard to make multithreaded. Which leaves you with about 0.5% performance of a current PCs.
Vector types / instructions would be nice. The C++20 STL algorithms are very friendly to vectorization with the various parallel policies (e.g. std::execution::unsequenced_policy) that open up your code to be vectorized. Wonderful libs like Eigen handle a lot of my numeric needs for linear algebra. I think you are forgetting the CUDA is C/C++.
Those are traditionally dangerous since they tend to compile poorly; not as bad as autovectorization but not as good as just writing in assembly. And since vectorization is platform-dependent anyway (because it's so different across platforms), assembly really isn't nearly as bad as it sounds.
Though it's certainly gotten better, the reason people push those is that they're written by compiler authors, who don't want to hear that their compiler doesn't work.
Some of the reason for this is that C doesn't let you specify memory aliasing as precisely as you want to. Fortran is better about this.
ABC, the predecessor from which Python took many syntax features, was. I wonder if Python also took a lot of the ABC implementation, given that it is still copyright CWI.
I agree that its popularity is very odd, but academics take what they are given when attending fully paid conferences (aka vacations).
>I think it would be very confusing for a child to start with a language so far away from low-level logic.
Why?
I started with C++ and when they showed me C# I instantly feel in love cuz I didn't have to deal with unnecessary complexity and annoyances and could focus on pure programming, algorithms, etc.
I loved BASIC as a kid. C was really confusing because I was constantly committing egregious memory errors without realizing it, the spooky effects were totally mystifying to me, and I didn’t even have the reasoning tools to know where to start with debugging it. It was definitely distracting from simple logic.
On the other hand, it made me think a lot about computer memory, and it makes computer memory easy to work with. Now I’m really comfortable with memory and encodings so that’s nice. I don’t think I would have gotten that by starting with Java or Python.
> I think it would be very confusing for a child to start with a language so far away from low-level logic.
Depending on the person.
For some, it would be very frustrating to start with a language so close to the implementation detail, and so far away from what you want to do. It's very possible that someone might have long lost the motivation before one can do anything non-trivial.
I started from Python, to C, to assembly, to 4-layer circuit boards. Whenever I went a level deeper, it feels like opening the inner working of a blackbox that I normally only interacts with pushbuttons on its front panel, but I otherwise is roughly aware of what they do.
On the other hand, much of my childhood was spent on tinkering with PCs and servers, including hosting websites and compiling packages from source, so I was already well aware of the basic concepts in computing before I started programming. So, top-down and bottom-up are both absolutely workable, under the right circumstances.
This is a Ycombinator site, the traditional term is Blub.
And they're right. Python is not a well designed programming language - it has exceptions and doesn't have value types so that's two strikes against it.
As long as we are language trolling, why would anyone start a greenfield project like this in C++ these days? The android, windows, firefox, and now chrome projects have all begun to shift towards rust and in the case of Android and Firefox, write a significant amounts of their project in rust. Migrating an existing project like that is difficult. The chrome team in particular lamented the difficulty. But starting a new project? If you have a team familiar with performant c++ the speedbump of starting a greenfield project in rust is negligible, and the ergonomic improvements in the build system and the language itself will make up for that in any project that takes more than a few months. For that speed bump you get memory and thread race safety, far better than any stack of c++ analysis tools could ever provide with a tiny fraction of the unit tests you'd write in c++. And you lose no performance.
Rust is great when you know what you're building. That qualifier encompasses quite amount of software space, but not all of it, and I would argue not even the majority of it.
If you don't know what you are doing, if you are exploring ideas, Rust will just get in the way. At some point you will end up realizing you need to adjust lifetimes, and that will require you to touch non-trivial amount of your code base. If you need to that multiple times, friction will overwhelm your desire to code.
I have a pet theory that, the people that find Rust intuitive and fun, are the people that are working on well beaten paths; Rust is almost boring at doing that, which is a good thing. And the people that find Rust gets in their way are the people that like to experiment with their solutions, because there aren't any set, trusted solutions within their problem space, and even if there are, they like to approach the problem on their own, for better or worse.
In any case:
> why would anyone start a greenfield project like this in C++ these days?
The video game industry can single-handedly carry C++ on their back, kicking and screaming, if need be. Rust is uniquely unfit to write gameplay code due to game development's iterative nature. Using scripting languages doesn't cut it either, because often, slower designer made scripts will need to be converted to C++ by a programmer, and pull in the crazy reference hell of the game state into the C++ land.
I would say Rust is OK for engine level features -- those don't change that often, and requirements are usually well understood. But that introduces a cadence mismatch between different systems too, so there is a cost there as well. But for gameplay? There's a reason why many Rust based game engines use crazy amount of unsafe Rust to make their ECS. Just not a good fit.
And of course, there's the consoles, where Sony seem to have a political reason for not supporting Rust on non-1st-party studios. I have no idea what they are thinking, honestly.
MS has started implementing pieces of windows in rust. If you have windows 11 you are running rust. The cuda bindings are good for ml, but missing for cufft and similar. There are people working on better cuda support, but there are even more people working on vendor agnostic gpgpu using spirv and webgpu. It isn't there yet. Right now you are mostly left to your own bindings unless you are doing ml or blas.
Edit: I can't argue about the drama part. The competing compilers will get there. A couple gcc frontends in work, and crane lift as a competing back end for llvm and full self-hosting. There is also miri I guess to emit c? People use that to get rust on the C64 or other niche processors.
> why would anyone start a greenfield project like this in C++ these days?
TLDR: quite often, using C++ instead of Rust saves software development costs.
Some software needs to consume many external APIs. Examples on Windows: Direct3D, Direct2D, DirectWrite, MediaFoundation. Examples on Linux: V4L2, ALSA, DRM/KMS, GLES. These things are huge in terms of API surface. Choose Rust, and you gonna need to write and support non-trivial amount of boilerplate code for the interop. Choose C++ (on Linux, C is good too) and that code is gone, you only need well-documented and well supported APIs supplied by the OS vendors.
Similarly, some software needs to integrate with other systems or libraries written in C or C++. An example often relevant to HPC applications is Eigen. Another related thing, game console SDKs, and game engines, don’t support Rust.
For the project being discussed here, GGML, for optimal performance the implementation needs vector intrinsics. Technically Rust has the support, but in practice Intel and ARM are only supporting them for C and C++. Not just CPU vendors, when using C or C++ there’re useful relevant resources: articles, blogs, and stackoverflow. These things help a lot in practice. I don’t program Rust, but I program C# in addition to C++, technically most vector intrinsics are available in the current version of C#, but they are much harder to use from C# for this reason.
All current C and C++ compilers support OpenMP for parallelism. While not a silver bullet, and not available on all platforms supported by C or C++, some software benefits tremendously from that thing.
Finally, it’s easier to find good C++ developers, compared to good Rust developers.
There are existing supported bindings for direct3d from MS, as they themselves are migrating. GLES and ggml also have supported bindings. I like nalgebra + rustfft better than eigen now. Nalgebra still isn't quite as peformant on small matricies until const generic eval stabilizes, but it is close enough for 6x6 stuff that it is in the noise. Rustfft is faster than fftw even. Rust has intrinsic support on par with clang and gcc, and the autovectorizer uses whatever llvm know about, so again equivalent to clang.
On the last point, I will again assert that a good c++ developer is just a good rust developer minus a month of ramp, that you'll get nack from not having to fight combinations of automake, cmake, conan, vcpkg, meson, bazel and hunter.
I'll be very surprised if MS will ever support rust bindings for media foundation. That thing is COM based, requires users to implement COM interfaces not just consume, and is heavily multithreaded.
About SIMD, automatic vectorizers are very limited. I was talking about manually vectorized stuff with intrinsics.
I've been programming C++ for living for decades now. Tried to learn rust but failed. I have an impression the language is extremely hard to use.
Yes, rust directly supports modern intrinsics, that is what rustfft for instance uses. I try to stick with autovec myself, because my needs are simpler such that a couple tweaks usually gets me close to hand-rolled speedups on both avx 512 and aarch64. But for more complicated stuff yeah, rust seems to be keeping up. Some intrinsics are still only in nightly, but plenty of major projects use nightly for production, it is quite stable and with a good pipeline you'll be fine.
I've written c++ since ~94, and mostly c++17 since it came out. About a quarter of a century of that getting paid for it. I never liked or used exceptions or rtti, and generally used functional style except for preallocation of memory for performance. I think those habits might have made the transition a little easier, but the people on my team who had used a more OOP style and full c++ don't seem to have adapted much more slowly if at all. I struggled for years to internalize rust at home until I just jumped in at work by declaring the project I lead would be in rust. I have had absolutely no regrets. It really isn't as bad a learning curve as c++. But we learned c++ one revision at a time. Also, much like c++ rust has bits you mostly only need to know for writing libraries. So getting started you can put those things to the side for a bit right at first.
C# also directly supports them, but it doesn’t help with usability. The support alone is not enough, the API needs to match the C compiler extensions defined decades ago by Intel and ARM.
I have minimal experience with Rust. OTOH, programming C++ for living since 2000, with a few gaps when I used other languages like Obj-C and C#.
I agree C++ is very hard to learn if you only have experience with higher-level languages like Python and Scala. I think there’re two reasons for that.
C++ is unsafe. There’s no way around this one, it was designed that way, like C or assembly. Still, with modern toolset it’s not terribly bad. Compilers print warnings, BTW I typically ask them to treat warnings as errors to deliberately fail the build. On Windows, a combination of debug build, debug C runtime, and visual studio debugger helps tremendously. Linux compilers have these sanitizers (address, memory, thread, undefined behavior) which are comparable, they too sacrifice runtime speed for diagnostics and debuggability.
Another reason, the language itself is very complicated, especially the templates. However, just because something is in the language doesn’t mean it’s a good idea to use it. You don’t need to be familiar with that stuff unless doing something very advanced, like customizing the Eigen C++ library. Don’t follow the patterns found in the standard library: unlike your code, that library has good reasons to use that template BS. If instead of templates you do something else, C++ becomes much easier to use, and most importantly other people will still be able to read and understand your code. Another reason to avoid excessive template metaprogramming, it slows down the compiler, because template-heavy code often needs to be in headers as opposed to cpp files.
P.S. If you don’t need extreme levels of performance (defined as “approach the numbers listed in CPU specs”, the numbers are FLOPS or memory bandwidth), and you don’t need the ecosystem too much, consider C# instead of C++. Much faster than Python, often faster than Scala or Java, easy integration with C should you need that (about the same as Rust, much easier than Python or Java), the only downside is these ~100MB of the runtime. The reputation is weird, but technically the language and runtime are pretty good. For example, here’s a C# library which re-implements a subset of ffmpeg and libavcodec C libraries for one particular platform, Linux on Raspberry Pi4: https://github.com/Const-me/Vrmac/tree/master/VrmacVideo
Personally, I use rust for the ability to go as fast as C++ but with far fewer footguns. Not to mention that I personally really enjoy how they made algebraic data types and package management first class.
I suspect that if you really spent time learning rust that you'd really appreciate it coming from C++.
When I just want the ability to go as fast as conventional readable C++, I use C# instead. Compared to conventional C++, C# is typically only slightly worse in terms of performance. C# is safer than Rust, and the usability is IMO an order of magnitude better. Just the compilation speed is already a huge contributing factor.
I use C++ for 2 main reasons, integration with other software or libraries, and performance.
When I need the performance, I don’t write conventional C++. I implement custom data structures, usually use vector intrinsics, sometimes use other platform intrinsics like BMI2, sometimes implement custom threading strategies on top of OpenMP or other platform-supplied thread pools. Most of these things are impossible or very hard to express in idiomatic safe Rust, which means the language constantly gets in the way.
Also, most programs contain both performance critical and performance agnostic pieces. When both pieces have non-trivial complexity, I typically use both C# and C++ for the software. Often the frontend part is in .NET, and the performance-critical backend is in C++ DLL.
Looking at the different quantization levels examples, I'm quite impressed. The change from f16 to q8_0 seems to be more of a change in direction than loss of quality. The q5_1 result seems indistinguishable from the q8_0.
So you're losing determinism with the higher precision models, but potentially quite usable.
This is not surprising, because the GPU support in GGML is said to be preliminary and it is optimized for being run on CPUs.
Seeing the times reported by other people, it seems that using the GPU with GGML, instead of the CPU, still provides a speed improvement, but it is small.
Nevertheless, I have appreciated that after following exactly the instructions of this project everything was up and running after a few minutes and it could be tested.
Past attempts to install all the environment needed to run such models have required much more work.
...But Stable Diffusion is not the same. It doesn't quantize as well, the unet is very compute intense, and batched image generation is effective and useful to single users. Its a better fit for GPUs/IGPs. Additionally, it massively benefits from the hackability of the Python implementations.
I think ML compilation to executables is the way for SD. AITemplate is already blazing fast [1], and TVM Vulkan is very promising if anyone will actually flesh out the demo implementation [2]. And they preserve most of the hackability of the pure PyTorch implementations.
1: https://github.com/VoltaML/voltaML-fast-stable-diffusion
2: https://github.com/mlc-ai/web-stable-diffusion