Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Stable Diffusion in C/C++ (github.com/leejet)
303 points by kikalo00 on Aug 19, 2023 | hide | past | favorite | 115 comments


Llama.cpp/ggml is uniquely suited to llms. The memory requirements are huge, quantization is effective, and token generation is surprisingly serial and bandwidth bound, making it good for CPUs, and an even better fit for ggml's unique pipelined CPU/GPU inference.

...But Stable Diffusion is not the same. It doesn't quantize as well, the unet is very compute intense, and batched image generation is effective and useful to single users. Its a better fit for GPUs/IGPs. Additionally, it massively benefits from the hackability of the Python implementations.

I think ML compilation to executables is the way for SD. AITemplate is already blazing fast [1], and TVM Vulkan is very promising if anyone will actually flesh out the demo implementation [2]. And they preserve most of the hackability of the pure PyTorch implementations.

1: https://github.com/VoltaML/voltaML-fast-stable-diffusion

2: https://github.com/mlc-ai/web-stable-diffusion


The above project somewhat supports GPUs if you pass the correct GGML compile flags to it. `GGML_CUBLAS` for example is supported when compiling. You get a decent speedup relative to pure C/C++.


Interesting. It still doesn't seem to be very quick: https://github.com/leejet/stable-diffusion.cpp/issues/6

But don't get me wrong, I look forward to playing with ggml SD and its development.


Yeah for comparison, `tinygrad` takes a little over a second per iteration on my machine. https://github.com/tinygrad/tinygrad/blob/master/examples/st...


Is that on GPU or CPU? 1 it/s would be very respectable on CPU.

The fastest implementation on my 2060 laptop is AITemplate, being about 2x faster than pure optimized HF diffusers.


That was on GPU, and there are various CPU implementations (e.g. based on Tencent/ncnn) on github that have similar runtime (1-3s / iteration).


On the other hand, this is nice for anyone who wants to play with these networks locally and does not have a nvidia GPU with 6+ gigabytes of VRAM. I can run this on an old laptop, even if it takes a while.


> On the other hand, this is nice for anyone who wants to play with these networks locally and does not have a nvidia GPU with 6+ gigabytes of VRAM. I

SD 1.x works in stable-diffusion-web-ui (aka, A1111, one of the more popular frontends) with (reportedly) as little as 2GB of VRAM (I’ve personally used 1.5 and 2.1 without any problems with a 4GB card.)


I would also highly recommend https://tinybots.net/artbot

You can even run CLIP or (if its fast enough) llama.cpp on the CPU to contribute to the network, if you wish.


ComfyUI works pretty well on old computers using CPU. It takes over 30 seconds per sampling step on a 2015 MacBook Air (I7, 8GB RAM).


Iirc we had good speedups on it with torch.compile, and I remember working on it. Let me see if I can find numbers…


Its about 20-40% depending on the GPU, from my tests.

And only very recent builds of torch 2.1 (with dynamic input) work properly, and it still doesn't like certain input changes, or augmentations like controlnet.

AIT is the most usable compiled implementation I have personally tested, but SHARK (running IREE/MLIR/Vulkan) and torch-mlir are said to be very good.

Hidet is promising but doesn't really work yet. TVM doesn't have a complete implementation outside of the WebGPU demo.


Try head of master. If there’s any bugs or graph breaks you hit, lmk, I can take a look. My numbers say 71% with a few custom hacks.

Glad the dynamic stuff is working out tho!


I will, thanks!

I have been away for a month, but I will start testing it again later and submit some issues I run into.


My username without underscore, at meta. Email me any bugs, I can help file them on GH and lend a hand fixing.


Awesome that they implemented CLIP as well. That alone could be cool to extract and compile as a wasm implementation.

Edit: Seems like someone already has https://github.com/monatis/clip.cpp :) Now to wasmify it


Speaking of CLIP, I'm always troubled that the next CLIP might not get released as both OpenAI and Google are shifting into competition mode. Sad to think there might be a more advanced version of CLIP already but sitting in a secret vault somewhere.

Edit: I'm not referring to a CLIP-2 but any advance on the same level of importance as CLIP.


The biggest CLIP models we know of are open source.

If a company has a bigger CLIP model they don't have even reported that.

Also OpenAI had already for a moment a proprietary CLIP model that was bigger than any other models available, the CLIP-H used by Dalle 2.


As someone who is out of the loop but could use high quality image embeddings right now, what's the best CLIP model right now?


it really depends on what you're trying to achieve, if you want to build a semantic image search then a small/base model would be fine, I think that bigger models usually leak to much information that makes the embeddings space to difficult to interpreter for simple algorithm like cosine similarity, if you want to condition a generative model then a bigger model should provide more information about the prompt or the image.


SDXL uses OpenCLIP, and then OpenAI CLIP as a backup basically to allow it to spell words properly, but I think you could replace the second one.


Stable Diffusion switched to OpenCLIP for stable diffusion 2. But it looks they went back to clip for the xl version.

People complained about openclip not being as good. Hopefully we can have a better and open clip model eventually.


This is incredibly easy to setup, just tried it for first time.

How fast is it supposed to go?

Just tried on linux with `cmake .. -DGGML_OPENBLAS=ON` on a AMD Ryzen 7 5700g (no discrete GPU, only integrated graphics)

    ./bin/sd -m ../models/sd-v1-4-ggml-model-f32.bin -p "a lovely cat"
    [INFO]  stable-diffusion.cpp:2525 - loading model from '../models/sd-v1-4-ggml-model-f32.bin'
    ...
    [INFO]  stable-diffusion.cpp:3375 - start sampling
    [INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 12.25s
    [INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 12.22s
    [INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 12.56s
    ...
    sampling completed, taking 246.40s

Is that expected performance?

(EDIT: Don't have open Blas installed, so that flag is no-op)


This is nice, it basically does what i asked a year ago[0] and at the time pretty much every solution wanted a litany of Python dependencies that i ended up failing to install because it took ages... and then i ran out of disk space.

No, really, this replaces literal gigabytes of disk space with just a 799KB binary. And as a bonus using the Q8_0 format (the one that seems to be the fastest) it also saves ~2.3GB of data too.

That said, it seems to be buggy with anything other than the default 512x512 image size. Some sizes (e.g. 544x544) tend to cause assert fails, sizes smaller than 512x512 (which i tried as 512x512 is quite slow on my PC) sometimes generate garbage (anything smaller than 384x384 seems to always do that).

[0] https://news.ycombinator.com/item?id=32555608


> sizes smaller than 512x512 (which i tried as 512x512 is quite slow on my PC) sometimes generate garbage (anything smaller than 384x384 seems to always do that).

Not sure about the speed, but the garbage output might be due to the model instead of the library. I’ve always got garbage (using other tools) when I tried 256x256.


SD 1.x does not work well at resolutions less than 512x512, it's only really trained to output 512x512.


IME, SD 1.5 base model, IIRC (I don’t use the base model much) works tolerably well with the larger dimension at 512 and the smaller dimension between 384-512, and quite a lot of SD 1.5 based checkpoints work pretty well up to a larger dimension of 768 (some with smaller dimension up to that, too.)

But SD also has hard requirement that the output dimensions be divisible by 8 (I believe as a consequence of the pixel space to latent space ratio being 8:1 in each dimension) so its not surprising that 544x544 fails hard.


Also got segfault core dump with different sizes. 512w * 768h worked


You should quantize the model, but 12s/iter seems about right.


Nice. Tried the fp32, q8_0, and q4_0, and for some reason they all take ~12s/iter.

Must have something wrong with my setup, but no big deal, for my minimal usage of it, and the amount of time spent, fp32@12s/iter is fine


Hmm, theoretically FP16 might be the fastest, if thats an option in the implementation now.


I did a quick run under profiler and on my AVX2-laptop the slowest part (>50%) was matrix multiplication (sgemm).

In current version of GGML if OpenBLAS is enabled, they convert matrices to FP32 before running sgemm.

If OpenBLAS is disabled, on AVX2 plaftorm they convert FP16 to FP32 on every FMA operation, which even worse (due to repetition). After that, both ggml_vec_dot_f16 and ggml_vec_dot_f32 took first place in profiler.

Source: https://github.com/ggerganov/ggml/blob/master/src/ggml.c#L10...

But I agree, that in theory, and only with AVX512 BF16 (not exactly FP16, but similar) will be fast with VDPBF16PS instruction. Implementation is not there yet.


Interesting.

I saw some discussion on llama.cpp that, theoretically, implementing matmul for each quantization should be much faster since it can skip the conversion. But practically, its actually quite difficult since the various BLAS libraries are so good.


CPU-only, 8-bit quant, Intel Core i7 4770S, 16 GB DDR3 RAM, 10 year old fanless PC: 32 seconds per sampling step, correct output.


There's just something special to these C/C++ implementations of AI stuff. They feel so clean and straightforward and make the entire field of AI feel tangible and learnable.

Is that because Python's ecosystem is so messy?


Rewrites tend to improve code quality, and replacing dependencies with custom-tailored code that does just what you need also improves code quality.

And while the Python version uses C and C++ code for speed, this is all just one language.

A trifecta of factors enabling clean code.


Nice to see ML folks getting weaned off of Python and using a language that can optimally exploit the underlying hardware and not require setting up a specialized environment to build and run.


That's a rather odd comparison to make. First of all, OP, like llama.cpp, doesn't use the GPU – in contrast to most Python ML code. It's not hard to write Python code that "optimally exploits" the GPU. You might call the GPU a "specialized environment to build and run" but it's arguably much better suited to the problem.

Second, OP, like llama.cpp, produced efficient and highly specialized code after it was clear the model being specialized for (StableDiffusion / LLaMa / …) works well. Where Python shines, though, is the prototyping phase when you have yet to find an appropriate model. We have yet to see this sort of easy & convenient prototyping in C++.

Now, this is not to take away anything from the fantastic work that's being done by the llama.cpp people (to whom I also count OP) in the "ML on a CPU" space. But the problems being solved are entirely different.


> Where Python shines, though, is the prototyping phase when you have yet to find an appropriate model. We have yet to see this sort of easy & convenient prototyping in C++.

+1.

To produce a highly-optimized C/C++ kernel that utilizes the CPU to the fullest extent, it requires tremendously amount of talent and expertise. For example, not everyone can write a hand-vectorized kernel with AVX2 intrinsics (outside a few specialized applications like 3D graphics, media encoding, and the likes), and even fewer people can exploit the underlying feature of the algorithm for optimization, such as producing usable output at greatly reduced numerical precision. The power of LLM provides strong motivation to drive the brainpower of countless programmers all over the world to do just that. New techniques are proposed and implemented on a monthly basis, with people thinking and applying every possible trick on the LLM optimization problems. In this regard, moving from Python to C is totally reasonable.

In comparison, right now I'm working on optimizing a niche open-source scientific simulation kernel with a naive C codebase. Before me, there were hardly any contributors in the last decade.

Python has its place because not everyone has a level of resource and expertise comparable to ML. In particular, when the bulk of the data processing of a Python script is in done in a function call to a C++ or FORTRAN kernel like scipy, the differences between naive C and naive Python code (or Julia code if you're following the trend) are not that much, especially when it's a one-off project for just publishing a single paper.


It’s going to be a tf or PyTorch feature rather than going directly to writing things in C. No point solving this problem only once.


Yeah i make a living in the GPU space. I think my comment comes from colleagues having to hold my hand to set up their ML / Python environments with all of their picadellos. In fact its bad enough that i have to use docker to create an insular environment tailored to their specific setup. And Python is like a 1000 times slower when its not using other libs like numpy.


Are they not using venvs or something? It should be as simple as python -m venv venv; ./activate; pip install -r requirements.txt


Everyone has their own way to do this. Every step is broken by some unfamiliar dependency that requires special arcane knowledge to fix. Part of me is a grumpy old man that doesn’t gravitate to the shiny new tools that come out every week that the younger devs keep up with :)


pip and venv are neither shiny nor new, it's the standard way of doing things for a while. I am an outsider to python and am incredibly thankful for this standardization, because i agree getting python env set up correctly before venv was a huge pain

If your guys arent on this I'd suggest you get them on it, it dramatically simplifies setup


Here is a tiny excerpt try to get dvc to work just so I could get the training weights for deployment ... remember I don't develop much w Python...

    $ dvc pull
    Command 'dvc' not found, but can be installed with:
    sudo snap install dvc

    $ sudo snap install dvc
    error: This revision of snap "dvc" was published using classic confinement and thus may perform
    arbitrary system changes outside of the security sandbox that snaps are usually confined to,
    which may put your system at risk.

    If you understand and want to proceed repeat the command including --classic.

 ok I get dvc installed somehow -- don't remember. Time to get the weights...

    $ python3 -m dvc pull
    ERROR: unexpected error - Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden            

    Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Finally I just have my colleague manually copy the weights. This kind of thing went for hours.


Hey, DVC maintainer here.

Thanks for giving DVC a try!

There are a few ways to install dvc, see https://dvc.org/doc/install/linux

With snap, you need to use `--classic` flag, as noted in https://dvc.org/doc/install/linux#install-with-snap Unfortunately that's just how snap works for us there :(

Regarding the pull error, it simply looks like you don't have some credentials set up. See https://dvc.org/doc/user-guide/data-management/remote-storag... Still, the error could be better, so that's on us.

Feel free to ping us in discord (see invite link in https://dvc.org/support). I'm @ruslan there. We'll be happy to help.


Thanks… i know my colleague uses it a lot. I generally use his models and don’t do much ML development yet. At some point I need to properly learn all of this. It seems ML tools are only for developers not for those who simply want to deploy and use the resulting NN.


Researchers are notorious for writing bad code

What even is dvc

edit: also- i'd avoid snap and just use your regular package manager.


I think dvc is like git for large binary files. You need someway to manage your NN weights -- what are other methods?


git lfs is what everyone is using, HF in particular


> Are they not using venvs or something? It should be as simple as python -m venv venv; ./activate; pip install -r requirements.txt

In most cases, it would be possible to do close to that, but it is extremely common to run into things being distributed in the AI/ML space with install instructions that don’t include that, and instruct you to have a global install of a certain Python version, and then to pip install the dependencies (and globally install non-Python package dependencies, if there are any), so even if they’d work in a venv, you have to (1) indepently know you should be doing that, and (b) translate the instructions – which where (1) applies is usually trivial if all the dependencies are proper python packages, but can be more involved otherwise.

So, yeah, I can see that a lot of the time the path of least resistance is just to create an isolated container environment for it.


Unfortunatly its not that simple expecially for NVIDIA driver and cuda install. That's why we usually use conda that can handle cuda install but even with that some time it work flawlessly and some time not.


>You might call the GPU a "specialized environment to build and run" but it's arguably much better suited to the problem.

I feel like the person you're replying to knows that the GPU is better suited than the CPU to do this task, and your argument doesn't really make sense. I think they were referring to the python venv environment with all the library dependencies as the "specialized environment"


The point is that as awesome as this repo is it doesn't do much to ween the "ML folks" off of Python since it doesn't provide the flexibility and GPU support that people designing and training DL systems rely on.


I’m just encouraged when I see ML libraries not using Python w its environment kludges. Just a step in the right direction.


I don't disagree that Python environments are a mess. I'm actually a developer on quite a prominent large scale neural network training library and a DL researcher that uses said library. With my developer hat on I like to have minimal dependencies and keep Python scripting as decoupled as possible from the CUDA C++ implementation. With my researcher hat on I don't want to be slowed down by C++ development every time I want to change my model or training pipeline. At least for me, C++ development is slower and more error prone than modifying Python.

Obviously doing any heavy lifting in Python is a bad idea. But as a scripting language I think it's good, especially if you keep the environment simple. I don't think the answer for DL training is to dump Python entirely and start over in pure C/C++/Rust/Julia/whatever. Learning C/C++ is too big of an ask for everyone working on the model design and training side and it would slow down progress significantly - most of that work is actually data munging and targeted model tweaks. But I do think there's still a lot that can be done to decouple Python from the underlying engine and yield networks where inference can be run in a minimal dependency environment. There's lots of great people working on all these things.


That Python ML code is calling C++ code running in the GPU, one more reason to use C++ across the whole stack.

CERN already used prototyping in C++, with ROOT and CINT, 20 years ago.

https://root.cern/

Nowadays it is even usable from Netbooks via Xeus.

It is more a matter of lack of exposure to C++ interpreters than anything else.


Add to that it's only inference code, not training.


>That's a rather odd comparison to make. First of all, OP, like llama.cpp, doesn't use the GPU

When was the last time you looked at llama.cpp? It has supported GPU, GPU+CPU, and distributed inference using OpenMPI for awhile now. It also supports training, as well as negative prompting and grammars! The ease of getting llama.cpp running on just about anything has already started innovation.


not sure what "It's not hard to write Python code that "optimally exploits" the GPU", exactly means but Python is so far from exploiting the GPU resources even with C/C++ bindings that it's not even funny. I am sure that HPC folks would have migrated way from FORTRAN and C/C++ long time ago if it was so easy.


I wasn't trying to claim that Python is great at fully exploiting GPU resources on generic GPU tasks. But in ML applications it often does, at least in my experience.


Yup. I would much prefer if every ML model had a simple C inference API that could be called directly from pretty much any language on any platform, without a mess of dependencies and environment setup.


ML is such a beautiful and perfect setup for dependency free execution too. It should just be like downloading a mathematical function. I'm glad we're finally embracing that.


It's not like any performance significant component of the ML stack is actually implemented in Python. Everything is and has always been cuda, c or c++ under the hood. Python is just the extremely effective glue binding it all together.


Sometimes implementations will spend a little too much time in Python interpretation, but yeah, its largely lower level code.

The problem with PyTorch specifically is that (without Triton compilation) pretty much all projects run in eager mode. That's fine for experimentation and demonstrations in papers, but its crazy that its used so much for production without any compilation. It would be like using debug C binaries for production, and they only work with any kind of sane performance on a single CPU maker.


I really appreciate the people doing this work. It's the only way I've run these models without any headaches. The difference is so stark, even with CUDA and Linux it's bad, with AMD and Windows it's miserable. I'm pretty sure it's not just me..


It’s interesting to me that my CPU can run some of these things in quantized form almost as fast as the GPU. Has the whole thing been all about memory bandwidth all along?

In addition to compute the GPU architecture is one that somewhat colocates working memory alongside compute. Units have local memories that sync with global memory. Is that a big part of why GPUs are so good for this?


> Has the whole thing been all about memory bandwidth all along

Yeah, sort of.

LLMs like llama at a batch size of 1 are hilariously bandwidth bound.

Stable Diffusion less so. Its still bandwidth heavy on GPUs, but compute is much more of a bottleneck.


It's a bummer there's so little work on the training side in C++.

Especially since the python training systems are mostly calls into libraries written in C++!


Yeah, and since C++17 the language is already quite productive for scripting like workflows, the missing piece of the puzzle is that there are two few C++ REPLs aroung, ROOT/CINT being one of the few well known ones.


CINT is now obsolete, replaced by cling (which is way better!).


Since when does C++ optimally exploit the underlying hardware? It has no vector instructions, does not run on the GPU and is arguably too hard to make multithreaded. Which leaves you with about 0.5% performance of a current PCs.


> does not run on the GPU

both Cuda and the Metal shader language are C++, so is OpenCL since 2.0 (https://www.khronos.org/opencl/), so is AMD ROCm's HIP (https://github.com/ROCm-Developer-Tools/HIP), so is SYCL (https://www.khronos.org/sycl/)? C++ is pretty much the language that runs most on GPUs.

> no vector instructions,

There's a thousand different possibilities for SIMD in C++, from #pragma omp simd, to libs such as std::experimental::simd (https://en.cppreference.com/w/cpp/experimental/simd/simd), Eve (https://github.com/jfalcou/eve), Highway (https://github.com/google/highway), Vc (https://github.com/VcDevel/Vc)...


When compared against Python, more than enough.

C++ is one of the supported CUDA languages, even standard C++17 does run just fine on the GPU.

Metal uses C++14 alongside some extensions.


Vector types / instructions would be nice. The C++20 STL algorithms are very friendly to vectorization with the various parallel policies (e.g. std::execution::unsequenced_policy) that open up your code to be vectorized. Wonderful libs like Eigen handle a lot of my numeric needs for linear algebra. I think you are forgetting the CUDA is C/C++.


> Vector types / instructions would be nice

It’s technically not a C++ feature, but both gcc (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) and Clang (https://releases.llvm.org/3.1/tools/clang/docs/LanguageExten...) have vector types, and clang even supports the gcc way of writing them, so it gets pretty close.


Those are traditionally dangerous since they tend to compile poorly; not as bad as autovectorization but not as good as just writing in assembly. And since vectorization is platform-dependent anyway (because it's so different across platforms), assembly really isn't nearly as bad as it sounds.

Though it's certainly gotten better, the reason people push those is that they're written by compiler authors, who don't want to hear that their compiler doesn't work.

Some of the reason for this is that C doesn't let you specify memory aliasing as precisely as you want to. Fortran is better about this.


Do you mean Julia language?


Amen.


Wasn't Python originally designed as a language to teach children how to code? Weird to see so many, otherwise intelligent, folks latch onto it.

It really doesn't have any redeeming characteristics vs. Common Lisp, or Haskell, to warrant this bizarre popularity imo


ABC, the predecessor from which Python took many syntax features, was. I wonder if Python also took a lot of the ABC implementation, given that it is still copyright CWI.

I agree that its popularity is very odd, but academics take what they are given when attending fully paid conferences (aka vacations).


> Wasn't Python originally designed as a language to teach children how to code

I think it would be very confusing for a child to start with a language so far away from low-level logic.

...And some people said BASIC was evil. At least what it is doing looks plain and direct.


>I think it would be very confusing for a child to start with a language so far away from low-level logic.

Why?

I started with C++ and when they showed me C# I instantly feel in love cuz I didn't have to deal with unnecessary complexity and annoyances and could focus on pure programming, algorithms, etc.


Yes Tester,

but you are confirming my point :) ...You started with C++, then went to C#...


I started with C++ and switched around the beginning, so there wasn't any "low level knowledge" nor above c++ beginner level concepts.

Both: high-to-low and low-to-high have some advantages, but it's not like one is always better than the other.

high-to-low allows you to write stuff earlier - like programs that do something useful, GUI, web, whatever.

but at the cost of understanding internals / under the hood.


I loved BASIC as a kid. C was really confusing because I was constantly committing egregious memory errors without realizing it, the spooky effects were totally mystifying to me, and I didn’t even have the reasoning tools to know where to start with debugging it. It was definitely distracting from simple logic.

On the other hand, it made me think a lot about computer memory, and it makes computer memory easy to work with. Now I’m really comfortable with memory and encodings so that’s nice. I don’t think I would have gotten that by starting with Java or Python.


> I think it would be very confusing for a child to start with a language so far away from low-level logic.

Depending on the person.

For some, it would be very frustrating to start with a language so close to the implementation detail, and so far away from what you want to do. It's very possible that someone might have long lost the motivation before one can do anything non-trivial.

I started from Python, to C, to assembly, to 4-layer circuit boards. Whenever I went a level deeper, it feels like opening the inner working of a blackbox that I normally only interacts with pushbuttons on its front panel, but I otherwise is roughly aware of what they do.

On the other hand, much of my childhood was spent on tinkering with PCs and servers, including hosting websites and compiling packages from source, so I was already well aware of the basic concepts in computing before I started programming. So, top-down and bottom-up are both absolutely workable, under the right circumstances.


What's the term to describe a ad hominem fallacy towards programming languages? Asked the almighty chat and got this new term:

"Code Persona Attack"

Python is fine.


This is a Ycombinator site, the traditional term is Blub.

And they're right. Python is not a well designed programming language - it has exceptions and doesn't have value types so that's two strikes against it.

Of course, C++ isn't either.


As long as we are language trolling, why would anyone start a greenfield project like this in C++ these days? The android, windows, firefox, and now chrome projects have all begun to shift towards rust and in the case of Android and Firefox, write a significant amounts of their project in rust. Migrating an existing project like that is difficult. The chrome team in particular lamented the difficulty. But starting a new project? If you have a team familiar with performant c++ the speedbump of starting a greenfield project in rust is negligible, and the ergonomic improvements in the build system and the language itself will make up for that in any project that takes more than a few months. For that speed bump you get memory and thread race safety, far better than any stack of c++ analysis tools could ever provide with a tiny fraction of the unit tests you'd write in c++. And you lose no performance.


Rust is great when you know what you're building. That qualifier encompasses quite amount of software space, but not all of it, and I would argue not even the majority of it.

If you don't know what you are doing, if you are exploring ideas, Rust will just get in the way. At some point you will end up realizing you need to adjust lifetimes, and that will require you to touch non-trivial amount of your code base. If you need to that multiple times, friction will overwhelm your desire to code.

I have a pet theory that, the people that find Rust intuitive and fun, are the people that are working on well beaten paths; Rust is almost boring at doing that, which is a good thing. And the people that find Rust gets in their way are the people that like to experiment with their solutions, because there aren't any set, trusted solutions within their problem space, and even if there are, they like to approach the problem on their own, for better or worse.

In any case:

> why would anyone start a greenfield project like this in C++ these days?

The video game industry can single-handedly carry C++ on their back, kicking and screaming, if need be. Rust is uniquely unfit to write gameplay code due to game development's iterative nature. Using scripting languages doesn't cut it either, because often, slower designer made scripts will need to be converted to C++ by a programmer, and pull in the crazy reference hell of the game state into the C++ land.

I would say Rust is OK for engine level features -- those don't change that often, and requirements are usually well understood. But that introduces a cadence mismatch between different systems too, so there is a cost there as well. But for gameplay? There's a reason why many Rust based game engines use crazy amount of unsafe Rust to make their ECS. Just not a good fit.

And of course, there's the consoles, where Sony seem to have a political reason for not supporting Rust on non-1st-party studios. I have no idea what they are thinking, honestly.


C++ has a standard, multiple competing implementations and a largely drama-free community.

Does CUDA even have Rust bindings, and if so, are they on the same level as the C++ ones?

What do you mean by "the windows projects" that shift towards Rust?


MS has started implementing pieces of windows in rust. If you have windows 11 you are running rust. The cuda bindings are good for ml, but missing for cufft and similar. There are people working on better cuda support, but there are even more people working on vendor agnostic gpgpu using spirv and webgpu. It isn't there yet. Right now you are mostly left to your own bindings unless you are doing ml or blas.

Edit: I can't argue about the drama part. The competing compilers will get there. A couple gcc frontends in work, and crane lift as a competing back end for llvm and full self-hosting. There is also miri I guess to emit c? People use that to get rust on the C64 or other niche processors.


Yes they started, yet there is enough C++ to rewrite in the 30 years of Windows NT history.

Meanwhile, Visual Studio team released better tooling for Unreal in Visual C++.


> why would anyone start a greenfield project like this in C++ these days?

TLDR: quite often, using C++ instead of Rust saves software development costs.

Some software needs to consume many external APIs. Examples on Windows: Direct3D, Direct2D, DirectWrite, MediaFoundation. Examples on Linux: V4L2, ALSA, DRM/KMS, GLES. These things are huge in terms of API surface. Choose Rust, and you gonna need to write and support non-trivial amount of boilerplate code for the interop. Choose C++ (on Linux, C is good too) and that code is gone, you only need well-documented and well supported APIs supplied by the OS vendors.

Similarly, some software needs to integrate with other systems or libraries written in C or C++. An example often relevant to HPC applications is Eigen. Another related thing, game console SDKs, and game engines, don’t support Rust.

For the project being discussed here, GGML, for optimal performance the implementation needs vector intrinsics. Technically Rust has the support, but in practice Intel and ARM are only supporting them for C and C++. Not just CPU vendors, when using C or C++ there’re useful relevant resources: articles, blogs, and stackoverflow. These things help a lot in practice. I don’t program Rust, but I program C# in addition to C++, technically most vector intrinsics are available in the current version of C#, but they are much harder to use from C# for this reason.

All current C and C++ compilers support OpenMP for parallelism. While not a silver bullet, and not available on all platforms supported by C or C++, some software benefits tremendously from that thing.

Finally, it’s easier to find good C++ developers, compared to good Rust developers.


There are existing supported bindings for direct3d from MS, as they themselves are migrating. GLES and ggml also have supported bindings. I like nalgebra + rustfft better than eigen now. Nalgebra still isn't quite as peformant on small matricies until const generic eval stabilizes, but it is close enough for 6x6 stuff that it is in the noise. Rustfft is faster than fftw even. Rust has intrinsic support on par with clang and gcc, and the autovectorizer uses whatever llvm know about, so again equivalent to clang.

On the last point, I will again assert that a good c++ developer is just a good rust developer minus a month of ramp, that you'll get nack from not having to fight combinations of automake, cmake, conan, vcpkg, meson, bazel and hunter.


I'll be very surprised if MS will ever support rust bindings for media foundation. That thing is COM based, requires users to implement COM interfaces not just consume, and is heavily multithreaded.

About SIMD, automatic vectorizers are very limited. I was talking about manually vectorized stuff with intrinsics.

I've been programming C++ for living for decades now. Tried to learn rust but failed. I have an impression the language is extremely hard to use.


Not sure how fully featured it is. https://lib.rs/crates/mmf

Yes, rust directly supports modern intrinsics, that is what rustfft for instance uses. I try to stick with autovec myself, because my needs are simpler such that a couple tweaks usually gets me close to hand-rolled speedups on both avx 512 and aarch64. But for more complicated stuff yeah, rust seems to be keeping up. Some intrinsics are still only in nightly, but plenty of major projects use nightly for production, it is quite stable and with a good pipeline you'll be fine.

I've written c++ since ~94, and mostly c++17 since it came out. About a quarter of a century of that getting paid for it. I never liked or used exceptions or rtti, and generally used functional style except for preallocation of memory for performance. I think those habits might have made the transition a little easier, but the people on my team who had used a more OOP style and full c++ don't seem to have adapted much more slowly if at all. I struggled for years to internalize rust at home until I just jumped in at work by declaring the project I lead would be in rust. I have had absolutely no regrets. It really isn't as bad a learning curve as c++. But we learned c++ one revision at a time. Also, much like c++ rust has bits you mostly only need to know for writing libraries. So getting started you can put those things to the side for a bit right at first.


> Not sure how fully featured it is

The crate you have linked contains just a single line of source code. Here it is:

   pub struct SourceReader {}
Media foundation API reference: https://learn.microsoft.com/en-us/windows/win32/medfound/med...

> rust directly supports modern intrinsics

C# also directly supports them, but it doesn’t help with usability. The support alone is not enough, the API needs to match the C compiler extensions defined decades ago by Intel and ARM.


Curious how long you tried to learn rust? I've found C++ much harder to learn (coming from a python/scala) background.

Is it just a case of you forgetting how hard C++ was to learn?


I have minimal experience with Rust. OTOH, programming C++ for living since 2000, with a few gaps when I used other languages like Obj-C and C#.

I agree C++ is very hard to learn if you only have experience with higher-level languages like Python and Scala. I think there’re two reasons for that.

C++ is unsafe. There’s no way around this one, it was designed that way, like C or assembly. Still, with modern toolset it’s not terribly bad. Compilers print warnings, BTW I typically ask them to treat warnings as errors to deliberately fail the build. On Windows, a combination of debug build, debug C runtime, and visual studio debugger helps tremendously. Linux compilers have these sanitizers (address, memory, thread, undefined behavior) which are comparable, they too sacrifice runtime speed for diagnostics and debuggability.

Another reason, the language itself is very complicated, especially the templates. However, just because something is in the language doesn’t mean it’s a good idea to use it. You don’t need to be familiar with that stuff unless doing something very advanced, like customizing the Eigen C++ library. Don’t follow the patterns found in the standard library: unlike your code, that library has good reasons to use that template BS. If instead of templates you do something else, C++ becomes much easier to use, and most importantly other people will still be able to read and understand your code. Another reason to avoid excessive template metaprogramming, it slows down the compiler, because template-heavy code often needs to be in headers as opposed to cpp files.

P.S. If you don’t need extreme levels of performance (defined as “approach the numbers listed in CPU specs”, the numbers are FLOPS or memory bandwidth), and you don’t need the ecosystem too much, consider C# instead of C++. Much faster than Python, often faster than Scala or Java, easy integration with C should you need that (about the same as Rust, much easier than Python or Java), the only downside is these ~100MB of the runtime. The reputation is weird, but technically the language and runtime are pretty good. For example, here’s a C# library which re-implements a subset of ffmpeg and libavcodec C libraries for one particular platform, Linux on Raspberry Pi4: https://github.com/Const-me/Vrmac/tree/master/VrmacVideo


Personally, I use rust for the ability to go as fast as C++ but with far fewer footguns. Not to mention that I personally really enjoy how they made algebraic data types and package management first class.

I suspect that if you really spent time learning rust that you'd really appreciate it coming from C++.


When I just want the ability to go as fast as conventional readable C++, I use C# instead. Compared to conventional C++, C# is typically only slightly worse in terms of performance. C# is safer than Rust, and the usability is IMO an order of magnitude better. Just the compilation speed is already a huge contributing factor.

I use C++ for 2 main reasons, integration with other software or libraries, and performance.

When I need the performance, I don’t write conventional C++. I implement custom data structures, usually use vector intrinsics, sometimes use other platform intrinsics like BMI2, sometimes implement custom threading strategies on top of OpenMP or other platform-supplied thread pools. Most of these things are impossible or very hard to express in idiomatic safe Rust, which means the language constantly gets in the way.

Also, most programs contain both performance critical and performance agnostic pieces. When both pieces have non-trivial complexity, I typically use both C# and C++ for the software. Often the frontend part is in .NET, and the performance-critical backend is in C++ DLL.


I believe rust is actually more thread safe than C#, though the other points you made are pretty true.


No there aren't, unless you mean Rust/WinRT demos with community bindings.

Agility SDK and XDK have zero Rust support. If it isn't on the Agility SDK and XDK, it isn't official.

Hardly the same as the official Swift bindings to Metal, written in Objective-C and C++14.


It appears to be in C++, why state it as C/C++?


From what I understand the underlying ggml dependency is written in C.


Saw this repo today, fetched it, built a .dylib (Mac) and used Dart’s ffi-gen tooling to generate the bindings from the provided header file.

I’m just experimenting with it together with Flutter. Ffi because I’m trying to avoid spawning a subprocess

Fast forward: Ended up with severe headache and a broken app. Will continue my attempt tmr with a fresh mind haha

This repo is great though, had it up and running within 10 min on my M1 (using f16). Thanks for sharing!


Looking at the different quantization levels examples, I'm quite impressed. The change from f16 to q8_0 seems to be more of a change in direction than loss of quality. The q5_1 result seems indistinguishable from the q8_0.

So you're losing determinism with the higher precision models, but potentially quite usable.


Any benchmarks?


Some people have timed it here, it looks like it's taking 15-20s/it (dependent on quant and hardware).

https://github.com/leejet/stable-diffusion.cpp/issues/1


I have compiled it with the command:

cmake .. -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc

to use my NVIDIA GeForce RTX 2060 SUPER.

I have converted the model to use FP16.

With these choices, the time per iteration is between 8.5 s and 9 s and the total time for making an image is around 200 s.


That seems a lot worse than a 2060 SUPER with PyTorch in A1111.

https://vladmandic.github.io/sd-extension-system-info/pages/... (search for 2060 SUPER)


This is not surprising, because the GPU support in GGML is said to be preliminary and it is optimized for being run on CPUs.

Seeing the times reported by other people, it seems that using the GPU with GGML, instead of the CPU, still provides a speed improvement, but it is small.

Nevertheless, I have appreciated that after following exactly the instructions of this project everything was up and running after a few minutes and it could be tested.

Past attempts to install all the environment needed to run such models have required much more work.


print range 1 to 100




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: