A similar thing is that I often come across people who are well-informed but still surprised that compilers will combine two observable operations into one, and complain that some other thread somehow 'should' be able to observe the intermediate state. But I don't understand how they think they would be able to tell the difference between an interleaving that never happens to happen, and one that will never happen.
> A similar thing is that I often come across people who are well-informed but still surprised that compilers will combine two observable operations into one
Well, think of it this way, if you're not into compilers wouldn't you expect your code to run as written? Especially if you put a breakpoint in there and you can see each step happening sequently. I don't really think people that don't know these things should be viewed as ignorant because the compiler tends to be doing some very unintuitive things. IMO everyone would be better off if somehow the tooling or the compiler could report some of these things in an easy to visualize not-super-verbose way.
You use a compiler to abstract away the CPU. Because of that, there is no “run as written”.
Simple example: if you declare a structure that is 14-bytes in size, declare variables foo and bar of that type, and then write
foo = bar;
C must behave as if 14 bytes are copied from the memory where bar is stored to the memory where foo is stored, wit two observable states: foo either has its old contents or the new ones. Chances are very high that your CPU cannot do that.
The compiler will instead do something different. It might call memcpy, it might copy 8 bytes first, then 4 more, then the last 2, it might copy the first 8, then the last 8 (copying the two center bytes twice), etc.
Similarly, if you declare an int named i and do
i += 1;
There’s no guarantee that will translate to a single instruction (what if your CPU has 8-bit registers, but your compiler chose to have 16-bit ints?). Even if it does, does “as written” mean an increase instruction or an addition? Your CPU might have neither, but only subtraction and negation.
Also, a literal interpretation of “as written” requires spilling the result of every statement back to memory; a strict literal interpretation even forbids using registers (if you don’t write “copy i into a register, add 1, write the result back to i”, the compiler can’t do that if it has to run your code “as written”)
But yes, it would be awesome if somehow the tooling or the compiler could report some of these things in an easy to visualize not-super-verbose way.
Unfortunately, that isn’t possible. Modern languages and their compilers are just too far abstracted away from modern hardware.
Being caught in the middle of an instruction, like all your examples, is much less surprising than when something never happens at all.
> Also, a literal interpretation of “as written” requires spilling the result of every statement back to memory; a strict literal interpretation even forbids using registers (if you don’t write “copy i into a register, add 1, write the result back to i”, the compiler can’t do that if it has to run your code “as written”)
If you don't make a pointer to a variable then it's allowed to be in a register, isn't it? There's even a 'register' keyword, however obsolete it may be.
I think it’s only less surprising because it happens more often, so that ‘everybody’ knows about it, not because it is inherently less strange.
As examples of merging, if you do
s = sin(x)
c = cos(x)
would you find it surprising if that generated a single sincos instruction if the CPU has one?
If you do
typedef struct s { char a; char b; }
s c;
[…]
c.a = 3;
c.b = 6;
do you find it surprising if the compiler generates a single instruction that is equivalent to (exact code will vary according to the CPU architecture, and this may need a further cast to avoid undefined behavior):
I thought compilers ignored those words when optimizations are turned on, compilers reorganize and rewrite your code.
Using an interpreter seems to be the best option to abstract away the CPU. Write the interpreter once, get enough eyeballs onto it, test it 'till death and use it for predictability.
Even if you write assembly these days, there is all sorts of ambiguity around. Even the underlying hardware we use can fail.
I used to believe that computers did as they're told, they could do no mistakes, all mistakes were human-made. Now I see that there is no such thing as error-free. We are basically living on chance, much is uncertain.
I trust my eyes and ears more and more instead of the machine.
> You use a compiler to abstract away the CPU. Because of that, there is no “run as written”.
Um, no. I use a compiler to generate code that would otherwise be very tedious to write. I do not expect it in any way to abstract away the CPU. I do not want it to abstract away the CPU. If I did I would write Java or python. I'm using C++. I am very aware of the CPU. What I'm not aware of is whatever silly transformations the optimizer thinks it can do to my carefully crafted program because I didn't read page 783 of the spec.
You want a macro assembler. C is not a macro assembler. C was never supposed to be a macro assembler. It was designed as a cross platform language, which fundamentally means you are programming to an abstract machine, which necessarily means that you cannot have an macro assembler [0].
I can, however, see an argument that people want a C like language which is a macro assembler. I am lost again when people start asking for a C++ like macro assembler. The features that C++ brings over C are not concepts that exist on any mainstream CPU.
[0] Unless you decide to build a concrete machine (either in bare metal, or as a VM).
C was originally a macro assembler! This new mutation of C is just that, new. I don't mind optimizations as long as it doesn't remove intent, which is usually clearly indicated in the source, standards be damned.
There is no definitive start point for C, as it grew fairly organically out of B and began as a living language. However, a reasonable approximation for C 1.0 is The C Programming Language (Kernighan and Ritchie, 1978) [0], commonly known as K&R C.
The language they describe is much simpler than modern C. Notably for this discussion, it contains no synchronization mechanisms.
There is nothing in that book that suggests C was a macro assembler. On the contrary, its description of C precludes it from being a macro assembler for the reasons I talked about in a parent comment:
> Although C matches the capabilities of many computers, it is independent of any particular machine architecture, and so with a little care it is easy to write "portable" programs, that is, programs which can be run without change on a variety of hardware. . It is now routine in our environment that software developed on UNIX is transported to the local Honeywell, IBM and Interdata systems. In fact, the C compilers and runtime support on these four machines are much more compatible than than the supposedly ANSI standard versions of Fortran.
> Because the data types and control structures provided by C are supported directly by most existing computers, the runtime library required to implement self-contained programs is tiny. On the PDP-I1, for example, it contains only the routines to do 32-bit multiplication and division and to perform the subroutine entry and exit sequences.
Notice the way this paragraph frames the issue. Since the features of C are supported by most machines, it can be implemented with a small runtime library. It does not say that C must use those features directly. In fact, it explicitly anticipates that C could provide some of its primitives in ways other then directly leveraging a corresponding capability of the hardware. The feature here is efficiency and a small runtime.
I'm not sure where you got that idea. C was never ever a macro assembler, and its antecedent language (B[1]) is actually even more abstracted away. C was a slight lowering, not a raising, of hardware abstraction.
If C were a macro assembler the following things would be true of it, most likely:
- types would have fixed bit sizes (ie. uint32_t would be the norm)
- you'd be able to do various 'normal' things for assembly like rotate left and right.
- inline assembly would probably be.. you know.. standard instead of something that compiler vendors only started putting in in like the 80s.
You might be thinking of the fact that C++ was originally sort of a Macro C (aka Cfront)?
[1] Ever wondered why `int` is so fuzzily defined in C? It's because in B there was only one type and it was always word sized. That's also why various things are typed `int` by default if you don't specify.
History: C was essentially PDP11 macro assembler. There are so many assumptions in C that basically came from the PDP11. C was originally a very utilitarian language to make writing unix easier. That it became massively popular was more of a happy accident.
I'm sorry, but this is just a gross mischaracterization of what a "macro assembler" is. I agree it was utilitarian and made to make unix easier and that many of its core assumptions were based on the platform of the pdp11, but it never did in any way resemble a macro assembler as far as I know.
And in fact, as gizmo86 pointed out, one of the explicit goals of C in terms of "making unix easier to write" was, afaik, that it would allow the code to be more portable to other architectures, because as the 11 in the name implies, the history that led to the creation of unix was not short on new architectures, and one of the key weaknesses of Multics was, afaik, the fact that it was rigidly written specifically for one specific architecture.
By this logic nearly every programming language is just a macro assembler at a certain level of (non-)complexity. All compilers do is translate syntax (macros?) into machine code (assembly?). So, is pascal also a macro assembler?
Anyways here is a list of instructions on the pdp11 that you can't directly represent even in the earliest versions of C to my knowledge:
ROR
ROL
SWAB
SXT
ADC
SBC
BVC
BVS
BCS
MARK
HALT
WAIT
RESET
MTPD
MTPI
MFPD
MFPI
MTPS
MFPS
MFPT
CLC
CLV
CLZ
CLN
CCC
SEC
SEV
SEZ
SEN
SCC
That's the distinguishing factor of a "compiler" vs. a "macro assembler". A macro assembler is a layer on top of an assembler that still allows you full access to the instruction set and allows control flow that most compilers would consider unsafe. C was never that. It wasn't derived from a language that was that and it didn't, to my knowledge, allow that even on pdp-11 unix.
I'd be really happy to see some code from one of the code archives of unix' history that proves me wrong there, though, if you have it.
Unfortunately, classes teaching C often start with telling people that with C they learn how the real machine and the real CPU works on the lowest levels.
I'd be surprised if this was true. It certainly doesn't match my experience, and it also doesn't really make sense. The difference between machine code, assembly, and programming languages like C is first-week CS 101 stuff. What are you basing this on?
Have any good sources of this topic of macro assemblers? I'm interested in reading about this type of thing. I'll search Google, but you can't always find good sources there today.
I searched a bit, and found a large set of pages with information on the topic in the context of the 360 architecture. I'd say this architecture is unique in that it is still to this day used to write high level software.
Put "360 assembler macros" in your search engine and you'll find lots of information.
"run as written" implies targeting an abstract machine that performs operations the way programmer imagines it does. This machine isn't specified anywhere.
The hard part is that programmers don't imagine the code to be executed in a naive unoptimized way, but still expect it to be reasonably well optimized (converted to "obviously" equivalent assembly), except where optimizations combine into a result they haven't thought of.
This is more of a multi threading & ordering topic than a compiler topic. There were never any guarantees about another thread discretely observing those two observable operations in the first place, due to the memory model.
Not to mention that a big part of the confusion is not even the compilation - it's not beyond possibilities to inspect the machine code - but the processor itself still does a lot of unintuitive stuff, even more so with concurrency. I wish the tooling you mention existed, but also some microcode simulation/implementation to look beyond that.
I’m not sure what are you thinking of, but CPUs are guaranteed to run as if the given instructions were executed one after another, even though in practice it is free to reorder them slightly.
I've dealt with many CPUs designed to run out of sequence and to guarantee that. I've dealt with multiprocessor computers where order is not guaranteed at all outside the use of hardware interlinks. I've dealt with embarrassingly parallel processors where instruction order is orthogonal.
Perhaps you're thinking of a simple abstract CPUs used in "introduction to computers" courses instead of the real-life silicon used in most applications?
Consider the reverse. What if the interleaving was undesirable, would you like some way to be informed that it’s impossible so you don’t have to reason about it?
When trying to reason about code, finding somethings that looks like a bug but isn’t can result in a lot of wasted effort.
That information is moving in the wrong direction. I don’t want the programmer to be telling a compiler what to do, I want the compiler to inform the programmer what it did.
> That information is moving in the wrong direction.
You set the direction not me! You asked me 'would you like some way to be informed that [the interleaving is] impossible'. Yes... and the user gives me that information through a synchronised block. If they don't tell me they want an interleaving to be impossible then why should I not make it possible? There are countless other interleavings I make impossible or possible when I do any trivial compiler task such as scheduling. Should I consult the user on all of them? I don't think you'd like all the questions you'd get!
Informing the user isn’t about what you did, it’s about the user having some way of knowing what you did. At least compilers give you assembly code to look at, CPU’s are little better than black magic from a programmers perspective.
We don't - and in this example we'd combine atomics rather than guess that the user wanted the thread to yield between them, since they never told us that.
That's not true, you are making a guess: that the user doesn't care about the intermediate stage. Not guessing would mean generating the code as specified. It doesn't matter if you think its a bad idea, the whole ethos of C is trust the programmer.
> you are making a guess: that the user doesn't care about the intermediate stage
That's not a guess - the user didn't put anything in between them so we know they haven't required them to be separate.
Keeping them separate would be making the guess - a guess that they need them to be separate. Why should we invent that constraint when the user never wrote it?
> Not guessing would mean generating the code as specified.
We are generating it as specified - nothing in the specification says atomic actions have to be run independently. Why should I invent extra constraints that don't exist in the specification?
Oh come on, that language lawyering is sophistry. If I write 4 instructions that are "atomic" I expect 4 instructions. If I end up with one, they werent really "atoms" were they? I do not expect what you just said. I may think, gee, perhaps I don't need 4 instructions, but what you are doing is undeniably changing the meaning of the program, even if you think it's close enough. The fact that there's an article about this tends to agree with me that most people think: wtf happened?
> If I write 4 instructions that are "atomic" I expect 4 instructions.
Why do you expect that? Nobody ever promised you that. If you expect something you weren’t promised then that’s on you.
> If I end up with one, they werent really "atoms" were they?
Atomic has a technical definition - it means either applied fully or not at all. That doesn’t preclude combining.
> what you are doing is undeniably changing the meaning of the program
It doesn’t! According to the spec we’re both working off. If you think the spec is wrong then change it. But I can’t guess what you want unless you tell me using the spec.
I just assumed that compiler writers had empathy for humans, my bad.
> It doesn’t! According to the spec we’re both working off. If you think the spec is wrong then change it. But I can’t guess what you want unless you tell me using the spec.
How big is the spec and how many working programmers do you think are super intimately familiar? Do you think this article would be on the front page if this information wasn't surprising? Are you building a compiler for humans, or are you just chasing metrics and getting the smug satisfaction of telling users "I told you so" when shit doesn't work in an intuitive way?
If you want to beat me over the head with "the spec says", you win, master language lawyer. I build things for a living and I like it when the code I specify is the code that's generated. You can blame users all you want, but maybe after the trillionth zero day exploit because nobody gets these things right you might consider empathy and developer ergonomics.
> I like it when the code I specify is the code that's generated
But you aren't specifying it!
You're asking me to follow extra constraints that you haven't written down and we haven't agreed. Don't you think it's unreasonable when you say it out loud like that?
If you want extra constraints on top of what is already agreed in the spec then we can do that, but you have to ask for it and you have to bring me a workable spec. You want me to 'intuit' but really you want me to read your mind and follow a lot of unwritten rules!
I think we have a fundamental difference on what a compiler should do. I want the compiler to take my exact instructions and generate the best machine code it can, that follows those exact instructions. You want to rewrite my program to be more optimal. The tension here is that, even though I'm sure you can do that, I have to debug the result, and the result I have to debug is now stripped of symbols and super hard to relate back to my original source code. So ok, I am only adding one constraint but it's a big one: don't ever change the line-by-line meaning of my source code. If I wrote a line of code I intended for it to be executed, I don't want my memsets removed or my atomics collapsed, not because you can't do it, but because I need to debug it. I don't think taking advantage of undefined behavior to create optimizations is useful because being wrong 10% faster isn't really that useful. And I think optimizations like removing memsets are a really bad idea; not only does it potentially make peoples data insecure but the only people that are ever going to find a bug like that are people looking for vulnerabilities.
And frankly, here's the thing: I don't care that much about most of your optimizations. If I'm making my code fast, what I'm fretting about is designing my data structures to fit in cache, or using SIMD instructions, or offloading work to the GPU or other coprocessors. Screwing with the meaning of my code just makes my life harder.
But the meaning of your program is not being changed or screwed with. The optimised version implements the same meaning as your original program.
The problem is you don't understand the meaning of what you've written. You think the language has extra meaning that it doesn't. You can already get the meaning you want through existing constructs like volatile and yield and memset_s.
> The problem is you don't understand the meaning of what you've written.
This is the problem with the attitude of compiler writers. Of course there are bugs. Nobody perfect. Pobodys nerfect? If I accidentally left a sign overflow in the code, or something of that nature, I think any (human) reading the code would understand what I meant, but the compiler is like "'Lets get weird!'". Yes I CAN get the meaning through the existing construct. That, however, is not my point. I am perfectly skilled and capable of doing this. What I'm questioning is why tool vendors think it's useful or practical to take advantage of users mistakes. I'd rather have a diagnostic of "why are you adding these numbers multiple times?" then have the compiler silently rewrite something I did on purpose.
I'm not writing this in the abstract, I'm writing this because things like this have hurt me in the past. It's super annoying. I don't care about your optimization to make things five percent faster if I spend three days debugging why the release build crashes if the window is full screen, but the debug build runs fine 100% of the time. The code has a bug, I own this, but I don't see how the attitude of "if your code has a bug, we will turn it into fucking dragons" is even remotely useful. Usually it's not even my code I'm fixing, it's someone elses. I just want people to stop breaking things to meet a dumb benchmark.
It sounds like you're railing against optimising based on undefined behaviour. That's a separate thing - that's not what's being discussed in this article or thread - this is all well-defined behaviour.
> I want the compiler to take my exact instructions and generate the best machine code it can, that follows those exact instructions.
Two problems.
1. Your definition of "exact" is not exact. It is exact in your mind, but it isn't going to be precisely the same across all developers and compiler authors cannot reasonably develop hard requirements from this. This is especially true when targeting different machines. Operations on a uin64_t aren't going to look the same on a 32 bit machine, for example.
2. Attempting to do this makes your programs slow as hell. You can claim that compiler optimizations don't matter... but you are free to deploy debug builds if you really think this. C and C++ especially aim to provide a sort of ambient performance where everything is fast rather than just providing speed for hot loops. People complain nearly daily about modern applications being slow and then turn around and say we should throw away decades of compiler optimizations.
"Can the compiler inline this" is like 5000 on my list of things I care about. Compiler optimizations are mostly only useful to people who werent paying attention in the first place.
> "Can the compiler inline this" is like 5000 on my list of things I care about. Compiler optimizations are mostly only useful to people who werent paying attention in the first place.
Okay, you're definitely severely underestimating how much slower everything would be if compilers weren't allowed to optimize anymore.
Yeah. For somebody who really cares a lot about cache misses he sure doesn't seem to be concerned about basic things like leaving virtual function calls in place.
-O0 doesn't necessarily change this. After all, optimizations aren't allowed to violate the spec (if they do, they're buggy). So any code that can be output by an optimizing compiler could (in principle) be output by a non-optimizing compiler.
I’d be curious to know what benchmarks you’re using on which O0 is only 5% slower. I suspect your program must be almost entirely IO-bound, if this is the case. In that case, Python, Ruby, or shell probably wouldn’t be much slower either.
If you're going to use atomics, you should be reading what guarantees those atomics actually provide. In this case of memory_order_relaxed they pretty clearly specify they don't provide any cross-thread synchronization. So if you wrote those 4 atomics expecting something the docs explicitly & clearly say it doesn't provide, that's pretty much entirely your fault and not "language lawyering"
'atomic' isn't uncuttable in the sense that it is irreducible, but in the sense that it cannot be stopped/does not produce intermediaries. If you write four atomic instructions you should expect that they will not be interrupted, nothing more. of course I'm not an expert, but I think that is where your misunderstanding has originated.
Did you even read the post you're replying to. "Uncuttable" can apply to multiple properties of an instruction's execution - "atomic" colloquially refers to one of these properties.
You seem to want "atomic" to mean "the data that corresponds to the instruction cannot be divided" but it means "the execution of the instruction cannot be divided", those aren't the same thing.
The four hardest problems in computer science are cache invalidation, naming stuff, and off by one errors. Please accept that your understanding was flawed and have empathy for the people who named stuff.
It can be tricky in that fuzzing your program to discover race conditions seems to behave perfectly on one compilation target where these atomics are detected and reified, while you later discover on another platform surprisingly your app crashes 20% of the time.
Ideally we test our applications on all hardware, with all configuration permutations etc. etc. In practice we do rely on our compilers translating our intent accurately and sometimes such edge cases matter.
Compatibility is a tricky thing. It's kind of like the argument whether adding optional arguments to a function breaks BC. It doesn't break BC if you don't pass those parameters. But if for some reason you were passing extra parameters hoping they'd be ignored (for example as a handler some other place) then adding optional parameters WILL break BC and cause your software's behavior to be undefined.
It is not about translating intent accurately. What you need is for compilation to be fully specified. Anywhere where there is multiple correct ways to compile something introduces a risk for bugs that only show up with some compilers.
If you are a compiler or library writer, one solution is to avoid having useful properties that are not part of the spec. For instance, Go does not guarantee any particular iteration order for hashmaps; so they go out of there way to randomize the iteration order, thereby preventing developers from writing code that depends on a deterministic order.
In the case of threading, what you would need to do is essentially have a compiler/runtime that goes out of its way to order and time operation in a random/adversarial manner.
I've seen research that looks into doing this in a VM environment; which would be inhibited by the type of compiler optimizations being discussed. And others that modify the compiler itself to replace the concurrency primitives with runtime functions, that can then execute them in a fuzzed order.
Ultimately, fuzzing and testing can only give you confidence that what is being tested is mostly correct. It can never give you confidence that what is written is entirely correct. If you want confidence in the latter, you need to invest in some form of static analysis (which could either be built into the language, such as a type system, or be an external analysis tool). Ultimatly, writing even a moderately complicated program (by modern standards) with full confidence in its correctness would involve advancing the state of the art of the field by decades (if not longer).
For the most part, the field just doesn't care about programs being fully correct; and accept it as a fact of life that going onto new/untested platforms and configurations will introduce/expose bugs.
> In the case of threading, what you would need to do is essentially have a compiler/runtime that goes out of its way to order and time operation in a random/adversarial manner.
> others that modify the compiler itself to replace the concurrency primitives with runtime functions, that can then execute them in a fuzzed order.
Loom[1] is somewhat of that. It's a testing system (not a runtime for a full app) which tests multiple permutations of program execution (all possible permutations I think, limited by test case complexity or an optional "maximum thread switch count"), as well as modeling atomic/weak memory effects to some degree.
Very interesting. I would love to see tools emerge out of that research. In my experience one of the basic problems with testing/verifying concurrent code is that it is often difficult-to-impossible to perturb a particular part of the system, to exercise different orderings, and thus find corners.
Can a different thread observe x and y only incremented once? Before the optimization yes, it's possible. After the optimization, no, it's not possible. The compiler removes a possible observable state in the course of optimization, and that occasionally surprises people.
They tell me they're not happy with 0% chance of some interleaving they need. Are they happy with 0.000000000001%? I'm not sure I understand the practical difference between that and 0%? Can I tell them it is possible, but rely on the death of the universe happening before they have a right to complain it didn't turn up yet? If they have some hard minimum they require, then what do they expect me to do? Completely abstract from the system scheduler to make it happen?
Doesn't seem a reasonable expectation from programmers on compiler writers, and likely wouldn't really be what they wanted even if I made it happen.
It might be 0.000000000001% but it need not be, right? That's kind of the point. I'm pretty sure I could write you a program where the probability of observing that would be (say) 50% without the optimization, and with the optimization that would go to 0%, defeating the program. Should that be allowed?
This is basically a statistical/probabilistic version of another problem: should a compiler be allowed to change the time complexity of an algorithm? If I write linear search, and the compiler proves the list is sorted, should it be allowed to switch to binary search—or vice-versa? Traditionally the answer might be 'yes', but it's not exactly hard to argue an answer of 'no' might also be desirable in a number of situations. Now in this case, it's not the time complexity that's changing, but the statistical properties of the program... and while in some cases it might be negligible (1E-10 or what have you), that's not necessarily the case, at which point a difference in degree really can become a difference in kind.
But no mechanism was ever guaranteeing the 50% in the first place. So their program could already stop working correctly without any changes and with the same compiler. What I'm saying is I don't know how they'd even tell the difference between my compiler being the problem, or them just being unlucky and not getting the interleavings they want even though they're statistically likely.
I'm saying if you made a complier that made reduced that expected 50% to roughly 0%, it wouldn't matter to the user if it was 1E-10 or exactly 0%. Both could be undesirable for exactly the same reason. Again: you can complain that the 1E-10 compiler is still "correct" because there was never a 50% guarantee, but in doing that you're completely missing the point.
Here's an exaggerated example to get the point across. Imagine your Tesla autopilot's accuracy suddenly dropped from >99.9999% to <0.00001% overnight because of some stupid optimization (maybe this one) that was technically correct. Assume there was never a guarantee provided to the user that it would stay this accurate. Now if your 99.9999% became 99.9998% overnight, OK, I think a sane user would completely anticipate that and live with it. But when it drops orders of magnitude lower, do you really think you could shift the blame to the user because you were "technically" not guaranteeing anything? Is it really the user's fault that your optimization reduced that accuracy to 0.00001% and killed them? Can you walk away with your conscience clear at that point that you did everything 'right'? At what point do you feel like some kinds of changes just blatantly unfair and uncalled for, even if they're technically "by the book"?
Tesla's autopilot has a guarantee for its accuracy. If that drops to 0.00001%, then that is a problem, because the autopilot functionality is failing to meet its requirements. If that drop in accuracy happens because of a technically correct change, then that just means the actual bug was somewhere else and just happened to work previously. It also means that there is a flaw in Testla's quality control process that allowed such a change to ship to customers.
If your program breaks because the compiler no longer interleaves atomics, then your program was always broken, and you have just been getting lucky until now. Normally when this type of situation comes up, the programmers perspective is at least understandable, since what they were doing was somewhat reasonable, even if not allowed by the standard. In this case, the point is that I (and, I presume chrisseaton) cannot come up with any reasonable program that would rely on an interleaving happening 50% of the time.
If the interleaving were originally observed 99.999999% of the time, I could imagine a program relying on it happening. However, if the writes are close enough that the compiler could optimize them together, chances are the pre-optmiztion chance of it happening was already much closer to 0% than to 100%; which, again, begs the question 'what was your program possibly doing that relied on the original behaviour?'
> Tesla's autopilot has a guarantee for its accuracy.
Which (if true) is completely beside my the point. Just pick something that doesn't have a guarantee. The comment wasn't about Tesla. It was about empirical performance vs. written guarantees.
> Normally when this type of situation comes up, the programmers perspective is at least understandable, since what they were doing was somewhat reasonable, even if not allowed by the standard.
Okay great, we agree on that.
> In this case, the point is that I cannot come up with any reasonable program that would rely on an interleaving happening 50% of the time.
That's easily explained by our lack of imagination. No need to assume the person relying on it is being unreasonable.
Hypothetical example: suppose two threads are trying to alternate running some specific instructions with minimal latency. Maybe there are hardware requirements (e.g. they want to avoid overheating come specific components of a CPU core for too long?), so they try to alternate work via an atomically increasing counter? That's a starting point at least, feel free to improve the scenario.
> so they try to alternate work via an atomically increasing counter?
Then why did they asked for an atomic that doesn't provide that instead of one that does? The docs on memory_order_relaxed are pretty clear it provides no thread synchronization of any kind: https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxe...
This isn't some undefined behavior quirk being used to optimize, this is well-defined behavior being optimized in a well-defined fashion.
> The docs on memory_order_relaxed are pretty clear it provides no thread synchronization of any kind
No, that's not what it's saying. "Synchronization" is an imprecise term, and you're misunderstanding what it means here. Atomics (as is literally in their name) perform each operation atomically; by definition they provide at least that thread synchronization guarantee, even if you believe they provide nothing else.
The documentation explains what "are not synchronization operations" refers to in the subsequent sentences; it's most certainly not saying it doesn't provide thread synchronization "of any kind" (again, I already listed one above). In fact, it lists 2 specific kinds of thread synchronization it does provide: "atomicity" (i.e. operations are all-or-nothing, as above) and "modification order consistency" (i.e. alterations of that same variable occur in a consistent order across different threads, even under std::memory_order_relaxed). What it does not provide is a consistent ordering across different variables ("an order among concurrent memory accesses"), which is fine as it's not not something we need in this example.
Collapsing multiple atomic operations into one has nothing to do with ordering, nor with multiple locations in memory. In fact, I don't think even std::memory_order_seq_cst can prevent this. The only way I can imagine preventing it is to mark atomics as volatile, and even then I'm not sure if it's guaranteed under the as-if rule.
volatile should work. Each volatile access is a side effect from the point of view of the abstract machine and compilers are supposed to document this mapping to the concrete CPU (and usually is the obvious one).
In fact the fact that volatile atomic has additional guarantees is a strong reason for plain atomic not to have them, otherwise there wouldn't be a way to express the more relaxed requirement.
I think atomic not being volatile by default is a good default: the majority of code doesn't need to care about atomic in the first place, for the majority of the remaining code, these sort of optimizations are benign and desirable. For the remaining code that cares, volatile is the safety hatch.
I'm not so sure volatile is sufficient... I expect it would depend at least on how the variable is allocated/stored. Merely putting volatile shouldn't be sufficient to prevent access to it being optimized out. I forget if this was disallowed since C++20, but consider the program 'int main() { volatile int x = 0; ++x; }'. I'm fairly certain the compiler can optimize out the body despite the use of volatile, because of the as-if rule: it would be impossible to tell the difference from within the program.
Moreover, I'm not sure 'volatile' memory access without some sort of external linkage is even meaningful. The idea of volatile is that access to a volatile variable could have observable side effects outside the program, but that discussion seems rather moot for a variable that doesn't have connections to the outside world to begin with. Heck, by the same reasoning, even for a variable with external linkage, the rules might still permit it to be optimized out under link-time optimization, unless it has been declared to be externally reachable somehow (e.g. at a fixed/exported/otherwise 'known' address)... otherwise it's by definition externally unreachable and thus again unobservable under the as-if rule.
At least, I could see the reasoning going that way, and that's how it makes sense to me as far as the abstract machine goes. I can also see people disputing it though, and claiming (say) a debugger inspecting a program is sufficient accessibility for considering side effects, or something like that. I'm not sure I agree with every objection, but I see room for debate. It's just not clear what the implications are under the as-if rule.
The ++x is a side effect from the abstract machine point of view. The issue is of course how the compiler maps the abstract machine to the real machine. Volatile is (purposefully) underspecified and I guess it is not completely unreasonable for a compiler to declare that a local volatile object whose address is never taken does not have a mapping at all.
There is little value on explicitly optimizing volatiles but of course that can happen as a side effect (ah!) of unrelated optimizations.
I rarely use volatile at all, but when I do, I do make sure not to apply it to local objects.
edit: I think the most reliable and portable way to make sure that the address of a variable escapes is something like this:
Yeah, I think this is probably sufficient without whole-program optimization, but I'm not sure it is with LTO. It might be sufficient today, but I can imagine Clang optimizing it away eventually. Really, the only realistic-yet-futureproof method I can imagine (and I'm not sure if it's even within the jurisdiction of the language spec) is either (a) exporting the variable, or (b) exposing its address through some other established I/O mechanism (like printf).
I guess I don't know why the user would be betting on interleavings which I never said I could guarantee. If you build your car with a part that isn't rated for what you use it for that's your fault not the manufacturer of the part - even if it worked every time when they tested it and they never saw it fail before one time it kills someone.
I'm open to ideas on how to create understanding both ways on what the user wants and the compiler is able to provide?
> I guess I don't know why the user would be betting on interleavings which I never said I could guarantee.
Same reason people walk outside without bodyguards and "bet" on coming back home alive despite the fact that nobody ever gave them such a guarantee. It's not insane to expect the future to bear some kind of resemblance to the past without some contract to guarantee it, is it?
> It's not insane to expect the future to bear some kind of resemblance to the past without some contract to guarantee it, is it?
Well you used the example of a car and I think it would be insane to use a component in a car when the manufacturer tells you it's not designed to do that and I can't possibly guarantee that's going to work just because you think you tested it and it seemed fine.
I don't get why you're sighing at me when I'm answering your questions politely. If you don't like my replies feel free to stop asking me questions.
Your example about going outside isn't covered by a technical specification with a long list of guarantees formally agreed between two parties so I don't really see how it's comparable.
The question was about whether the original designer deserves any blame for making a sudden change like that. Not about whether the manufacturer is at fault vs. the user vs. a bunch of other middlemen that weren't in the story. That twist just derailed the discussion, and it's frustrating to discuss something when the other person is so keen to avoid the crux of the question. So much so that you didn't even try to address everything else I wrote.
> The question was about whether the original designer deserves any blame for making a sudden change like that.
I don't believe so - as long as there is a version change and it's documented in the change log, which I aim to do with optimisations I write.
If you're using undocumented apparent behaviour, please check the change log when upgrading, or ask us to document the behaviour. But as already described, we may not be able to guarantee what you want while keeping other things you depend on like performance!
> If they have some hard minimum they require, then what do they expect me to do? Completely abstract from the system scheduler to make it happen?
What if they say the hard minimum is "occasionally"?
They're not holding you responsible for the scheduler, they just want you to leave it as possible. (and not be deliberately malicious, just in case that needs to be said)
And sure, there should be a way to signal intentions. By my point is that there is a reasonable way to describe what certain people would expect by default.
But is it possible? The CPU can, in principle, coalesce writes, especially to the same cache line, so, on a given microarchitecture, the intermediate value might be actually impossible to observer even without compiler optimizations.
It's because people still confuse std::atomic with volatile. Actually, there is a way to get both behaviors: 'volatile std::atomic'. Personally, I never had a situation where I would've needed that, and I use atomics on a regular basis.
I think at least regarding concurrency, most developers need to rely on a model that is "good enough". You can't know everything, therefore you need to contend with knowing at least enough that is necessary to do your work.
Thinking that atomics will not be optimized by compiler is "good enough" in my opinion for almost, and I really mean almost, everybody.
Heck, I interview people and atomics optimization is one of my staple questions (Tell me, what does volatile do in Java, exactly?). I have never heard a thorough answer to that question and yet people do work as developers and do produce working applications.
The only real users of these optimizations are atomics primitives developers, and the use it to improve them a tiny bit in some special cases.
You can technically contrive a test where you run an update that changes something to 1, then immediately to 2 on each iteration. Then you try to look at it from another core and you would expect that at least once in a blue moon you will see 1. But I just don't see any sanely written application to rely on that kind of behavior.
If they need some interleaving to happen at least n% of the time for their program to work... what am I supposed to do? Because it was never guaranteed to be interleaved at least n% of the time, even when the atomics were not merged. Do they want me to bypass the system scheduler and run threads concurrently with forced interleavings against a statistical distribution? I doubt they want that. So what do they want?
> int rlo() {
> // Dead store eliminated.
> y.store(0, std::memory_order_release);
> // Redundant load eliminated.
> x = 1;
> return 0; // Stored value propagated here.
> }
This is actually wrong, for subtler reasons than one might initially think[0]. Consider the case where rlo is called with x!=0,1 and y!=0. Another thread that observes y=0 is guaranteed not to see the originial value of x - it might see 0, it might see 1, but something has to have been written to x by the time it observes y=0. The above optimization allows even a single-processor execution to context-switch at "// Redundant load eliminated" and observe y=0, x=??? from another thread.
Oddly, the correct code seems to be:
x = 1;
y.store(0, std::memory_order_release);
return 0;
On the face of it, the load acquire seems like it would prohibit that, but in that's only a problem if it (the acquire) observes some release with which x = 1 can't be reordered. But it's hard coded to always observe the y = 0 release, so that can't happen. (I'm less sure that this is correct than that the previous version is wrong, admittedly.)
0: In particular, it's not just a matter of "x = 0 has to happen because store-release".
I fairly sure it's correct! precisely because of the initial store `int x = 0;`, for the reason given by the article. If that were `int x = 2;` instead, then I agree that the transformation would be invalid.
(Perhaps this is what you meant? but I wanted to clarify.)
In case anyone is curious, here are the relevant parts of the C++ reference:
> Absent any constraints on a multi-core system, when multiple threads simultaneously read and write to several variables, one thread can observe the values change in an order different from the order another thread wrote them. Indeed, the apparent order of changes can even differ among multiple reader threads. Some similar effects can occur even on uniprocessor systems due to compiler transformations allowed by the memory model.
> If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.
> The synchronization is established only between the threads releasing and acquiring the same atomic variable. Other threads can see different order of memory accesses than either or both of the synchronized threads.
Right, but writes to x from another thread (i.e. by a thread other than the one in which rlo is called or the thread reading y) aren't required to be visible when y is read.
Edit: ah, I think I understand what you're saying – if a thread that calls rlo also modifies x elsewhere, beforehand, then yeah, the store can't be eliminated.
That depends. rlo() is not the only function that is allowed to have synchronization primitives. It is possible that before rlo() is called, some other code modifies x, and all threads use some synchronization primitive to guarantee that they observe it. There is no notion of lexical or compilation unit scope for synchronization, so (unless it is doing whole program optimization) the compiler has no way of knowing that no other writes are visible after initialization.
Agreed, but we might be stepping outside the scope of the example. Likely, you can make most examples incorrect if you're allowed to add arbitrary code; I read this section of the article with the implicit assumption that "the thread calling rlo() performs no other writes to x" and "there's no additional synchronization", since the author didn't put any such code in their example and only claimed that the optimization was possible for the example they gave. (This is Grice's maxim of quantity, I suppose?) Although they probably should have been more careful and clarified that the example they gave doesn't work regardless of context.
I'd love to have a compiler that just tells me what optimizations could be made, but let me either do it myself (if syntactically available in the language), or explicitly mark the optimization as ok through a #pragma or something like that. I just think having to read the disassembly to figure out what happened isn't a great user experience.
Maybe not, but I feel like as a C programmer I'm already kind of a control freak, and I really want to be able to tell the compiler what I want and where, I don't want magic. I'm using C explicitly because I want to know and control what's going on.
If I'm dealing with atomics and concurrent code I really want to tell the compiler, hey, inline constants maybe, but don't reorder or combine operations. Even if the compiler is "right" or my code is subtly wrong, I don't want the compiler making things worse.
I hope you're not programming for anything more advanced than a Z80, PIC16F or 6502 then because pretty much any CPU more advanced than that will likely also do that internally.
Yeah. What good is demanding that the assembly has a certain structure if that doesn't mean that the machine actually does it that way? Out of order execution and speculative execution are optimizations that improve performance by 10x or more. They aren't optional.
> If I'm dealing with atomics and concurrent code I really want to tell the compiler, hey, inline constants maybe, but don't reorder or combine operations.
If that's what you want, then declare your variables volatile (in addition to being atomic, not instead).
Atomic instructions wouldn’t be reordered like that on a CPU, that’s the point of using them.
Non-atomic ones might be, and volatile would since it’s only a C concept. Although it wouldn’t be through instruction fusion, more like non-coherent caches.
Doesn't do exactly what you want, but try -fopt-info-all in GCC. Gives you a reasonably informative guide of what it tried to do and whether it used the transformation or not.
This would be fun, and it can be made available from the major compilers, but using it almost certainly bad for most developers. Any nonstandard decisions you make here are likely to be unstable. Even if they are better for one build, they will sit in a config file and rot to the point that you've sabotaged your future performance by making decisions for an older build. I see this all the time with nonstandard flags for performance. Somebody five years ago had a problem with the VM and needed to change the way GC worked and set a flag. Nobody remembers why and they are all afraid to touch it. Now the application is getting worse performance.
I too was briefly worried that Ix was up to no good with their thinking machines. Honor the Butlerian Jihad! This kind of optimization should be left to the Mentats.
Writing correct, even if non-optimal, code that uses atomics is already highly non-trivial. Transforming code that uses atomics into more efficient forms while guaranteeing the same behavior is even more non-trivial.
>Get back to work, there’s so much more to optimize… and so much code to break! Help users write good code: the compiler should provide diagnostics when it detects anti-patterns or misuses of atomics.
Is it worth to put effort into compiler optimizations
instead of putting more effort into providing "diagnostics" and informations for programmer?
Compilers are sane and written by very smart people. They do have bugs just like any other software. In my 20+ yrs of coding, I can count the number of bugs I've encountered on one hand.
memset tend to be optimized out in the dead code elimination pass and highly relies on SSA. Compilers adhere to the abstract machine specs. If you want memset not to be optimized out then use memset_s because the spec explicitly doesn't allow that.
if an optimization relies on a UB and surprised the programmer then the programmer was also relying on a UB in the first place.
The problem with programming is almost never the lack of smartness. It's almost always the opposite: People being too clever.
The sane thing in a realistic scenario is to ban all usage of memset and add a diagnostic to use memset_s, like any responsible programmer should do with strcpy. Then however, was it really wise to make this machinery, that the people who need it the most are most likely to not have, necessary? I really, really don't think so.
\Edit: Here is some code written by a very smart person, Niels Fugerson. I think (not sure) it tries to wipe sensive material from memory and gets it wrong. Really not sure, tho.
There's absolutely nothing wrong with using memset. I don't understand why you would suggest such a ban. If your goal is to scrub a memory region after you're completely done using it then you shouldn't use memset.
Compilers won't optimize out memset if it changes the program's behavior within the boundaries of abstract machine's specification.
A call to memset to clear a memory region after you're done with it because a password was stored there can be optimized out because it doesn't change the behavior of the program.
The code you shared is incorrect if your definition of correctness here includes K being wiped out.
> A call to memset to clear a memory region after you're done with it because a password was stored there can be optimized out because it doesn't change the behavior of the program.
Well, that does change the behavior of the program. After all, the behavior of the program was to wipe the memory. The programmer specified it. The compiler is clearly wrong by any reasonable measure. It's removing an intentional, clearly specified instruction. It's just wrong. It's making an "interpretation" of something the programmer intended as a fact, and breaking things in the process. Why does anyone think this is desirable? Any good programmer knows that correctness comes before speed.
I've been programming C++ since 1998 and I didn't even know this till reading this thread. Blaming the user here is counterproductive. Compilers should not make this optimization, plain and simple. memset should set the memory. If you think otherwise, I invite you to take responsibility for all the users data you have placed in harms way.
Btw to the downvoters, explain to me why a function that is clearly intended to create a side effect that is visible outside the process, IE something you cant reason about, is safe to remove? Memory debuggers are not new or exotic tech. Why do you think that is reasonable to remove if a programmer clearly thought it should be done?
> explain to me why a function that is clearly intended to create a side effect that is visible outside the process
This is simply not accurate. memset_s provides that guarantee. memset does not. It would appear that you have been under the impression that it did. However, it did not.
It does not. Almost all memset_s implementations are broken, just mine not. memset_s just ensures that the unsane compiler will not optimize it away, whilst it should also ensure that is fenced. Otherwise you can still read the memory from the cache.
Calling the ordinary memset_s secure is a bug, it's a mere convenience function.
Doesn't it seem bad that the default that's been taught is wrong? Admittedly I haven't read a C programming book since 1998 because I've been coding that long, but, it strikes me as incredibly fucked that you need a separate memset_s to mean "I'm not fucking around". When I learned C, memset just meant, uh, set this memory, it was pretty straightforward.
There's a big difference between you not liking something and it being wrong. You want to make the common case slower for everyone so that you don't have to learn one extra rule for your uncommon case.
> it strikes me as incredibly fucked that you need a separate memset_s to mean "I'm not fucking around". When I learned C, memset just meant, uh, set this memory, it was pretty straightforward.
I don't think it's ever worked the way you want it to work.
So you are saying that we should throw away every single optimization, and go back to C being a fancy assembler? As if we consider the memory layout at all times inspectable, pretty much any optimization is impossible. Also, the compiler produces code for the CPU not a debugger — here you can’t implicitly view memory at arbitrary places.
I personally prefer to write code that is more readable to humans that will get translated to the optimal instructions over having to learn some tricks known only by some elite C hacker that will actually be inferior, since CPUs are finicky. Even with very good understanding of the CPU one can not really guess which instructions are faster to execute without benchmarks.
Nonetheless, there are tiny C compilers that do next to no optimizations, feel free to compare the speed of the resulting binaries.
memset it is not clearly intended to create a side effect that is visible outside the process. In fact setting memory is not a visible side effect from the C abstract machine point of view.
The compiler cannot read minds, if the programmer clearly thought the code should behave in a certain way it should have told the compiler in a language it understand. Volatile would probably be part of that solution.
That's the thing: we live in the real world. Yes the C abstract machine blah blah blah. Ok, but, I called memset for a reason, and the compiler writer, being also of the real world, should realize maybe I'm deleting sensitive data? And yes I know there's memset_s but it strikes me as bad prioritization if I have to seek out a secondary function to do what I mean.
how can the compiler realize you are deleting sensitive data? The compiler is dumb and just applying a bunch of rules, it has no way to infer intent from your code unless you explicitly tell it so. And no, despite the name, memset does not mean set the memory to 0 anymore than int i = 0 means set the memory to 0.
BTW, I think memset_s is also problematic: there is nothing preventing a compiler from storing a copy of your data anywhere else. At the very least is likely to be in registers, possibly partially spilled on the stack.
> how can the compiler realize you are deleting sensitive data?
It can't, so don't remove my memset. Why is this hard. Compilers are dumb, yes, so don't delete the thing I wrote on purpose.
You are right about the definition of memset. It's unreliable. My point is why is that ok with anyone sane. If I clear 128 bytes on a stack frame that's returning, sure, it's not optimal, but the compiler doesn't know why I did it and I surely did it for a reason. The definition of memset is bad.
why do you even expect those 128 bytes to be on a stack frame at all? I would expect the compiler to keep stuff in registers as much as it can and reuse stack slots as soon as they are no longer needed. And I would certainly expect to remove dead stores aggressively.
I don't use memset much, but I use memcpy and certainly do want (and expect) the compiler convert memcpys to register moves when possible and even to eliminate them when the result is not actually used.
Memory debuggers aren't mentioned in the C spec, and an implementation that let you see all past contents of memory would be valid (and pretty useful), in which case that memset wouldn't be hiding anything.
being smart doesn't eliminate the possibility of making mistakes. Are you expecting that a smart human being isn't going to make mistakes? Go through the revision history and see how many bugs got fixed.
Isn't memset different in that it's an external function that compiler writers have chosen to intercept and perform themselves? Why is saying 'use memset_s' acceptable when they could have stopped treating memset as a special case?
If you include a standard header (using <> brackets) and call a function that the standard says get declared when doing that, the compiler is allowed to assume you want the standard behavior, and inline that/replace it by a specialized version/etc.
Optimizing memset, in particular, can reap huge benefits, for example for sizes of 4 or 8 bytes in tight loops or for zeroing cache line sized blocks.
> Isn't memset different in that it's an external function that compiler writers have chosen to intercept and perform themselves?
The semantics of memset and memset_s are defined by a standard. They're not unknown functions that could do anything, and they're breaking a rule by treating them differently, like you're suggesting they are.
memset is not a special case. Compiler designers are very much against special casing, and rightfully so.
Compilers do have an intrinsic equivalent but that has nothing to do with whether it gets optimized out or kept.
Following SSA rules, if you are writing to a non-volatile memory and not reading from it for the remaining duration of its lifetime then it's dead code.
use memset_s was a suggestion if your goal is to scrub memory because it guarantees the write will happen. The guarantee comes from the C standard:
> Unlike memset, any call to the memset_s function shall be evaluated strictly according to the rules of the abstract machine as described in (5.1.2.3). That is, any call to the memset_s function shall assume that the memory indicated by s and n may be accessible in the future and thus must contain the values indicated by c
Which essentially means treat the memory as-if it's volatile.
Treating "the memory as-if it's volatile" would not block such an optimization. Specifically, zeroing a volatile or atomic that lives on the stack immediately before returning from a function whose scope it lives in is freely optimized away, because it is not observable.
Many people mentally model atomic as akin to volatile, about which they also frequently harbor superstitions.
I knew a FreeBSD core committer who insisted (and convinced our boss!) that volatile was entirely sufficient to serialize interactions between threads.
Someone else expected a line with a volatile or atomic operation never to be optimized away, and was appalled that it happened. Assignment to a volatile stack object whose address has not escaped and is not used will, in fact, be optimized away, regardless of the "clear language of the Standard".
Often, not performing what would appear to be a pointless optimization would block a more substantial optimization. Compiler writers optimize aggressively in order to break through to the substantial ones, which are most usually dead-code elimination that can then propagate backward and upward.
If you really need for a volatile object and operations on it not to be optimized away, you need to allow a pointer to it to escape to somewhere the compiler cannot see. Have a function in a separate ".o" file, and pass the pointer to that. Or, better, use a compiler intrinsic or other facility provided specifically for the purpose, such as memset_s or explicit_bzero.
> Specifically, zeroing a volatile [...] that lives on the stack immediately before returning from a function whose scope it lives in is freely optimized away
> Assignment to a volatile stack object whose address has not escaped and is not used will, in fact, be optimized away
I don't believe either of these statements. https://godbolt.org/z/8cqEdY9oE is a simple case where this doesn't happen. Can you post a Godbolt link or something that demonstrates when it does?
Keep in mind that the standard says "Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine." In particular, it doesn't say that accesses to non-volatile objects through volatile-qualified pointers will be handled in any special way. (This looks like it's going to be changed in C2x, but it hasn't been changed as of C17.) Also, it says "If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined."
Any particular compiler and command-line setting where it is not optimized away tells you nothing about what that or some other compiler may do in a different release or command-line setting.
Talk to any core compiler implementer, and they will say that the optimization is allowed. No compiler performs all permitted optimizations, in any release, but the set of those each does do, varies.
> Any particular compiler and command-line setting where it is not optimized away tells you nothing about what that or some other compiler may do in a different release or command-line setting.
Can you provide a different program, compiler/version, and command line where it will be optimized away?
> Talk to any core compiler implementer, and they will say that the optimization is allowed.
Are you just guessing that they'll say this, or have they actually said this? If the latter, can you link to where they did?
> No compiler performs all permitted optimizations, in any release, but the set of those each does do, varies.
My Godbolt link is a really obvious, easy place that the optimization you claim is possible could be performed. I think the fact that none of the 4 major compilers optimized it away is decent evidence that it isn't allowed.
> have they actually said this? If the latter, can you link to where they did?
They have said it face-to-face, in person. It is possible that, with enough pushback, they have since disabled the optimization. (They all work at large corporations, nowadays, and must obey management direction.) Or, more likely, disabled certain cases of it. I will ask again.
For documentary evidence, you may consult the definition of C11 memset_s, which is specified not to permit the optimization. It is hard to imagine such a requirement being perceived as needed, absent cases where calls to memset have actually been elided.
I will also experiment further with Godbolt. You might have failed to tickle the compilers in just the right way.
> They have said it face-to-face, in person. [...] I will ask again.
If you do ask again, can you do so somewhere like a public mailing list, so I can read their response myself?
> For documentary evidence, you may consult the definition of C11 memset_s, which is specified not to permit the optimization. It is hard to imagine such a requirement being perceived as needed, absent cases where calls to memset have actually been elided.
The point of memset_s is to prohibit the optimization when the object isn't volatile.
Are you allowed to cast the void * to volatile void *, pass that to memset and except the same result, or is the void *, volatile void*, void * conversion eliminated aswell?
As of C17, adding a volatile qualifier like that doesn't do anything useful. The standard currently only requires that accesses to volatile objects be handled specially, not accesses to nonvolatile objects through volatile-qualified pointers. A lot of people, including Linus Torvalds [1], are unhappy that this is how it works, and it looks like DR 476 may finally change it in C2x.
You can wave whichever dead chickens you like over the keyboard, all to the same effect. The compiler ensures only semantics that are observable.
If you pass a pointer to a volatile or atomic object to another function the compiler cannot see into, the compiler knows that any specified operations might be observable, so must act accordingly. Then, operations on that object will not be optimized away. Or, anyway, not until someone builds the program using Link Time Optimization ("-flto") and the called function body becomes visible.
Casting the pointer passed to memset has no effect. The compiler knows what the storage pointed-to is: a temporary object. The compiler knows that memset does not allow a pointer to the object to escape from view. So, it can eliminate the call.
Calling C11's memset_s, or MS's SecureZeroMemory, or FreeBSD's explicit_bzero, has the effect, to the compiler, of making the object observable, so that operations on it will not be optimized away. There is no way to express this locally within the language; you must rely on facilities provided by a Standard or your platform, or on the dodgy linkage trick cited elsewhere.
However, memset's declaration takes a void* which prevents you from passing a volatile void*. casting or not won't change anything in this case because what matters is what memset takes and not what the pointer's fully qualified type is prior to calling memset.
From the compiler's perspective, you're passing a non-volatile pointer to memset which makes it a candidate for elimination if the pointer's lifetime ends after memset call.
The default "memset" should just do what the programmer intended, and there should be a "memset_kinda_maybe" for the current interpretation. I think it's totally busted that users need to guess what way the compiler is going to maul their program.
99% of C/C++ programmers are not bored language lawyers with time to memorize thousand page specs. They use the language because they're pragmatic. Messing up peoples code with incredibly over-eager optimization is the opposite of pragmatism. I'd rather spend a day writing inline assembly for a hot path than two days dealing with the compiler breaking my code and making users data insecure becuase "usually" its ok to ignore a memset.
Frankly, just learn the language properly…
Compiler writers can’t throw their hands up in the air and do some non-deterministic/non-specified thingy. That would be terrible.
I know the language perfectly fine; that doesn't stop me from recognizing how broken this all is. I can dance around these horrible land mines, I'm just saying take the landmines out because people are getting hurt.
If anything I think they are written by people who are too smart --- theoreticians who care only about the standard and can't be bothered exercising common sense when the standard doesn't specify what to do.
No, I've read the standard. The "abstract machine" is, as its name implies, abstract. In reality, there are other concerns that are equally if not more important. That is the problem with compiler writers, ignoring those practical concerns. They prefer to downvote rather than face the reality of going outside their imaginary abstract machine world.
"In theory, theory and practice are the same. In practice, they are not."
The C standard can be thought of as a bunch of minimum requirements that program authors and compiler authors have to follow. What you seem to be arguing is for compiler authors to go above and beyond what's required of them, so that you as a program author can get away with doing less than the minimum that's required for you.
Reminds me once saw an comment about optimizing on UB by a compiler maintainer for Microsoft. He pointedly said the gcc maintainers would do better to work on their crummy code generation.
On the contrary, first of all there is no rule that requires a C++ compiler to also be a C compiler.
Second, Microsoft considered C done on Windows, and green field systems programming should be done in C++, there were a couple of blog posts done about the subject, which apparently hater kept forgeting to read about.
So no, it wasn't about when it was convenient, rather to the extent required by ISO C++ compliance.
In fact, as per Microsoft Security Center advisory, C# or other managed languages, Rust or C++ with Core Guidelines, should be the main languages to use for new applications.
Most likely they ended up backtracking with the goal to drop C support from Visual Studio due to pressure from high profile customers, WSL and FOSS code on Windows, or even the culture change brought by Satya.
> On the contrary, first of all there is no rule that requires a C++ compiler to also be a C compiler.
This is true, but an OS with no C compiler is horrifically bad. If Microsoft didn't want their C++ compiler to also be a C compiler, then they should have made a separate C compiler, instead of forcing people to rely on something like MinGW.
The compilers are separate, just like in most C and C++ compilers out there.
There is hardly any reason to keep writing new C code in 2021 other than UNIX like codebases or embedded development that keeps resisting C++ adoption.
I was actually disappointed with Microsoft's decision to backtrack on their decision.
Was there a separate C compiler? I thought Microsoft had a C++-only compiler but not a corresponding C-only compiler, and that the lack of a separate C compiler was the problem.
> There is hardly any reason to keep writing new C code in 2021 other than UNIX like codebases or embedded development that keeps resisting C++ adoption.
> I was actually disappointed with Microsoft's decision to backtrack on their decision.
You seem to be saying that C is deprecated or something, but it isn't.
The just like most C and C++ compilers there is a common frontend that then executes the corresponding compiler infrastructure.
C was deprecated from Microsoft point of view regarding systems programming on Windows, apparently complainer keep forgetting to actually read Microsoft employees blog posts.
Here, have a go at them, to spare you some Google-fu.
This attitude drives me crazy. C was once a simple language, but now the only people that can write it "correctly" are compiler implementers, and if they fuck up your code, it's your fault. You know how Linus will practically murder anyone that breaks user space, even if user space is "wrong"? I wish compiler makers had the same ethos.
BTW, go look at how many optimizations are explicitly disabled to compile the kernel without it breaking and then come back to me with how these are a good idea. One of the most critical pieces of software infrastructure in the world looks at how broken compilers are and says yeah, leave that junk out.
There are plenty of optimizations in Linux, they just pick ones that don't break userspace. Similarly compiler writers could do optimizations without breaking user programs. Although I guess testing compiler backward compatibility would probably be more difficult than kernel backward compatibility, since compilers present a larger interface (multiple entire programming languages).
> Similarly compiler writers could do optimizations without breaking user programs.
This isn't possible, since "breaking user programs" in this context means "violating the intuition of the programmer", who is an unknown person who has never written their intuition down to be observed by the compiler authors.
If you are actually talking about breaking user programs, both C and C++ already have ABI stability as a design goal, which hugely limits changes to the languages but enables people to keep linking against binaries that were compiled ages ago.
GCC and clang developers already do regularly compile and test a limited number of open source packages, and investigate any newly occurring failures. For now they ignore failures if the code does not conform to standards (except for notifying upstream developers). In principle, in such cases they could expend effort to find similar (possibly simpler) optimizations that do not change the existing behavior of the code. Alternatively, they could disable the relevant optimizations at -O2 (and maybe -O3) level.
I believe MSVC developers do put more effort than GCC/Clang developers to maintain source-level backwards compatibility. Alternative development strategies are certainly possible, but I understand that the GCC/Clang developers do face pressure from a significant subset of users to maximize performance at any cost.
> In principle, in such cases they could expend effort to find similar (possibly simpler) optimizations that do not change the existing behavior of the code.
This wouldn't just be expending effort. It would be making all code GCC/Clang ever compiled be slower.
> Alternatively, they could disable the relevant optimizations at -O2 (and maybe -O3) level.
Your point is the standard that "optimizations shouldn't fuck up working code" is unreasonable? What point is there to optimizing something thats wrong? So you can break things 10% faster?
> Your point is the standard that "optimizations shouldn't fuck up working code" is unreasonable?
By "working code", do you mean code that the standard and/or target implementation promise will do what you want, or do you mean code that just happened to do what you want the last time you tested it on your system? If you mean the latter, then yes.
What percentage of code do you honestly think has no undefined behavior? (hint: hardly any). Are you fortunate enough to work in one of those code bases? Remember its not up to you to not write UB, its up to everyone that touches the code base.
If you are stuck with UB, would you prefer a compiler that does the safe thing, or a compiler that unleashes dragons?
> What percentage of code do you honestly think has no undefined behavior? (hint: hardly any). Are you fortunate enough to work in one of those code bases? Remember its not up to you to not write UB, its up to everyone that touches the code base.
Maybe we should focus on better warnings, static analyzers, and UBSan, so that people with UB can fix their code, rather than not letting compilers optimize anymore.
> If you are stuck with UB, would you prefer a compiler that does the safe thing, or a compiler that unleashes dragons?
I'd very much prefer that my compiler unleashes dragons, so that I notice the bugs and can fix them. If my compiler did what you want, then when someone else compiled my code with a different compiler, they'd find them instead. (And it's obviously totally unrealistic to expect that literally every compiler in the world would do what you're asking.)
The example about signed overflow misses the point, on good code it results in a performance improvement. Similar for dereference of null pointers; it allows the compiler to remove visible successive checks(e.g if you call *ptr, then all further calls to ptr will not be null until it is written to). This makes it faster to express preconditions but let the compiler make safer code faster.
Because, when a programmer write a call to memset, she does, if you, as a compiler, take her seriously at all, expects the memory to be set, and not it just being a mere suggestion 1). should added a separate function called it memsetOrDontIfYouThinkINeverReadThisAgain. Does this make sense?
1) Yes, I know there are rules and all that are precise and allow that.
If you take this stance most optimizations are of the table. Optimization are rewriting code in some way to ensure it runs faster. In that sense yes the entirety of a program "is a mere suggesting on how to compute the output". If you actually want to tell the CPU what to do you have to use ASM and an ISA that specs what exactly every instruction does and how it does it.
> If you actually want to tell the CPU what to do you have to use ASM and an ISA that specs what exactly every instruction does and how it does it.
Even that isn't enough, unless you go with a CPU that doesn't have any branch prediction, speculative execution, or out-of-order execution. I'm not aware of any such processors.
That's exactly what I meant with "how". If an ISA has branch prediction, speculative executions or out-of-order execution this definitely is included on how an ISA handles it's instructions.
Let me paraphrase this differently, on a more pragmatic level. From the perspecitve of a compiler user, rather than a compiler author: How likely do you think is it that a program that contains memset and has this optimisation applied how diverts from the indented behaviour and now has a serious flaw in it? The answer to this question is the answer to your question "Why not?".
When people write C, they often do so for specific domains. Crypto, embedded, device drivers. They choose C for these domains because they think, and they are taught that C is some kind of macro assembler, because in these domains they care deeply about the "how". If they wouldn't, they would write java or any current top 20 programming languages. So almost all optimization off looks ok. Reasonable compared to this, anyhow.
I am deeply confused by your argument, for the simple reason that you can just turn off optimizations. In fact, it's literally the default in most compilers to have no optimization, you have to explicitly request them. Are you actually complaining that a compiler told to optimize is... optimizing?
> How likely do you think is it that a program that contains memset and has this optimisation applied how diverts from the indented behaviour and now has a serious flaw in it?
What if this optimization happens after three levels of inlining and some other dead code elimination? Is that "diverting from intended behavior"?
> When people write C, they often do so for specific domains.
The article is about C++ which people choose because they want to go fast. Which is also the dominant reason people choose C (embedded & device drivers both absolutely care about performance & efficiency, too, after all).
You can't have both "C is fast" and also "compilers can't optimize." That's not how that works. Similarly if you're just starting out in crypto and you're reaching for C you're already in a bad place and atomic optimizations are the least of your concerns. C is a terrible language for crypto.
> How likely do you think is it that a program that contains memset and has this optimisation applied how diverts from the indented behaviour and now has a serious flaw in it?
Almost every time.
Yeah, the times when it leads to problems are more important individually. And I have no idea of the overall picture. But looking purely at the frequency, those you enumerate basically never happen.
C really needs a "I mean this!" marker that one can use on those cases. Rust also would gain from some kind of predictable region similar to unsafe, but C would have to gain it first.
I would assume it doesn't divert because the user cannot tell from the output of the program if a memset is optimized away. If you want the memset itself to be observable you need to use volatile or memset_s. Else you are not describing your intent properly
If people truly still think C is a macro assembler then that's the problem. Besides if you care deeply about how modern high performance architectures won't cut it anyway.
I think the thing is users of C compilers generally care about raw performance much much less than compiler maintainers who are universally using C++.
Compiler writers are stuck in a trap. C++ compilation model is broken. Which makes compiling C++ programs very slow. Which means compiler writers care a lot about how fast compilers are. So they desperately add more optimizations to speed up the compiler program. But which adds more work to the process of compiling a program.
Most of the other users, especially C programmers don't care about speed nearly as much. Notably because while the compilation model for C is also broken, it's not nearly so. So their programs compile in a few seconds, not minutes to hours.
I submit that this viewpoint is very mistaken. If it were true, "users of C compilers" would just compile with optimizations off and there would not be any problem. Evidently, "most of the other users" do care about performance. Unsurprisingly, I might add.
Doesn't Wirth actually believe the same thing about compilers? Specifically the part about
> compiler writers care a lot about how fast compilers are. So they desperately add more optimizations to speed up the compiler program. But which adds more work to the process of compiling a program.
It's the opposite. People use C and C++ because of their speed. If you don't need the speed then there is little reason to use C or C++. This makes these optimization absolutely vital for C and C++.
I think a lot (not all) of people will use C and C++ because they need good performance, compile time checks, portability across OS/CPU architectures and reasonable memory usage.
Most of the people who use C and C++ are average programmers who think their programs will do exactly what they wrote in the source code.
The issue is that optimizations are required to get expected performance but some optimizations will result in programs not doing what user specified in the source code.
Compiler should offer a clear split between "safe" optimizations that will not change what source code says and "unsafe" optimizations that will do their best to create fastest code.
I don't use C or C++ very often and I don't know how optimizations are currently structured but if a lot of people are constantly caught unaware of what optimizations are in use and how they affect the output then this is obviously either a documentation issue or compiler interface issue - both of these are up to compiler writers to fix.
Distinguish programs not doing what the user specified in the source code from programs not doing what the user incorrectly thought they specified in the source code.
What the source code says it does is exactly specced in the language standard. If the compiler changes this then that's a compiler bug. What you really meant to say is what people assume source code does, not what it says it does.
Yes if you write C++ compilers in C++ they are the same.
Compiler writers are stuck. The more optimizations they add. The more work the compiler has to do. And the slower it runs. Defeating the purpose of those optimizations.
If you write other types of programs the compilers speed and your programs speed aren't the same at all. People that write other types of programs care more how fast the compiler is and less how fast their programs run. People that write programs that process entrusted data care vastly more about no surprises than they care about speed. Even more so, in modern code bases the hot sections are very small parts of the total code base. For 99% of the code, CPU speed doesn't matter.
The whole thing is made worse because of the disconnect between modern CPU's and machine implemented by C++.
> If you write other types of programs the compilers speed and your programs speed aren't the same at all. People that write other types of programs care more how fast the compiler is and less how fast their programs run. People that write programs that process entrusted data care vastly more about no surprises than they care about speed. Even more so, in modern code bases the hot sections are very small parts of the total code base. For 99% of the code, CPU speed doesn't matter.
Literally everything you ask for here is fulfilled by not turning on compiler optimizations. That's the default. I don't understand what your complaint is.
And since nobody stated it plainly to you yet: This whole world view is completely bogus. Entire branches of the industry require C++ compilers that produce the fastest possible code (to name a few: Games, HFT, image processing). I don't know how you convinced yourself that compiler writers are the only consumers of compiler optimizations, but it's dead wrong. I don't claim your experience or what tradeoffs you seek in a compiler are wrong, but they are extremely far from universal.
But self-compilation speed is a common benchmark for this, because compilers themselves are rich programs that exercise all parts of the compiler and the compilation speed of which can be measured.
> the entirety of a program "is a mere suggesting on how to compute the output".
Maybe people using Java or various interpretted languages are fine with that, but most C programmers want their code to be a very close mapping to what they specified.
The problem is, I want optimizations.. that don't break things. There's -O1 or -O3 but not -Ononewbugs. I don't expect perfection but changing the meaning of memset or collapsing atomic operations is like "settle down".
Every optimization changes the observable behavior of a program in some way. How is the compiler supposed to know which changes you deem acceptable?
Eliminating dead stores (including calls to memcpy whose result is never read) seems like one of the most basic and least objectionable optimizations I can imagine. So, if you’re against eliminating dead stores, what optimizations _are_ you okay with?
What you seem to want is "do all the optimizations, except the ones that contradict what I 'obviously' mean", which is just not possible with current technology.
The set of all the optimizations that do not break any program is very likely empty. So you are happy to break other people programs as long as yours keep working?
I'm pretty happy to leave the optimizations off. Writing the correct data structures matters a lot more than a clever compiler collapsing two commands into one.