No sane compiler would optimize atomics (2015)

chrisseaton · on July 11, 2021

A similar thing is that I often come across people who are well-informed but still surprised that compilers will combine two observable operations into one, and complain that some other thread somehow 'should' be able to observe the intermediate state. But I don't understand how they think they would be able to tell the difference between an interleaving that never happens to happen, and one that will never happen.

overgard · on July 11, 2021

> A similar thing is that I often come across people who are well-informed but still surprised that compilers will combine two observable operations into one

Well, think of it this way, if you're not into compilers wouldn't you expect your code to run as written? Especially if you put a breakpoint in there and you can see each step happening sequently. I don't really think people that don't know these things should be viewed as ignorant because the compiler tends to be doing some very unintuitive things. IMO everyone would be better off if somehow the tooling or the compiler could report some of these things in an easy to visualize not-super-verbose way.

Someone · on July 12, 2021

You use a compiler to abstract away the CPU. Because of that, there is no “run as written”.

Simple example: if you declare a structure that is 14-bytes in size, declare variables foo and bar of that type, and then write

   foo = bar;

C must behave as if 14 bytes are copied from the memory where bar is stored to the memory where foo is stored, wit two observable states: foo either has its old contents or the new ones. Chances are very high that your CPU cannot do that.

The compiler will instead do something different. It might call memcpy, it might copy 8 bytes first, then 4 more, then the last 2, it might copy the first 8, then the last 8 (copying the two center bytes twice), etc.

Similarly, if you declare an int named i and do

  i += 1;

There’s no guarantee that will translate to a single instruction (what if your CPU has 8-bit registers, but your compiler chose to have 16-bit ints?). Even if it does, does “as written” mean an increase instruction or an addition? Your CPU might have neither, but only subtraction and negation.

Also, a literal interpretation of “as written” requires spilling the result of every statement back to memory; a strict literal interpretation even forbids using registers (if you don’t write “copy i into a register, add 1, write the result back to i”, the compiler can’t do that if it has to run your code “as written”)

But yes, it would be awesome if somehow the tooling or the compiler could report some of these things in an easy to visualize not-super-verbose way.

Unfortunately, that isn’t possible. Modern languages and their compilers are just too far abstracted away from modern hardware.

Dylan16807 · on July 12, 2021

Being caught in the middle of an instruction, like all your examples, is much less surprising than when something never happens at all.

> Also, a literal interpretation of “as written” requires spilling the result of every statement back to memory; a strict literal interpretation even forbids using registers (if you don’t write “copy i into a register, add 1, write the result back to i”, the compiler can’t do that if it has to run your code “as written”)

If you don't make a pointer to a variable then it's allowed to be in a register, isn't it? There's even a 'register' keyword, however obsolete it may be.

Someone · on July 12, 2021

I think it’s only less surprising because it happens more often, so that ‘everybody’ knows about it, not because it is inherently less strange.

As examples of merging, if you do

  s = sin(x)
  c = cos(x)

would you find it surprising if that generated a single sincos instruction if the CPU has one?

If you do

  typedef struct s { char a; char b; }
  s c;
  […]
  c.a = 3;
  c.b = 6;

do you find it surprising if the compiler generates a single instruction that is equivalent to (exact code will vary according to the CPU architecture, and this may need a further cast to avoid undefined behavior):

  *(short *)(&c) = 0x0603;

terminalcommand · on July 12, 2021

I thought compilers ignored those words when optimizations are turned on, compilers reorganize and rewrite your code.

Using an interpreter seems to be the best option to abstract away the CPU. Write the interpreter once, get enough eyeballs onto it, test it 'till death and use it for predictability.

Even if you write assembly these days, there is all sorts of ambiguity around. Even the underlying hardware we use can fail.

I used to believe that computers did as they're told, they could do no mistakes, all mistakes were human-made. Now I see that there is no such thing as error-free. We are basically living on chance, much is uncertain.

I trust my eyes and ears more and more instead of the machine.

phendrenad2 · on July 12, 2021

No "run as written"? Tell that to interpreted C++.

overgard · on July 12, 2021

> You use a compiler to abstract away the CPU. Because of that, there is no “run as written”.

Um, no. I use a compiler to generate code that would otherwise be very tedious to write. I do not expect it in any way to abstract away the CPU. I do not want it to abstract away the CPU. If I did I would write Java or python. I'm using C++. I am very aware of the CPU. What I'm not aware of is whatever silly transformations the optimizer thinks it can do to my carefully crafted program because I didn't read page 783 of the spec.

gizmo686 · on July 12, 2021

You want a macro assembler. C is not a macro assembler. C was never supposed to be a macro assembler. It was designed as a cross platform language, which fundamentally means you are programming to an abstract machine, which necessarily means that you cannot have an macro assembler [0].

I can, however, see an argument that people want a C like language which is a macro assembler. I am lost again when people start asking for a C++ like macro assembler. The features that C++ brings over C are not concepts that exist on any mainstream CPU.

[0] Unless you decide to build a concrete machine (either in bare metal, or as a VM).

overgard · on July 12, 2021

C was originally a macro assembler! This new mutation of C is just that, new. I don't mind optimizations as long as it doesn't remove intent, which is usually clearly indicated in the source, standards be damned.

gizmo686 · on July 12, 2021

How far back to you want to go?

There is no definitive start point for C, as it grew fairly organically out of B and began as a living language. However, a reasonable approximation for C 1.0 is The C Programming Language (Kernighan and Ritchie, 1978) [0], commonly known as K&R C.

The language they describe is much simpler than modern C. Notably for this discussion, it contains no synchronization mechanisms.

There is nothing in that book that suggests C was a macro assembler. On the contrary, its description of C precludes it from being a macro assembler for the reasons I talked about in a parent comment:

> Although C matches the capabilities of many computers, it is independent of any particular machine architecture, and so with a little care it is easy to write "portable" programs, that is, programs which can be run without change on a variety of hardware. . It is now routine in our environment that software developed on UNIX is transported to the local Honeywell, IBM and Interdata systems. In fact, the C compilers and runtime support on these four machines are much more compatible than than the supposedly ANSI standard versions of Fortran.

> Because the data types and control structures provided by C are supported directly by most existing computers, the runtime library required to implement self-contained programs is tiny. On the PDP-I1, for example, it contains only the routines to do 32-bit multiplication and division and to perform the subroutine entry and exit sequences.

Notice the way this paragraph frames the issue. Since the features of C are supported by most machines, it can be implemented with a small runtime library. It does not say that C must use those features directly. In fact, it explicitly anticipates that C could provide some of its primitives in ways other then directly leveraging a corresponding capability of the hardware. The feature here is efficiency and a small runtime.

[0] https://1lib.us/book/2660358/b1a4cf

stormbrew · on July 12, 2021

I'm not sure where you got that idea. C was never ever a macro assembler, and its antecedent language (B[1]) is actually even more abstracted away. C was a slight lowering, not a raising, of hardware abstraction.

If C were a macro assembler the following things would be true of it, most likely:

  - types would have fixed bit sizes (ie. uint32_t would be the norm)
  - you'd be able to do various 'normal' things for assembly like rotate left and right.
  - inline assembly would probably be.. you know.. standard instead of something that compiler vendors only started putting in in like the 80s.

You might be thinking of the fact that C++ was originally sort of a Macro C (aka Cfront)?

[1] Ever wondered why `int` is so fuzzily defined in C? It's because in B there was only one type and it was always word sized. That's also why various things are typed `int` by default if you don't specify.

Someone · on July 12, 2021

Another extremely normal thing assembly allows you to do, but C doesn’t: access carry and overflow flags.

spc476 · on July 12, 2021

Not all CPUs have carry and overflow flags. For example, MIPS, or SPARC, or the new RISC-V.

stormbrew · on July 12, 2021

really relevant to this discussion, though, the pdp11 did.

umanwizard · on July 12, 2021

Another (modern) example: SIMD instructions.

overgard · on July 12, 2021

> I'm not sure where you got that idea

History: C was essentially PDP11 macro assembler. There are so many assumptions in C that basically came from the PDP11. C was originally a very utilitarian language to make writing unix easier. That it became massively popular was more of a happy accident.

stormbrew · on July 12, 2021

I'm sorry, but this is just a gross mischaracterization of what a "macro assembler" is. I agree it was utilitarian and made to make unix easier and that many of its core assumptions were based on the platform of the pdp11, but it never did in any way resemble a macro assembler as far as I know.

And in fact, as gizmo86 pointed out, one of the explicit goals of C in terms of "making unix easier to write" was, afaik, that it would allow the code to be more portable to other architectures, because as the 11 in the name implies, the history that led to the creation of unix was not short on new architectures, and one of the key weaknesses of Multics was, afaik, the fact that it was rigidly written specifically for one specific architecture.

By this logic nearly every programming language is just a macro assembler at a certain level of (non-)complexity. All compilers do is translate syntax (macros?) into machine code (assembly?). So, is pascal also a macro assembler?

Anyways here is a list of instructions on the pdp11 that you can't directly represent even in the earliest versions of C to my knowledge:

  ROR
  ROL
  SWAB
  SXT
  ADC
  SBC
  BVC
  BVS
  BCS
  MARK
  HALT
  WAIT
  RESET
  MTPD
  MTPI
  MFPD
  MFPI
  MTPS
  MFPS
  MFPT
  CLC
  CLV
  CLZ
  CLN
  CCC
  SEC
  SEV
  SEZ
  SEN
  SCC

That's the distinguishing factor of a "compiler" vs. a "macro assembler". A macro assembler is a layer on top of an assembler that still allows you full access to the instruction set and allows control flow that most compilers would consider unsafe. C was never that. It wasn't derived from a language that was that and it didn't, to my knowledge, allow that even on pdp-11 unix.

I'd be really happy to see some code from one of the code archives of unix' history that proves me wrong there, though, if you have it.

pjmlp · on July 13, 2021

UNIX was originally written in Assembly, C was used to rewrite the 4th version.

benibela · on July 12, 2021

Unfortunately, classes teaching C often start with telling people that with C they learn how the real machine and the real CPU works on the lowest levels.

weeblewobble · on July 12, 2021

I'd be surprised if this was true. It certainly doesn't match my experience, and it also doesn't really make sense. The difference between machine code, assembly, and programming languages like C is first-week CS 101 stuff. What are you basing this on?

co_pilates · on July 12, 2021

Have any good sources of this topic of macro assemblers? I'm interested in reading about this type of thing. I'll search Google, but you can't always find good sources there today.

pjmlp · on July 12, 2021

Yeah, plenty of books at archive.org.

"Assembly language for the PC"

https://archive.org/details/assemblylanguage00soch

"TASM Reference"

https://archive.org/details/bitsavers_borlandturemblerVersio...

"Structured Assembler Language For IBM Microcomputers"

https://archive.org/details/structured-assembler-language-fo...

Or just go to MASM documentation, you will find plenty of directives that look like an high level language

https://docs.microsoft.com/en-us/cpp/assembler/masm/directiv...

gizmo686 · on July 12, 2021

Unfourtuantly not. The only one I have worked with was really just a normal assembler with a big library of in house macros.

lokedhs · on July 12, 2021

I searched a bit, and found a large set of pages with information on the topic in the context of the 360 architecture. I'd say this architecture is unique in that it is still to this day used to write high level software.

Put "360 assembler macros" in your search engine and you'll find lots of information.

This is a good place to start: https://www.ibm.com/docs/en/zos/2.1.0?topic=language-macro-i...

brundolf · on July 12, 2021

Sorry to break it to you: https://m-cacm.acm.org/magazines/2018/7/229036-c-is-not-a-lo...

aaronmdjones · on July 12, 2021

Even x86 assembly is not a low-level language: https://blog.erratasec.com/2015/03/x86-is-high-level-languag...

pjmlp · on July 13, 2021

Not only x86 assembly, microcoded CPUs has always been quite common on CISC designs, hence why 1960's papers tend to refer to Assembly as bytecode.

overgard · on July 12, 2021

I've read that and it would take a book to explain whats wrong with it.

pajko · on July 12, 2021

> I do not expect it in any way to abstract away the CPU

C was especially designed to do that! https://stackoverflow.com/questions/53100198/what-is-the-pre...

kaba0 · on July 12, 2021

Feel free to examine the resulting machine code. Godbolt is a really valuable tool.

pornel · on July 12, 2021

"run as written" implies targeting an abstract machine that performs operations the way programmer imagines it does. This machine isn't specified anywhere.

The hard part is that programmers don't imagine the code to be executed in a naive unoptimized way, but still expect it to be reasonably well optimized (converted to "obviously" equivalent assembly), except where optimizations combine into a result they haven't thought of.

ip26 · on July 12, 2021

This is more of a multi threading & ordering topic than a compiler topic. There were never any guarantees about another thread discretely observing those two observable operations in the first place, due to the memory model.

sharikone · on July 12, 2021

Not to mention that a big part of the confusion is not even the compilation - it's not beyond possibilities to inspect the machine code - but the processor itself still does a lot of unintuitive stuff, even more so with concurrency. I wish the tooling you mention existed, but also some microcode simulation/implementation to look beyond that.

kaba0 · on July 12, 2021

I’m not sure what are you thinking of, but CPUs are guaranteed to run as if the given instructions were executed one after another, even though in practice it is free to reorder them slightly.

astrange · on July 12, 2021

Memory operations aren't required to behave like that, although on x86 they mostly do. ARM and Alpha certainly don't.

chrisseaton · on July 12, 2021

> CPUs are guaranteed to run as if the given instructions were executed one after another

That’s just not true. Intel models will reorder reads and other processors will even reorder writes - and I’m talking about observably.

bregma · on July 12, 2021

I've dealt with many CPUs designed to run out of sequence and to guarantee that. I've dealt with multiprocessor computers where order is not guaranteed at all outside the use of hardware interlinks. I've dealt with embarrassingly parallel processors where instruction order is orthogonal.

Perhaps you're thinking of a simple abstract CPUs used in "introduction to computers" courses instead of the real-life silicon used in most applications?

chrisseaton · on July 11, 2021

> Well, think of it this way, if you're not into compilers wouldn't you expect your code to run as written?

But the possible and actual interleavings aren't written in the program - so how can I guess what the user needs from what's written?

Retric · on July 12, 2021

Consider the reverse. What if the interleaving was undesirable, would you like some way to be informed that it’s impossible so you don’t have to reason about it?

When trying to reason about code, finding somethings that looks like a bug but isn’t can result in a lot of wasted effort.

chrisseaton · on July 12, 2021

> What if the interleaving was undesirable, would you like some way to be informed that it’s impossible

That's like synchronised in Java or any other kind of atomic block or serialising block - that's already a thing.

Retric · on July 12, 2021

That information is moving in the wrong direction. I don’t want the programmer to be telling a compiler what to do, I want the compiler to inform the programmer what it did.

chrisseaton · on July 12, 2021

> That information is moving in the wrong direction.

You set the direction not me! You asked me 'would you like some way to be informed that [the interleaving is] impossible'. Yes... and the user gives me that information through a synchronised block. If they don't tell me they want an interleaving to be impossible then why should I not make it possible? There are countless other interleavings I make impossible or possible when I do any trivial compiler task such as scheduling. Should I consult the user on all of them? I don't think you'd like all the questions you'd get!

Retric · on July 12, 2021

Informing the user isn’t about what you did, it’s about the user having some way of knowing what you did. At least compilers give you assembly code to look at, CPU’s are little better than black magic from a programmers perspective.

chrisseaton · on July 12, 2021

The user can dump the compiler output and intermediate structures if they want.

Or they could program against the spec and not worry.

kaba0 · on July 12, 2021

Check the generated machine/byte code?

overgard · on July 12, 2021

The answer is: don't guess.

chrisseaton · on July 12, 2021

We don't - and in this example we'd combine atomics rather than guess that the user wanted the thread to yield between them, since they never told us that.

overgard · on July 12, 2021

That's not true, you are making a guess: that the user doesn't care about the intermediate stage. Not guessing would mean generating the code as specified. It doesn't matter if you think its a bad idea, the whole ethos of C is trust the programmer.

chrisseaton · on July 12, 2021

> you are making a guess: that the user doesn't care about the intermediate stage

That's not a guess - the user didn't put anything in between them so we know they haven't required them to be separate.

Keeping them separate would be making the guess - a guess that they need them to be separate. Why should we invent that constraint when the user never wrote it?

> Not guessing would mean generating the code as specified.

We are generating it as specified - nothing in the specification says atomic actions have to be run independently. Why should I invent extra constraints that don't exist in the specification?

overgard · on July 12, 2021

Oh come on, that language lawyering is sophistry. If I write 4 instructions that are "atomic" I expect 4 instructions. If I end up with one, they werent really "atoms" were they? I do not expect what you just said. I may think, gee, perhaps I don't need 4 instructions, but what you are doing is undeniably changing the meaning of the program, even if you think it's close enough. The fact that there's an article about this tends to agree with me that most people think: wtf happened?

chrisseaton · on July 12, 2021

> If I write 4 instructions that are "atomic" I expect 4 instructions.

Why do you expect that? Nobody ever promised you that. If you expect something you weren’t promised then that’s on you.

> If I end up with one, they werent really "atoms" were they?

Atomic has a technical definition - it means either applied fully or not at all. That doesn’t preclude combining.

> what you are doing is undeniably changing the meaning of the program

It doesn’t! According to the spec we’re both working off. If you think the spec is wrong then change it. But I can’t guess what you want unless you tell me using the spec.

overgard · on July 12, 2021

> Why do you expect that?

Because I wrote it?

> Nobody ever promised you that.

I just assumed that compiler writers had empathy for humans, my bad.

> It doesn’t! According to the spec we’re both working off. If you think the spec is wrong then change it. But I can’t guess what you want unless you tell me using the spec.

How big is the spec and how many working programmers do you think are super intimately familiar? Do you think this article would be on the front page if this information wasn't surprising? Are you building a compiler for humans, or are you just chasing metrics and getting the smug satisfaction of telling users "I told you so" when shit doesn't work in an intuitive way?

If you want to beat me over the head with "the spec says", you win, master language lawyer. I build things for a living and I like it when the code I specify is the code that's generated. You can blame users all you want, but maybe after the trillionth zero day exploit because nobody gets these things right you might consider empathy and developer ergonomics.

chrisseaton · on July 12, 2021

> I like it when the code I specify is the code that's generated

But you aren't specifying it!

You're asking me to follow extra constraints that you haven't written down and we haven't agreed. Don't you think it's unreasonable when you say it out loud like that?

If you want extra constraints on top of what is already agreed in the spec then we can do that, but you have to ask for it and you have to bring me a workable spec. You want me to 'intuit' but really you want me to read your mind and follow a lot of unwritten rules!

overgard · on July 12, 2021

I think we have a fundamental difference on what a compiler should do. I want the compiler to take my exact instructions and generate the best machine code it can, that follows those exact instructions. You want to rewrite my program to be more optimal. The tension here is that, even though I'm sure you can do that, I have to debug the result, and the result I have to debug is now stripped of symbols and super hard to relate back to my original source code. So ok, I am only adding one constraint but it's a big one: don't ever change the line-by-line meaning of my source code. If I wrote a line of code I intended for it to be executed, I don't want my memsets removed or my atomics collapsed, not because you can't do it, but because I need to debug it. I don't think taking advantage of undefined behavior to create optimizations is useful because being wrong 10% faster isn't really that useful. And I think optimizations like removing memsets are a really bad idea; not only does it potentially make peoples data insecure but the only people that are ever going to find a bug like that are people looking for vulnerabilities.

And frankly, here's the thing: I don't care that much about most of your optimizations. If I'm making my code fast, what I'm fretting about is designing my data structures to fit in cache, or using SIMD instructions, or offloading work to the GPU or other coprocessors. Screwing with the meaning of my code just makes my life harder.

chrisseaton · on July 12, 2021

But the meaning of your program is not being changed or screwed with. The optimised version implements the same meaning as your original program.

The problem is you don't understand the meaning of what you've written. You think the language has extra meaning that it doesn't. You can already get the meaning you want through existing constructs like volatile and yield and memset_s.

overgard · on July 12, 2021

> The problem is you don't understand the meaning of what you've written.

This is the problem with the attitude of compiler writers. Of course there are bugs. Nobody perfect. Pobodys nerfect? If I accidentally left a sign overflow in the code, or something of that nature, I think any (human) reading the code would understand what I meant, but the compiler is like "'Lets get weird!'". Yes I CAN get the meaning through the existing construct. That, however, is not my point. I am perfectly skilled and capable of doing this. What I'm questioning is why tool vendors think it's useful or practical to take advantage of users mistakes. I'd rather have a diagnostic of "why are you adding these numbers multiple times?" then have the compiler silently rewrite something I did on purpose.

I'm not writing this in the abstract, I'm writing this because things like this have hurt me in the past. It's super annoying. I don't care about your optimization to make things five percent faster if I spend three days debugging why the release build crashes if the window is full screen, but the debug build runs fine 100% of the time. The code has a bug, I own this, but I don't see how the attitude of "if your code has a bug, we will turn it into fucking dragons" is even remotely useful. Usually it's not even my code I'm fixing, it's someone elses. I just want people to stop breaking things to meet a dumb benchmark.

chrisseaton · on July 12, 2021

It sounds like you're railing against optimising based on undefined behaviour. That's a separate thing - that's not what's being discussed in this article or thread - this is all well-defined behaviour.

UncleMeat · on July 13, 2021

> I want the compiler to take my exact instructions and generate the best machine code it can, that follows those exact instructions.

Two problems.

1. Your definition of "exact" is not exact. It is exact in your mind, but it isn't going to be precisely the same across all developers and compiler authors cannot reasonably develop hard requirements from this. This is especially true when targeting different machines. Operations on a uin64_t aren't going to look the same on a 32 bit machine, for example.

2. Attempting to do this makes your programs slow as hell. You can claim that compiler optimizations don't matter... but you are free to deploy debug builds if you really think this. C and C++ especially aim to provide a sort of ambient performance where everything is fast rather than just providing speed for hot loops. People complain nearly daily about modern applications being slow and then turn around and say we should throw away decades of compiler optimizations.

josephcsible · on July 12, 2021

I think you're severely underestimating how much slower everything would be if compilers weren't allowed to optimize anymore.

overgard · on July 12, 2021

Here's where I start when I'm optimizing things:

- Does my data fit in cache

- Can I make it fit in cache?

- Have I packed my memory efficiently?

- Am I using the right algorithm?

"Can the compiler inline this" is like 5000 on my list of things I care about. Compiler optimizations are mostly only useful to people who werent paying attention in the first place.

josephcsible · on July 12, 2021

> "Can the compiler inline this" is like 5000 on my list of things I care about. Compiler optimizations are mostly only useful to people who werent paying attention in the first place.

Okay, you're definitely severely underestimating how much slower everything would be if compilers weren't allowed to optimize anymore.

UncleMeat · on July 13, 2021

Yeah. For somebody who really cares a lot about cache misses he sure doesn't seem to be concerned about basic things like leaving virtual function calls in place.

jmgao · on July 12, 2021

> I build things for a living and I like it when the code I specify is the code that's generated.

Okay, then specify -O0 and enjoy your dumpster fire performance.

SAI_Peregrinus · on July 13, 2021

-O0 doesn't necessarily change this. After all, optimizations aren't allowed to violate the spec (if they do, they're buggy). So any code that can be output by an optimizing compiler could (in principle) be output by a non-optimizing compiler.

overgard · on July 12, 2021

To be fair, it's usually like 5% slower. But I also don't usually have control over the flags if I'm writing a library.

umanwizard · on July 12, 2021

I’d be curious to know what benchmarks you’re using on which O0 is only 5% slower. I suspect your program must be almost entirely IO-bound, if this is the case. In that case, Python, Ruby, or shell probably wouldn’t be much slower either.

overgard · on July 12, 2021

Writing games, I've rarely seen the optimizer do anything of substantial use.

kllrnohj · on July 12, 2021

If you're going to use atomics, you should be reading what guarantees those atomics actually provide. In this case of memory_order_relaxed they pretty clearly specify they don't provide any cross-thread synchronization. So if you wrote those 4 atomics expecting something the docs explicitly & clearly say it doesn't provide, that's pretty much entirely your fault and not "language lawyering"

burnished · on July 12, 2021

'atomic' isn't uncuttable in the sense that it is irreducible, but in the sense that it cannot be stopped/does not produce intermediaries. If you write four atomic instructions you should expect that they will not be interrupted, nothing more. of course I'm not an expert, but I think that is where your misunderstanding has originated.

overgard · on July 12, 2021

> 'atomic' isn't uncuttable

Well then it's a pretty bad name no? Considering what the word atomic means.

jcelerier · on July 12, 2021

Did you even read the post you're replying to. "Uncuttable" can apply to multiple properties of an instruction's execution - "atomic" colloquially refers to one of these properties.

You seem to want "atomic" to mean "the data that corresponds to the instruction cannot be divided" but it means "the execution of the instruction cannot be divided", those aren't the same thing.

burnished · on July 12, 2021

The four hardest problems in computer science are cache invalidation, naming stuff, and off by one errors. Please accept that your understanding was flawed and have empathy for the people who named stuff.

historyloop · on July 11, 2021

It can be tricky in that fuzzing your program to discover race conditions seems to behave perfectly on one compilation target where these atomics are detected and reified, while you later discover on another platform surprisingly your app crashes 20% of the time.

Ideally we test our applications on all hardware, with all configuration permutations etc. etc. In practice we do rely on our compilers translating our intent accurately and sometimes such edge cases matter.

Compatibility is a tricky thing. It's kind of like the argument whether adding optional arguments to a function breaks BC. It doesn't break BC if you don't pass those parameters. But if for some reason you were passing extra parameters hoping they'd be ignored (for example as a handler some other place) then adding optional parameters WILL break BC and cause your software's behavior to be undefined.

gizmo686 · on July 11, 2021

It is not about translating intent accurately. What you need is for compilation to be fully specified. Anywhere where there is multiple correct ways to compile something introduces a risk for bugs that only show up with some compilers.

If you are a compiler or library writer, one solution is to avoid having useful properties that are not part of the spec. For instance, Go does not guarantee any particular iteration order for hashmaps; so they go out of there way to randomize the iteration order, thereby preventing developers from writing code that depends on a deterministic order.

In the case of threading, what you would need to do is essentially have a compiler/runtime that goes out of its way to order and time operation in a random/adversarial manner.

I've seen research that looks into doing this in a VM environment; which would be inhibited by the type of compiler optimizations being discussed. And others that modify the compiler itself to replace the concurrency primitives with runtime functions, that can then execute them in a fuzzed order.

Ultimately, fuzzing and testing can only give you confidence that what is being tested is mostly correct. It can never give you confidence that what is written is entirely correct. If you want confidence in the latter, you need to invest in some form of static analysis (which could either be built into the language, such as a type system, or be an external analysis tool). Ultimatly, writing even a moderately complicated program (by modern standards) with full confidence in its correctness would involve advancing the state of the art of the field by decades (if not longer).

For the most part, the field just doesn't care about programs being fully correct; and accept it as a fact of life that going onto new/untested platforms and configurations will introduce/expose bugs.

nyanpasu64 · on July 11, 2021

> In the case of threading, what you would need to do is essentially have a compiler/runtime that goes out of its way to order and time operation in a random/adversarial manner.

> others that modify the compiler itself to replace the concurrency primitives with runtime functions, that can then execute them in a fuzzed order.

Loom[1] is somewhat of that. It's a testing system (not a runtime for a full app) which tests multiple permutations of program execution (all possible permutations I think, limited by test case complexity or an optional "maximum thread switch count"), as well as modeling atomic/weak memory effects to some degree.

[1]: https://github.com/tokio-rs/loom

b3morales · on July 11, 2021

Very interesting. I would love to see tools emerge out of that research. In my experience one of the basic problems with testing/verifying concurrent code is that it is often difficult-to-impossible to perturb a particular part of the system, to exercise different orderings, and thus find corners.

seppel · on July 12, 2021

Helgrind does something like this: https://www.valgrind.org/docs/manual/hg-manual.html

b3morales · on July 12, 2021

Good to know, thanks!

historyloop · on July 13, 2021

> It is not about translating intent accurately. What you need is for compilation to be fully specified.

If you know how complicated the transformation pipeline is on a typical compiler, you'll realize that's pretty much impossible.

And that pipeline is specific not only to your code, compiler settings, but also your target machine, and is constantly in flux, update to update.

Even if it could be documented, no one would read the documentation.

kccqzy · on July 11, 2021

The article has this very nice example here:

    x.fetch_add(1, std::memory_order_relaxed);
    y.fetch_add(1, std::memory_order_relaxed);
    x.fetch_add(1, std::memory_order_relaxed);
    y.fetch_add(1, std::memory_order_relaxed);

Can a different thread observe x and y only incremented once? Before the optimization yes, it's possible. After the optimization, no, it's not possible. The compiler removes a possible observable state in the course of optimization, and that occasionally surprises people.

chrisseaton · on July 11, 2021

> and that occasionally surprises people

But what did they want?

They tell me they're not happy with 0% chance of some interleaving they need. Are they happy with 0.000000000001%? I'm not sure I understand the practical difference between that and 0%? Can I tell them it is possible, but rely on the death of the universe happening before they have a right to complain it didn't turn up yet? If they have some hard minimum they require, then what do they expect me to do? Completely abstract from the system scheduler to make it happen?

Doesn't seem a reasonable expectation from programmers on compiler writers, and likely wouldn't really be what they wanted even if I made it happen.

dataflow · on July 11, 2021

It might be 0.000000000001% but it need not be, right? That's kind of the point. I'm pretty sure I could write you a program where the probability of observing that would be (say) 50% without the optimization, and with the optimization that would go to 0%, defeating the program. Should that be allowed?

This is basically a statistical/probabilistic version of another problem: should a compiler be allowed to change the time complexity of an algorithm? If I write linear search, and the compiler proves the list is sorted, should it be allowed to switch to binary search—or vice-versa? Traditionally the answer might be 'yes', but it's not exactly hard to argue an answer of 'no' might also be desirable in a number of situations. Now in this case, it's not the time complexity that's changing, but the statistical properties of the program... and while in some cases it might be negligible (1E-10 or what have you), that's not necessarily the case, at which point a difference in degree really can become a difference in kind.

chrisseaton · on July 11, 2021

But no mechanism was ever guaranteeing the 50% in the first place. So their program could already stop working correctly without any changes and with the same compiler. What I'm saying is I don't know how they'd even tell the difference between my compiler being the problem, or them just being unlucky and not getting the interleavings they want even though they're statistically likely.

dataflow · on July 11, 2021

I'm saying if you made a complier that made reduced that expected 50% to roughly 0%, it wouldn't matter to the user if it was 1E-10 or exactly 0%. Both could be undesirable for exactly the same reason. Again: you can complain that the 1E-10 compiler is still "correct" because there was never a 50% guarantee, but in doing that you're completely missing the point.

Here's an exaggerated example to get the point across. Imagine your Tesla autopilot's accuracy suddenly dropped from >99.9999% to <0.00001% overnight because of some stupid optimization (maybe this one) that was technically correct. Assume there was never a guarantee provided to the user that it would stay this accurate. Now if your 99.9999% became 99.9998% overnight, OK, I think a sane user would completely anticipate that and live with it. But when it drops orders of magnitude lower, do you really think you could shift the blame to the user because you were "technically" not guaranteeing anything? Is it really the user's fault that your optimization reduced that accuracy to 0.00001% and killed them? Can you walk away with your conscience clear at that point that you did everything 'right'? At what point do you feel like some kinds of changes just blatantly unfair and uncalled for, even if they're technically "by the book"?

gizmo686 · on July 11, 2021

Tesla's autopilot has a guarantee for its accuracy. If that drops to 0.00001%, then that is a problem, because the autopilot functionality is failing to meet its requirements. If that drop in accuracy happens because of a technically correct change, then that just means the actual bug was somewhere else and just happened to work previously. It also means that there is a flaw in Testla's quality control process that allowed such a change to ship to customers.

If your program breaks because the compiler no longer interleaves atomics, then your program was always broken, and you have just been getting lucky until now. Normally when this type of situation comes up, the programmers perspective is at least understandable, since what they were doing was somewhat reasonable, even if not allowed by the standard. In this case, the point is that I (and, I presume chrisseaton) cannot come up with any reasonable program that would rely on an interleaving happening 50% of the time.

If the interleaving were originally observed 99.999999% of the time, I could imagine a program relying on it happening. However, if the writes are close enough that the compiler could optimize them together, chances are the pre-optmiztion chance of it happening was already much closer to 0% than to 100%; which, again, begs the question 'what was your program possibly doing that relied on the original behaviour?'

dataflow · on July 11, 2021

> Tesla's autopilot has a guarantee for its accuracy.

Which (if true) is completely beside my the point. Just pick something that doesn't have a guarantee. The comment wasn't about Tesla. It was about empirical performance vs. written guarantees.

> Normally when this type of situation comes up, the programmers perspective is at least understandable, since what they were doing was somewhat reasonable, even if not allowed by the standard.

Okay great, we agree on that.

> In this case, the point is that I cannot come up with any reasonable program that would rely on an interleaving happening 50% of the time.

That's easily explained by our lack of imagination. No need to assume the person relying on it is being unreasonable.

Hypothetical example: suppose two threads are trying to alternate running some specific instructions with minimal latency. Maybe there are hardware requirements (e.g. they want to avoid overheating come specific components of a CPU core for too long?), so they try to alternate work via an atomically increasing counter? That's a starting point at least, feel free to improve the scenario.

kllrnohj · on July 12, 2021

> so they try to alternate work via an atomically increasing counter?

Then why did they asked for an atomic that doesn't provide that instead of one that does? The docs on memory_order_relaxed are pretty clear it provides no thread synchronization of any kind: https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxe...

This isn't some undefined behavior quirk being used to optimize, this is well-defined behavior being optimized in a well-defined fashion.

dataflow · on July 12, 2021

> The docs on memory_order_relaxed are pretty clear it provides no thread synchronization of any kind

No, that's not what it's saying. "Synchronization" is an imprecise term, and you're misunderstanding what it means here. Atomics (as is literally in their name) perform each operation atomically; by definition they provide at least that thread synchronization guarantee, even if you believe they provide nothing else.

The documentation explains what "are not synchronization operations" refers to in the subsequent sentences; it's most certainly not saying it doesn't provide thread synchronization "of any kind" (again, I already listed one above). In fact, it lists 2 specific kinds of thread synchronization it does provide: "atomicity" (i.e. operations are all-or-nothing, as above) and "modification order consistency" (i.e. alterations of that same variable occur in a consistent order across different threads, even under std::memory_order_relaxed). What it does not provide is a consistent ordering across different variables ("an order among concurrent memory accesses"), which is fine as it's not not something we need in this example.

Collapsing multiple atomic operations into one has nothing to do with ordering, nor with multiple locations in memory. In fact, I don't think even std::memory_order_seq_cst can prevent this. The only way I can imagine preventing it is to mark atomics as volatile, and even then I'm not sure if it's guaranteed under the as-if rule.

gpderetta · on July 12, 2021

volatile should work. Each volatile access is a side effect from the point of view of the abstract machine and compilers are supposed to document this mapping to the concrete CPU (and usually is the obvious one).

In fact the fact that volatile atomic has additional guarantees is a strong reason for plain atomic not to have them, otherwise there wouldn't be a way to express the more relaxed requirement.

I think atomic not being volatile by default is a good default: the majority of code doesn't need to care about atomic in the first place, for the majority of the remaining code, these sort of optimizations are benign and desirable. For the remaining code that cares, volatile is the safety hatch.

dataflow · on July 12, 2021

I'm not so sure volatile is sufficient... I expect it would depend at least on how the variable is allocated/stored. Merely putting volatile shouldn't be sufficient to prevent access to it being optimized out. I forget if this was disallowed since C++20, but consider the program 'int main() { volatile int x = 0; ++x; }'. I'm fairly certain the compiler can optimize out the body despite the use of volatile, because of the as-if rule: it would be impossible to tell the difference from within the program.

Moreover, I'm not sure 'volatile' memory access without some sort of external linkage is even meaningful. The idea of volatile is that access to a volatile variable could have observable side effects outside the program, but that discussion seems rather moot for a variable that doesn't have connections to the outside world to begin with. Heck, by the same reasoning, even for a variable with external linkage, the rules might still permit it to be optimized out under link-time optimization, unless it has been declared to be externally reachable somehow (e.g. at a fixed/exported/otherwise 'known' address)... otherwise it's by definition externally unreachable and thus again unobservable under the as-if rule.

At least, I could see the reasoning going that way, and that's how it makes sense to me as far as the abstract machine goes. I can also see people disputing it though, and claiming (say) a debugger inspecting a program is sufficient accessibility for considering side effects, or something like that. I'm not sure I agree with every objection, but I see room for debate. It's just not clear what the implications are under the as-if rule.

gpderetta · on July 12, 2021

The ++x is a side effect from the abstract machine point of view. The issue is of course how the compiler maps the abstract machine to the real machine. Volatile is (purposefully) underspecified and I guess it is not completely unreasonable for a compiler to declare that a local volatile object whose address is never taken does not have a mapping at all.

There is little value on explicitly optimizing volatiles but of course that can happen as a side effect (ah!) of unrelated optimizations.

I rarely use volatile at all, but when I do, I do make sure not to apply it to local objects.

edit: I think the most reliable and portable way to make sure that the address of a variable escapes is something like this:

   volatile std::atomic<void*> sink; // global

   void escape(void*x) { 
     std::atomic_signal_fence(std::memory_oreder_release);
     sink.store(x, std::memory_order_relaxed);
   }

any signal handler executed after escape(&val) is guaranteed to be able to see the last value stored into val.

dataflow · on July 12, 2021

Yeah, I think this is probably sufficient without whole-program optimization, but I'm not sure it is with LTO. It might be sufficient today, but I can imagine Clang optimizing it away eventually. Really, the only realistic-yet-futureproof method I can imagine (and I'm not sure if it's even within the jurisdiction of the language spec) is either (a) exporting the variable, or (b) exposing its address through some other established I/O mechanism (like printf).

chrisseaton · on July 11, 2021

I guess I don't know why the user would be betting on interleavings which I never said I could guarantee. If you build your car with a part that isn't rated for what you use it for that's your fault not the manufacturer of the part - even if it worked every time when they tested it and they never saw it fail before one time it kills someone.

I'm open to ideas on how to create understanding both ways on what the user wants and the compiler is able to provide?

dataflow · on July 11, 2021

> I guess I don't know why the user would be betting on interleavings which I never said I could guarantee.

Same reason people walk outside without bodyguards and "bet" on coming back home alive despite the fact that nobody ever gave them such a guarantee. It's not insane to expect the future to bear some kind of resemblance to the past without some contract to guarantee it, is it?

chrisseaton · on July 11, 2021

> It's not insane to expect the future to bear some kind of resemblance to the past without some contract to guarantee it, is it?

Well you used the example of a car and I think it would be insane to use a component in a car when the manufacturer tells you it's not designed to do that and I can't possibly guarantee that's going to work just because you think you tested it and it seemed fine.

dataflow · on July 11, 2021

Sigh. What about my subsequent example? About walking outside with a bodyguard despite the lack of a guarantee you'd stay safe?

chrisseaton · on July 11, 2021

I don't get why you're sighing at me when I'm answering your questions politely. If you don't like my replies feel free to stop asking me questions.

Your example about going outside isn't covered by a technical specification with a long list of guarantees formally agreed between two parties so I don't really see how it's comparable.

dataflow · on July 11, 2021

The question was about whether the original designer deserves any blame for making a sudden change like that. Not about whether the manufacturer is at fault vs. the user vs. a bunch of other middlemen that weren't in the story. That twist just derailed the discussion, and it's frustrating to discuss something when the other person is so keen to avoid the crux of the question. So much so that you didn't even try to address everything else I wrote.

chrisseaton · on July 11, 2021

> The question was about whether the original designer deserves any blame for making a sudden change like that.

I don't believe so - as long as there is a version change and it's documented in the change log, which I aim to do with optimisations I write.

If you're using undocumented apparent behaviour, please check the change log when upgrading, or ask us to document the behaviour. But as already described, we may not be able to guarantee what you want while keeping other things you depend on like performance!

chc · on July 11, 2021

If you only had a 50% chance of surviving before, I would say it was already a bad idea.

lmm · on July 12, 2021

What they wanted was a cross-platform assembler, often. They listened to people who told them things like "C is close to the metal".

pjmlp · on July 12, 2021

The amount of damage made to IT industry by that cargo cult is immeasurable and will take generations to recover from, if ever.

Dylan16807 · on July 12, 2021

> If they have some hard minimum they require, then what do they expect me to do? Completely abstract from the system scheduler to make it happen?

What if they say the hard minimum is "occasionally"?

They're not holding you responsible for the scheduler, they just want you to leave it as possible. (and not be deliberately malicious, just in case that needs to be said)

chrisseaton · on July 12, 2021

Isn't this already expressible with something like a yield statement between the two atomics?

If you need it - write it down! I can't guess your intentions as a compiler author!

Dylan16807 · on July 12, 2021

A yield might be way too slow.

And sure, there should be a way to signal intentions. By my point is that there is a reasonable way to describe what certain people would expect by default.

spacechild1 · on July 12, 2021

“volatile std::atomic“ also works and is more clear, IMO.

gpderetta · on July 12, 2021

But is it possible? The CPU can, in principle, coalesce writes, especially to the same cache line, so, on a given microarchitecture, the intermediate value might be actually impossible to observer even without compiler optimizations.

spacechild1 · on July 12, 2021

It's because people still confuse std::atomic with volatile. Actually, there is a way to get both behaviors: 'volatile std::atomic'. Personally, I never had a situation where I would've needed that, and I use atomics on a regular basis.

lmilcin · on July 12, 2021

I think at least regarding concurrency, most developers need to rely on a model that is "good enough". You can't know everything, therefore you need to contend with knowing at least enough that is necessary to do your work.

Thinking that atomics will not be optimized by compiler is "good enough" in my opinion for almost, and I really mean almost, everybody.

Heck, I interview people and atomics optimization is one of my staple questions (Tell me, what does volatile do in Java, exactly?). I have never heard a thorough answer to that question and yet people do work as developers and do produce working applications.

The only real users of these optimizations are atomics primitives developers, and the use it to improve them a tiny bit in some special cases.

You can technically contrive a test where you run an update that changes something to 1, then immediately to 2 on each iteration. Then you try to look at it from another core and you would expect that at least once in a blue moon you will see 1. But I just don't see any sanely written application to rely on that kind of behavior.

voidnullnil · on July 11, 2021

So what's the motivation for the article? Presumably someone complained that the compiler optimizing atomics broke their code.

chrisseaton · on July 11, 2021

If they need some interleaving to happen at least n% of the time for their program to work... what am I supposed to do? Because it was never guaranteed to be interleaved at least n% of the time, even when the atomics were not merged. Do they want me to bypass the system scheduler and run threads concurrently with forced interleavings against a statistical distribution? I doubt they want that. So what do they want?

a1369209993 · on July 12, 2021

  > int rlo() {
  >   // Dead store eliminated.
  >   y.store(0, std::memory_order_release);
  >   // Redundant load eliminated.
  >   x = 1;
  >   return 0; // Stored value propagated here.
  >   }

This is actually wrong, for subtler reasons than one might initially think[0]. Consider the case where rlo is called with x!=0,1 and y!=0. Another thread that observes y=0 is guaranteed not to see the originial value of x - it might see 0, it might see 1, but something has to have been written to x by the time it observes y=0. The above optimization allows even a single-processor execution to context-switch at "// Redundant load eliminated" and observe y=0, x=??? from another thread.

Oddly, the correct code seems to be:

  x = 1;
  y.store(0, std::memory_order_release);
  return 0;

On the face of it, the load acquire seems like it would prohibit that, but in that's only a problem if it (the acquire) observes some release with which x = 1 can't be reordered. But it's hard coded to always observe the y = 0 release, so that can't happen. (I'm less sure that this is correct than that the previous version is wrong, admittedly.)

0: In particular, it's not just a matter of "x = 0 has to happen because store-release".

agalunar · on July 12, 2021

> This is actually wrong [...].

I fairly sure it's correct! precisely because of the initial store `int x = 0;`, for the reason given by the article. If that were `int x = 2;` instead, then I agree that the transformation would be invalid.

(Perhaps this is what you meant? but I wanted to clarify.)

In case anyone is curious, here are the relevant parts of the C++ reference:

> Absent any constraints on a multi-core system, when multiple threads simultaneously read and write to several variables, one thread can observe the values change in an order different from the order another thread wrote them. Indeed, the apparent order of changes can even differ among multiple reader threads. Some similar effects can occur even on uniprocessor systems due to compiler transformations allowed by the memory model.

> If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

> The synchronization is established only between the threads releasing and acquiring the same atomic variable. Other threads can see different order of memory accesses than either or both of the synchronized threads.

https://en.cppreference.com/w/cpp/atomic/memory_order

a1369209993 · on July 12, 2021

> because of the initial store `int x = 0;`

That's a global variable declaration; it only happens once, but rlo can be called multiple times, or after some other code mutates x.

agalunar · on July 12, 2021

Right, but writes to x from another thread (i.e. by a thread other than the one in which rlo is called or the thread reading y) aren't required to be visible when y is read.

Edit: ah, I think I understand what you're saying – if a thread that calls rlo also modifies x elsewhere, beforehand, then yeah, the store can't be eliminated.

gizmo686 · on July 12, 2021

That depends. rlo() is not the only function that is allowed to have synchronization primitives. It is possible that before rlo() is called, some other code modifies x, and all threads use some synchronization primitive to guarantee that they observe it. There is no notion of lexical or compilation unit scope for synchronization, so (unless it is doing whole program optimization) the compiler has no way of knowing that no other writes are visible after initialization.

agalunar · on July 12, 2021

Agreed, but we might be stepping outside the scope of the example. Likely, you can make most examples incorrect if you're allowed to add arbitrary code; I read this section of the article with the implicit assumption that "the thread calling rlo() performs no other writes to x" and "there's no additional synchronization", since the author didn't put any such code in their example and only claimed that the optimization was possible for the example they gave. (This is Grice's maxim of quantity, I suppose?) Although they probably should have been more careful and clarified that the example they gave doesn't work regardless of context.

overgard · on July 11, 2021

I'd love to have a compiler that just tells me what optimizations could be made, but let me either do it myself (if syntactically available in the language), or explicitly mark the optimization as ok through a #pragma or something like that. I just think having to read the disassembly to figure out what happened isn't a great user experience.

tjalfi · on July 11, 2021

SGI compilers had diagnostic information called love notes that explains what optimizations had been applied to loops.

Here is an example taken from [0].

    #<loop> Loop body line 1, nesting depth: 1, estimated iterations: 17  
    #<loop> Unrolled 2 times  
    #<swps> Pipelined loop line 1 steady state  
    #<swps> 50 estimated iterations before pipelining 2 unrollings before pipelining  
    #<swps> 6 cycles per 2 iterations  
    #<swps> 4 flops  
    #<swps> ( 33% of peak) (madds count as 2) 
    #<swps> ( 16% of peak) (madds count as 1) ( 33% of peak)      
    #<swps> 2 flops  
    #<swps> 2 madds  
    #<swps> 6 mem refs  
    #<swps> 3 integer ops ( 25% of peak)  
    #<swps> 11 instructions ( 45% of peak) 1 short trip threshold  
    #<swps> 5 integer registers used.  
    #<swps> 9 float registers used.

[0] https://cug.org/5-publications/proceedings_attendee_lists/20...

ndesaulniers · on July 11, 2021

These are called remarks in GCC and clang and are controlled by -R family of flags.

ndesaulniers · on July 11, 2021

I don't think you would. The list would get unwieldy quick and the order of passes might give you surprising results.

It's also possible to get into situations where your code is incorrect (relies on UB), and just so happens to work with these flags, but not those!

overgard · on July 12, 2021

Maybe not, but I feel like as a C programmer I'm already kind of a control freak, and I really want to be able to tell the compiler what I want and where, I don't want magic. I'm using C explicitly because I want to know and control what's going on.

If I'm dealing with atomics and concurrent code I really want to tell the compiler, hey, inline constants maybe, but don't reorder or combine operations. Even if the compiler is "right" or my code is subtly wrong, I don't want the compiler making things worse.

jcelerier · on July 12, 2021

> but don't reorder or combine operations

I hope you're not programming for anything more advanced than a Z80, PIC16F or 6502 then because pretty much any CPU more advanced than that will likely also do that internally.

https://en.m.wikipedia.org/wiki/Out-of-order_execution

UncleMeat · on July 13, 2021

Yeah. What good is demanding that the assembly has a certain structure if that doesn't mean that the machine actually does it that way? Out of order execution and speculative execution are optimizations that improve performance by 10x or more. They aren't optional.

pjmlp · on July 12, 2021

Given that 16 bit compilers were already doing that, although in clusmy ways, what you really want is Assembly.

Beware that microcoded CPU might still do magic to your beloved Assembly opcodes.

josephcsible · on July 12, 2021

> If I'm dealing with atomics and concurrent code I really want to tell the compiler, hey, inline constants maybe, but don't reorder or combine operations.

If that's what you want, then declare your variables volatile (in addition to being atomic, not instead).

astrange · on July 12, 2021

You probably want seq_cst atomics. That's easier to understand and control then volatile is.

josephcsible · on July 12, 2021

seq_cst doesn't forbid optimizing `x = 1; x = 2;` into `x = 2;`, though, but volatile does.

jcelerier · on July 12, 2021

That doesn't mean that your CPU isn't gonna optimize it anyways afterwards, see for instance all the macro-fusion cases described in here : https://en.wikichip.org/wiki/macro-operation_fusion

astrange · on July 12, 2021

Atomic instructions wouldn’t be reordered like that on a CPU, that’s the point of using them.

Non-atomic ones might be, and volatile would since it’s only a C concept. Although it wouldn’t be through instruction fusion, more like non-coherent caches.

overgard · on July 12, 2021

At this point I wouldn't expect the compiler to care.

mhh__ · on July 12, 2021

Doesn't do exactly what you want, but try -fopt-info-all in GCC. Gives you a reasonably informative guide of what it tried to do and whether it used the transformation or not.

overgard · on July 12, 2021

Thanks! I'll try that.

UncleMeat · on July 13, 2021

This would be fun, and it can be made available from the major compilers, but using it almost certainly bad for most developers. Any nonstandard decisions you make here are likely to be unstable. Even if they are better for one build, they will sit in a config file and rot to the point that you've sabotaged your future performance by making decisions for an older build. I see this all the time with nonstandard flags for performance. Somebody five years ago had a problem with the VM and needed to change the way GC worked and set a flag. Nobody remembers why and they are all afraid to touch it. Now the application is getting worse performance.

dang · on July 11, 2021

One past thread:

No Sane Compiler Would Optimize Atomics - https://news.ycombinator.com/item?id=11694325 - May 2016 (56 comments)

elliotpage · on July 11, 2021

2105? I love posts from the future

georgeoliver · on July 11, 2021

I had a very weird moment of cognitive dissonance reading the headline, having just finished rereading the novel Dune yesterday.

bpodgursky · on July 12, 2021

I too was briefly worried that Ix was up to no good with their thinking machines. Honor the Butlerian Jihad! This kind of optimization should be left to the Mentats.

andi999 · on July 11, 2021

Why is it surprising that atomic can be optimized?

Sharlin · on July 12, 2021

Writing correct, even if non-optimal, code that uses atomics is already highly non-trivial. Transforming code that uses atomics into more efficient forms while guaranteeing the same behavior is even more non-trivial.

andi999 · on July 13, 2021

But is difficult for the programmer not more difficult for the compiler, isn't it?

tester756 · on July 11, 2021

>For Compiler Writers

>Get back to work, there’s so much more to optimize… and so much code to break! Help users write good code: the compiler should provide diagnostics when it detects anti-patterns or misuses of atomics.

Is it worth to put effort into compiler optimizations instead of putting more effort into providing "diagnostics" and informations for programmer?

in context of this: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.29....

kccqzy · on July 11, 2021

The very paragraph you are quoting asks compiler writers to provide better diagnostics.

tester756 · on July 11, 2021

but there's also

>Get back to work, there’s so much more to optimize…

my point is that maybe code optimization research time could be spent better

MauranKilom · on July 11, 2021

Could you fix the title to not be almost a century in the future please? :)

dang · on July 11, 2021

Whoops! twas a typo (2105). An editor has fixed it now.

DangitBobby · on July 11, 2021

Feels like a really weird context for click-bait. I guess I shouldn't be surprised.

rurban · on July 11, 2021

Likewise no sane compiler would optimize memset. Unluckily we don't have sane compilers

halayli · on July 11, 2021

Compilers are sane and written by very smart people. They do have bugs just like any other software. In my 20+ yrs of coding, I can count the number of bugs I've encountered on one hand.

memset tend to be optimized out in the dead code elimination pass and highly relies on SSA. Compilers adhere to the abstract machine specs. If you want memset not to be optimized out then use memset_s because the spec explicitly doesn't allow that.

if an optimization relies on a UB and surprised the programmer then the programmer was also relying on a UB in the first place.

1ris · on July 11, 2021

The problem with programming is almost never the lack of smartness. It's almost always the opposite: People being too clever.

The sane thing in a realistic scenario is to ban all usage of memset and add a diagnostic to use memset_s, like any responsible programmer should do with strcpy. Then however, was it really wise to make this machinery, that the people who need it the most are most likely to not have, necessary? I really, really don't think so.

\Edit: Here is some code written by a very smart person, Niels Fugerson. I think (not sure) it tries to wipe sensive material from memory and gets it wrong. Really not sure, tho.

https://github.com/MacPass/KeePassKit/blob/master/TwoFish/tw...

halayli · on July 11, 2021

There's absolutely nothing wrong with using memset. I don't understand why you would suggest such a ban. If your goal is to scrub a memory region after you're completely done using it then you shouldn't use memset.

Compilers won't optimize out memset if it changes the program's behavior within the boundaries of abstract machine's specification.

A call to memset to clear a memory region after you're done with it because a password was stored there can be optimized out because it doesn't change the behavior of the program.

The code you shared is incorrect if your definition of correctness here includes K being wiped out.

overgard · on July 12, 2021

> A call to memset to clear a memory region after you're done with it because a password was stored there can be optimized out because it doesn't change the behavior of the program.

Well, that does change the behavior of the program. After all, the behavior of the program was to wipe the memory. The programmer specified it. The compiler is clearly wrong by any reasonable measure. It's removing an intentional, clearly specified instruction. It's just wrong. It's making an "interpretation" of something the programmer intended as a fact, and breaking things in the process. Why does anyone think this is desirable? Any good programmer knows that correctness comes before speed.

I've been programming C++ since 1998 and I didn't even know this till reading this thread. Blaming the user here is counterproductive. Compilers should not make this optimization, plain and simple. memset should set the memory. If you think otherwise, I invite you to take responsibility for all the users data you have placed in harms way.

overgard · on July 12, 2021

Btw to the downvoters, explain to me why a function that is clearly intended to create a side effect that is visible outside the process, IE something you cant reason about, is safe to remove? Memory debuggers are not new or exotic tech. Why do you think that is reasonable to remove if a programmer clearly thought it should be done?

dagenix · on July 12, 2021

> explain to me why a function that is clearly intended to create a side effect that is visible outside the process

This is simply not accurate. memset_s provides that guarantee. memset does not. It would appear that you have been under the impression that it did. However, it did not.

rurban · on July 12, 2021

It does not. Almost all memset_s implementations are broken, just mine not. memset_s just ensures that the unsane compiler will not optimize it away, whilst it should also ensure that is fenced. Otherwise you can still read the memory from the cache. Calling the ordinary memset_s secure is a bug, it's a mere convenience function.

dagenix · on July 12, 2021

I don't follow - how do memory fences make a meaningful difference in security?

overgard · on July 12, 2021

Doesn't it seem bad that the default that's been taught is wrong? Admittedly I haven't read a C programming book since 1998 because I've been coding that long, but, it strikes me as incredibly fucked that you need a separate memset_s to mean "I'm not fucking around". When I learned C, memset just meant, uh, set this memory, it was pretty straightforward.

josephcsible · on July 13, 2021

> the default that's been taught is wrong

There's a big difference between you not liking something and it being wrong. You want to make the common case slower for everyone so that you don't have to learn one extra rule for your uncommon case.

> it strikes me as incredibly fucked that you need a separate memset_s to mean "I'm not fucking around". When I learned C, memset just meant, uh, set this memory, it was pretty straightforward.

I don't think it's ever worked the way you want it to work.

kaba0 · on July 12, 2021

So you are saying that we should throw away every single optimization, and go back to C being a fancy assembler? As if we consider the memory layout at all times inspectable, pretty much any optimization is impossible. Also, the compiler produces code for the CPU not a debugger — here you can’t implicitly view memory at arbitrary places.

I personally prefer to write code that is more readable to humans that will get translated to the optimal instructions over having to learn some tricks known only by some elite C hacker that will actually be inferior, since CPUs are finicky. Even with very good understanding of the CPU one can not really guess which instructions are faster to execute without benchmarks.

Nonetheless, there are tiny C compilers that do next to no optimizations, feel free to compare the speed of the resulting binaries.

gpderetta · on July 12, 2021

memset it is not clearly intended to create a side effect that is visible outside the process. In fact setting memory is not a visible side effect from the C abstract machine point of view.

The compiler cannot read minds, if the programmer clearly thought the code should behave in a certain way it should have told the compiler in a language it understand. Volatile would probably be part of that solution.

overgard · on July 12, 2021

That's the thing: we live in the real world. Yes the C abstract machine blah blah blah. Ok, but, I called memset for a reason, and the compiler writer, being also of the real world, should realize maybe I'm deleting sensitive data? And yes I know there's memset_s but it strikes me as bad prioritization if I have to seek out a secondary function to do what I mean.

gpderetta · on July 12, 2021

how can the compiler realize you are deleting sensitive data? The compiler is dumb and just applying a bunch of rules, it has no way to infer intent from your code unless you explicitly tell it so. And no, despite the name, memset does not mean set the memory to 0 anymore than int i = 0 means set the memory to 0.

BTW, I think memset_s is also problematic: there is nothing preventing a compiler from storing a copy of your data anywhere else. At the very least is likely to be in registers, possibly partially spilled on the stack.

overgard · on July 12, 2021

> how can the compiler realize you are deleting sensitive data?

It can't, so don't remove my memset. Why is this hard. Compilers are dumb, yes, so don't delete the thing I wrote on purpose.

You are right about the definition of memset. It's unreliable. My point is why is that ok with anyone sane. If I clear 128 bytes on a stack frame that's returning, sure, it's not optimal, but the compiler doesn't know why I did it and I surely did it for a reason. The definition of memset is bad.

gpderetta · on July 12, 2021

why do you even expect those 128 bytes to be on a stack frame at all? I would expect the compiler to keep stuff in registers as much as it can and reuse stack slots as soon as they are no longer needed. And I would certainly expect to remove dead stores aggressively.

I don't use memset much, but I use memcpy and certainly do want (and expect) the compiler convert memcpys to register moves when possible and even to eliminate them when the result is not actually used.

astrange · on July 12, 2021

Memory debuggers aren't mentioned in the C spec, and an implementation that let you see all past contents of memory would be valid (and pretty useful), in which case that memset wouldn't be hiding anything.

ncmncm · on July 11, 2021

Yes, this code gets it wrong.

It demonstrates that there are many different meanings for "smart", and even for "smart programmer". Nobody qualifies for all of them.

1ris · on July 11, 2021

If Niels Furgerson does not qualify as "smart" for programming I don't and I don't think i have ever met anyone who does.

halayli · on July 11, 2021

being smart doesn't eliminate the possibility of making mistakes. Are you expecting that a smart human being isn't going to make mistakes? Go through the revision history and see how many bugs got fixed.

secondcoming · on July 11, 2021

Isn't memset different in that it's an external function that compiler writers have chosen to intercept and perform themselves? Why is saying 'use memset_s' acceptable when they could have stopped treating memset as a special case?

Someone · on July 11, 2021

If you include a standard header (using <> brackets) and call a function that the standard says get declared when doing that, the compiler is allowed to assume you want the standard behavior, and inline that/replace it by a specialized version/etc.

Optimizing memset, in particular, can reap huge benefits, for example for sizes of 4 or 8 bytes in tight loops or for zeroing cache line sized blocks.

chrisseaton · on July 11, 2021

> Isn't memset different in that it's an external function that compiler writers have chosen to intercept and perform themselves?

The semantics of memset and memset_s are defined by a standard. They're not unknown functions that could do anything, and they're breaking a rule by treating them differently, like you're suggesting they are.

halayli · on July 11, 2021

memset is not a special case. Compiler designers are very much against special casing, and rightfully so.

Compilers do have an intrinsic equivalent but that has nothing to do with whether it gets optimized out or kept.

Following SSA rules, if you are writing to a non-volatile memory and not reading from it for the remaining duration of its lifetime then it's dead code.

use memset_s was a suggestion if your goal is to scrub memory because it guarantees the write will happen. The guarantee comes from the C standard:

> Unlike memset, any call to the memset_s function shall be evaluated strictly according to the rules of the abstract machine as described in (5.1.2.3). That is, any call to the memset_s function shall assume that the memory indicated by s and n may be accessible in the future and thus must contain the values indicated by c

Which essentially means treat the memory as-if it's volatile.

ncmncm · on July 11, 2021

Treating "the memory as-if it's volatile" would not block such an optimization. Specifically, zeroing a volatile or atomic that lives on the stack immediately before returning from a function whose scope it lives in is freely optimized away, because it is not observable.

Many people mentally model atomic as akin to volatile, about which they also frequently harbor superstitions.

I knew a FreeBSD core committer who insisted (and convinced our boss!) that volatile was entirely sufficient to serialize interactions between threads.

Someone else expected a line with a volatile or atomic operation never to be optimized away, and was appalled that it happened. Assignment to a volatile stack object whose address has not escaped and is not used will, in fact, be optimized away, regardless of the "clear language of the Standard".

Often, not performing what would appear to be a pointless optimization would block a more substantial optimization. Compiler writers optimize aggressively in order to break through to the substantial ones, which are most usually dead-code elimination that can then propagate backward and upward.

If you really need for a volatile object and operations on it not to be optimized away, you need to allow a pointer to it to escape to somewhere the compiler cannot see. Have a function in a separate ".o" file, and pass the pointer to that. Or, better, use a compiler intrinsic or other facility provided specifically for the purpose, such as memset_s or explicit_bzero.

josephcsible · on July 12, 2021

> Specifically, zeroing a volatile [...] that lives on the stack immediately before returning from a function whose scope it lives in is freely optimized away

> Assignment to a volatile stack object whose address has not escaped and is not used will, in fact, be optimized away

I don't believe either of these statements. https://godbolt.org/z/8cqEdY9oE is a simple case where this doesn't happen. Can you post a Godbolt link or something that demonstrates when it does?

Keep in mind that the standard says "Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine." In particular, it doesn't say that accesses to non-volatile objects through volatile-qualified pointers will be handled in any special way. (This looks like it's going to be changed in C2x, but it hasn't been changed as of C17.) Also, it says "If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined."

ncmncm · on July 12, 2021

Any particular compiler and command-line setting where it is not optimized away tells you nothing about what that or some other compiler may do in a different release or command-line setting.

Talk to any core compiler implementer, and they will say that the optimization is allowed. No compiler performs all permitted optimizations, in any release, but the set of those each does do, varies.

josephcsible · on July 12, 2021

> Any particular compiler and command-line setting where it is not optimized away tells you nothing about what that or some other compiler may do in a different release or command-line setting.

Can you provide a different program, compiler/version, and command line where it will be optimized away?

> Talk to any core compiler implementer, and they will say that the optimization is allowed.

Are you just guessing that they'll say this, or have they actually said this? If the latter, can you link to where they did?

> No compiler performs all permitted optimizations, in any release, but the set of those each does do, varies.

My Godbolt link is a really obvious, easy place that the optimization you claim is possible could be performed. I think the fact that none of the 4 major compilers optimized it away is decent evidence that it isn't allowed.

ncmncm · on July 12, 2021

> have they actually said this? If the latter, can you link to where they did?

They have said it face-to-face, in person. It is possible that, with enough pushback, they have since disabled the optimization. (They all work at large corporations, nowadays, and must obey management direction.) Or, more likely, disabled certain cases of it. I will ask again.

For documentary evidence, you may consult the definition of C11 memset_s, which is specified not to permit the optimization. It is hard to imagine such a requirement being perceived as needed, absent cases where calls to memset have actually been elided.

I will also experiment further with Godbolt. You might have failed to tickle the compilers in just the right way.

josephcsible · on July 12, 2021

> They have said it face-to-face, in person. [...] I will ask again.

If you do ask again, can you do so somewhere like a public mailing list, so I can read their response myself?

> For documentary evidence, you may consult the definition of C11 memset_s, which is specified not to permit the optimization. It is hard to imagine such a requirement being perceived as needed, absent cases where calls to memset have actually been elided.

The point of memset_s is to prohibit the optimization when the object isn't volatile.

1ris · on July 11, 2021

Are you allowed to cast the void * to volatile void *, pass that to memset and except the same result, or is the void *, volatile void*, void * conversion eliminated aswell?

josephcsible · on July 12, 2021

As of C17, adding a volatile qualifier like that doesn't do anything useful. The standard currently only requires that accesses to volatile objects be handled specially, not accesses to nonvolatile objects through volatile-qualified pointers. A lot of people, including Linus Torvalds [1], are unhappy that this is how it works, and it looks like DR 476 may finally change it in C2x.

[1]: https://lore.kernel.org/lkml/Pine.LNX.4.64.0607060856080.124...

ncmncm · on July 11, 2021

You can wave whichever dead chickens you like over the keyboard, all to the same effect. The compiler ensures only semantics that are observable.

If you pass a pointer to a volatile or atomic object to another function the compiler cannot see into, the compiler knows that any specified operations might be observable, so must act accordingly. Then, operations on that object will not be optimized away. Or, anyway, not until someone builds the program using Link Time Optimization ("-flto") and the called function body becomes visible.

1ris · on July 11, 2021

I'm confused. Is the memset to the temporary volatile void* observable or not, and is the temporary possibly eliminated?

ncmncm · on July 11, 2021

Casting the pointer passed to memset has no effect. The compiler knows what the storage pointed-to is: a temporary object. The compiler knows that memset does not allow a pointer to the object to escape from view. So, it can eliminate the call.

Calling C11's memset_s, or MS's SecureZeroMemory, or FreeBSD's explicit_bzero, has the effect, to the compiler, of making the object observable, so that operations on it will not be optimized away. There is no way to express this locally within the language; you must rely on facilities provided by a Standard or your platform, or on the dodgy linkage trick cited elsewhere.

halayli · on July 11, 2021

volatile is a type qualifier and can be casted.

However, memset's declaration takes a void* which prevents you from passing a volatile void*. casting or not won't change anything in this case because what matters is what memset takes and not what the pointer's fully qualified type is prior to calling memset.

From the compiler's perspective, you're passing a non-volatile pointer to memset which makes it a candidate for elimination if the pointer's lifetime ends after memset call.

overgard · on July 12, 2021

The default "memset" should just do what the programmer intended, and there should be a "memset_kinda_maybe" for the current interpretation. I think it's totally busted that users need to guess what way the compiler is going to maul their program.

99% of C/C++ programmers are not bored language lawyers with time to memorize thousand page specs. They use the language because they're pragmatic. Messing up peoples code with incredibly over-eager optimization is the opposite of pragmatism. I'd rather spend a day writing inline assembly for a hot path than two days dealing with the compiler breaking my code and making users data insecure becuase "usually" its ok to ignore a memset.

kaba0 · on July 12, 2021

Frankly, just learn the language properly… Compiler writers can’t throw their hands up in the air and do some non-deterministic/non-specified thingy. That would be terrible.

overgard · on July 12, 2021

I know the language perfectly fine; that doesn't stop me from recognizing how broken this all is. I can dance around these horrible land mines, I'm just saying take the landmines out because people are getting hurt.

userbinator · on July 11, 2021

If anything I think they are written by people who are too smart --- theoreticians who care only about the standard and can't be bothered exercising common sense when the standard doesn't specify what to do.

http://blog.metaobject.com/2014/04/cc-osmartass.html

halayli · on July 11, 2021

Your definition of what's considered common sense is based on your sole understanding of computers. It's more like your intuition.

Just because you don't understand something fully doesn't mean it's not common sense.

It's very common that programmers don't bother reading the standard or attempt to understand the concept of an abstract machine.

Programmers that know what they are doing tend not to be surprised by such nuances.

userbinator · on July 12, 2021

No, I've read the standard. The "abstract machine" is, as its name implies, abstract. In reality, there are other concerns that are equally if not more important. That is the problem with compiler writers, ignoring those practical concerns. They prefer to downvote rather than face the reality of going outside their imaginary abstract machine world.

"In theory, theory and practice are the same. In practice, they are not."

josephcsible · on July 12, 2021

The C standard can be thought of as a bunch of minimum requirements that program authors and compiler authors have to follow. What you seem to be arguing is for compiler authors to go above and beyond what's required of them, so that you as a program author can get away with doing less than the minimum that's required for you.

userbinator · on July 12, 2021

Me and everyone who has been subject to this "abuse" by compiler-writers stuck in their fantasy world.... including Linus Torvalds: https://news.ycombinator.com/item?id=17242103

Gibbon1 · on July 12, 2021

Reminds me once saw an comment about optimizing on UB by a compiler maintainer for Microsoft. He pointedly said the gcc maintainers would do better to work on their crummy code generation.

astrange · on July 12, 2021

Aren't gcc/clang significantly better than MSVC? They're certainly better at safe code with tools like asan, and at least they support C99.

pjmlp · on July 12, 2021

Not updated with Visual Studio 2019?

C11 and C17 are supported, minus C99 features made optional in C11.

ASan is now in the box.

And Visual Studio definitly provides a better experience of coupling static analysis with code being written.

astrange · on July 13, 2021

I am a bit behind. Sounds like a good update, since before it was effectively just a C++ compiler that implemented parts of C when it was convenient.

pjmlp · on July 13, 2021

On the contrary, first of all there is no rule that requires a C++ compiler to also be a C compiler.

Second, Microsoft considered C done on Windows, and green field systems programming should be done in C++, there were a couple of blog posts done about the subject, which apparently hater kept forgeting to read about.

So no, it wasn't about when it was convenient, rather to the extent required by ISO C++ compliance.

In fact, as per Microsoft Security Center advisory, C# or other managed languages, Rust or C++ with Core Guidelines, should be the main languages to use for new applications.

Most likely they ended up backtracking with the goal to drop C support from Visual Studio due to pressure from high profile customers, WSL and FOSS code on Windows, or even the culture change brought by Satya.

josephcsible · on July 13, 2021

> On the contrary, first of all there is no rule that requires a C++ compiler to also be a C compiler.

This is true, but an OS with no C compiler is horrifically bad. If Microsoft didn't want their C++ compiler to also be a C compiler, then they should have made a separate C compiler, instead of forcing people to rely on something like MinGW.

pjmlp · on July 13, 2021

The compilers are separate, just like in most C and C++ compilers out there.

There is hardly any reason to keep writing new C code in 2021 other than UNIX like codebases or embedded development that keeps resisting C++ adoption.

I was actually disappointed with Microsoft's decision to backtrack on their decision.