atom3's comments

atom3 · on Nov 24, 2021

According to the cppcon talk about Mesh [1] (an allocator that implement compaction for C++ programs), the overhead can be massive too (17% overhead measured on firefox, 50% on redis!)

[1] https://youtu.be/XRAP3lBivYM?t=1374

jimsimmons · on Nov 24, 2021

Nice link. I guess theoretically you can always optimise this in languages with manual management. With complex GC you have to figure out a way to tame the beast and I’m not sure if it’s easier to reason about

atom3 · on July 21, 2021

I've not try that but wouldn't the html meta tag work as well?

Something like

  <head>
    <meta http-equiv="Cross-Origin-Embedder-Policy" content="require-corp" />
    <meta http-equiv="Cross-Origin-Opener-Policy" content="same-origin" />
    ...
  </head>

Jasper_ · on July 21, 2021

No, http-equiv only supports a hard-coded whitelist of headers. https://html.spec.whatwg.org/multipage/semantics.html#pragma...

infogulch · on July 21, 2021

It seems the answer to the question "wouldn't this work" is "no", but I'd ask "shouldn't this work?" what's the problem with this being configured in a meta tag?

flohofwoe · on July 21, 2021

Apparently not, but I haven't tried myself:

https://stackoverflow.com/questions/67259043/can-coop-coep-h...

atom3 · on July 10, 2021

When I found out about this, I wrote some macros to replicate some of the semantic of ISPC [1] in C++ as a fun experiment [2].

Of course it has no practical value but it was really cool to see it was possible to do so.

[1] https://ispc.github.io/

[2] https://github.com/aTom3333/ispc-in-cpp-poc

atom3 · on April 27, 2021

> Here’s an example how to compute FP32 dot product with intrinsics: https://stackoverflow.com/a/59495197/126995 I have doubts the ISPC’s reduction gonna result in similar code. Even clang’s automatic vectorizer (which I have a high opinion of) is not doing that kind of stuff with multiple independent accumulators.

ISPC lets you request that the gang size be larger that the vector size [1] to get 2 accumulators out of the box. If having more accumulator is crucial, you can have them at the cost of not using idiomatic ispc but I'd argue the resulting code is still more readable.

I'm no expert so they might be flaws that I don't see but the generated code looks good to me, the main difference I see is that ISPC does more unrolling (which may be better?).

Here is the reference implementation: https://godbolt.org/z/MxT1Kedf1

Here is the ISPC implementation: https://godbolt.org/z/qcez47GT5

[1] https://ispc.github.io/perfguide.html#choosing-a-target-vect...

Const-me · on April 27, 2021

> Here is the ISPC implementation

Line 36 computes ymm6 = (ymm6 * mem) + ymm4, the next instruction on line 37 computes ymm6 = (ymm8 * mem) + ymm6

These two instructions form a dependency chain. The CPU can’t start the instruction on line 37 before the one on line 36 has made a result. That gonna take 5-6 CPU cycles depending on CPU model. Same happens for ymm5 vector between instructions on line 38 and 41, and in a few other places.

In the reference code all 4 FMA instructions in the body of the loop are independent from each other, a CPU will run all 4 of them in parallel. The data dependencies are across loop iterations, only the complete loop is limited to 4-5 cycles/iteration. That’s OK because the throughput limit (probably not the FMA throughput though, I think load ports throughput is saturated before FMA, especially for unaligned inputs) is smaller than that.

atom3 · on April 27, 2021

Oh right, I didn't think of looking for that, guess you're right and doing things by hand is still better

Const-me · on April 27, 2021

It’s not terribly bad because CPUs are out-of-order. As far as I can tell, there’s no single dependency chain over all instructions in the loop body, some of these FMAs gonna run in parallel in your ISPC version. Still, I would expect manually-vectorized code to be slightly faster.