Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
gRPC benchmark results (github.com/lesnyrumcajs)
217 points by pjmlp on May 8, 2021 | hide | past | favorite | 193 comments


I worked on gRPC Java optimization for a few years. The primary reason it is so fast is because it does hardly anything, and avoids synchronization. However, there is a deeper reason why Java was so much faster: more people were working on making it faster. The external C++ version of gRPC was ignored for years, with people focusing on optimizing the internal version. The other languages (C#, Python, etc.) all had significantly less attention that Java did, so they aren't nearly as polished. It's been amusing to watch how stirred up people get about how fast a "language" is, rather than how much time has been spent optimizing the particular implementation.


Looking at the 99% times and memory usage, it looks to me like rust came out on top if you’re looking at best worst-case performance. Any idea why the tail latency was so much better for rust vs Java or c++? E.g. specific work that went into it or language optimizations?


I've heard the anecdote many times that the idiomatic way of writing Rust tends to also be performant. I've spent some time fiddling with C code to reuse buffers and avoid copies, make loops amenable to auto vectorization. Many times my first try in Rust will still be faster. In C reusing memory over and over is dangerous and confusing. In Rust it's the default because you use borrowing and transfer ownership.

My guess is that little time was spent optimizing the C++ and Rust codebase, and Rust performs better because the code doesn't do copies.


My money is on garbage collection. In Rust it can happen from compile time: https://stackoverflow.com/a/32677591/235463


That doesn’t explain C++’s poor tail performance, unless it’s just poorly optimized like the top comment suggested


Yes it’s hard to tell without looking at the Java options. But especially since it’s a single cpu benchmark, it’s likely something blocked for a period of time.


That's pretty interesting. Are you saying, in other words, that gRPC C++ suffered from the competence of the C++ stubby team, while gRPC Java benefited from the comparative neglect of Java stubby?


Comments from one of the maintainers on why Java is on top

https://www.reddit.com/r/grpc/comments/muy8dj/grpc_bench_ope...

Small clarification (to my understanding, I'm not a Java Guru) on why Java got on top - those Java implementations use something called Direct Executor. It's super performant when there's no chance of a blocking operation. But if you are to do anything more than echo service, you might be in trouble. Other implementations probably don't suffer from the same constraint

PR and discussions here- https://github.com/LesnyRumcajs/grpc_bench/pull/91


To me it seems pretty misleading to publish a benchmark with a language setting that basically cannot be used in real world applications. Makes the whole benchmark pretty much pointless except as a game.


It’s not necessary a “language setting” — a Java “ExecutorService” is an abstraction which allows one to dispatch futures / async jobs. In this case, the DirectExecutor is an implementation of this service that, doesn’t do anything async at all, avoiding context switches and all other overhead that comes with it.

If anything, to me it’s a clear example of why benchmarks like these are silly as they benchmark very uninteresting aspects that typically are never real world bottlenecks.


That's specifically the purpose of the benchmarking right ? to be misleading and get attention. I guess this benchmark achieved that.

Java subs are celebrating it like they won a war. Java is a great language and will remain one of the top languages in the near future. But these silly benchmarks don't serve any purpose.

I am expecting to see this benchmark thrown around a lot from now on any language discussion. Whoever is doing these benchmarks should be more responsible and sensible.


I'm not primarily a Java developer and this use of "blocking" is very confusing to me. Outside of the Java world whether something is blocking or asynchronous is a property of the API, and usually obvious from the method signature (e.g. if it returns a future or accepts a callback it's asynchronous; if it has a timeout it's blocking, etc.). For example Go unary gRPC is certainly "blocking" as far as any code a user writes cares - both the server start/stop and the individual calls are completely synchronous on their goroutines. And this is also how the term is used in the Java gRPC documentation.

Instead here it seems to mean something more like "is capable of deadlocking if a thread pool is exhausted" or therefore roughly equivalent to "not lock-free". Is that correct? (But then the comments about `volatile` confuse me more - can you deadlock a thread with purely `volatile` access if your platform doesn't natively support it? That seems like a large failing of the JVM.)


Can I upvote this a few times?


Tried to run the benchmark on my laptop (with a Ryzen 5 3500U), got different results much closer to what I would expect normally:

==> Running benchmark for java_grpc_pgc_bench... Requests/sec: 25563.46

==> Running benchmark for cpp_grpc_mt_bench... Requests/sec: 31389.24

==> Running benchmark for dotnet_grpc_bench... Requests/sec: 25376.18

==> Running benchmark for go_grpc_bench... Requests/sec: 29158.60

==> Running benchmark for rust_tonic_st_bench... Requests/sec: 28120.25

Different images for the same language (java, rust, cpp) performed with similar if not worse results


I think this makes sense. I suspect the core count makes a pretty big difference, especially for Go, given it’s optimized for multi core.

Edit: what were your latencies and memory stats like?


I didn't got the final report because of some error, those numbers were printed during the execution of the benchmark.

I executed it giving both the client and the server 4 CPUs, now I'm running it again with 180s duration and gonna update the original comment if I get to see the report or there is any significant change


Optimized for multi-core safety in a safe way, but given the safety of the actor idiom may incur in more copies, it cant be compared to the speed of a implementation that are more complicated to implement, but will perform better at the end giving you can customize better for that particular scenario.

With C++ and Rust you will be able to implement this in a more optimized way as you can look for approaches that avoid copies which can be the main factor of a slow implementation, specially in multi-threaded scenarios.


Nice benchmark, but I think the results tell us more about the grpc+protobuf+http2 libraries in use than the languages. It's quite amazing though to see such a highly optimized Java libraries.


Looking at the 99% latency and memory usage Java is pretty far from the winner. 99% latency actually matters a lot.


> 99% latency actually matters a lot.

It does, but the numbers are not too far. It's 3.5ms for Java versus 2.3ms for C++.

Even when running a lot of requests when 99% latency becomes the average one, that's like 1 millisecond difference, way under typical user's network latency.


And these number were when pushed to the max, when you are running at half the capacity if not lower it really shouldn't matter.


This! Benchmarks that fully sustains the throughput are rarely useful. If you’re running at that capacity your service is melting.

Latency histogram for fixed throughout (e.g. 1k RPS, 10k RPS, 50k RPS) are far more useful. You want to know if going from 1k to 10k increases latency.


With Java you'll hit GC-breaks even under low load conditions. If we were to look at the 99.5-percentile, java would look even worse.


You assume that this is being used by a user.

These numbers are meaningless unless accompanies by specific usecases - don't just choose java or C++ just because the benchmark says its fast or the best.

High frequency trading needs that extra milli second, but a batch job or backend computation won't.


Also if you want to do high frequency you probably don't won't GC.



'allows pauseless garbage collection regardless of the Java heap size'. I thought this is not possible.


Clearly it is, since the C4 collector manages to do it. To be fair though, the "pauseless" part does come at the cost of a slight slowdown in normal code execution since it instates read barriers to work its magic. This means the lack of pauses is paid for by significantly higher overall CPU overhead for garbage collection compared to a collector design that does include pauses. It could be worth it for some workloads where it is very important to have low 99th percentile response times though.


Touché :-)


I'm pretty sure that Jane Street begs to differ. They're a OCaml company and they do very well on high frequency trading.


Does the ocaml backend do the high frequency trading?


As far as I understand, it does.


Milliseconds are at least 4 orders of magnitude too high for much of HFT.


This is a classic issue for Java vs compiled to binary languages such as C. Java does its own memory management thus allocates a big chunk of memory from the OS at startup. It also has garbage collection which runs from time to time dependent on what gc implementation is used.

These issues can be mitigated by carefully tuning the parameters of the Java virtual machine, but in practice, for most projects, it is not an issue.


You are correct. Some of this has changed significantly somewhere between Java 8 and 16. For example the Java applications we're running against Hotspot 16 regularly gives memory back to the host. It's GC pauses are also quite insignificant now for a microservice regularly allocating and GCing.


Seems like rust is actually the winner. Average latency is a meaningsless metric for performance. If not, why not perc20?


Throughput is a very valuable metric.


Ok, so let's compare it to batch-methods of delivery then.

No measurements are valuable all by themselves. None.


Even 95%, 90% or 50% would be better points of comparison than average. Average is nearly meaningless.


Average is the metric to look at if you have millions of requests to do one after the other, and care about total completion time.


As a rust fanboy, I was very happy to see the 99% and memory numbers. Still, very impressive for Java, I wonder whether this is just trading off the 99% percentile peak vs higher common case throughput? I guess rust actually allocates/deallocates all the memory on each request while java batches the deallocation?


Looking at the memory column, it looks to me like Java benchmark uses a large heap to avoid deallocations for most of requests. This means less but larger garbage collections and reduces average time while increasing 99% where the GC shows up.

A long time ago I had to implement a real time service with Java. The best solution was to use whopping 16 megabytes of heap so that it would do a full GC multiple times per second but each of the GCs lasted less than a millisecond.


Was there a theoretical upper bound for how long collecting those 16 MB could take?


No idea. It was much faster than the service really needed to be, so we just set the heap to 128M in production and never heard about any latency problems.


Just curious, did you experiment with different GCs?


At the time CMS was the only option.


In server applications you really shouldn't use the default allocator, but have one allocator per request which can then allocate and batch-deallocate very quickly. Unless you want to keep objects after the request ends, of course.


Is that possible in Rust these days?


Kind of. You can use a bump allocator like bumpalo[0], but any code that allocates needs to be aware of it (or it won't be able to take advantage of it). It also needs to reinvent all collection types (bumpalo provides strings and vectors atm).

The plan is to parametrize all std collections on the allocator eventually, but that's not stable yet.

[0]: https://docs.rs/bumpalo/3.6.1/bumpalo/


Or you could just use jemallocator with background threads.

https://crates.io/crates/jemalloc-sys

"background_threads (disabled by default): enables background threads by default at run-time. When set to true, background threads are created on demand (the number of background threads will be no more than the number of CPUs or active arenas). Threads run periodically, and handle purging asynchronously. [...]"


Hopefully this project could allow allocations inside memory-mapped files (which can then later be loaded at a different base address) in one quick load operation (instead of serializing everything).


What about an app server same process in memory cache, if using per request allocators?

> keep objects after the request ends

Yes. Can that be made in a way compatible with per request allocators?


Don't use the per-request allocator for those objects. You don't need to use the same allocator for everything.


That sounds great :-) Thanks


With optimized memory handling Rust should easily beat Java. Reusing allocations or at least using jemalloc/mimalloc.


There was follow up discussion with improvement for Go. https://www.reddit.com/r/golang/comments/mwr4iw/go_grpc_benc...


I would look into re-running the rust benchmarks w/ jemalloc or mimalloc instead of the system allocator and see how it performs.


The difference in avg memory for first and second is substantial, 115.41 MiB Java vs 4.15 MiB Rust.


Indeed, but is relevant? If you have a machine with average memory installed, both are fine. If you have something with very little memory available, both would be too much.


It matters when you do other stuff than rust running a benchmark. If every piece of your software uses 25x the memory you might end up needing 25x the servers for your application.


Most of that is probably up front cost for a jvm instance. So the 25x only happens because it is a micro benchmark that isn't doing anything useful. Might as well complain that the size of a C executable with an empty main is infinitely larger than a python script that does nothing.


So you’re saying Java is a bad choice for something like a sidecar (most likely multiple) running along your main app, due to that JVM overhead. I think you’re all agreeing here.


You can probably cut down the default heap size, get it to use 32 bit pointers for a small heap, etc. . The JVM has quite a few startup options that you could use if the memory footprint of hundreds of tiny instances is a bottleneck for you. Might even speed things up a bit more.


Unfortunately there is very very little information (last I looked) on the web about tuning down a jvm methodically and sensibly.


> Indeed, but is relevant? If you have a machine with average memory installed, both are fine.

The whole point is that you don't. You literally pay for the memory your application requires, proportionally to the amount of memory. The difference in the resources required to run is around two orders of magnitude.

Let's put things in perspective: in some platforms such as AWS Lambda, you are charged per memory used per second.


Yeah, sure, that's one use case, to run it on AWS Lambda. Everyone does not, and if you really care about latency/throughoutput, you won't be anywhere near Lambda, and probably not even on AWS or any "cloud" for that matter.

Paying for the amount of memory you use down to the MB, is hardly something everyone does, which makes it weird that you are now the third reply that assume this runs on AWS/Lambda.


> Yeah, sure, that's one use case, to run it on AWS Lambda.

It's not an edge case. It is a clear and irrefutable example that you pay for the memory you use.

Let's be very clear here: when you provision a VM anywhere in the world, you have to pick how much memory you require. You are charged for that memory, proportionally to the memory you require. If your app requires over 100x memory to run, that comes out of your wallet.


> It's not an edge case

Never said it was an edge case....

> when you provision a VM anywhere in the world, you have to pick how much memory you require

Yes, thank you! That's exactly my point! You create a VM (or a dedicated instance) somewhere, they ask you for the memory usage up front. Usually they start at 128MB and go their way up from there.

Even if you take the instance with the smallest amount of memory, you'll fit any of the benchmarked programs, effective making the "115.41 MiB Java vs 4.15 MiB Rust" statement not as important anymore.

> You are charged for that memory, proportionally to the memory you require. If your app requires over 100x memory to run, that comes out of your wallet

Hm, maybe on Lambda you pay more if the memory your app uses grows. But this is certainly not standard for "normal" hosting where you rent either a VM or proper instance. Then the memory grows until it cannot take any more memory, and the process either crashes, gets killed by OOM or does whatever operation you've designed it to take when running out of memory.


One project I worked on recently ran on containers with 256MB memory, on k8s. Main app would use ~80MB, plus five or six sidecars for metrics, logging, discovery, http proxy etc using 5-10MB each. Having a 100MB+ baseline makes it prohibitive for most of those applications.


The JVM's memory usage is based on a heuristic. It can get away with usually much less, but it errs on the bigger memory footprint side, because that way it has to do less actual work GC-wise.

For example, anecdotally a small, semi-complex JavaFX 2D game I wrote uses by default something like 250 MiB of RAM, but manually limiting it to max 80 was possible (but in the latter case, the GC had to run quite often)


And AWS lambda cpu performance is proportional to mem setting https://aws.amazon.com/blogs/compute/operating-lambda-perfor...


Giving further proof that if you actually care about increasing throughoutput + saving costs, you won't be anywhere near Lambda. Thanks for the additional resource and confirmation.


You're insisting in a red herring that tries to misrepresent the actual problem and the whole point of this discussion.

The whole point is that memory costs money, and the more memory you require, the more you pay.

AWS Lambdas is a clear example that demonstrates this, but is not an isolated case. This is the case for all cloud providers selling VM time. All of them.

Even bare metal providers charge you more for more memory.

Isn't it clear that if your app requires more and more memory, that comes out of your pocket? Is this something that really warrants a debate?


The problem is that the only way to get more CPU, you need to assign more memory to lambda. Most node.js lambdas would have stable idle/low memory usage around 200MB but you get ~1/9of vCPU core, this have severe impact performance. You are forced to overprovision lambdas with more memory, even then performance is still very much a disappointment. People that sold promise of serverless framework should have a special place in hell.


memory may matter when you want to run lot of application instances via docker


I worked in a web scraping team for a couple of years and memory was always the bottleneck since we ran hundreds of thousands of instances. Our scrapers were written in python and with the interpreter + all the imports we had 40MB reserved before even starting work.

Not just that, but they were very sensitive to fluctuations.


Are not cloud resources often billed memory x seconds?


If you're optimizing for pricing/billing you wouldn't use cloud in the first place.


It matters for things like AWS Lambda


If you're out after performance you wouldn't use AWS Lambda or even AWS at all. Go for dedicated hosting, better performance and cheaper.


> If you're out after performance you wouldn't use AWS Lambda or even AWS at all.

That point makes as much sense as complaining that if you were after performance you'd use a Formula1 car and not a Tesla.

People who live in the real world and have to do real work need to use real world tools, and one of which is AWS Lambda.

Let's put things in perspective: would it make any sense at all to advise a company to not only rewrite a whole application stack from scratch in your pet performant language but also jump head on to some boutique service provider? I mean, who in their right mind would get accountants involved in a goal to shave a few milliseconds over a few gRPC calls? Is a suggestion to improve performance expected to be considered even sane if it requires rewriting everything and change shop?


> People who live in the real world and have to do real work need to use real world tools, and one of which is AWS Lambda.

Another real tool is dedicated servers, something people used before "cloud" and something that people who care about performance still uses. AWS even offers dedicated servers themselves, so not sure why you would need to involve any "boutique service provider". Otherwise you have OVH, Hetzner and a range of others who compete well with AWS on dedicated instances as well, neither I'd say are "boutique".

> rewrite a whole application stack from scratch in your pet performant language

Not sure where this comes from, which one of these languages are "pet performant (SIC) languages"?

> who in their right mind would get accountants involved in a goal to shave a few milliseconds over a few gRPC calls

Hmm, unless the accountants are involved somehow in the API design (not sure what you're building), I don't know what the accountants have to do with anything here.

In the end, AWS and Lambda absolutely does not fit every use case. Depending on your use case, and if it's important a few ms here and there, you chose different solutions. Since this benchmark is about throughoutput, I thought we were discussing the use case of needing the best throughoutput, otherwise this is all off-topic. And with that, I'm just sharing that if that is your focus, you would probably not be using Lambda in the first place, as you'll get very shitty throughoutput and you have to pay a lot, compared to other mature solutions for this that we already had for many many years.


If you have a large number of users, accounting is involved in virtually everything you do in the capacity of serving them. Engineering history is littered with projects to shave milliseconds or cents off a very small component which is used extremely frequently.


> If you have a large number of users, accounting is involved in virtually everything you do in the capacity of serving them

What do you count as "large number of users"? Worked on projects with millions of users, thousands of requests per second and when we refactored based on increasing throughoutput/decreasing latency we had exactly 0 accountants involved, even if the company had accountants in-house and full-time.

I think it depends more on the company size than the number of users you have.

> Engineering history is littered with projects to shave milliseconds or cents off a very small component which is used extremely frequently.

Yup, agree and been there myself, hence my comments about staying away from Lambda and VPS for this kind of focus and go for dedicated instances where this matter.


> (...) we had exactly 0 accountants involved,

You should really rephrase that as "I was totally unaware that there were accountants involved" because it is simply inconceivable that a business activity involving allocating resources and changing operational needs would not be tracked. Either you are grossly misrepresenting your personal anecdote or you are filling in quite a few blindspots.


Yeah, it's completely impossible we both have two different experiences, I'm either lying or missing something, because every engineering-team waits for accountants before they do performance optimizations or capacity planning.

Whatever happened to "Assume good faith"?


> Worked on projects with millions of users

Maybe interacting with the finance department was above your pay grade. The bigger the operation, the more costs matter. That's a thumb-rule you can find anywhere.


> Maybe interacting with the finance department was above your pay grade.

It was not, we were a nimble and lightweight team with everyone doing everything they could possibly do. I had insight into accounting and helped them with implementing some stuff and some of the designers helped with frontend as some of them knew HTML and CSS and so on.

> The bigger the operation, the more costs matter. That's a thumb-rule you can find anywhere.

Yeah... I think I agree? "The bigger the operation" is referring to the employees working for the company, not the number of users right? If so, that was exactly my point. It's about the number of employees that dictate if accountants gets involved or not, not the number of users.


Being the SME on accounting for your team is not even a little bit comparable to working with the department with accountability to the company's cash flow. I don't know why you have tried that angle. It is a significant category error.


Most instances on AWS get dedicated cpu and memory resources, so noisy neighbors isn't a problem. The difference with metal is that you get the whole box for yourself.


> Most instances on AWS get dedicated cpu and memory resources, so noisy neighbors isn't a problem

First time I heard of that and also doesn't match my own experience using AWS. AWS tends to have a lot of noise from neighbors but might be because of the region I was using. Could you link the official statement you got this from?

> The difference with metal is that you get the whole box for yourself.

Yes, + you normally avoid virtualization as that can have impact on your performance too. I'm glad we agree there is a difference that is worth mentioning when it comes to performance :)


https://youtu.be/mZy6E2I5Rek

At 15 minutes. Afaik it's all but T type instances that get reserved resources. Might have been different in early AWS, pre nitro etc.

Something that i just remembered is that in AWS dedicated means that the hardware is dedicated to you, so no other customer vm on it, but you can have multiple vm's on it.


Server management is a significant part of cost.

Why would cloud provider be a business in the first place, do you think?


Sure, it's up to you what you optimize for. Easy and fast to scale vertically or better numbers in terms of latency/through-output? With the former, go for cloud. For the latter, go for dedicated. Want best performance for each buck spent? Again, dedicated.


If I am already using Lambda because of infrastructure management costs, it is always nice to reduce my bill by using less memory and/or improving response time.


In practice you might get through the night with a memory-hogging process but that's not how you get to scale.


How can people make generalized statements like this without anything about the use case or problem that would have to be solved?

You have exactly 0 information available and yet you make a comment like this? Is this why we have cargo-culting?

Not every solution to a problem needs to use as little memory as possible. And also not every solution can ignore memory usage. It depends, of course.

Since this thread is about throughoutput and optimizing for that, I could tell you that most times I've been put in charge for optimizing throughoutput, having low memory usage have been pretty far down the list.


Not every solution to a problem needs to use as little memory as possible.

But that's not the point. People aren't suggesting micro managing memory. Java has a well deserved reputation of being a memory hog and its using 20x what is needed in this benchmark.


In optimizations, you focus on a few metrics, like throughput, latency, resource usage, but deciding to improve one can (and usually does) impact the others. Java (that is, OpenJDK) usually trades of RAM for throughput. And compared to high-level languages doing comparable work, it doesn't use much memory. Of course a low-level language like Rust can be more lean, but that's beside the point. And in this case presumably, Rust could be made to have a better throughput by using more memory.


> Not every solution to a problem needs to use as little memory as possible

Those problems are called "one-offs" relative to things I design for scale.


Unless you talk about embedded, I have to disagree. RAM is the cheapest resource to scale and it will likely continue to grow without problems. There are server machines with 1TB of RAM available. And in the domain where Java is used, it uses totally apt amounts of RAM compared to other managed languages.


...

Yes, "relative to things I design"

You think that applies to everyone?

You think that comes close to applying to people focusing on improving throughoutput specifically?


I'd be curious about why the Java benchmarks are so much better than the Kotlin ones. I could buy hotspot JVM optimizations making it better than, say, Rust. But shouldn't Kotlin be leveraging the same speed ups?

The explanations could be: a) the bytecode generated by the Kotlin compiler is much worse and b) the benchmark Kotlin code is not as good or optimized as the Java one.


For comparison:

https://github.com/LesnyRumcajs/grpc_bench/blob/master/kotli...

https://github.com/LesnyRumcajs/grpc_bench/blob/master/java_...

Looking at the build files, different versions of the libraries appears to be used. The Java implementation is configurable but defaults to something called "direct executor".

Maybe that explains the difference.


Also the kotlin example is using coroutines


Coroutines in Kotlin can be convenient, but they are overall about 20% slower than non-suspending code. I didn't spend too much time investigating the cause, but I believe it's because of the extra object that is passed along in all method calls.

In my case, I have a programming language interpreter implemented in Kotlin, and as an experiment I made the entire interpreter using suspending calls so that I could call asynchronous functions, and my performance tests dropped by about 20%.


As I understand it, Kotlin coroutines generate what is known as irreducible control flow. This means that loops have more than one entry point, or equivalently, there are loops that have a goto jumping into the body from the outside. (Java the language and I believe also Kotlin the language don't have goto, but the Java bytecode does.)

Irreducible loops make many optimizations much more complex. Bytecode generated by javac never contains irreducible loops, and since bytecode generated by javac is the number 1 use case targeted by JVM JIT compilers, they probably just don't bother trying to be that smart about irreducibility.


Java has the compiler in memory while running. Rust's compiler is a separate program so isn't in memory while it runs. That by itself is a lot of code.


Kotlin runs in JVM and should get the same benefits from live profiling and JIT compilation.


Like all guest languages it needs to generate additional boilerplate to pretend to be Java, and support its additional features not available out of the box in JVM bytecodes.


Kotlin should be close-to-identical* in terms of bytecode it immits for straightforward code like this, it's designed to very closely match Java capabilities, among others to make sure interop is very good.

* Exception is supporting things like default methods on JVM 1.6 bytecode


Apparently that isn't the case.

Additionally it is stuck on Java 8 view of the world, otherwise those .class files won't be usable on Android toolchain thanks Google.


Looks like it supports generating class files up to java 16 https://kotlinlang.org/docs/compiler-reference.html#jvm-targ...


Yeah, but it can't make use of JVM bytecodes without DEX counterparts, nor JVM abilities unknown to ART.


Incorrect, the Kotlin compiler allows you to specify the target JVM level, it can generate JVM 1.6 bytecode (for android) up to 16.

I know for a fact it uses different bytecode features if the level is >= 1.8, not sure how smart it is above that.


It can only generate bytecode features that ART understands, and D8 is able to convert into DEX, thus is the price of Android marriage.

Going forward while for Java code there is no worry about using SIMD, JNI replacement, value types, Kotlin code will need to make use of KMM for code that is supposed to target both JVM and Android.


Are you so dense that you cannot understand that the Kotlin compiler can do gasp different things depending on different settings?

It can do stuff android doesn't support if you set the bytecode target level to > 1.8

It's like using modern Javascript features but providing a polyfill for older browsers.


Which produces libraries that are worthless to be consumed from Android without recompiling.

Also stuff like value classes will have different semantics in memory consumption and performance across targets.

Maybe some learning required?


Those are bytecode level related. You can't consume Java 16 libraries on Android either. It's not related to Kotlin at all.

You are either being deliberately obtuse or you genuinely have difficulty to understand that Java version, bytecode version, Kotlin compiler output and what bytecode Android supports are totally independent concepts.


Enjoy #KotlinFirst, I am out of this thread.

It is going to be fun to port back stuff to Java.


Oh I replied to the wrong comment - I was commenting on why it used more memory than Rust.


Optimizers can be finicky; kotlin and Java are going to generate different bytecode. The optimiser's been optimized for Java, and those differences will matter


I haven't looked at the benchmark implementations but one particular area I've heard other languages to lag behind java is cost of allocation.

Since openjdk has had to cope with lack of (user defined) value types and the garbage heavy ecosystem it has a very well optimised GC for handling heap allocations.

I wonder if the results will be different if other language benchmarks (C++ or rust) use different allocators (eg: bump allocator, per request collectors) for cheaper allocation.


Most of the compared languages (except C#) are likely to lag behind Java on many things, performance included, since Java has been around since 1996, Golang only been around since 2009 and Rust 2010. More time = more engineering time to optimize.


While this _could_ be correct if we are just talking about Go, which has its own backend and optimizer, it is definitely not the case for Rust, which uses LLVM. The differences seen in these benchmarks are all about the gRPC implementation used; languages are irrelevant. C++, Rust, Go and Java all have their own standalone implementation written from scratch, so it's kind of an apples and oranges situation here. You can write good or crummy code in any language for what it's worth.


I agree with you and I think we're both right here, on both measures :)


This comparison is more about the different gRPC implementations rather than the languages. There is no inherent reason why Java (as a language) should be faster than C++ or Rust in a benchmark like this. A lot of time and money have been spent optimizing Java (or HotSpot), but only because Java is really hard to optimize compared to GC-less low level languages.


Ref counting is more efficient than any GC allocator, which is generally the default in C++.


I don't know, is it though? Properly done reference counting AFAIK requires using atomic increase/decrease in order to be safe, and that creates a bit of overhead every time you assign a reference, while assignment to a reference with a precise GC is basically a pointer assignment. It's much better for latency though, given that the overhead from rc is deterministic. It has been a few years though since I've looked into garbage collection techniques, so I could be a bit rusty about the current state of the art.


You only need atomic reference counting if you're sharing objects between multiple threads, but if you use an object from one thread at a time then non-atomic inc/dec is enough. Rust allows you to make the choice between the two kinds and the compiler can infer which kind you need to use.


Yeah - Rust can, because it also ensures you can't exchange data unsafely between threads, but C++ can't. That's why `std::shared_ptr` has to be thread-safe.


Not necessarily. Bump allocations are extremely cheap, and collections can be batched to avoid interruption and performance impact.


Absolutely have been loving Java and even Spring Boot lately. Have been doing a lot of productive magic number hunting as well as REST API design.


The postfix for the Java entries is the type of garbage collector being used:

pgc: ParallelGC sgc: SerialGC g1gc: G1GC she: ShenandoahGC zgc: ZGC


But look at 95% and 99% latency numbers - Rust is on top. And surprisingly (to me) Go totally tanks in 99% latency.


It's not that surprising -- Go is designed for speed by simplicity, and a low-pause-latency GC is anything but simple.


In 2021, Go is supposed to have GC pauses on the order of several ms at worst. Not 40 ms. So it kind of is surprising, something seems to be broken there. I'm wondering whether this isn't a limitation caused by forcing a single core operation, the runtime might not be designed for that.

EDIT: Someone else noted (https://news.ycombinator.com/item?id=27085507) a discussion on Reddit where a <5ms latency was achieved in 99.9% cases, so perhaps this is indeed a subpar result.


When you allocate against a running GC it will penalize you for this (literally sleep your thread) - hence garbage tail latency. The solution is to not do allocation which is what gogo library strives for


Most of the phases of GC can be done parallel, so allocation should stop only for a really small amount of time. For comparison, Java's low-latency GC (which optimize for latency but may reduce throughput), ZGC can do worst-case <1ms pause times, that is the OS scheduler will cost more than GC.


I’m talking about Go garbage collector specifically which does mark and sweep and will slow down functions doing frequent allocations on purpose in order to catch up. This is different from stw kind of scenario e.g shenandoah


Iirc, the official protobuf module for Go still uses reflection in the generated code as opposed to fully generating the encoding and decoding code, so maybe that creates additional garbage or performance issues or lock contention. I think I remember there being an alternative module that fully generates the code, and it would be interesting to see that in the table as well.


Why is it surprising that Go tanks in 99% latency? That's what I would've expected.


You would? I wouldn't. It definitely looks like somewhat pathological case to me, at least in 2021. Maybe five years earlier the number would be appropriate, but there seems to be something wrong with Go slowing down this much at that small a heap. I'm wondering if it was tuned at all.


Because Golang was touted as a systems programming language (a better C).


Go is generally slower than Java and C#, many languages pretend C performance but this is just false advertising


I was about to say because of GC, but doesn’t java also GC ?


Java also has a JIT compiler which might eliminate object allocations at runtime.


It also has better GCs


Benchmarked on a single 6 core / 12 thread machine, with 9 client threads and 1-3 server threads. Language wars aside, it's hard to extract useful information like CPU time overhead on the server from these results.

For the single core test, we can infer an upper bound of 20-33us of server cpu time per request for most of the relevant languages for gRPC. That seems pretty good.


None of these are likely to compare to using GRPC with BPF:

https://github.com/fujita/greeter-bpf

2-3x faster than GRPC-Go.


Where is the source code for the winning Java implementation?

Not really anything here:

https://github.com/LesnyRumcajs/grpc_bench/tree/master/java_...


Follow the code!

    COPY java_grpc_sgc_bench /app
Directory here: https://github.com/LesnyRumcajs/grpc_bench/tree/master/java_...



I've got very different results with Go 1.16 and the latest stable version of gRPC, I ran things manually under WSL but it gives you another perspective:

90k req/sec with p99 under 1ms

  ghz --proto=/proto/helloworld/helloworld.proto --call=helloworld.Greeter.SayHello --insecure --concurrency="50" --connections="5" --duration "60s" --data-file 1kb.json 127.0.0.1:50051 --cpus=4

  Summary:
    Count:        5411979
    Total:        60.00 s
    Slowest:      15.24 ms
    Fastest:      0.03 ms
    Average:      0.34 ms
    Requests/sec: 90194.36

  Response time histogram:
    0.026 [1]     |
    1.548 [997969]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    3.070 [1426]  |
    4.592 [251]   |
    6.114 [108]   |
    7.636 [87]    |
    9.157 [80]    |
    10.679 [17]   |
    12.201 [40]   |
    13.723 [11]   |
    15.245 [10]   |

  Latency distribution:
    10 % in 0.12 ms
    25 % in 0.19 ms
    50 % in 0.29 ms
    75 % in 0.42 ms
    90 % in 0.58 ms
    95 % in 0.72 ms
    99 % in 1.04 ms

  Status code distribution:
    [OK]            5411959 responses
    [Canceled]      3 responses
    [Unavailable]   17 responses

  Error distribution:
    [3]    rpc error: code = Canceled desc = grpc: the 
  client connection is closing
    [17]   rpc error: code = Unavailable desc = transport 
  is closing


That in itself is not too useful, for reference you should run a few other implementations on your machine, as the numbers will likely be better for all of them on a beefier machine.


It shows that you can have p99<1ms with over 100k req/sec. The benchmark is bad and does not reflect Go performance at all.

Also the tool used in that benchmark ghz is also written in Go, so somehow the client is able to have high troughput and verify requests but the server can't?


It's interesting that the java ones use an order of magnitude more ram than rust. Is that just the overhead of being bytecode, or a different choice in processing algorithm?


It's because Rust frees memory at the moment it's not needed while Java runtime maintains a pool of memory and does garbage collection every now and then. This is a huge performance boost for code that does a lot of small allocations, and if you want to play dirty, reserving a large enough heap may allow you to run through a benchmark without a single garbage collection.

Seriously, every now and then there's a story about some fintech company who use Java and use large enough heap that there's no garbage collection during stock market opening hours.


Rust can do that but it also can not do that. You’d want to not do that here. That is, this is a library concern, not really a language one.


Interesting, makes sense. So presumably if you ran the benchmark over a long enough time, performance would take a dive while the gc did its thing.


Possibly. I haven't tried running this benchmark with different heap sizes, but to me the 99% latency looks like it doesn't try to avoid GC. The JVM is really fast at memory management.


During the benchmark's duration, the GC was definitely ran. Java simply has really advanced GCs.


It likely counts the memory used by the JVM so it includes memory for the runtime, JIT, initial heap allocation, etc.


Also both bytecode and the native code generated from bytecode are in memory at the same time, aren't they?


Yes, but it's more nuanced then that.

If I remember correctly the bytecode is interpreted and executed by the runtime. This means the bytecode is fully in memory as bytecode when the jvm loads .class files. The native code for the JVM interpreter and runtime execution is also present in memory.

At this point execution happens without "duplicate code" in memory.

Note: the JVM could compile the bytecode fully into native code before execution but that's an implementation detail for a specific JVM implementation.

Then we add the JIT. The JIT will notice hot paths during interpretation/execution and compile a method to native code and inject it so it calls the jit_native version instead of the bytecode_interpreted version. The JIT does not overwrite original bytecode but stores the compilation result separately.

So to come back to your statement "both bytecode and native code generated from bytecode in memory at the same time": yes, but only for the JIT-compiled hot paths.

This to me is the "power" of the JVM + JIT. Run interpreted bytecode and compile to native for the hot paths.


This is more like a "gRPC libraries" benchmark instead of language benchmark.


Can we change the title to something less flame bait like: Java implementation currently most competitive in gRPC benchmark


I agree, the results of this benchmark only show that this C++/Rust gRPC implementation has room for performance improvements


I would hardly complain about this title being too sensationalized given that it's a factual statement


It's factual if you consider average latency important, which is debatable. On 90th, 95th and 99th percentile latency, it isn't the most performant. On memory usage, it isn't.


No, it isn't


AFAIK It is far from being usable benchmark in real world.

Why isn't teh SDK/Language major versions not mentioned in wiki page ?

Am I making mistake in reading this .... As the number of CPUs increases req/s and latency does not change, except for few java and rust readings. Network connections and IO isn't seem to scale as number of CPUs increases ?


It is worth noting that the 90th, 95th, and 99th percentile latencies for Java are not the best ones.

Also, I believe the heavily editorialized title should be replaced with something that is close to the title of the original post, such as "2021-04-13 gRPC benchmark results".


The title specifically mentions throughput. It does not say anything about latency.


That is correct. However, from the HN guidelines [1]:

> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.

[1] https://news.ycombinator.com/newsguidelines.html


Where is the source for the Java Benchmarks? All I see is a docker file in the folder?


All Java Benchmarks share the same program, except the gc options. The benchmark code is in this subdirectory: java_grpc_sgc_bench


Thanks! So they test the same code against mutiple GC options? Nice.


I’m not surprised to see the php implementation so far down, you can get better throughput with just json.

Edit: it’s not even php, it’s roadrunner (a golang server) proxying for php. By default, it’s probably not configured for production.


My hypothesis is that the Java implementation has simply received the most R&D and tuning. It’s developed mostly by Google, where Java has a large footprint.


There are probably better reasons to use GRPC in a language other than C++ than than this, like the annoying dependency on Abseil and the general disadvantages that come with using anything from Google.

Unfortunately I've never found a best-in-class modern RPC system for C++ that embraces integration with other libraries (for I/O, concurrency, etc)

Thrift (Apache not Facebook) and CapnProto are the only other contenders that I know of.


Really surprised to see crystal below Ruby in all the tests. Perhaps there hasn't been enough time to fine tune the implementation of grpc yet.


To me, cpp looks like the best overall compromise with that extremely low memory usage.


How is gRPC used at Google?

I know the Cloud APIs use gRPC, but only on the client side. But I'm under the impression that gRPC and Stubby (the internal RPC framework?) are completely different codebases (even if they are allegedly similar).


The node benchmark numbers are suspiciously low, and actually get worse for 3 cores. It seems the implementation is not using parallelism via cluster/workers/threads at all.


The lua numbers are fairly impressive being that a dynamically typed interpreted language should be at a disadvantage with grpc. It's not luajit either, just lua.


This is using openjdk 14, I expect even better GC performance in openjdk 16 (stable). Also graalVM CE and EE can push performance even beyond and maybe improve the 99%


Why does Golang have such a poor showing here?


I also wondered this. I suspect that, like a lot of Go code, the gRPC internals are nonchalant about starting goroutines for fairly small things, because Go's scheduler is quite good. In the single-core case this may therefore involves a lot of goroutine switches. Go moves up the list rapidly in the 2 and 3 core case - the 99th percentile is pretty bad, and I'm not too surprised because Go has a GC, and my experience is that gRPC in Go does a lot of interfaces/reflection wrapping and makes a lot of garbage - the requests / responses are not pooled and at least in our gRPC APIs, the response often contains enough slices of the request I expect it keeps most of both around for too long.


The Java numbers (1 CPU) don't look so good to me, especially compared to P99 and memory consumption for Rust and C++

Results for 2 and 3 CPUs are strange too. So little scaling, what's the point...


Also interesting that Java seems to be a lot better than C# / .NET which normally would be quite comparable.


It's not a fault of the runtimes, but of how optimized the library implementations are. The Java grpc library has had a lot of love.


I remember playing minecraft and the server was constantly crashing. Not sure how it was related to java.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: