How is this using rsync? Does it invoke an rsync process, or implement the rsync wire protocol? One of my gripes with rsync is I've not found another way to integrate it into other systems. librsync doesn't implement the wire protocol and parsing the output of rsync itself is fraught with peril (for instance the formatting of --no-h has changed before.)
Edit: it seems to invoke an a rsync process but doesn't parse stdout for progress. A bit disappointing but I suppose that's all which is needed for this application.
librsync has absolutely no connection to the rsync utility, it's a distinct implementation of the underlying differential algorithm. The rsync utility doesn't use it, and they don't even share a common data format at any level.
Yes I'm aware, but usually when I voice my gripes with integrating rsync somebody unfamiliar with the matter quickly googles 'rsync library', finds librsync, and assumes it must be the answer to my problem. Then I have to explain to them that it's not. I thought to head that off this time by mentioning upfront that librsync is not the answer.
Just to be sure I'm getting the question - you want to use rsync(1), and you're wondering if there's a good way to guard your code that's using/calling it from underlying changes in rsync options and output, how to read status/progress and such?
Here's an example of library that wraps rsync https://metacpan.org/pod/File::Rsync - you can provide callback functions that will be called for each line of stdout/stderr.
Granted it's in Perl - and chances are your code isn't written in Perl. So if you can't find something similar for your language of choice, and if you can't be bothered with checking out implementation in Perl and rewriting it...
And next to those - Perlito can "Compile Perl to Java/JavaScript/Python/Ruby/Go/etc" https://github.com/fglock/Perlito. I think we had/have some Perl code running in production Hadoop/JVM.
Indeed there is no widespread documentation on how to play with rsync itself. A long time ago I played with the idea of replicating rproxy (https://rproxy.samba.org/), which was just a cool idea. The results are here (https://rproxy.samba.org/) but are very very basic, it was just a toy. I did have to invent my own protocol, but I did it over HTTP.
A major problem with rsync is it relies on filenames matching at the source and destination for an accelerated delta transfer to occur.
If a file gets renamed or copied, modified or not, rsync will transfer the whole file.
They added --fuzzy to try improve on this situation, but it generally only helps with unmodified files copied/renamed within the same directory.
You can build robust solutions around rsync that work efficiently most of the time, but pathological cases such as users having a huge file regularly regenerated using something like a $(date --iso-8601=seconds) filename (think mysqldump backups) are surprisingly common once you have enough users. Even if you can convince them to uncompress such files for delta transfer sake, something as common as a versioned filename prevents rsync from automatically finding the relationship for use in the delta transfer.
So we need a new tool that is basically git and rsync rolled into one.
Might be easier to use something like zfs and use it's features to build a tool that behaves almost like git. Then a (zfs + rsync + [zfs-git hybrid]) monster is born, which may or may not work.
Fyi have a look at zfs. It has some neat traits/behaviours. So my thinking is, someone should write a database that uses gazzilions of files (instead of a file per table), a tiny file for each cell/field in a table. Then use zfs for versioning of data (or git). Zfs can already do software mirroring to drive pools on the machine, so maybe add rsync for remote pools. Optimize it for SSD's and you have a filesystem-database-replicating-versioned -based store.
Maybe just need a nice cli/gui for zfs to make it easier/safer to expose more of it's interesting features.
Who is "they"? Are you asking about rysnc developers or filesystem developers?
For a filesystem the developers are free to innovate on any form of filesystem-wide block checksumming and content-addressed deduplication approach. This kind of global block-level content awareness is entirely decoupled from filenames and paths.
Using rsync to underpin a filesystem strikes me as a gross hack done by developers punting on the real challenges of modern filesystem development.
WRT rsync doing better than --fuzzy, the options are more limited since it's a userspace tool working within the confines of filesystem APIs like POSIX.
So basically there is no good solution at the moment, and a good solution would probably involve enhancing the underlying filesystem - this was the same conclusion I came to
Windows Server has a variant of the Rsync algorithm that it uses for its Distributed File System Replication (DFS-R) that can look for up to 4 similar files. It works okay-ish in practice, and is better than nothing.
Minor rant: The trouble with all these lovely Apache projects is their dependency on Java.
If these things were written in C, Go or Rust then I'd leap at the opportunity to explore them and maybe use them in production.
But Java dependency just brings so much baggage with it. Its all well and good if you've got a Java pro or two on your team. But if you don't, you usually end up spending hours and days troubleshooting obscure JVM problems or somesuch.
Cannot upvote this enough. I'm sure a frequent Java user will come along to tell me how easy it is - but at this point I'm not interested in delving into their world. If a new application I'm looking at is a single binary I can drop on my system (a la Go/Rust) then great - otherwise it has to be something I really need for me to not just resort to "whatever is in the package manager, or find a different solution".
I'm still trying to figure that out honestly, but https://www.meilisearch.com/ is pretty sweet. Doesn't cover everything ES can do obviously, but it covers a big one.
Funny how I'm the complete opposite. For me, seeing it is written in Java is a big plus.
As soon as I see Java I know it's going to have a whole set of properties that will make it highly manageable to deploy and run. And when they go wrong I have a lot of hooks and tools to understand and resolve the issue that are completely unavailable to me with native applications. You apply the pejorative "baggage" to these as if they are valueless but sometimes you just have to develop enough experience before the utility of things becomes apparent.
(Writing an application in Java however ... I would go straight for a JVM language like Kotlin/Groovy/Scala)
Personally I never want to touch Scala again. Implicit and operator overloading hell. Let alone the shit show that is sbt and c++ compile times from scalac.
Ironically it pushed me back towards Groovy because I found I was constantly hitting unexpected performance bottlenecks by implicit conversions that unexpectedly jumped in and executed things I didn't even realise were happening. Groovy was completely unelegant but pragmatic option that actually did what I expected most of the time.
Not sure what to say to that, except that it might have been a long time ago, or with some very specific libraries or teams?
If it was years ago, then I can say that things have changed a lot since the last time you touched Scala, and in the right direction. Most of the criticisms from, say, 5 years ago, are being addressed.
For example, if by "operator overloading" you mean "symbolic method names", then those have largely fallen out of fashion. sbt has seen huge improvements (some people dislike it, some people like it, but it's one of the most powerful build tools out there). Compilation times also have improved.
As far as I am concerned, it is my favorite language, and there is so much going for it: great tooling, the upcoming 3 version, binary compatibility improvements, rock-solid JavaScript compilation, native compilation under development, and, of course, all the benefits from a language with one of the most powerful types systems around.
Both of you are right. You happen to be a “Java pro” who knows how to use the tool well. It’s a bit uphill for shops without Java experience - more uphill than with other languages, perhaps.
Thank you for pointing this out. I read the title and it sounded nice, but if Java is anywhere near close to it, then no thanks. Thank you, you saved me from wasting time. :)
You have to go back almost 20 years to understand the relationship between the ASF and the Java community. Java at the time was positioned to be a core web technology (originally as applets, then as as backend tech via J2EE) and Apache was interested in ensuring the web ecosystem as a whole wasn't locked up in proprietary tech. The ASF was rather influential in working with Sun at the time to foster was eventually became a vibrant open source community.
I know the ASF isn't as trendy these days, but that's never really bothered those of us who volunteer. The focus has always been about a stable, open source ecosystem, primarily, but not exclusively for the internet and its underlying infrastructure. The goal is measured in decades.
It wasn't about being "enterprisey" or anything like that. The ASF as a foundation doesn't really care about the language or tech stack. Being a more mature non-profit, many corporations have found it easier to interface with the ASF than many other open source organizations out there, which does lend itself to seeing a lot of enterprisey donations.
What baggage is specific to Java? You'd want some experts for the language your critical services are written in, regardless of it being java or C or Rust or whatever. Likewise, a java shop wouldn't want to use a critical service written in C.
For Java you need people in operations who are good at things like JVM tuning working. Just a fact of life, speaking from experience working on teams that do that.
Languages like C and Rust do not have any equivalent to JVM tuning.
> For Java you need people in operations who are good at things like JVM tuning working. Just a fact of life, speaking from experience working on teams that do that.
You really don't. Anything you could do it, say, Go, you can do in Java without doing any JVM flags. The only time you need to do JVM tuning is when you have performance requirements that are simply impossible in most languages.
Yes, Oracle changed the JDK's licensing model from part-proprietary/part-open, part-free/part-paid to 100% open and free (as of JDK 11). What confused some people is that the old website for the semi-free JDK is now for Oracle support customers who pay for support (even though that page directs non-customers to the free version) and Oracle provides the free JDK over at http://jdk.java.net/
Hi Ron. If you're still employed by Oracle, it might not be a bad idea to disclose that, either in the specific comments where you may be talking (semi-)authoritatively about Java and its licensing / commercial model, or within your HN profile. I recall the last talk I saw from you was an AMA at QCon London in 2019 about Java, where you were with Oracle.
I found the comments really clueful but also appreciate when others like e.g. @_msw_ are very overt with disclosure of their employment and interests whilst talking about $employer's stuff on HN (and in his case that'd be AWS and EC2).
Hey Alex! Yes I am with Oracle, and indeed I always make sure to disclose my affiliation: I do so on HN every couple of months or whenever it's relevant (when saying something that could be biased/controversial). It's not just a nice thing to do, but it's also company policy.
Plus, ZGC, the low-latency collector (which gives a ~1-2ms max pause times on heaps of up to 16TB in JDK 16) was introduced in 11 and made production-ready in 15.
You would do that even with a GC language like Java. The serious performance code paths in Java also do meticulous memory management to minimize allocations, GC costs, and utilize off-heap memory (Netty comes immediately to mind).
In my opinion it’s the worst of both worlds at that point—you’ve got to manually manage memory and tip toe around the runtime.
You're not wrong, but that's a bit of an unfair comparison.
You're likely talking about work on custom memory allocators and impact on real time performance. In the hosted services domain this isn't so much an issue. Nobody's doing that in Java, and you likewise wouldn't care much if writing the same type of service in Rust.
Rust very much enables you to write high level, Java-style code. It's not bad code, nor is it difficult to write.
I've been staring down compiler flags for two weeks straight, now, since I've been working on cross-compilation toolchains, but that kind of narrative doesn't fit your current level of snark, I guess.
I have run operations for programs written in Java, C++, and Go. I've done it both at small companies and large companies, for everywhere from a rack of dedicated servers running Apache Tomcat to thousands of virtual machines running a mixture of C++ and Go microservices.
The constant thread is... over the last ten years... the people running JVMs complain about tuning the JVM, and always have stories to tell about it. The JVM is fantastically tuneable, and when people online complain about GC pauses or memory problems, there's often some way to tune the JVM to fix those problems. The JVM is amazing, it's a marvelous piece of technology.
But there's also a bunch of people who don't know how to do it and just turn knobs. Like, oh, customers are calling in to complain about latency, so I'll increase the size of the Java heap. (Which, for those not familiar with GC, will make throughput better but make latency worse.) Running services on the JVM means that "working with the JVM" is now a skill you need to select for whatever operations team you have, and if you're a company running a mix of different services (like, I don't know, some databases, memcache, load balancers, etc) then "knowing how to tune the JVM" competes with several other skills you'd like your team to learn.
Just like there are whole conference talks on how to use PGO data properly and optimization flags on AOT compilation toolchains, naturally one needs to spend the effort to learn how to use them and not just type -O2 and go home.
Or those that complain about RDBMS queries being slow, without having normalized their data or written proper indexes.
There is always bunch of people that don't know how to do things, and then there are those among them, that care to improve their skill set and get to know how to turn those knobs.
Sure, but it’s only at large scale that you care about stuff like PGO. My experience is that even small shops care about JVM tuning.
It’s not a question of whether there are knobs to turn. It’s about the typical experience of an actual person running a JVM app versus, say, C or Go, and those experiences are quite different.
And we’re also talking about running someone else’s app. If your database queries are slow, that’s a conversation between the DBA and the devs. If you’re running some Apache app, you’re probably not talking to the devs.
Yeah, I've been a C/C++ programmer for 15+ years with heavy focus on performance, and have not once run PGO. I think it's quite often the case for C/C++ programs that you -O2 and walk away, at least for codebases that your team doesn't specifically own.
Unlike C/C++ you do have to care about tuning at a minimum to set the memory sizes.
That said you probably can’t get very far in c/c++ without worrying about memory the whole way through. So, you only worry about tuning when writing your code.
Well, it's a trade-off. If someone compiled the C++ app for your architecture, sure – if not, you need to figure out how to compile it and depending on the compiler and set of headers you have installed, it's not always easy. When I say, not always easy I'm being generous, there have been cases when the function signature was different between the sets of headers the author had and what I had...
In Java's case if you have all the class files, all you need is the JVM for your architecture. You don't need the program's author to compile it for your architecture.
And finally, JVM tuning does not need to be performed by the user, but can be performed by the user. The most frequent tuning needed to be done by the user was the maximum memory that the JVM could allocate, but that hasn't been needed for a long time (since Java 8, I think). Now the JVM, by default, has the sensible behavior to allocate as much memory as the OS will let it, when asked by the application.
But you can use 99% of C/C++ applications without any compiler flag trickery.
My experience is not that you can run 99% of Java applications without running out of memory or something like that. That's why I avoid Java these days when I have the option. I don't think things would have changed much that much in recent years, making my experience no longer valid.
I've been using Java daily for years at this point. The only time I've had that sort of issue is when the program I was writing dealt with large files in memory, and it was solved with a simple -Xmx16G.
Hah, I still use laptops with 2G and 4G :) They work perfectly fine when you know what you are doing. But running Java with 16GB pool does not fall in that category. The same holds for small cloud instances.
Admittedly even with a C program the size of a problem requiring 16GB might not run well in a 2GB machine, but the fact remains that Java applications are often much more memory hungry than comparable C/C++ implementations. And if you have the memory they will just use it without any Xmx. That sounds really like computing in the 1960/70s that you need to allocate memory in advance. (I'm old enough to have seen such stuff, although not when it was new.)
ZGC, the low-latency collector (which gives a ~1-2ms max pause times on heaps of up to 16TB in JDK 16) was introduced in 11 and made production-ready in 15.
In general, the way the GCs work is just very different from 8.
The JVM that Java shop is using is likely written in C, which should make you question the straight equivalence you’re setting up. There is an operational difference between getting a native executable and having something that sits on top of Java (or any other managed platform).
Not sure what you mean by "managed platform". Java being written in C doesn't mean you need to know C to run Java. You hardly deal with any native code at all unless you write your own JNI interface or do some obscure tuning, which is pretty rare.
I don’t see how it does at all - the point isn’t C versus Java, it’s native versus managed. Any reasonably proficient team running managed services needs to know how to deal with native dependencies, but the reverse is not true.
not in my view, a person who knows neither java nor c. but i think the counterargument from the people in this thread would make would be something like:
i can write my C programs without ever thinking about java. it is irrelevant to me. however, C is very important for java since the JVM it is running on top of is written in C.
i think, personally, the outcome this type of thinking is that “therefore every java program is in fact just cruft on top of C” which personally as someone who does 80% of their job in SQL i am ill-equipped to object on java’s behalf.
You’ll note the word “likely” in my comment. And there’s nothing special about C in my argument - I was responding to a comment which used it as an example. The point is simply that one way or another any given team likely knows how to tune, monitor and/or diagnose issues in native bits - in fact even a Java focussed team likely has these skills. But the reverse is clearly not true.
For me, at this point, the biggest baggage is dealing with Oracle and funneling them money if you want an LTS JVM, alongside the shifting sands of their licensing rules. Unfortunately it can still be a bit of a wheel of fortune as to how well (if at all) something runs on OpenJDK.
Between the open-world assumption the classloader uses, and the lack of a performant, general purpose garbage collector, you can’t seriously consider deploying Java at scale without having experts on hand.
I’ve deployed things written in many languages at extremely large scale. They’ve been written in Java, C, C++, python, go, sh, perl, and other languages I’ve forgotten. Java is, by far, the least operable of those languages.
I’m an expert Java developer, along with most of those other languages (where expert is defined by me as “over 10 years experience”).
In fairness to java, it’s not my least favorite language on the list.
One of the most important and valuable (at least to me) things about the Java and the JVM is its ecosystem. And Apache is a part of that ecosystem. BTW a lot of the Apache projects aren't even Java or depend on the JVM.
> If these things were written in C, Go or Rust...
Go and Rust have very good package management (Go's kinda sucked but go modules is helping a lot), so yes. They both work much nicer than the occasions I've used maven etc.
I think that package management is very different from an ecosystem. I'm somewhat aware of rust and gos' package managers, but I'm really curious about if they have ecosystems similar to Apache. I wouldn't be too surprise if the Apache ecosystem embraces more and more go and rust projects as time goes on...
You can check if there is a project like that in the incubator, or suggest one if you have an idea or some code to donate.
Most new projects are related to existing ASF projects, or use ASF libraries, or are created by someone already in the ASF that maybe use/prefer Java.
But there is no requirement on a project being Java.
See Airflow, Arrow, Log4net, httpd... it's just a question of people creating the proposals that must show a possible community of users/devs to use/support it.
Unless you are happily using a JVM-hosted language of course (Clojure in my case), in which case it is a huge advantage. I can use this immediately and easily in production.
It would be useful to you to recognize that you are letting an emotional reaction stopping you from using useful technology. I understand that Java makes you uncomfortable, but it is a powerful and stable language and you're missing out by avoiding it. It would take much more time to implement a similar project in C, and it would likely have lower stability. Yes, implementing it in C would use a lot less memory and if that's what you're optimizing for going with something else than Java is a wise choice most of the time.
Most technology is useful, depending on what you're optimizing for. I urge you not to optimize for being comfortable.
The language isn’t the problem. The runtime is the problem. It’s a dependency mess, a licensing lottery, and a tuning crapshoot. The overhead of simply making it work, and maintaining that status across heterogeneous environments, is more than non-Java shops want to undertake.
Having spent much of the first half of my career writing C and Perl and then Java, I was very happy to set aside the latter two - for completely differing reasons.
OPs points are valid. Filesystem work requires experience with system calls and I/O behavior. Java strives to abstract the underlying system away from the developer. System level tooling written in Java frequently requires twice as much expertise because there are two big systems in play instead of one, hence the unnecessary baggage.
My point was not to this project in particular, but general. I have seen this reaction towards Java often. Java is not familiar, for many people Java is inscrutable. I think this is because of the existence of the JVM and the need for the user to be aware of its existence. Rust is not simpler than Java, but the user just deals with a compiled executable, he doesn't see any complexity. Maybe GraalVM with nativeimage will help Java in this regard.
It's not an emotional reaction they were expressing at all - they even explicitly stated it's because of the need to bring in someone savvy with Java and the JVM to be able to support it. Sometimes there aren't the resources.
Someone not having an emotional reaction would not use expressions like "All the worst and most time-consuming troubleshooting experiences of my life have involved Java" and "the JVM decided to vomit all over my screen".
"All the worst and most time-consuming troubleshooting experiences of my life have involved Java" is simply a statement.
All the worst and most time-consuming troubleshooting experiences of my life have involved dealing with a legacy VB6 app with a truly terrible MSSQL database structure - that's not an emotional statement, it's a fact.
> All the worst and most time-consuming troubleshooting experiences of my life have involved dealing with a legacy VB6 app with a truly terrible MSSQL database structure
Do you feel any powerful emotion when you think about that experience? Would you be indifferent if you would need to do it again?
Yes, dread. But it doesn't make the statement any less true. Myself and every single person maintaining that system felt exactly the same - it was a complete mess.
It was the owner of the platform & its developers, neither technology - I think we both understand that.
I was just trying to make the point that it's not necessarily emotional for someone to not want to work with a certain tool without an expert available because of terrible past experiences where that was the case.
I'd say it's almost impossible to imagine any past event someone has present at without having some emotional response.
The emotional raction is the resistance to Java and the assumption that it will break and you will need a Java expert to fix it. I don't know much C and I use MySQL and don't need to worry about C internals at all when using it. Just some configuration stuff, and maybe install some C build tools. I wouldn't expect it to be much different for this project.
I think you've overlooked OPs point a bit. You don't need to understand the C/libc internals to be able to operate MySQL (you do need to know it's quirks though). With a JVM application, you're going to need to understand the application level configs, think of something like spark, and if you're doing anything with large amounts of data or high RPS, you're also going to need to understand the JVM internals. The closest analog might be needing to understand the libc allocator for certain usage patterns, and that happens, but it's far more rare than someone needing to tune garbage collection algorithms and parameters. JVM tuning is just a reality of operating in the java ecosystem.
> java is very easy to troubleshoot compared to c, go, and rust
All the worst and most time-consuming troubleshooting experiences of my life have involved Java. Whether staring at pages of obscure stacktraces that the JVM decided to vomit all over my screen, or hours on end with Oracle tech support troubleshooting why Java decided to throw a tantrum (followed by editing obscure Java XML config files).
I really haven't seen all that much XML in Java recently outside of legacy applications... there's libraries for loading config data from JSON, YAML, etc etc.
JSON, YAML, or XML, it's all the same crap. The fact that programmers are feeling the need to externalise their business logic into "config" files is a sign that the language has failed. In a decent language you would write logic or "config" in the language itself, because the language would be a comfortable medium for expressing things (this is the norm in Python for example - while a few overengineered pieces insist on using TOML or something, most Python libraries don't require any separate config, you just "configure" them the same way you do the rest of your programming, in Python).
well yes... it's a learning curve and it takes a while but if you are familiar with the jvm and xml and to some degree spring-framework and reflection it's usally pretty easy to find a root-cause - you also have great tooling like async-profiler or just sending `kill -3 <pid>` to get a stacktrace. At least you've got a hill to climb upon instead of an empty field.
As an example, this app (Apache Helix) implements MBeans, which means that you can connect to the running application to observe its state and manage it. You can do this with the standard JMX toolchain, without having to implement anything else in the app other than the MBeans.
I realize, writing this, that for someone unfamiliar with the ecosystem this sounds very abstract, but MBeans (managed beans) are simply counters, operations, stats, faults that the app makes available to other apps/users via a standard protocol.
How extensive is your experience with other toolchains? Programmers notoriously use easy as a synonym for familiar and given how long Java has been around by now we'd have some evidence showing a major productivity advantage. Companies like Google with massive Java experience also investing heavily other languages suggests that this is not so compelling.
Being able to off-the-cuff use JMX to do realtime heap and stack analysis and then go deep with an offline heap dump (via Eclipse MAT, for example) is quite nice. On top of it you have a standardized interface for emitting counters and gauges. You can kind of get that through effort with other ecosystems (closest probably being .NET), but with the JVM it's just there and it's standardized. Have a memory leak that takes days of operation to manifest? Set up a trigger to heap dump at a certain time, then open up MAT and you can find some very subtle memory leaks and lock contention issues. Any JVM is basically running in a valgrind sandbox, with enough symbols to support a low overhead gdb connect, with a standard API for emitting meta-performance data.
All the above can be done in other environments, but generally with specially prepared deployments or executables, with choices being made along the way as to how to do it. I think the two key things are that in the JVM the heap is actually a quite structured database, which allows for introspection without recompilation or special tooling, and there's a standard mechanism for exposing detailed performance data.
Whether or not that translates into actual productivity differences is tricky. In my experience large companies tend to build their own equivalent and better-targeted tooling, and smaller companies increasingly pass their performance diagnostics to SAAS companies. It takes time to learn diagnostic tools and procedures in general, and in my experience a lot of Java teams don't know the tooling they have. I'd say the main productivity gain would be in quickly diagnosing production issues. You could argue that the JVM leads to software thinking that increases production issues by relying on long-running processes that need the stability to survive, but in my experience there are definitely niches where that is the only way to meet your performance goals.
>Programmers notoriously use easy as a synonym for familiar and given how long Java has been around by now we'd have some evidence showing a major productivity advantage
Have you ever heard the phrase "the dog that didn't bark?" You appear to be using an absence of evidence to prove evidence of absence, not to mention an argument from ignorance (which is always possible in our post-Gödel world).
That's a rather pretentious way to miss the point. The question was “Can you expand on this? Why do you think that?” and the response was a sweeping assertion and the name of a couple of products with no further details, or even a clear indication of which of the three languages mentioned further up-thread or others were being used as a basis for comparison. I wasn't asking for a peer-reviewed paper but even, say, “My team had a productivity loss when they switched to Go because [reasons]” would be better than nothing at all.
> java is very easy to troubleshoot compared to c, go, and rust
>> Can you expand on this? Why do you think that?
So the comparison is C, Go, and Rust. Yes, I'm familiar with all 3 of them and Java is easier to troubleshoot than either of them because it has better introspection out of the box.
That’s getting closer to what would have been an informative response. It could use further expansion but that could allow someone reading it to weigh how much that would remove a limiting factor in their experience.
For example, also having experience with all of them, the overall advantage over C is clear-cut but Rust is a lot less clear since stronger type system and better culture around package management and complexity avoids a lot of problems that otherwise require runtime troubleshooting.
Granted, a key part of that is really the question of whether you’re talking about a modern Java project or the more common enterprise Java sort with layers of accreted complexity and probably architecturally frozen at Java 8, where the problem is cultural rather than the language itself.
> That’s getting closer to what would have been an informative response. It could use further expansion but that could allow someone reading it to weigh how much that would remove a limiting factor in their experience.
Consider that you've gotten more useful information from my response than what you've paid for. If you want the whole picture, buy the book. Your sense of entitlement to more of my time seems misplaced.
>Companies like Google with massive Java experience also investing heavily other languages suggests that this is not so compelling.
Google's scale means that performance actually matters a lot versus productivity. When you've got millions of servers the costs add up. So making something twice as fast but taking 25% more dev time is probably a good tradeoff for them. For most other companies that's not the case.
Java is extremely old and stable. The obsession with Rust as open-source savior is overwrought. Rust may be good but Java has been around 25 years, give Rust another 10 years and 3 major language revisions before I will trust it.
No, it's not. My experience is that for poor programmers, Java is easy because they lack the fundamental skills, and are familiar w/ the mechanics of Java. Of course this is a broad generalization and terribly unfair empirical anecdata.
Also, every single Java application I've ever used has used an entirely unreasonable amount of memory for the job it's supposed to do. I'm sure you can write things in Java that are lightweight on resources, but I've never seen it.
I also wonder why are they using Java? Filesystems should be written in something like Rust. Using languages like Java for systems programming always looked bizarre to me.
I love using things written in Rust, because if nothing else I know it's going to be resource efficient and the deployment will be nice and easy.
Java applications: I hope there's enough spare memory, I hope the setup instructions aren't littered with "set this obscure JVM config in some obscure way, that isn't the application config, but we won't tell you how/where, you've just got to figure that one out"
I don't see a reason for users to be aware of Maven, similarly when you get a C, Go, app, you are not aware of the build tool that was used to create it.
While agree that some of the Apache projects would be better written in something other than Java but the comment about spending hours debugging obscure Jvm issues is huge exaggeration
Java is pretty much the only language I'd want to see low-level libraries written in. Go libraries can only be used from Go, Rust libraries can only be used from Rust, and any nontrivial C library contains an arbitrary code execution vulnerability when built with a newer compiler. Java is a lowest-common-denominator language but it's memory-safe, cross-platform (in a way that C# isn't really yet), and there's a decent range of languages that you can run on the JVM and use Java libraries from.
If a Go or Rust library takes the time to offer a C-compatible interface (most don't), or you add one yourself, then you can invoke it via the anaemic, unsafe C ABI - no first-class objects, functions, or sum types, raw pointers everywhere. At that point it might as well be a C library, because any program that uses it is still inherently memory-unsafe.
I wonder how near-realtime this is. It seems that will dictate the types of applications this will be applicable for.
I currently use Gluster which has its downsides, but is what I consider near-realtime, but with full filesystem features in a nice fuse wrapper. I'd love it if someone would contrast the performance and more of the features.
In the mid-2000s (the "oughts?") I went to my doctor at PAMF with a sore throat and he said he would do a strep test.
I asked him if we would get "real time results?"
He stopped, and looked at me puzzled with a tilted head... "What other kind of time is there?"
Your question is highly reasonable, but I heard a colleague today say that the issue could happen only in a very, very short time period of, say, 2-3 minutes. We routinely work with corner cases of 1-2 ms; so the context is relative :-)
"context is relative" made me smile, thinking "PAMF" is Palo Alto Medical Foundation, but why would anyone in general have the context to understand that
Gluster's performance with metadata heavy workloads is very poor. If you're running files with a high ratio of data:metadata, it works well; if you're running something like a git repo on Gluster it is not a great experience.
It's also thrashing a bit development wise; there's a lot of features being pruned, and I'm a little concerned what its future looks like as Red Hat push harder down the Ceph route, since they're the main source of code contributions to Gluster.
I use Resilio Sync in production with my servers. I like it more than this solution because it does not require a primary and replica design but instead uses bit torrent where all servers are equal. I can add a new server and drop an old server and it handles everything gracefully.
Haven't come across this before. At a glance, it looks like it's a file sync tool for small scale use, rather than a "serious" tool for replicating a file system.
Curious to know more about your production use case?
I used to run distributed system based on a few bash scripts and rsync to work in real time. It has become more complex when each "cell" exceeded the node capacity. The file naming system was not flexible enough to split these cells into smaller nodes and I abandoned writing a router for Nginx to direct requests to right server based on the filenames. I just decided to throw a couple of big machines at the problem instead and that removed all complexity. It wouldn't be possible just a couple of years ago though.
For a couple of years the rsync solution worked very well.
I'm using a distributed filesystem for persisted files (config, assets and resources, etc - anything that'd need to be in a docker volume and persisted between restarts) in my container workloads. This mostly works fine, except for sqlite applications that insist on using WAL - this completely breaks and results in corrupted database files.
"Near-realtime" isn't good enough for SQLite. It has a strong dependency on the consistency guarantees provided by a POSIX-like filesystem.
The approach described in this page does make some attempt to provide consistency. The master generates a single stream of file updates, and all of the replicas consistently apply those changes in the same order. But rsync is not guaranteed to observe changes to files on the master in the same order that they were written.
In particular, SQLite (by default) uses a rollback journal to recover from failures. While a transaction is in process and some parts of the database might have been partially updated, the journal stores the old contents of the modified pages. Since the old data is fsync'ed to the journal before the new data is written to the database file, the data on disk at any moment in time is recoverable.
But there's no clean way for rsync to read an atomic snapshot of both the database file and the journal at the same instant in time. If it reads them at different times, a database change might be partially applied or partially rolled back, which has a high likelihood of corrupting the database.
This won't work for database files, and it won't work for any other application that requires a filesystem to behave consistently over a network either.
The property you are looking for in a distributed filesystem is called strong consistency, or cache coherent (different name, same effect).
If a distributed filesystem does genuinely provide that property, it should be ok for SQLite and other applications, because it means the filesystem behaves the same as if it were multiple processes on the same local system accessing it.
Most distributed filesystems don't provide that guarantee though. It's a very nice guarantee, but it comes at a complexity cost, usually a performance cost (although there are clever ways to approach the perforance of a non-consistent system), and in particular it means the filesystems can't be used when the network connection is down.
There is another property called durability, which you might also care about. That affects whether you get corruption when devices fail or are rebooted suddenly.
I actually don't need strong consistency here - I'll only have a single instance of the container accessing the sqlite file at a single point, but they can be running on any host. As long as I can ensure replication is synchronized before process start when a container gets rescheduled we good.
It goes over my head a bit exactly why (related to POSIX lock mechanisms?) but sqlite doesn't work reliable on NFS/ceph/gluster/networked filesystems, even when only accessed from a single host, distributed or not.
> It goes over my head a bit exactly why (related to POSIX lock mechanisms?) but sqlite doesn't work reliable on NFS/ceph/gluster/networked filesystems, even when only accessed from a single host, distributed or not.
I don't see anything in that forum thread which says SQLite is unreliable on sufficiently POSIX-ish network filesystems. Single host or not.
The linked post says it will be slower to commit write transactions than is possible using other methods, but I don't see anything there saying it's unreliable.
(What looks unreliable to me is Joelmo's proposal to remove the WAL lock for better performance. But the lock is there for a reliability reason, you can't just remove it. To get the boost in performance Joelmo would like requires significantly different techniques.)
You could try implementing a small shim over the FS using Posix or Dokan which just goes to the other FS but make sure writes are synced correctly. It would probably be quite slow, but is sounds like you mostly care about having something that works.
> Most distributed filesystems don't provide that guarantee though. It's a very nice guarantee, but it comes at a complexity cost, usually a performance cost (although there are clever ways to approach the perforance of a non-consistent system), and in particular it means the filesystems can't be used when the network connection is down.
Which distributed file systems do provide this guarantee? I can't think of one, so hopeful to learn something new today.
Heh, even Microsoft SMB provides some level of coherency guarantee combined with performance, using its read and write leases. NFSv4 has a similar concept. It's an old idea by now.
I actually don't know of any distributed (multi-master, fully replicated) filesystems that are published with this property. I only know it's possible because I designed one that isn't published.
The basic principle is similar to CPU MESI caching but with predicate-scopes suited to a filesystem rather than cache lines. (Predicate-scopes are similar to predicate-locks in databases). As you know, multi-core CPU systems remain fast despite sharing memory, incurring significant overhead only for the changes that need to be communicated between cores. Same applies on a network.
> To facilitate this, the master logs each transaction in a file and each transaction is associated with an 64 bit ID in which the 32 LSB represents a sequence number and MSB represents the generation number The sequence number gets incremented on every transaction and the generation is incremented when a new master is elected
So, what happens after 32bit transactions without re-election?
As I haven't seen it mentioned yet, if you want bi-directional sync of changes with relatively low overhead, I've found csync2 to be reasonably good for this task.
Unison is only intended for two "points" isn't it? i.e. it's meant for pretty narrow range of "client : server" sync jobs, and is essentially unusable without exactly the same patch version on both ends.
For purely server-side "cluster" sync I've found csync2 to be a reasonable solution.
Edit: it seems to invoke an a rsync process but doesn't parse stdout for progress. A bit disappointing but I suppose that's all which is needed for this application.