Hacker Newsnew | past | comments | ask | show | jobs | submit | mccoyb's commentslogin

I wanted to write the same comment. These people are fucking hucksters. Don’t listen to their words, look at their software … says all you need to know.

The opposite is true.

There is barely any magic in the harness, the magic is in the model.

Try it: write your own harness with (bash, read, write, edit) ... it's trivial to get a 99% version of (pick your favorite harness) -- minus the bells and whistles.

The "magic of the harness" comes from the fun auxiliary orchestration stuff - hard engineering for sure! - but seriously, the model is the key item.


Yeah I agree with this. The only tool that really matters is file patching -- which you can check something like the opencode patch implementation, its fairly straightforward.

Not all percentages are weighted equally. That 1% is worth a lot more than the low-hanging 99% from your example

Is it? Look at pi, for instance.

It turns out that "most of the bell and whistles" could amount to instructing models how to use tools like tmux


This is the complete opposite of my experience.

Does your experience include writing your own agent? Send a link

"Training the specific harness" is marginal -- it's obvious if you've used anything else. pi with Claude is as good as (even better! given the obvious care to context management in pi) as Claude Code with Claude.

This whole game is a bizarre battle.

In the future, many companies will have slightly different secret RL sauces. I'd want to use Gemini for documentation, Claude for design, Codex for planning, yada yada ... there will be no generalist take-all model, I just don't believe RL scaling works like that.

I'm not convinced that a single company can own the best performing model in all categories, I'm not even sure the economics make it feasible.

Good for us, of course.


> pi with Claude is as good as (even better! given the obvious care to context management in pi) as Claude Code with Claude

And that’s out of the box. With how comically extensible pi is and how much control it gives you over every aspect of the pipeline, as soon as you start building extensions for your own, personal workflow, Claude Code legimitely feels like a trash app in comparison.

I don’t care what Anthropic does - I’ll keep using pi. If they think they need to ban me for that, then, oh well. I’ll just continue to keep using pi. Just no longer with Claude models.


As a Claude Code user looking for alternatives, I am very intrigued by this statement.

Can you please share good resources I can learn from to extend pi?


Pi has specific instructions to extend itself.

You can just tell it to create an extension to connect to any AI API provider and it'll most likely one or two-shot it for you.

IMO it's the most self-aware of all of the current harnesses.


OpenAI has endorsed OAuth from 3rd party harnesses, and their limits are way higher. Use better tools (OpenCode, pi) with an arguably better model (xhigh reasoning) for longer …

I am looking forward to switching to OpenAI once my claude max account is banned for using pi....

Broadly agree with the author's points, except for this one:

> TypeScript/Node.js: Better concurrency story thanks to the event loop, but still fundamentally single-threaded. Worker threads exist but they're heavyweight OS threads, not 2KB processes. There's no preemptive scheduling: one CPU-bound operation blocks everything.

This cannot be a real protest: 100% of the time spent in agent frameworks is spent ... waiting for the agent to respond, or waiting for a tool call to execute. Almost no time is spent in the logic of the framework itself.

Even if you use heavyweight OS threads, I just don't believe this matters.

Now, the other points about hot code swapping ... so true, painfully obvious to those of us who have used Elixir or Erlang.

For instance, OpenClaw: how much easier would "in-place updating" be if the language runtime was just designed with the ability in mind in the first place.


> 100% of the time spent in agent frameworks is spent ... waiting for the agent to respond, or waiting for a tool call to execute. Almost no time is spent in the logic of the framework itself.

But that’s exactly where multi threaded Elixir is better! You want a single thread like Node for CPU bound work, you want extreme multi threading for I/O bound work like AI agents. In Elixir you can do both: heavy CPU work without worrying about stopping the world, and heavy concurrency across millions of threads where work is I/O bound and you want to saturate your network connection. In Node you can’t do either of those things easily - it’s just a single thread.


> Even if you use heavyweight OS threads, I just don't believe this matters.

It matters a lot. How many OS threads can you run on 1 machine? With Elixir you can easily run thousands without breaking a sweat. But even if you need only a few agents on one machine, OS thread management is a headache if you have any shared state whatsoever (locks, mutexes, etc.). On Unix you can't even reliably kill dependent processes[1]. All those problems just disappear with Elixir.

[1] https://matklad.github.io/2023/10/11/unix-structured-concurr...


Presumably if you can afford to pay for all those tokens the computational cost should be mostly insignificant?

Spending too much time optimizing for the 1% of extra overhead seems suboptimal..


Even if I was building one single agent for a one off hobby project I would still use Elixir. It’s elegantly suited to the job. With any long running failure prone process you’re constantly writing try/catches, health checks, network timeouts, retries, and a whole lot of other orchestration stuff just to keep a flaky real world agent running. With Erlang it’s all just a built in state machine. The process knows what state it is in, crashes, recovers gracefully in exactly the right state.

> How many OS threads can you run on 1 machine?

Any modern Linux machine should be able to spawn thousands of simultaneous threads without breaking a sweat.


Effectively everyone is building the same tools with zero quantitative benchmarks or evidence behind the why / ideas … this entire space is a nightmare to navigate because of this. Who cares without proper science, seriously? I look through this website and it looks like a preview for a course I’m supposed to buy … when someone builds something with these sorts of claims attached, I assume that there is going to be some “real graphs” (“these are the number of times this model deviated from the spec before we added error correction …”)

What we have instead are many people creating hierarchies of concepts, a vast “naming” of their own experiences, without rigorous quantitative evaluation.

I may be alone in this, but it drives me nuts.

Okay, so with that in mind, it amounts to heresay “these guys are doing something cool” — why not shut up or put up with either (a) an evaluation of the ideas in a rigorous, quantitative way or (b) apply the ideas to produce an “hard” artifact (analogous, e.g., to the Anthropic C compiler, the Cursor browser) with a reproducible pathway to generation.

The answer seems to be that (b) is impossible (as long as we’re on the teet of the frontier labs, which disallow the kind of access that would make (b) possible) and the answer for (a) is “we can’t wait we have to get our names out there first”

I’m disappointed to see these types of posts on HN. Where is the science?


Honestly I've not found a huge amount of value from the "science".

There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.

Have you seen any papers that really elevated your understanding of LLM productivity with real-world engineering teams?


The writing on this website is giving strong web3 vibes to me / doesn't smell right.

The only reason I'm not dismissing it out of hand is basically because you said this team was worth taking a look at.

I'm not looking for a huge amount of statistical ceremony, but some detail would go a long way here.

What exactly was achieved for what effort and how?


Nothing in this space “smells right” at the moment.

Half the “ai” vendors outside of frontier labs are trying to sell shovels to each other, every other bubbly new post is about this-weeks-new-ai-workflow, but very few instances of “shutting up and delivering”. Even the Anthropic C compiler was torn to pieces in the comments the other day.

At the moment everything feels a lot like the people meticulously organising desks and calendars and writing pretty titles on blank pages and booking lots of important sounding meetings, but not actually…doing any work?


This was my reaction as well, a lot of hand-waving and invented jargon reminiscent of the web3 era - which is a shame, because I'd really like to understand what they've actually done in more detail.


Yeah, they've not produced as much detail as I'd hoped - but there's still enough good stuff in there that it's a valuable set of information.


No, I agree! But I don’t think that observation gives us license to avoid the problem.

Further, I’m not sure this elevates my understanding: I’ve read many posts on this space which could be viewed as analogous to this one (this one is more tempered, of course). Each one has this same flaw: someone is telling me I need to make a “organization” out of agents and positive things will follow.

Without a serious evaluation, how am I supposed to validate the author’s ontology?

Do you disagree with my assessment? Do you view the claims in this content as solid and reproducible?

My own view is that these are “soft ideas” (GasTown, Ralph fall into a similar category) without the rigorous justification.

What this amounts to is “synthetic biology” with billion dollar probability distributions — where the incentives are setup so that companies are incentivized to convey that they have the “secret sauce” … for massive amounts of money.

To that end, it’s difficult to trust a word out of anyone’s mouth — even if my empirical experiences match (along some projection).


The multi-agent "swarm" thing (that seems to be the term that's bubbling to the top at the moment) is so new and frothy that is difficult to determine how useful it actually is.

StrongDM's implementation is the most impressive I've seen myself, but it's also incredibly expensive. Is it worth the cost?

Cursor's FastRender experiment was also interesting but also expensive for what was achieved.

I think my favorite current example at the moment was Anthropic's $20,000 C compiler from the other day. But they're an AI vendor, demos from non-vendors carry more weight.

I've seen enough to be convinced that there's something there, but I'm also confident we aren't close to figuring out the optimal way of putting this stuff to work yet.


But the absence of papers is precisely the problem and why all this LLM stuff has become a new religion in the tech sphere.

Either you have faith and every post like this fills you with fervor and pious excitement for the latest miracles performed by machine gods.

Or you are a nonbeliever and each of these posts is yet another false miracle you can chalk up to baseless enthusiasm.

Without proper empirical method, we simply do not know.

What's even funnier about it is that large-scale empirical testing is actually necessary in the first place to verify that a stochastic processes is even doing what you want (at least on average). But the tech community has become such a brainless atmosphere totally absorbed by anecdata and marketing hype that no one simply seems to care anymore. It's quite literally devolved into the religious ceremony of performing the rain dance (use AI) because we said so.

One thing the papers help provide is basic understanding and consistent terminology, even when the models change. You may not find value in them but I assure you that the actual building of models and product improvements around them is highly dependent on the continual production of scientific research in machine learning, including experiments around applications of llms. The literature covers many prompting techniques well, and in a scientific fashion, and many of these have been adopted directly in products (chain of thought, to name one big example—part of the reason people integrate it is not because of some "fingers crossed guys, worked on my query" but because researchers have produced actual statistically significant results on benchmarks using the technique) To be a bit harsh, I find your very dismissal of the literature here in favor of hype-drenched blog posts soaked in ridiculous language and fantastical incantations to be precisely symptomatic of the brain rot the LLM craze has produced in the technical community.


I do find value in papers. I have a series of posts where I dig into papers that I find noteworthy and try to translate them into more easily understood terms. I wish more people would do that - it frustrates me that paper authors themselves only occasionally post accompanying commentary that helps explain the paper outside of the confines of academic writing. https://simonwillison.net/tags/paper-review/

One challenge we have here is that there are a lot of people who are desperate for evidence that LLMs are a waste of time, and they will leap on any paper that supports that narrative. This leads to a slightly perverse incentive where publishing papers that are critical of AI is a great way to get a whole lot of attention on that paper.

In that way academic papers and blogging aren't as distinct as you might hope!


> There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.

This is a general problem with papers measuring productivity in any sense. It's often a hard thing to define what "productivity" means and to figure out how to measure it. But also in that any study with worthwhile results will:

1. Probably take some time (perhaps months or longer) to design, get funded, and get through an IRB.

2. Take months to conduct. You generally need to get enough people to say anything, and you may want to survey them over a few weeks or months.

3. Take months to analyze, write up, and get through peer review. That's kind of a best case; peer review can take years.

So I would view the studies as necessarily time-boxed snapshots due to the practical constraints of doing the work. And if LLM tools change every year, like they have, good studies will always lag and may always feel out of date.

It's totally valid to not find a lot of value in them. On the other hand, people all-in on AI have been touting dramatic productivity gains since ChatGPT first arrived. So it's reasonable to have some historical measurements to go with the historical hype.

At the very least, it gives our future agentic overlords something to talk about on their future AI-only social media.


How does this model compare to the syndicated actor model of Tony Garnock-Jones?

(which, as far as I can tell, also supports capabilities and caveats for security)

Neat work!


The animation on the Syndicated Actors home page [0] does a pretty good job of showing the difference, I think. Goblins is much more similar to the classic actor model shown at the beginning of the animation. The "syndicated" part, as far as I understand, relates to things like eventually consistent state sync being built-in as primitives. In Goblins, we provide the actor model (actually the vat model [1] like the E language) which can be used to build eventually consistent constructs on top. Recently we prototyped this using multi-user chat as a familiar example. [2]

[0] https://syndicate-lang.org/

[1] https://files.spritely.institute/docs/guile-goblins/0.17.0/T...

[2] https://spritely.institute/news/composing-capability-securit...


Thank you, very helpful!


My 5 minute read is that the divergences are primarily in the communication model and in transactions:

- the SAM coordinates through the dataspace, whereas Goblins is focused on ("point-to-point") message passing

- SAM (as presented) doesn't contain a transactional semantics -- e.g. turns are atomic, and there's no rollback mechanism (I haven't been up to speed on recent work, I do wonder if this could be designed into SAM)


a better term might be “feedback engineering” or “verification engineering” (what feedback loop do I need to construct to ensure that the output artifact from the agent matches my specification)

This includes standard testing strategies, but also much more general processes

I think of it as steering a probability distribution

At least to me, this makes it clear where “vibe coding” sits … someone who doesn’t know how to express precise verification or feedback loops is going to get “the mean of all software”


It's not in the harness today, it's a special RL technique they discuss in https://www.kimi.com/blog/kimi-k2-5.html (see "2. Agent Swarm")

I looked through the harness and all I could find is a `Task` tool.


Claude Code gets functionally worse every update. They need to get their shit together, hilarious to see Amodei at Davos talking big game about AGI and the latest update for a TUI application fucking changes observable behavior (like history scrolling with arrow keys), rendering random characters on the "newest" native version rendered in iTerm2, broken status line ... the list goes on and on.

This is the new status quo for software ... changing and breaking beneath your feet like sand.


Software changing and breaking beneath your feet is not new


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: