Amazing how much the mainstream media are still obsessed with LLMs and factual accuracy. I prefer to see them as the worlds greatest innovation as lateral thinking machines ie pattern A applied over pattern B to give a credible pattern C.
LLMs are, and will continue to be, used for applications where factual accuracy is important. The decisions that LLM outputs prompt will soon be having big impacts in people's actual lives, if they aren't already.
The appearance of credibility of the output makes this much worse, since inevitably decisions will be deferred to LLMs that they are not sufficiently accurate enough for.
Real world industry wont stand for "hallucinated" outputs, unless there is more innovation around UI/UX on outputs. For example, no way lawyers/bankers/doctors are going to use LLM in their current forms and limitations if they can't trust the outputs.
Disagree. As a lawyer I use LLMs with RAG to help me surface information all the time. Often, this allows me to find niche case law that I just wouldn't have had the time to find on my own. However, I double-check everything, and read all the original sources.
LLMs are best treated as the AI equivalent of a human assistant who is knowledgeable and fast, but also inexperienced, and thus, prone to making mistakes. You won't throw out the work of such an assistant-it'll still save you hours of effort. However, you won't take the work at face value either.
I use LLMs for helping with coding it's proving invaluable. I see it very similarly, an inexperienced assistant with very broad knowledge. I also find that they don't make too many mistakes in tasks such as refactoring or finding bugs, it's when you ask it to just wholesale generate code for you that you hit problems. If I were to just take the code as is, not test it and not check I understand it then use it it's me that would be making the mistake not the LLM.
I have a friend in law school at the moment, and while he obviously can't use AI for school, multiple professors of his have recommended he get familiar with using them now so he'll be efficient at using them after passing the bar.
It's the trick where you take a question from a user, search for documents that match that question, stuff as many of the relevant chunks of content from those documents as you can into the prompt (usually 4,000 or 8,000 tokens, but Claude can go up to 100,000) and then say to the LLM "Based on this context, answer this question: QUESTION".
Real world industry already relies heavily on hallucinated outputs. They are called humans.
One of the main differences between people who see the astonishing value of LLMs right now and people who are some combo of skeptical, dismissive, and indignant, is the expectation that for something to be a valuable source of information, it has to be factually accurate every time.
The entirety of civilization has been built on the back of inaccurate sources of information.
That will never change and it absolutely can't, because (1) factual accuracy is not something that can be determined by consensus in a variety of significant cases, and (2) factual accuracy as a concept itself does not have a consensus definition, operationally or in the abstract.
Absolutely frustrating to see these topics addressed as if thousands of years of intense thinking around truth and factual accuracy has not taken place.
The results of those inquiries do not support the basic assumptions of these conversations (i.e. that factual accuracy is amenable to exhaustive algorithmic verification).
If you have employees, you also need to create transverse structures (make people work in teams, set check lists, QA, HR, accounting, create corporate charters on gender equality and many other topics, tell many "statements that ..." or "our mission is ...", create corporate culture, ) because you can't trust humans employees at 100%.
Actually some of banks biggest failures were when only a few and in some cases only one person was in charge.
> Real world industry wont stand for "hallucinated" outputs
Of course they will, if the other benefits are large enough. Checking factual accuracy of a large corpus can often be considerably simpler than generating the large corpus to begin with.
But how long will factual accuracy remain below human level? I expect the incorrect information issue to be a short term problem. That new GPT-4 based legal AI the BigLaw firms are signing up for already produces some documents with a lower error rate than a human lawyer.
Perfection isn't necessary, it only needs to be better than the average human. In a year or two I expect those kinks will be worked out for countless job tasks.
If I'm making a tool for doctors, I don't need to surface the exact scientific fact the LLM recalled from memory, I can design an interface that surfaces verbatim text from sources based on the LLM's understanding of the situation.
No serious product should be surfacing a ChatGPT style chat window to the user, it's a poor UX anyways with awful discoverability.
Remember the outcry against using Wikipedia as a source in school-work etc? Pretty convinced that the fear that "LLMs just make stuff up" will gradually go away as they get better.
Also, at the moment you should view e.g. ChatGPT as your autistic well-read friend who really do know quite a lot of things, but only approximately, and is prone to make things up instead of being found out for not knowing. It's OK to ask your autistic friend how electricity works, but don't expect him to correctly cite the titles and authors of relevant research papers. Just common sense stuff if you really think of the LLM as a brain of compressed information and not a search engine.
I think there is more innovation to happen with LLM output, for example leveraging the attention weights to give information on what the key events in history should be paid attention to given a sequence of current events, etc. More predictive information beyond just generated text responses ("completions").
This is the real breakthrough of LLMs imo. Karpathy had a series of tweest about this a year ago. It makes sense: Prettained LLMs learn to continue sequences of tokens. These are language tokens but the language is so complex and rich that a kind of general pattern recognition and continuation ability is learned.
I think finetuning and alignment destroys a large part of this in-context learning, so we are currently not focusing on this ability. In the future we will train on all kinds of token sequences.
If we can analyze these machines deeply and understand how they are doing this, is it possible that we can extract some kind of general purpose pattern recognition and manipulation algorithm or set of algorithms?
Are we just inductively generating algorithms that could in fact run more powerfully and efficiently if they were “liberated” from the sea of matrix math in which they are embedded?
Or is the sea of matrix math fundamental to how these things work?
It seems at least likely that a model with only has this general pattern prediction ability could be much smaller than current models. After all, these models don't just predict abstract patterns, they also "predict" a lot of declarative knowledge about history, science, human languages, pop culture, programming languages, and so on. They memorize a lot. But the vast majority of these memories are not necessary for a general pattern inductor. Such a model (or "algorithm") wouldn't need or have the ability to answer factual questions about the capital of France, or even the ability to understand English.
So I think it should be theoretically possible to separate the "inductive inference" abilities of a model from its "knowledge". Currently both faculties are mixed together, but this doesn't mean that they can't be separated somehow. We already know than knowledge and "inductive inference ability" (~intelligence) are not in fact the same in humans, even though both appear to be implemented in a fairly unified way in the brain.
Though I still think the pattern inductor algorithm wouldn't be something you could implement in C++. Programming languages are far too discrete, where all conditions are either met or not met, instead of being met more or less. Neural networks don't have this problem.
This makes me think a bit about how digital technology is changing the way humans use their intelligence. Memorization is de-emphasized since I can look up anything at any time. Instead, it's more valuable to use space in my brain to learn concepts and "meta" reasoning tasks.
We're probably becoming smarter in terms of general intelligence but dumber in terms of remembered knowledge. There's only so much space up there.
Unfortunately as a species we are in fact getting dumber because of dysgenics. Intelligence (which is strongly heritable) and fertility are anti-correlated, especially in women. See e.g.
We can't ask it questions, but we can enter a data sequence and let the model extrapolate it. Similar to how foundation language models (without instruction tuning) used to work, but not just for human readable text.
I don't think there is a magic alrogithm at play. I think it's all about studying the data - the geometry of the data manifold. In the case of a transformer more precisely learning the geometry of the stochastic process of tokens. Common "patterns", such as repetition or symmetry, appear there and can be learned. I think like learning eigenvectors in PCA you just learn more and more patterns and combinations of patterns. Then at some point that starts looking like pattern recognition.
The sea of matrix math is the fundamental way these things work. You could grossly simplify individual patterns to small explicit algorithms but this is the way for general patterns. Compress your information, and then synthesize its new information with your understanding of the world into insights.
I always thought better sequence prediction was the natural prerequisite to AGI, given that it’s kind of the whole foundation of algorithmic information theory (specifically, the upper limit to prediction is a prefix-free universal Turing machine that maximally compresses an input sequence and is allowed to continue running once it has reproduced the given input sequence).
It’s kind of funny to me that the general public seems to be recognizing this in a backward fashion — “oh hey LLMs can be used for other stuff too!” I would be extremely surprised if any LLM researcher hasn’t heard of Solomonoff induction though.
I don't think maximal (lossless) compression is an ideal for AI. Sense data (our eyes basically provide video feeds) are very noisy and compress extremely badly if you try to do it losslessly. I don't think the Kolmogorov complexity of Oppenheimer would be much shorter than its gzip length.
Usually, when we try to make predictions about unseen data from given data (induction), we make a trade-off between the complexity of a hypothesis and how well it "predicts" the available data. Solomonoff induction doesn't make this trade-off. Being based on lossless compression, it always demands perfect "prediction" of the available data, even if this means that the complexity of the "hypothesis" (~its Kolmogorov complexity) is very high and perhaps hardly smaller than the length of the uncompressed available data. Such greedy strategies usually tend to extreme overfitting and terrible generalization to new unseen data. What you really want is to accept some lossiness in predicting the available data for a much simpler hypothesis (better compression ratio). An optimal AI would always make the perfect such trade-off.
> I don't think the Kolmogorov complexity of Oppenheimer would be much shorter than its gzip length.
I think the conditional Kolmogorov complexity of Oppenheimer given all real world data we could collect [K(Oppenheimer|earth)] would be massively shorter than its gzip length. I agree that by using just the universal distribution alone this might not be the case.
Consider an input tape that consists of all real world data we are able to collect, concatenated with the input data of interest. Now consider the set of all prefix-free universal Turing machines that generate the full sequence on the tape. We then consider a normalized probability distribution based on this set where each program’s output beyond the reproduction of the input tape is weighted by 2^-[length of program].
Roughly speaking (glossing over some details), you can’t do better at predicting the future than this probability distribution assuming the physical Church-Turing thesis. An optimal AGI must incorporate optimal prediction. At this point, you can now develop an agent that utilizes reinforcement learning based upon this probability distribution, but my guess is that this part won’t be nearly as difficult as the prediction step. The particulars of the actual objective function for the agent are kind of irrelevant (could be something along the lines of “maximize whatever the p99 of humans consider AGI to be”), but whatever is chosen, there cannot be a general algorithmic process that would perform better.
> I think the conditional Kolmogorov complexity of Oppenheimer given all real world data we could collect [K(Oppenheimer|earth)] would be massively shorter than its gzip length.
To predict Oppenheimer perfectly, pixel for pixel, we would need at least Laplace's Demon with knowledge of both the laws and the complete initial state of the universe. And we need to assume the evolution of the universe is even deterministic and computable, otherwise this wouldn't work in the first place.
And even if we have the laws and the complete initial state of the universe, and its evolution is computable: if we live in some form of multiverse (or extremely large repeating universe, which is not unlikely, since many physicists assume the universe is infinitely large), it would still not work. There may be countless copies of Earth which are indistinguishable given (i.e. underdetermined by[1]) the data we can collect on Earth, but where each version of Oppenheimer has a completely different pixel noise pattern, and consequently different Kolmogorov complexity. So the "compressed" UTM input string would have to hardcode basically our entire Earth Oppenheimer in order to perfectly and uniquely determine the movie, resulting in a terrible compression ratio.
And even if the universe is in fact small and not a multiverse, our data available to us on Earth is probably not enough to uniquely retrodict the initial state (and laws) of our universe, since there are countless possible but non-actual universes which are consistent with our Earth data, but which all differ in initial state and laws in a way such that they again predict different versions of Oppenheimer, and so again the "compression" has to hardcode basically everything. Underdetermination strikes again, be it uncertainty about location in a multiverse or uncertainty about which merely possible universe the the actual one.
> Roughly speaking (glossing over some details), you can’t do better at predicting the future than this probability distribution assuming the physical Church-Turing thesis.
That only holds if you restrict yourself to perfect predictions, not prediction in general. (And of course ignore that Solomonoff induction is uncomputable even if the universe itself is computable.) The issue is that we don't care about just maximizing the probability of perfectly accurate predictions, because, given underdetermination, the best perfect prediction would still be highly unlikely to be true, because it would be dominated by overfitting on tons of hardcoded noise and not by real high level patterns in the data. What we really want for an "optimal" prediction is the prediction with the best "expected accuracy", the best expected "goodness of fit", expected "closeness to the truth" etc. The best such prediction may be perfectly accurate with probability 0% (most good predictions make unrealistic simplifying assumptions where we already know they are strictly speaking false), but that doesn't matter when the prediction is likely very accurate, very close to the truth.
An ideal prediction algorithm would probably minimize something like the quantity (complexity of hypothesis + inverse of goodness of fit to the given data). And it would be computable, since uncomputability is a rather non-ideal property of an algorithm.
Looking at the ARC problems that the model didn't solve correctly I honestly have no idea in some of done on how the model of wrong or what should have been the solution given the train example.