This seems like a very important paper, basically showing that Markov models with exponential decay of influence of tokens by distance are often a poor model, where as deep neural networks with LSTM (long short term memory) has power law decay of influence decay, which performs better for a variety of sequential data.
BTW, I went to the North American Association of Computational Linguistics conference in April and it seemed like half the papers used LSTM.
Forgive me, but I'm less impressed by the paper. As far as I can tell, they've only really shown that (1) language is recursive, which we know already; (2) markov models cannot capture recursive languages, which we've known; and (3) RNNs can, which we've known. But so can PCFGs and many other formalisms from the past 25 years, which they ignore.
> We can formalize the above considerations by giving rules for a toy language L over an alphabet A. In the parlance of theoretical linguistics, our language is generated by a stochastic or probabilistic context-free grammar (PCFG) [41–44]. We will discuss the relationship between our model and a generic PCFG in Section C.
The paper itself is a mess. It's very interesting but they aren't doing themselves any favors. They need to clean up the presentation and get rid of the distracting phrases like "fail epically".
> This seems like a very important paper, basically showing that Markov models with exponential decay of influence of tokens by distance are often a poor model, where as deep neural networks with LSTM (long short term memory) has power law decay of influence decay, which performs better for a variety of sequential data.
As someone outside of this field, it seems to me that this kind of result should have been very obviously foreseeable, hindsight bias and all that of course - but I would never have considered Markov processes to be an adequate predictability model for natural language. Though obviously the formalized results are important.
Could someone with more knowledge comment on what the current working assumptions were prior to this paper and what the consequences would be?
> but I would never have considered Markov processes to be an adequate predictability model for natural language
Isn't it the case that for very short distances (several elements), power decay and exponential decay are (or can be made, with proper constants) more similar? Thus, if predictive models were originally studied only for very short sequences in the past (limited computational resources!), it seems to make sense that this is a mistake that anyone could have made more easily back then.
Personally, no. I think all of these models are essentially trivial and a long way from genuine NLP.
That doesn't mean they're not useful in very narrow domains. But language is pretty much the definition of the ultimate wide domain, and trying to cover it with statistical correlations makes as much sense as word counting Shakespeare to try to generate some new plays.
I think you might be pleasantly surprised by recent results using DL and LSTM for building models of natural language. The next advancement I would like to see is handling anaphora resolution (resolving pronouns to previous noun phrases in text, resolving words like 'there' to a place mentioned elsewhere in text, etc.) Progress has been so rapid that I bet I don't have to wait long.
Have you seen the results from Dynamic Memory Networks? [0]
The relevant example from the paper:
I: Jane went to the hallway.
I: Mary walked to the bathroom.
I: Sandra went to the garden.
I: Daniel went back to the garden.
I: Sandra took the milk there.
Q: Where is the milk?
A: garden
Obviously just a toy task, but as you said, progress is rapid!
> Deep models are important because
without the extra “dimension” of depth/abstraction,
there is no way to construct “shortcuts” between random
variables that are separated by large amounts of time
with short-range interactions; 1D models will be doomed
to exponential decay.
From my non-professional perspective the above seems that it should have been very obvious - (and also that correlations between variables for natural languages would be better explained by multi-dimensional structure). That is if you told me that this were proved / formally supported as it is in this paper, my reaction would be a "that sounds like reasonable approximation" not - "that result sounds very surprising I must read the paper".
@mark_l_watson -- the paper also seems important in another respect -- in the conclusion the authors suggest abandoning loss functions as optimization objective functions in machine learning and replacing them with mutual information functions.
Hah, I'll take your word for it, then. :) Are there any recent comprehensive monographs you'd recommend for state-of-the-art NPL, for someone who has yet to enter the field?
Monographs aren't really a thing in NLP, beyond theses. There are a couple of reference books most people lean on, though:
* Foundations of Statistical Natural Language Processing by Manning and Schütze
* Speech and Language Processing by Jurafsky and Martin (which is being revised for a third edition, which you can look at: https://web.stanford.edu/~jurafsky/slp3/ )
Beyond that, you're basically stuck reading the research literature. On the up-side, most of that literature is freely available from the ACL anthology at http://aclweb.org/anthology/
Thank you very much. I'm aware of a large number of books that I could read, but if Sturgeon's law applies, or at least if a lot of the reading would be redundant anyway, I'll rather ask for pointers than to waste a lot of time by rediscovering the wheel. Basically, what I was asking for was the NLP's respected equivalent of Norvig and Russell, or similar books from other fields. Those two books seem to fit to bill.
If you have the time, I would start by taking Andrew Ng's machine learning class and then this NLP class when it is next offered in September 2016 https://www.coursera.org/learn/nlp
Ah yes, the one I've heard about but still have to take. :) Well, I guess I should give Coursesa a chance. (Somehow I'm not fond of their "timelined" format, it seems redundant if you're communicating with a machine. I hope the future of online learning will avoid it like the plague.)
Not to nitpick, Max Tegmark is a cosmologist. I only recently learned the difference when I called a cosmologist friend an astrophysicist. Cosmology deals with the big stuff, almost philosophically, like: "where did the universe come from" and "what is the fate of the universe", while astrophysics deals with the nature of the things within, like: "how do stars form" and "what happens when black holes collide".
Max is deeply invested in modeling, analysis and prediction software, and I suspect did the bulk of the work in the paper.
Henry Lin is a student who is focused on astrophysics. He gave an interesting TED talk (http://www.ted.com/speakers/henry_lin) a few years back about studying distant galaxy clusters.
Henry is energetic and almost viscerally inspired by the beauty of science and mathematics, such a wonderful quality! His voice is definitely in the prose of the paper.
> Did a colleague take at peek at the screen and said, hey I have the same equations?
This would an interesting thing to try - a computer system that would scan all the papers for math and find parallels. I think we already have something like term indexing for deductive systems?
One of the goals of OEIS is to discover when the same integer sequence arises in different mathematical contexts, and to confirm whether it's really the same sequence, which then gives the opportunity to prove why it's the same.
That might be a bit of a narrower domain, but it seems to work out pretty well!
One of the common threads I've noticed between the best researchers I know is their uncanny ability to draw these parallels between the most obscure domains.
Corollary: No probabilistic regular grammar
exhibits criticality.
In the next section, we will show that this statement is
not true for context-free grammars (CFGs).
That is, there exists CFGs that exhibit criticality. Programming languages are often parsed by CFGs, so it's likely that some programming languages exhibit the same criticality structure as natural languages.
I think that means only that programs written in a language described by a CFG could exhibit "criticality", not that they will. "Exhibiting criticality" is a property of a distribution (e.g. a corpus of human-written programs or an algorithm for generating programs), not of a grammar, IIUC.
It seems to me that if we humans are the only ones doing this, it is either trivially natural or trivially unnatural based on whether you include or exclude our creations in/out of the natural world.
"... which explains why natural languages are poorly approximated by Markov processes."
Is that a joke? It's 2016 and you think you need to explain that a Markov process is a poor approximation of natural language? This has been obvious for computational linguists, and anyone working in the field, from day one.
The Bach data consists of 5727 notes from Partita No. 2 [11],
with all notes mapped into a 12-symbol alphabet
consisting of the 12 half-tones {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}
with all timing, volume and octave information
BTW, I went to the North American Association of Computational Linguistics conference in April and it seemed like half the papers used LSTM.
Edit: the NAACL 2016 papers are here: http://aclweb.org/anthology/N/N16/