Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language

mark_l_watson · on July 13, 2016

This seems like a very important paper, basically showing that Markov models with exponential decay of influence of tokens by distance are often a poor model, where as deep neural networks with LSTM (long short term memory) has power law decay of influence decay, which performs better for a variety of sequential data.

BTW, I went to the North American Association of Computational Linguistics conference in April and it seemed like half the papers used LSTM.

Edit: the NAACL 2016 papers are here: http://aclweb.org/anthology/N/N16/

slashcom · on July 13, 2016

Forgive me, but I'm less impressed by the paper. As far as I can tell, they've only really shown that (1) language is recursive, which we know already; (2) markov models cannot capture recursive languages, which we've known; and (3) RNNs can, which we've known. But so can PCFGs and many other formalisms from the past 25 years, which they ignore.

I did not read it very closely though.

evanb · on July 13, 2016

From the article:

> We can formalize the above considerations by giving rules for a toy language L over an alphabet A. In the parlance of theoretical linguistics, our language is generated by a stochastic or probabilistic context-free grammar (PCFG) [41–44]. We will discuss the relationship between our model and a generic PCFG in Section C.

stelfer · on July 13, 2016

The paper itself is a mess. It's very interesting but they aren't doing themselves any favors. They need to clean up the presentation and get rid of the distracting phrases like "fail epically".

forgotpwtomain · on July 13, 2016

> This seems like a very important paper, basically showing that Markov models with exponential decay of influence of tokens by distance are often a poor model, where as deep neural networks with LSTM (long short term memory) has power law decay of influence decay, which performs better for a variety of sequential data.

As someone outside of this field, it seems to me that this kind of result should have been very obviously foreseeable, hindsight bias and all that of course - but I would never have considered Markov processes to be an adequate predictability model for natural language. Though obviously the formalized results are important.

Could someone with more knowledge comment on what the current working assumptions were prior to this paper and what the consequences would be?

jakub_h · on July 13, 2016

> but I would never have considered Markov processes to be an adequate predictability model for natural language

Isn't it the case that for very short distances (several elements), power decay and exponential decay are (or can be made, with proper constants) more similar? Thus, if predictive models were originally studied only for very short sequences in the past (limited computational resources!), it seems to make sense that this is a mistake that anyone could have made more easily back then.

thomasahle · on July 13, 2016

> I would never have considered Markov processes to be an adequate predictability model for natural language.

Would you consider LSTM an adequate model?

TheOtherHobbes · on July 13, 2016

Personally, no. I think all of these models are essentially trivial and a long way from genuine NLP.

That doesn't mean they're not useful in very narrow domains. But language is pretty much the definition of the ultimate wide domain, and trying to cover it with statistical correlations makes as much sense as word counting Shakespeare to try to generate some new plays.

mark_l_watson · on July 13, 2016

I think you might be pleasantly surprised by recent results using DL and LSTM for building models of natural language. The next advancement I would like to see is handling anaphora resolution (resolving pronouns to previous noun phrases in text, resolving words like 'there' to a place mentioned elsewhere in text, etc.) Progress has been so rapid that I bet I don't have to wait long.

nschucher · on July 13, 2016

Have you seen the results from Dynamic Memory Networks? [0]

The relevant example from the paper:

  I: Jane went to the hallway.
  I: Mary walked to the bathroom.
  I: Sandra went to the garden.
  I: Daniel went back to the garden.
  I: Sandra took the milk there.
  Q: Where is the milk?
  A: garden

Obviously just a toy task, but as you said, progress is rapid!

[0]: http://arxiv.org/abs/1506.07285

mark_l_watson · on July 13, 2016

Thanks for the link!

crypto5 · on July 13, 2016

> I think you might be pleasantly surprised by recent results using DL and LSTM for building models of natural language.

What exactly those models are modelling?

forgotpwtomain · on July 13, 2016

> Deep models are important because without the extra “dimension” of depth/abstraction, there is no way to construct “shortcuts” between random variables that are separated by large amounts of time with short-range interactions; 1D models will be doomed to exponential decay.

From my non-professional perspective the above seems that it should have been very obvious - (and also that correlations between variables for natural languages would be better explained by multi-dimensional structure). That is if you told me that this were proved / formally supported as it is in this paper, my reaction would be a "that sounds like reasonable approximation" not - "that result sounds very surprising I must read the paper".

osipov · on July 13, 2016

@mark_l_watson -- the paper also seems important in another respect -- in the conclusion the authors suggest abandoning loss functions as optimization objective functions in machine learning and replacing them with mutual information functions.

jakub_h · on July 13, 2016

> This seems like a very important paper

> mark_l_watson

Hah, I'll take your word for it, then. :) Are there any recent comprehensive monographs you'd recommend for state-of-the-art NPL, for someone who has yet to enter the field?

arnsholt · on July 13, 2016

Monographs aren't really a thing in NLP, beyond theses. There are a couple of reference books most people lean on, though:

* Foundations of Statistical Natural Language Processing by Manning and Schütze

* Speech and Language Processing by Jurafsky and Martin (which is being revised for a third edition, which you can look at: https://web.stanford.edu/~jurafsky/slp3/ )

Beyond that, you're basically stuck reading the research literature. On the up-side, most of that literature is freely available from the ACL anthology at http://aclweb.org/anthology/

jakub_h · on July 13, 2016

Thank you very much. I'm aware of a large number of books that I could read, but if Sturgeon's law applies, or at least if a lot of the reading would be redundant anyway, I'll rather ask for pointers than to waste a lot of time by rediscovering the wheel. Basically, what I was asking for was the NLP's respected equivalent of Norvig and Russell, or similar books from other fields. Those two books seem to fit to bill.

mark_l_watson · on July 13, 2016

If you have the time, I would start by taking Andrew Ng's machine learning class and then this NLP class when it is next offered in September 2016 https://www.coursera.org/learn/nlp

jakub_h · on July 13, 2016

> Andrew Ng's machine learning class

Ah yes, the one I've heard about but still have to take. :) Well, I guess I should give Coursesa a chance. (Somehow I'm not fond of their "timelined" format, it seems redundant if you're communicating with a machine. I hope the future of online learning will avoid it like the plague.)

BenoitP · on July 13, 2016

> [...] A Hidden Dimension in Natural Language

Mmmh

> [...] We show that in many data sequences — from texts in different languages to melodies and genomes

Hum, ehrm

> [...] natural languages are poorly approximated by Markov processes.

Alright, alright

> [...] This model class captures the essence of probabilistic context-free grammars

Ok, ok

> [...] and cosmological inflation

Wat.

Out of nowhere, Creation of the Univerve.

-------------

I'm always baffled by the ability to draw parallels. Did a colleague take at peek at the screen and said, hey I have the same equations?

Gargoyle · on July 13, 2016

Worth noting the authors, Henry Lin and Max Tegmark, are both astrophysicists. Among other things.

spdustin · on July 13, 2016

Not to nitpick, Max Tegmark is a cosmologist. I only recently learned the difference when I called a cosmologist friend an astrophysicist. Cosmology deals with the big stuff, almost philosophically, like: "where did the universe come from" and "what is the fate of the universe", while astrophysics deals with the nature of the things within, like: "how do stars form" and "what happens when black holes collide".

Max is deeply invested in modeling, analysis and prediction software, and I suspect did the bulk of the work in the paper.

Henry Lin is a student who is focused on astrophysics. He gave an interesting TED talk (http://www.ted.com/speakers/henry_lin) a few years back about studying distant galaxy clusters.

Henry is energetic and almost viscerally inspired by the beauty of science and mathematics, such a wonderful quality! His voice is definitely in the prose of the paper.

jakub_h · on July 13, 2016

> Did a colleague take at peek at the screen and said, hey I have the same equations?

This would an interesting thing to try - a computer system that would scan all the papers for math and find parallels. I think we already have something like term indexing for deductive systems?

kough · on July 13, 2016

This is theoretically very possible, and I know that at least a few people (http://ccimi.maths.cam.ac.uk/projects/create-semantic-search...) are working on it.

jakub_h · on July 13, 2016

Very yummy, thanks!

schoen · on July 13, 2016

One of the goals of OEIS is to discover when the same integer sequence arises in different mathematical contexts, and to confirm whether it's really the same sequence, which then gives the opportunity to prove why it's the same.

That might be a bit of a narrower domain, but it seems to work out pretty well!

https://oeis.org/

ASpring · on July 13, 2016

One of the common threads I've noticed between the best researchers I know is their uncanny ability to draw these parallels between the most obscure domains.

jakub_h · on July 13, 2016

Heh! For some reason, that reminded me of this famous conversation:

"...and that, my liege, is how we know the Earth to be banana-shaped."

"This new learning amazes me, Sir Bedevere. Explain again how sheep's bladders may be employed to prevent earthquakes."

evanb · on July 13, 2016

Max Tegmark is a cosmologist who writes ~5 serious papers and then one "out there" paper.

https://en.wikipedia.org/wiki/Max_Tegmark

crypto5 · on July 13, 2016

Maybe this whole article has been generated by one of their models..

w_t_payne · on July 13, 2016

I wonder what programming languages look like? I'd guess a lot like natural language?

laretluval · on July 13, 2016

They say their results from Wikipedia data were influenced by XML tags, so programming languages might look a lot like their Wikipedia data.

For reasons like this, I don't trust their empirical results at all.

xtacy · on July 13, 2016

Not quite, but as the paper says:

    Corollary:  No  probabilistic  regular  grammar
    exhibits criticality.

    In the next section, we will show that this statement is
    not true for context-free grammars (CFGs).

That is, there exists CFGs that exhibit criticality. Programming languages are often parsed by CFGs, so it's likely that some programming languages exhibit the same criticality structure as natural languages.

rntz · on July 13, 2016

I think that means only that programs written in a language described by a CFG could exhibit "criticality", not that they will. "Exhibiting criticality" is a property of a distribution (e.g. a corpus of human-written programs or an algorithm for generating programs), not of a grammar, IIUC.

meeper16 · on July 13, 2016

Here's an implemented Markov + word2vec chatbot http://lexcognition.com/lexi.html

the_duke · on July 13, 2016

I am VERY impressed:

-----

you: tell me something

me: Don't speak for me first time I asked it

you: tell me about your mother

me: just like my mother is on that AK47 diet

you: Tell me about artificial intelligence

me: Okay , maybe not intelligence capabilities, etc

you: ask me something

me: Points bow No one would ask this haha

hacker43 · on July 13, 2016

hmm...maybe the way we then encode data digitally is also "wrong" or should I say unnatural.

jakub_h · on July 13, 2016

It seems to me that if we humans are the only ones doing this, it is either trivially natural or trivially unnatural based on whether you include or exclude our creations in/out of the natural world.

wingcommander · on July 13, 2016

"... which explains why natural languages are poorly approximated by Markov processes."

Is that a joke? It's 2016 and you think you need to explain that a Markov process is a poor approximation of natural language? This has been obvious for computational linguists, and anyone working in the field, from day one.

mrcactu5 · on July 13, 2016

  The Bach data consists of 5727 notes from Partita No. 2 [11], 
  with all notes mapped into a 12-symbol alphabet
  consisting of the 12 half-tones {C, C#, D, D#, E, F, F#, G, G#, A, A#, B} 
  with all timing, volume and octave information

discarded.

I was good until the last part.

curiousgal · on July 13, 2016

Can you elaborate?

versteegen · on July 14, 2016

I misread your comment as saying you discarded the paper. Actually, 'discarded.' is part of the quote.

eximius · on July 13, 2016

Was this a sample of text generated by a Markov chain or their 'new' model?

eximius · on July 13, 2016

Oh. I feel a little silly.