Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language (arxiv.org)
175 points by hacker42 on July 13, 2016 | hide | past | favorite | 44 comments


This seems like a very important paper, basically showing that Markov models with exponential decay of influence of tokens by distance are often a poor model, where as deep neural networks with LSTM (long short term memory) has power law decay of influence decay, which performs better for a variety of sequential data.

BTW, I went to the North American Association of Computational Linguistics conference in April and it seemed like half the papers used LSTM.

Edit: the NAACL 2016 papers are here: http://aclweb.org/anthology/N/N16/


Forgive me, but I'm less impressed by the paper. As far as I can tell, they've only really shown that (1) language is recursive, which we know already; (2) markov models cannot capture recursive languages, which we've known; and (3) RNNs can, which we've known. But so can PCFGs and many other formalisms from the past 25 years, which they ignore.

I did not read it very closely though.


From the article:

> We can formalize the above considerations by giving rules for a toy language L over an alphabet A. In the parlance of theoretical linguistics, our language is generated by a stochastic or probabilistic context-free grammar (PCFG) [41–44]. We will discuss the relationship between our model and a generic PCFG in Section C.


The paper itself is a mess. It's very interesting but they aren't doing themselves any favors. They need to clean up the presentation and get rid of the distracting phrases like "fail epically".


> This seems like a very important paper, basically showing that Markov models with exponential decay of influence of tokens by distance are often a poor model, where as deep neural networks with LSTM (long short term memory) has power law decay of influence decay, which performs better for a variety of sequential data.

As someone outside of this field, it seems to me that this kind of result should have been very obviously foreseeable, hindsight bias and all that of course - but I would never have considered Markov processes to be an adequate predictability model for natural language. Though obviously the formalized results are important.

Could someone with more knowledge comment on what the current working assumptions were prior to this paper and what the consequences would be?


> but I would never have considered Markov processes to be an adequate predictability model for natural language

Isn't it the case that for very short distances (several elements), power decay and exponential decay are (or can be made, with proper constants) more similar? Thus, if predictive models were originally studied only for very short sequences in the past (limited computational resources!), it seems to make sense that this is a mistake that anyone could have made more easily back then.


> I would never have considered Markov processes to be an adequate predictability model for natural language.

Would you consider LSTM an adequate model?


Personally, no. I think all of these models are essentially trivial and a long way from genuine NLP.

That doesn't mean they're not useful in very narrow domains. But language is pretty much the definition of the ultimate wide domain, and trying to cover it with statistical correlations makes as much sense as word counting Shakespeare to try to generate some new plays.


I think you might be pleasantly surprised by recent results using DL and LSTM for building models of natural language. The next advancement I would like to see is handling anaphora resolution (resolving pronouns to previous noun phrases in text, resolving words like 'there' to a place mentioned elsewhere in text, etc.) Progress has been so rapid that I bet I don't have to wait long.


Have you seen the results from Dynamic Memory Networks? [0]

The relevant example from the paper:

  I: Jane went to the hallway.
  I: Mary walked to the bathroom.
  I: Sandra went to the garden.
  I: Daniel went back to the garden.
  I: Sandra took the milk there.
  Q: Where is the milk?
  A: garden
Obviously just a toy task, but as you said, progress is rapid!

[0]: http://arxiv.org/abs/1506.07285


Thanks for the link!


> I think you might be pleasantly surprised by recent results using DL and LSTM for building models of natural language.

What exactly those models are modelling?


> Deep models are important because without the extra “dimension” of depth/abstraction, there is no way to construct “shortcuts” between random variables that are separated by large amounts of time with short-range interactions; 1D models will be doomed to exponential decay.

From my non-professional perspective the above seems that it should have been very obvious - (and also that correlations between variables for natural languages would be better explained by multi-dimensional structure). That is if you told me that this were proved / formally supported as it is in this paper, my reaction would be a "that sounds like reasonable approximation" not - "that result sounds very surprising I must read the paper".


@mark_l_watson -- the paper also seems important in another respect -- in the conclusion the authors suggest abandoning loss functions as optimization objective functions in machine learning and replacing them with mutual information functions.


> This seems like a very important paper

> mark_l_watson

Hah, I'll take your word for it, then. :) Are there any recent comprehensive monographs you'd recommend for state-of-the-art NPL, for someone who has yet to enter the field?


Monographs aren't really a thing in NLP, beyond theses. There are a couple of reference books most people lean on, though:

* Foundations of Statistical Natural Language Processing by Manning and Schütze

* Speech and Language Processing by Jurafsky and Martin (which is being revised for a third edition, which you can look at: https://web.stanford.edu/~jurafsky/slp3/ )

Beyond that, you're basically stuck reading the research literature. On the up-side, most of that literature is freely available from the ACL anthology at http://aclweb.org/anthology/


Thank you very much. I'm aware of a large number of books that I could read, but if Sturgeon's law applies, or at least if a lot of the reading would be redundant anyway, I'll rather ask for pointers than to waste a lot of time by rediscovering the wheel. Basically, what I was asking for was the NLP's respected equivalent of Norvig and Russell, or similar books from other fields. Those two books seem to fit to bill.


If you have the time, I would start by taking Andrew Ng's machine learning class and then this NLP class when it is next offered in September 2016 https://www.coursera.org/learn/nlp


> Andrew Ng's machine learning class

Ah yes, the one I've heard about but still have to take. :) Well, I guess I should give Coursesa a chance. (Somehow I'm not fond of their "timelined" format, it seems redundant if you're communicating with a machine. I hope the future of online learning will avoid it like the plague.)


> [...] A Hidden Dimension in Natural Language

Mmmh

> [...] We show that in many data sequences — from texts in different languages to melodies and genomes

Hum, ehrm

> [...] natural languages are poorly approximated by Markov processes.

Alright, alright

> [...] This model class captures the essence of probabilistic context-free grammars

Ok, ok

> [...] and cosmological inflation

Wat.

Out of nowhere, Creation of the Univerve.

-------------

I'm always baffled by the ability to draw parallels. Did a colleague take at peek at the screen and said, hey I have the same equations?


Worth noting the authors, Henry Lin and Max Tegmark, are both astrophysicists. Among other things.


Not to nitpick, Max Tegmark is a cosmologist. I only recently learned the difference when I called a cosmologist friend an astrophysicist. Cosmology deals with the big stuff, almost philosophically, like: "where did the universe come from" and "what is the fate of the universe", while astrophysics deals with the nature of the things within, like: "how do stars form" and "what happens when black holes collide".

Max is deeply invested in modeling, analysis and prediction software, and I suspect did the bulk of the work in the paper.

Henry Lin is a student who is focused on astrophysics. He gave an interesting TED talk (http://www.ted.com/speakers/henry_lin) a few years back about studying distant galaxy clusters.

Henry is energetic and almost viscerally inspired by the beauty of science and mathematics, such a wonderful quality! His voice is definitely in the prose of the paper.


> Did a colleague take at peek at the screen and said, hey I have the same equations?

This would an interesting thing to try - a computer system that would scan all the papers for math and find parallels. I think we already have something like term indexing for deductive systems?


This is theoretically very possible, and I know that at least a few people (http://ccimi.maths.cam.ac.uk/projects/create-semantic-search...) are working on it.


Very yummy, thanks!


One of the goals of OEIS is to discover when the same integer sequence arises in different mathematical contexts, and to confirm whether it's really the same sequence, which then gives the opportunity to prove why it's the same.

That might be a bit of a narrower domain, but it seems to work out pretty well!

https://oeis.org/


One of the common threads I've noticed between the best researchers I know is their uncanny ability to draw these parallels between the most obscure domains.


Heh! For some reason, that reminded me of this famous conversation:

"...and that, my liege, is how we know the Earth to be banana-shaped."

"This new learning amazes me, Sir Bedevere. Explain again how sheep's bladders may be employed to prevent earthquakes."


Max Tegmark is a cosmologist who writes ~5 serious papers and then one "out there" paper.

https://en.wikipedia.org/wiki/Max_Tegmark


Maybe this whole article has been generated by one of their models..


I wonder what programming languages look like? I'd guess a lot like natural language?


They say their results from Wikipedia data were influenced by XML tags, so programming languages might look a lot like their Wikipedia data.

For reasons like this, I don't trust their empirical results at all.


Not quite, but as the paper says:

    Corollary:  No  probabilistic  regular  grammar
    exhibits criticality.

    In the next section, we will show that this statement is
    not true for context-free grammars (CFGs).
That is, there exists CFGs that exhibit criticality. Programming languages are often parsed by CFGs, so it's likely that some programming languages exhibit the same criticality structure as natural languages.


I think that means only that programs written in a language described by a CFG could exhibit "criticality", not that they will. "Exhibiting criticality" is a property of a distribution (e.g. a corpus of human-written programs or an algorithm for generating programs), not of a grammar, IIUC.


Here's an implemented Markov + word2vec chatbot http://lexcognition.com/lexi.html


I am VERY impressed:

-----

you: tell me something

me: Don't speak for me first time I asked it

you: tell me about your mother

me: just like my mother is on that AK47 diet

you: Tell me about artificial intelligence

me: Okay , maybe not intelligence capabilities, etc

you: ask me something

me: Points bow No one would ask this haha


hmm...maybe the way we then encode data digitally is also "wrong" or should I say unnatural.


It seems to me that if we humans are the only ones doing this, it is either trivially natural or trivially unnatural based on whether you include or exclude our creations in/out of the natural world.


"... which explains why natural languages are poorly approximated by Markov processes."

Is that a joke? It's 2016 and you think you need to explain that a Markov process is a poor approximation of natural language? This has been obvious for computational linguists, and anyone working in the field, from day one.


  The Bach data consists of 5727 notes from Partita No. 2 [11], 
  with all notes mapped into a 12-symbol alphabet
  consisting of the 12 half-tones {C, C#, D, D#, E, F, F#, G, G#, A, A#, B} 
  with all timing, volume and octave information
discarded.

I was good until the last part.


Can you elaborate?


I misread your comment as saying you discarded the paper. Actually, 'discarded.' is part of the quote.


Was this a sample of text generated by a Markov chain or their 'new' model?


Oh. I feel a little silly.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: