Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Transformers for software engineers (nelhage.com)
175 points by datkinson on April 2, 2022 | hide | past | favorite | 20 comments


Thanks for this writeup! Software engineers should definitely learn about Transformers. They are a significant step-change in the advancement of deep learning. They've proven useful in multiple domains outside of natural language, including computer vision and audio.

This opens up the world of multi-modal models which represent concepts by considering multiple types of input. The most popular one has been OpenAI's CLIP [0]

If anyone's interested, I run a podcast and did an episode on Transformer models and their implications [1]. Check it out!

[0] https://openai.com/blog/clip/ [1] https://www.youtube.com/watch?v=Kb0II5DuDE0


Just a fun technical note:

CLIP would be possible with any language and vision encoder.

It turns out that transformers were the most efficient encoder choices available at the time but most likely the approach would have given interesting results with a resnet and some sort of convolutional language encoder. In fact the paper has a resnet as one model for the vision side.


Yeah they trained multiple ResNet models. The ViT's (vision transformers) tended to outperform the ResNet's although they discuss a wide range of datasets and I believe the ResNet ones are better in some cases. They have also been used for visual salience tasks as they may have a better representation of positional information in a scene.

I think they do use a modified ResNet with some form of attention in it - but I'm not clear on the specifics and my understanding is that is somewhat common now.


The Limitations section of [0] isn't very encouraging.

> While CLIP usually performs well on recognizing common objects, it struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo. On these two datasets, zero-shot CLIP is only slightly better than random guessing. Zero-shot CLIP also struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.

> CLIP also still has poor generalization to images not covered in its pre-training dataset. For instance, although CLIP learns a capable OCR system, when evaluated on handwritten digits from the MNIST dataset, zero-shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset. Finally, we’ve observed that CLIP’s zero-shot classifiers can be sensitive to wording or phrasing and sometimes require trial and error “prompt engineering” to perform well.


This is a really accessible write up.

Although I agree with the comment elsewhere of asking for a “examples in Rust” comment somewhere to save everyone the 15 sec of squinting “is this … C++, or…” who don’t instantly recognize it.

Also funny that I just shared a Twitter thread on this topic to a coworker.

I was sharing the POV of this prof with the “wow, it’s amazing to think all the hard-won gains I ever did with deep feature engineering and tuning is now useless!”

Sharing in case others find it helpful.

https://twitter.com/moyix/status/1469401502422818823?s=10&t=...

Which references this thread:

https://twitter.com/karpathy/status/1468370605229547522?s=10...


There are millions of "Transformers Explained" blog posts by now. The one I got the most out of is "Transformers from Scratch" by Peter Bloem:

http://peterbloem.nl/blog/transformers


I hoped this would be an explanation of electricity, magnetism, the Lorentz force and induction.


Principles probably 10x more useful than ML gobbledygook. We need more fundamental research.


Curious why you think this. Do you not believe there is some value in automation of certain tasks?


No he doesn't believe it is useless, he said that it was probably 0.1x as useful as something else, so unless he thinks that something else is useless he considers this knowledge to still be useful.


Here is one that shows how to implement Transforms using excel.

https://www.youtube.com/watch?v=S9eKuRVigjY


Thanks for sharing this!


It's funny how nobody outside expensive classes at Stanford says anything about the relationship of attention to a (somewhat fuzzy) lookup table. With that in mind transformers suddenly become easy for software engineers.


Very good content! It's interesting that you use Rust here.


(author here)

I actually did a v0 writeup in Go, but I wanted a language that had a bit more powerful type system and more support for a fluent/functional style in some of the expressions. I was optimizing for what I know and felt was highly expressive; I've since gotten a bunch of feedback from people who are interested in the content but not comfortable reading Rust, so perhaps it wasn't the best choice.


What does the`.0` calling convention mean in Rust? e.g. ``` for (i, r) in right.0.iter().enumerate() { out.0[i] += r; } ```

Aside from that, I thought it was fairly legible. Great write-up by the way. Squashing things into state helps get rid of some of the spookiness created by matrix multiplication and back-propagation. I also really appreciated seeing the explanation on the actual MLP part of the transformer as that is typically assumed to be prior knowledge in other tutorials.


It's to access the tuple's single element, but the author could have used #[repr(transparent)]


You should mention that you’re using rust. I got distracted trying I figure out what language you were using.


Python would be awesome!


I think it’s better to use other languages for pseudo code. Rust is not the easiest language to read and understand for people who never used it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: