Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Writing a Compiler is Surprisingly Easy (part 1) (sebmestre.blogspot.com)
284 points by mmphosis on Nov 7, 2023 | hide | past | favorite | 130 comments


Writing a compiler is not too difficult. Most of the techniques and procedures are well researched and documented. The best thing is that you can pick the parts that are useful and stop instead of implementing the full blown end to end pipeline.

Recently in an app, instead of supporting the bare minimum I wanted to support a bit more sophisticated search expression, like embedded commands and compounded conditional expression. I just hand-rolled the lexical tokenizer and the parser, and stopped without going the full mile. The tokenizer used simple regex to break up the terms and tagged them with token types. Came up with a grammar for the mini language. The parser used the standard recursive descent parsing technique on the grammar production. It's simple to handle errors and do auto-completion of the missing syntax because the parser knew what to expect at the point of error and could fill in the default nodes in the tree as needed. The generated AST was enough to implement the search tree and the embedded commands. The whole thing was only couple files and done in couple days, but it's very useful for the app.


I wrote a parser/interpreter for VBA a few years back, and even though the core concepts were straightforward, conforming to a language that had been in development for decades turned out to be brain-wreckingly tedious. One of the biggest parts of it was a block of thousands(?) of conditional statements to figure out the result type of every combination of type+operator+type. At the end of the ordeal, though, I threw the engine into a test app that looked like GWBASIC, and got a little nostalgia kick from being able to run BASIC like it was 1983.


Same boat here. I wrote a VB to C++ transpiler to make VB6-like IDE for PalmOS (basically a clone of the existing Handheld Basic++). The type system was a mess.

After one year of development on my free time I had the compiler and a form editor, but lots of out of memory errors when running the code on the simulator (and real devices) :-|

At the same time, Palm sold the OS licenses to ACCESS inc., IIRC, and the future of the platform was sealed. Long story short, I shelved the project and I lost it during one of my cyclical "move backups to to a bigger disk" migration.


That's amazing to implement a full interpreter for VBA.

I would imagine a lot of the type resolution are redundant. E.g. +, -, *, and / would have the same type resolution for numbers. Those can reuse the same type resolver. May be + also work on string. Then call the string type resolver on it in addition to see if it can be resolved.

Anyway, kudos for the amazing work.


I have an open source compiler-interpreter for VBA here: https://github.com/sancarn/stdVBA/blob/master/src/stdLambda....

In case you were interested :P (I say compiler, it compiled to byte code, not asm.)


Is that really needed? You can use VBA to do all the heavy lifting. That's what we do here: https://github.com/sancarn/stdVBA/blob/master/src/stdLambda....


One of the biggest parts of it was a block of thousands(?) of conditional statements to figure out the result type of every combination of type+operator+type

This is not something uncommon for a dynamically typed language; I think the simplest approach is a multidimensional array lookup.


Did you need thousands of conditionals? Couldn't you map types and operators into an efficient representation for a single lookup?!


That and there were never official grammars published for any version of any basic from Microsoft afaik


I doubt they ever existed.

I had to write a VBScript parser some 25 years ago (to syntax-check ASP files before they would crash live) and to get it to parse valid syntax correctly, I ended up with a messy patchwork of logic which was also letting through some really pathological edge cases. While trying to fix it, to my utter delight, I found that Microsoft’s parsers also allowed the exact same cases which looked nothing like valid code otherwise. I don’t recall specific examples anymore, but they were far from obvious without looking at the parser code, and I made and won bets with other programmers on whether those snippets would throw syntax errors or not.


> I doubt they ever existed.

Probably right, since Microsoft's BASIC interpreter is probably derived from the one that Bill Gates wrote as a teenage slacker. He wrote it, tried to sell it at the homebrew computer club, and then threw a hissy fit when people kept copying his paper tapes for their friends. Thus, Microsoft was born. (He got lucky later when Microsoft were chosen by IBM to develop a DOS, then bought a working DOS from somebody else. The rest as they say is history)


By the time Microsoft secured a deal with IBM, they were already a multimillion-dollar company, which was quite significant in 1980. Judging by their future moves, it's evident that luck wasn't the only factor contributing to their massive success.


Microsoft was the most important vendor of programming language compilers and interpreters for the CP/M operating system, for computers based on Intel 8080 or Zilog Z80.

They had compilers for many programming languages, but before buying the operating system that they have rebranded proudly as MS-DOS and which they have resold to IBM at an overprice, they had nothing to do with operating systems.

So the deal had nothing to do with their previous technical competence, but they happened to know the right information about who is selling and who is buying, which allowed them to gain the most as the middlemen.


> Writing a compiler is not too difficult.

Wait till you're faced with the endless .h files with their inevitable use of nutburger undocumented C extensions.


Here's a secret trick to parsing C headers: whenever you hit an error, discard the line and restart parsing. If you have a fallback explicit FFI, most ~mystery extensions~ can be done without.


Don't write a compiler for C. Write compilers that convert your sources to C.

You create the header files.


D's ImportC compiler can convert C code to D!


Works for Swift!


Everyone should learn about bison/yacc/lex.


We will store that intermediate value in the stack, and read it back after the right term is done computing. If we take care that the code we generate for the right term doesn't write into the same stack slot, then we can be sure that the value will be preserved.

There are push and pop instructions which are perfectly suited to this nested value use. In the past, lamenting how many compilers don't seem to realise that (and the fact that push/pop are specially optimised on x86 via hardware known as the stack engine), I wrote a piece of a code generator with similar capabilities to the one in this article as a proof-of-concept, and with the addition of some trivial (rotating FIFO-like) register allocation, was able to generate surprisingly compact and fast code that looks like it could've been written by a human Asm programmer.

I believe Crenshaw's famous compiler tutorial also makes the same simplification, and it's also a powerful generalisation since it means that arbitrarily deep nested expressions can be compiled, as long as there is sufficient stack space.


> There are push and pop instructions which are perfectly suited to this nested value use. In the past, lamenting how many compilers don't seem to realise that (and the fact that push/pop are specially optimised on x86 via hardware known as the stack engine)

That's fair, but in the article I also store local variables relative to %rsp, so I wouldn't want it changing at all

I could use %rbp for that instead, but then I'd have to explain it in the article :^)


That's fair, but in the article I also store local variables relative to %rsp, so I wouldn't want it changing at all

You only need to keep track of how much you've put on the stack (which can be a single counter) and adjust the offsets appropriately.

But then you'd have to explain that too...


crenshaw mostly just emits pushes and pops; he recommends a simpler version of that rotating fifo technique (just spilling deeply nested values onto the stack so you get a performance cliff, instead of pushing old values out of the fifo) but doesn't actually show it

in ur-scheme i just pushed and popped, getting performance much less terrible than i expected, but still probably a third of the performance the rotating fifo register allocator would give


Writing a toy compiler for one sane input language and one target platform is indeed surprisingly easy, and very educational. I had to do this back in the 90's for C when the O/S SW group I worked for needed to start work on a UNIX port to a new proprietary architecture (simulator) and the company's compiler group wasn't getting anything started any time soon. It took maybe 3-4 KLOC to get a boring old recursive descent parser and simple code generator working well enough to compile a minimal kernel configuration and pass a compiler self-replication test. The most interesting bit was actually the preprocessor.


Writing a compiler was part of the undergraduate compsci curriculum back where I did it. I still have vague memories of ancient tools like Flex and Bison (and I’m puzzled that the author seems to be hand-rolling code rather than use existing tooling).


AFAIK nowadays it is more common to write a recursive descent parser to build whatever structure you want to work with (if not use the parsing directly) than rely on Flex/Bison.

Personally i've written so many parsers over the years that the code for writing one "flows" naturally and i feel like Flex/Bison would introduce complexity and dependencies for no gain.


I keep writing parser generators because I hate writing parsers manually, but keep returning to writing parsers manually because making a parser generator survive the first few meetings with reality without the pretty grammar turning into a hairy beast (when e.g. adding error recovery or trying to wrangle it into producing the trees you'd prefer rather than what fits the structure of the grammar) remains elusive.

Ironically I think the sheer number of parser generators reflects our collective failure to make a good one more so than their amazing utility (sure, some projects so actually use some).

I'm knee deep in her another parser generator these days, and I'm mildly hopeful (then that's always the case at the outset) of taking an augmented BNF variant and producing something DFA like that retains links from each state to higher level rules to retain enough information to backtrack to a reasonable place to both report errors from and/or resume parsing after an error.

I doubt the overall approach is novel in any way, and I suspect I'll be back to hand rolling parsers soon enough, but who knows, I might end up with some worthwhile reusable components from it...


there are boatloads of projects that use flex and bison, or antlr, for their parsers. handy here right now i have gcc 0.9, gforth, yosys's ilang and verilog frontends, linux kconfig, perf, and pcc, all using yacc or bison. also wget uses lex for css (maybe flex) but not yacc. but i only have the source code of a few dozen free software projects handy here to grep

many of them do eventually switch to shotgun parsers, but in a sense that's sort of like rewriting a c subroutine in assembly; it gives you, usually, modestly more benefit, for enormously more effort — but it's a benefit you can't get any other way

i recently worked on a project that parsed pdf files with a peg parser generator, a parser generator we had to extend in significant ways during the project in order to get what we needed


There is also the ANTLR parser generator which generates a top-down parser. It's nice to write in a grammar. Still, almost all production lexers and parsers are hand written.

https://www.antlr.org


> Still, almost all production lexers and parsers are hand written.

As a counterpoint, I consider the JetBrains parsing system to be world class and they hand-write very few (instead building on this system: https://github.com/JetBrains/Grammar-Kit#readme )

- https://github.com/JetBrains/intellij-community/blob/idea/23... (the parser I'll concede, as they do seem to be hand-rolling that part)

- https://github.com/JetBrains/intellij-community/blob/idea/23... (same for its parser)

- https://github.com/JetBrains/intellij-community/blob/idea/23... and https://github.com/JetBrains/intellij-community/blob/idea/23...

- https://github.com/JetBrains/intellij-plugins/blob/idea/233.... and https://github.com/JetBrains/intellij-plugins/blob/idea/233....


it introduces knowing where the ambiguities are in your grammar and eliminates large classes of parsing bugs

it's also dramatically easier for other people to understand


Unfortunately `%expect` is so easy to reach even in yacc.


Those Dodekalogue people are laughing.


Undergrad was when I learned about the subject. It was done with tools like Lex and Yacc, along with the Dragon book. These tools are fine but only work in C. Yacc/Bison generate LALR bottom up parsers, which can be difficult to debug and use. Also it's difficult to implement error recovery. If you want to do it in another language or want to have more control, hand rolling the lexer and parser is not too bad.


yeah had to do it as part of a Carnegie Mellon undergrad course. One of the higher level optional ones. Generally worthwhile experience.


I can agree with that approach, because I used it to write a C compiler for a fantasy console I made (Vircon32). Since it has a fully custom ISA and set of chips I could not use something like LLVM, so I wrote it more "manually".

The first versions were very barebones (they supported little more than variables, expressions and if/else), but over time I expanded it to support loops, arrays, pointers, structures... nowadays it's still not a full C compiler but it has been good enough for me to write several full games reliably, and even a few people have been using it too.


Playing chess is easy. Playing good chess - not so much. Playing world class chess - good luck.


But a lot of programmers think compilers are magic. Maybe because they read about stuff like parser generators, register allocators, optimizations, etc., that are not strictly necessary. It's empowering to remind people that a seemingly hard task is within their reach.


People always think that what they dont understand is magic, and when they do understand it they think its simple.

For example: React, Operating Systems, AI, CSS and assembly.


For AI, it is magic all the way down.

Understand how the math work is one level. Understanding _why_ the math work is another level.


For some flavors of AI it's not simple, true.


React is simple?


Sometimes people just want to learn how to play chess, and that's ok.


To elaborate, I've written a number of compilers in years past. I highly recommend it as an exercise in learning about parsing, code emission, etc. Anything that gets a developer closer to understanding that process, rather than a mystery black box, is a very good thing. So I'm not trying to be a downer, just noting that it can rapidly become an extremely complex task to make an excellent compiler which includes end user expected code optimizations, informative error messages, syntax error recovery, etc.


I'm not sure anyone would think otherwise, though.


Wow! My blog made the HN front page!

I've been lurking for a while but only now made an account :^)


Thanks for your article. Perhaps (after some more parts) you'd like to compare how tinycc (tcc) [1] differs from your approach. In case tcc would be more complex, is it only due to dealing with C or because of any general issues.

[1] https://repo.or.cz/tinycc.git


thanks for writing it! i look forward to seeing future installments


Writing a toy compiler is easy enough - often given as a comp. sci. student exercise. Writing a highly optimizing standards compliant one, one with dozens of real world constraints is another story, especially if you are writing the whole thing and not building atop of something like LLVM.

It's a bit like saying that building a rocket is no big deal, and wanting others to assume you've built a space shuttle when really it's a hobby rocket.


Exactly this, the complexity of a production compiler project is its sheer vastness. Hundreds to thousands of language features need to work together in a performant and correct manner.

Plus, the language itself is probably less than 15% of the work. You'll need a build system, a standard library, a code formatter, a syntax highlighter, support for incremental and cached builds, helpful error messages, bindings for a systems language like C, among other things.


i'd have phrased it more like building a shed vs. building a mansion or shopping mall.

you can learn a lot of basics building a shed -- pouring concrete, adding spaces for wiring, drainage, etc. -- but ultimately it's just a small one off for learning.


"Writing a Compiler is Surprisingly Easy (part 1/72)"

Joke aside, great work up! Thank you.


Wirth's classic recursive descent Pascal compiler in his 1976 book Algorithms + Data Structures = Programs is a basic compiler for a somewhat useful language. That's kind of dated. What's a modern equivalent?


https://mitpress.mit.edu/author/jeremy-g-siek-37497/ - He has two books, both are also available through open access. One is in Python implementing a Python-ish language and the other in Racket implementing a Scheme-ish language.

I ran through the first half or so of both versions earlier this year while looking for material for a colleague (and a few other books, courses, and tutorials). Overall I like the approach which is based on the nanopass compiler concept mentioned by some other commenters. You get a functioning compiler (as in it produces an executable program) very early, and then spend the rest of the book extending the language and adding passes.


I keep recommending “Compiling to Assembly”, by Vladimir Keleshev: https://keleshev.com/compiling-to-assembly-from-scratch/

The author chose TypeScript for the Compiler, which makes it somewhat more accessible to these days programmers.


Crafting Interpreters?


I'd argue that crafting interpreters is becoming dated as well. There are modern alternatives that could make the process a lot more streamline. Would love a rewrite!


Could you expand on what alternatives you prefer? I'm not sure what is dated about Crafting Interpreters (but I'm also not following the latest developments in the compiler space).

Thorston Ball's Compiler and Interpreter book series is the only other book on the subject I find widely recommend and modern.


If you don't mind interpreters and/or swift, I'm working on a tutorial here:

https://github.com/codr7/swift-interpreter


My first introduction to this subject was “Let’s Build a Compiler” by Jack Crenshaw. Even though it’s quite old now, I still highly recommend if you’re looking for a gentle, no-frills introduction: https://compilers.iecc.com/crenshaw/


Yet it is more and more time consuming the closer you want to get to the real world compilers


It gets harder when you try to deviate from the norm of “40 years of research and experience, and this has survived 40 years for a reason”


The nice thing about real world compilers is that they already exist. You can write your own compiler for situations where you need to do other things. If you need performance, compile to a language with a high-performance compiler.


I wish there were a standard cross platform assembler library in C for x86, ARM, and WASM. Because frankly writing an asm text generator and shelling out to GCC or NASM is really annoying, and it feels like something so foundational should just exist.

I get there's yasm, and libnasm, and asmjit, and libgccjit, and llvm, etc but they're all varying degrees of "bad" for the purposes of a simple code generator for someone that doesn't want to dig through ARM and Intel's developer docs.


Can you elaborate what's "bad" on asmjit?

I wrote it for the purpose of making it easy to write JIT compilers in C++ and it has been adopted even by projects that are written in C (like Erlang). It's also one of the smallest libraries out there and has a great performance.

I understand why people don't want to use projects such as libgccjit or llvm, but asmjit is just super easy to use compared to others.


It's in C++. I don't really mind interfacing with C, but C++ is kind of a non-starter.

And back when I last used it, the docs were pretty rough. I understand that's changed a lot and they look much better now.


The reason why it's written in C++ is to make it practical. I understand why some people would want it in C, but honestly I have never seen a nice JIT assembler library written in C - I saw few incomplete assemblers in C before as part of other projects, but it was usually ugly code full of macros, so no proper C API to interface with anyway.

I think that asmjit's biggest strength is in its completeness and performance, so it can be used to generate code in a sub-millisecond time, which higher level code generators such as LLVM don't offer.


I understand, but I don't think you understand me.

asmjit is great if you're writing a fast JIT and want to write C++, or can tolerate a C++ requirement. If you're writing a compiler in any other language, requiring C++ is a burden that may be insurmountable. What I'm lamenting is the lack of a widely used, decent C library for generating x86 and ARM code, which doesn't seem to exist, nor does asmjit fill that gap.

> I think that asmjit's biggest strength is in its completeness and performance, so it can be used to generate code in a sub-millisecond time, which higher level code generators such as LLVM don't offer.

Not everyone cares about sub-millsecond performance for code generation. It's easier to generate GCC syntax ASM and shell out to `asm` or `cc` than to write bindings to asmjit and interface from a different language, and the performance is fine because lowering your AST into an IR that can be used to generate asm either as text or the input to asmjit is a much bigger bottleneck.


Yeah, I usually write compilers in C++ and it seems we have totally different use-cases :)

High performance assembling is literally what asmjit was designed for and that allows it to be used in interesting projects - for example there are multiple databases that use asmjit to JIT compile queries, and low-latency compilation is the key feature here.

BTW if you need a pure C library to encode x86 there is zydis (yeah it used to be a disassembler only, but it has now also an encoder), but it's only for x86.


The fact that it's written in C++ doesn't make it 'bad'. It just means it's not suited to your use, which is your problem not the library's problem.


> I wish there were a standard cross platform assembler library in C for x86, ARM, and WASM. Because frankly writing an asm text generator and shelling out to GCC or NASM is really annoying, and it feels like something so foundational should just exist.

It does, it's called C :-)

> I get there's yasm, and libnasm, and asmjit, and libgccjit, and llvm, etc but they're all varying degrees of "bad" for the purposes of a simple code generator for someone that doesn't want to dig through ARM and Intel's developer docs.

This is only for language developers who aren't writing in C, right? Because if you're writing in C, what would you use this assembler for?

Anything a cross-platform assembler emits is going to be portable, so it's only going to be as performant as C is, so why not just emit C?

Having a C library that emits C might be a nice thing to have - then all the other languages can emit "code" that is cross-platform.


C is not assembler nor is it close to assembler nor is it desirable to have a C compiler in your toolchain to compile something that needs assembler

For example, C has a calling convention. It has a notion of a stack and heap. Certain things like arbitrary control flow, threading, and process spawning cannot be expressed in C. It has a memory model that you must adhere to.

If you actually want to implement a programming language, C is not sufficient as a transpilation target except for mostly trivial designs.


works pretty well most of the time; chicken scheme even compiles call/cc to c


> For example, C has a calling convention. It has a notion of a stack and heap. Certain things like arbitrary control flow, threading, and process spawning cannot be expressed in C. It has a memory model that you must adhere to.

Yes, all trivially bypassed or worked-around if you're emitting C. Considering the initial constraint was "cross-platform":

1. Calling convention - what would be the point of a calling convention different to the rest of the system? You wouldn't be able to do system calls, for one. And if you need a different calling convention within the language itself C will let you push elements onto a stack and jump to a different piece of code[1], optionally saving and restoring registers.

2. The notion of a stack and heap - what's the alternative you have in mind when using assembler? Assemblers also have a notion of a stack and non-stack addresses, after all.

3. Arbitrary control flow - you can emit goto's to your hearts content (see [1] below).

4. Threading - this is platform-specific by nature, but if you're emitting C you can emit calls to the platform's thread primitives (CreateThreadEx, pthread_create, etc) the same way you'd have to do if you were using an assembler. Nothing stops you from proving a wrapper for threads that your emitted C code calls (so that it works on multiple platforms).

5. Process-spawning - once again, this is platform-specific, so even in assembler you're going to have to make a call to the platform's CreateProcess/posix_spawn/exec functions. If you're going to call a platform function from assembly, you may as well call it from C.

6. Memory model - I'll give you this point, even though the memory model is not that much different from assembler. If you're emitting C, you're still free to allocate/define a large array and use that memory in whatever way you wish.

> If you actually want to implement a programming language, C is not sufficient as a transpilation target except for mostly trivial designs.

Actually, many languages with non-trivial and advanced features, such as Common Lisp, already have implementations that transpile to C. They implement the entire Common Lisp standard, and don't seem to be missing any features even though they are emitting C and not assembler.

Anonymous functions? Done with emitted C.

Closures, continuations? Done with emitted C.

Garbage collection? Same.

What feature do you have in mind that cannot be implemented by emitting C[2], but can be implemented by emitting assembler[3]?

[1] Some constraints on the jumping exist, but emitting GCC-specific C let's you do calculated expressions for the jump target - you don't need to `goto` a constant address.

[2] I initially thought that something like Goroutines or async functions might be a candidate, but off the top of my head, I can think of at least one way to implement async functions and/or go routines by emitting C. There's probably more.

[3] I'm not being contentious, I'd really rather like to know - my own toy language (everyone has one) currently in the phase of "I'm thinking really hard about what it should look like" is going to be emitting C. I couldn't find any feature for the language that can't be done by emitting C, so I really want to know what you have in mind.


Plenty of languages have language-internal calling conventions nothing like C, and usually for good reasons.

Some try to stay close to make passing the boundary easy, some are so different they need an expensive conversion step anyway and then there's little point.

Sure, you can always map your requirement onto the standard calling convention somehow, but if you can't directly call out or in anyway it's a constraint it's totally reasonable to not want to be forced into.

EDIT: Note that there's nothing wrong with emitting C if you don't have any compelling reasons not to. When we don't it's often for reasons such as wanting a simpler toolchain, wanting the environment to be self-hosted etc. that are tradeoffs that sometimes are frankly ideological (take that as you want - I don't mean anything inherently negative with that other than when the authors ideology differs from mine ;) I like that there's choice and that some deviates from the norms ), sometimes based on requirements that doesn't really have to do with the language itself. I do agree with you that you can fit all languages onto C, and most languages neatly onto C.

One example where it has traditionally been painful without compiler extensions, though, is tail call elimination. You can still do it. Other fun times involve closures, but you can do that too (been there, done that, wrote a blog post years ago) even though it quickly gets ugly.

Often you end up wanting to demand a specific set of extensions.

Overall nothing stops you, but at some point it also doesn't necessarily buy you all that much - you still need to do most of the hard bits.


> Sure, you can always map your requirement onto the standard calling convention somehow, but if you can't directly call out or in anyway it's a constraint it's totally reasonable to not want to be forced into.

I get that, but what I don't get is what about C makes it harder to define your own calling convention than defining it in an assembler language.

In the context of this thread, OP wants a portable assembler. Implementing a custom calling convention by emitting any assembly language is bound to be more work than implementing a custom calling convention by emitting C.

Maybe the situation regarding floating point registers when passing floating point values? That's about the only thing I can think of that makes it harder to emit C code that implements a custom calling convention than doing it in assembler, but even for that I can think of workarounds, just so that other high-level constructs can be used.


> I get that, but what I don't get is what about C makes it harder to define your own calling convention than defining it in an assembler language.

You don't get to control which registers are caller or callee saved, or which registers are free to use.

If your desired calling convention fits within those constraints, you can use it via C all you want. If it doesn't, you usually will need to either resort to compiler extensions or give up on using C. Whether it's wise to change that for your given use case can be a worthwhile question to ask, but that is another issue.

Ironically sometimes this choice is about making C interop easier. E.g. I have a prototype Ruby compiler. It can call out to C from a low level "escape". The same exact code is used to generate calls to Ruby methods and to C. But the Ruby code needs the arity of the code that's being called to be passed along. Putting it in a register that's not used for arguments in the standard calling convention means that when calling out to e.g. glibc the C code just won't be able to access those values.

Now, in my case, if I'd been on the fence about using C as the target, then I'd have handled that in a different way instead, so that didn't really influence my choice to emit assembler. But I can't define that calling convention in ISO C.

> Implementing a custom calling convention by emitting any assembly language is bound to be more work than implementing a custom calling convention by emitting C.

Generally, your calling convention simply involves what goes into registers and what goes on the stack in which order. Storing to a register vs. outputting to an argument list only marginally different in effort. In the compiler mentioned above, preparing arguments is literally:

    arges.each_with_index { a,i| @e.save_to_stack( compile_eval_arg(scope, a), i) }
If I was emitting to a C argument list instead, the only thing that'd change is the `@e.save_to_stack` bit. `save_to_stack` is a one-line method. The non-C compatible parts add up to 2-3 more lines in that case. One doing something more unusual might be longer, but not a lot.

There is plenty of complexity of dealing with assembler directly, and you certainly ought to have a reason for doing so over outputting to a higher-level target, but the complexity cost of assembler is not there.

> OP wants a portable assembler

And as they pointed out, C is not a portable assembler. It is sometimes close enough, but it imposes plenty of restrictions that you can't get around in ISO C. That can be perfectly fine, but it's a tradeoff. Sometimes you can find a halfway house of requiring a specific compiler, but that also removes at least some of the advantage.


You make some very good points, and I agree with your conclusion, viz it's a trade-off.

I'm still wondering if the OP (@duped) meant the same thing as you did, or if he had another use-case in mind when he said "calling conventions".

FWIW ...

> But the Ruby code needs the arity of the code that's being called to be passed along. Putting it in a register that's not used for arguments in the standard calling convention means that when calling out to e.g. glibc the C code just won't be able to access those values.

In general, when a call returns you can't trust the register contents for those registers that are unused in the calling convention, because the callee might clobber it (for example by calling back into your language before returning).

What I did[1] when doing the ffi to the external C functions was emit a `setjmp()`, call the function, and emit `longjmp()` when the callee returned.

That allowed recursive calls no matter how many times that FFI boundary is crossed in a single function call: The language calls into C API, which calls something that invokes a callback in the language, which makes another call to a C function, etc.

[1] IOW, this is not the usual way, as far as I know. All a bit hackier than when I (for a different toy language) used `libffi` when emitting the code to call external C functions, which (IIRC) makes certain gaurantees about some of the registers.


> In general, when a call returns you can't trust the register contents for those registers that are unused in the calling convention, because the callee might clobber it (for example by calling back into your language before returning).

Depends on the register and the calling convention. E.g. for i386 the Intel ABI requires that eax, edx, and ecx are caller-saved, and so can be clobbered by the called function, but the rest are callee-saved, and so you can rely on them surviving just fine if you're calling any code generated by a compiler conforming to the Intel ABI. So calling out to C-code is fine.

(Calling back into the Ruby code requires extra effort, because the C code can't know how to set the arity; I might change this to store the arity of a method so that it can be looked up by the caller, in which case it might be doable to not require setting it at the time of the call, but that increases the amount of code at each call site as you'd need to resolve the method address and look it up relative to the actual method address instead of an indirect call, in order to avoid race conditions if someone redefines a method (we do love redefining methods at runtime in Ruby; it doesn't actually happen that often, but it can in theory happen at any time, and proving it won't for a given piece of code is a pain); it might be worth it at some point, even optionally, as when you can safely cache the method address you'd be able to jump straight past the arity check if you can guarantee the right number of arguments, but it's not simple).


> what would be the point of a calling convention different to the rest of the system

C is actually the exception and not the rule, most languages have different or multiple calling conventions.

> what's the alternative you have in mind when using assembler? Assemblers also have a notion of a stack and non-stack addresses, after all.

Growable segmented stacks.

Now sure, you can do mostly anything in C. But again, it's not assembly. Is it that odd that someone would want to write a compiler without having to write an x86 code generator, or tying themselves to C which again, cannot implement certain things at all?

> you can emit calls to the platform's thread primitives (CreateThreadEx, pthread_create, etc)

pthread_create is not a platform primitive, it's a glibc function, and interestingly cannot be written in C.

> What feature do you have in mind that cannot be implemented by emitting C

You can't implement fibers with growable stacks in C. Now you can use <ucontext.h> (getcontext, setcontext, swapcontext are all libc functions implemented in assembly) but they are very expensive with low hanging fruit for optimizations (for example, not saving the fp status registers or signal mask). Context switching also depends on your calling convention, so you may want to have multiple variants that are emitted for different parts of the program, or evolve over time.

For growable stacks you need to insert a prologue into function signatures to check if the stack needs to grow, however you probably don't want to do this for every function call and can write peephole optimizations in your compiler to avoid doing it.

There are other problems with C that aren't even worth mentioning, notably the linkage/loading model.


> What feature do you have in mind that cannot be implemented by emitting C[2], but can be implemented by emitting assembler[3]?

If all features can be implemented in C, then why do all major C compilers have an asm extension?


You can generate the opcodes directly if you want to avoid having a separate assembler.

AFAIK MSVC does that unless you tell it to generate Asm output, whereas GCC always goes through the separate assembler route.


One way would be to target and generate LLVM IR.

You get multiple platforms for free if I'm not wrong.


a problem that a lot of these series run into is that the author runs out of steam before they finish writing them. crenshaw's otherwise excellent series suffers from this, for example

so far the author of this one has only written the first chapter

i've written a few didactic compilers that are complete enough to compile themselves, though not complete enough to compile anything else; these may be of more interest to someone who's looking for how to make writing a compiler surprisingly easy, though they do lack the extensive explanatory text in the linked blog post

- http://canonical.org/~kragen/sw/urscheme (from a subset of scheme to at&t-syntax i386 assembly, 1553 lines of code)

- https://github.com/kragen/stoneknifeforth (from a forth-like language to an i386 linux elf executable, 132 lines of code)

- https://github.com/kragen/peg-bootstrap/blob/master/peg.md (from a peg grammar with semantic actions to javascript, 66 lines of code)

- https://github.com/kragen/peg-bootstrap/blob/master/handaxew... (tangles a literate program in virtually any plain-text format, such as markdown, html, restructuredtext, or tex, into virtually any plain-text programming language, 208 lines of code; used to tangle both itself and the peg compiler compiler above)

- http://canonical.org/~kragen/sw/dev3/meta5ix.m5 (from meta5ix, a language for writing parsers derived from val schorre's meta-ii, to meta5ix assembly, 18 lines of code; it is a top-down parsing language with limited backtracking, like pegs, but less expressive and implementable without heap allocation or the ability to backtrack past the beginning of a line; see http://canonical.org/~kragen/sw/dev3/meta5ixrun.py for an m5asm interpreter)

- http://canonical.org/~kragen/sw/dev3/meta5ix2c2.m5 (from a slightly different variation of meta5ix to c, 133 lines of code)

most of these do take the same ast-free approach as the linked blog post and crenshaw, just emitting code as they finish parsing a production (or, in the case of stoneknifeforth, a token), but ur-scheme in particular does not; it uses scheme s-expressions as its syntax tree, because some kind of syntax tree is necessary to be able to allocate some local variables on the stack while also supporting closures. peg-bootstrap can certainly handle code that builds an ast with its semantic actions, but its implementation of itself just glues together snippets of javascript text

also worth mentioning in this connection:

- darius bacon's work, especially parson (for example, https://github.com/darius/parson/blob/master/eg_calc_compile...)

- the oberon compiler in wirth and gutknecht's oberon book

- the later chapters of sicp


This is the first time I'm noticing you have a treatment on versioning in handaxeweb. I don't have time to read now, but I've bookmarked it for later meditation.

You might also consider using the file format from the WHATWG/W3C hypertext system as your source code + presentation format. (I.e. HTML, CSS, and JavaScript.) Readers that support the format (i.e. web browsers) are ubiquitous, and JS is capable and every bit as good of a didactic language as other choices (if not better, e.g. in comparison to Python). Viz:

<https://www.colbyrussell.com/LP/debut/index.html>

(It happens to have the nice property of supporting forward declarations and function redefinition, not demonstrated in the link above.)

Once you understand the trick, you can exploit it to great effect for other purposes, like the plain.txt.htm hack:

<https://www.colbyrussell.com/LP/debut/plain.txt.htm>


> the author runs out of steam before they finish

One approach is to finish the whole compiler first and then to split it up for presentation, but I imagine most writers don't want to do that.


Thank you for your hard work.


i hope you enjoy it


I like this approach, but why not take it one step further toward Turbo Pascal and generate the opcodes? ;-)


Author here. I wanted to get the bar for writing a compiler as low as possible, and emitting assembly code seemed better in that regard.

Also I don't have time to learn x86 instruction encoding right now :(


Thanks for the response! I was joking a bit (though not entirely - I really like dynamic code generators and am inspired by tiny, powerful systems) but as I mentioned I like the approach and hope that you get the chance to finish the series.


I did not understood the stack part.

Lets say the slot value is 2, instruction for this would be to move the value located at %rsp-24, into the %rax register.

So your instruction becomes mov -24(%rsp), %rax

Correct?


That's right!


The 2014 (doesn't time fly) edition of the ICFP Programming Contest had participants make a Pacman brain, deliverable in SECD machine assembly, and ghost brains, deliverable in a simple microcontroller assembly.

Most teams, ours included, wrote 2 (probably rather crappy) compilers in 3 days.

https://en.wikipedia.org/wiki/ICFP_Programming_Contest

https://en.wikipedia.org/wiki/SECD_machine


Have written compilers, can confirm. For parsing especially using recursive descent is surprisingly pleasant, even in something badly suited like C.


One thing I haven't seen yet is how a compiler would keep track of types and emit the corresponding assembly


class Type { int size; } and then add methods for every conceivable type-dependent operation. Alternately, a sumtype or enum+pointer will also do. Then you just attach a type to every expression. At least, that's how you do it in a "forward-typed" language like C. In a language like Haskell it's a bit harder because your type assignment is a full graph, not a DAG like in C.


That's... not a compiler. It's a small part of a compiler backend (code generator).


I like the sentiment of the title, even if it gets people riled up. I'm fondly reminded of the introduction from Bob Nystrom's _Crafting Interpreter_ [1]

> This last reason is hard for me to admit, because it’s so close to my heart. Ever since I learned to program as a kid, I felt there was something magical about languages. When I first tapped out BASIC programs one key at a time I couldn’t conceive how BASIC itself was made.

> Later, the mixture of awe and terror on my college friends’ faces when talking about their compilers class was enough to convince me language hackers were a different breed of human—some sort of wizards granted privileged access to arcane arts.

> It’s a charming image, but it has a darker side. I didn’t feel like a wizard, so I was left thinking I lacked some inborn quality necessary to join the cabal. Though I’ve been fascinated by languages ever since I doodled made-up keywords in my school notebook, it took me decades to muster the courage to try to really learn them. That “magical” quality, that sense of exclusivity, excluded me.

> And its practitioners don’t hesitate to play up this image. Two of the seminal texts on programming languages feature a dragon and a wizard on their covers.

> When I did finally start cobbling together my own little interpreters, I quickly learned that, of course, there is no magic at all. It’s just code, and the people who hack on languages are just people.

> There are a few techniques you don’t often encounter outside of languages, and some parts are a little difficult. But not more difficult than other obstacles you’ve overcome. My hope is that if you’ve felt intimidated by languages and this book helps you overcome that fear, maybe I’ll leave you just a tiny bit braver than you were before.

1: https://craftinginterpreters.com/


Who needs a lexer, a parser, and an AST when you can just explicitly feed the tokens into a switch statement that completely falls through if it runs into something it doesn't recognize? This is super easy!

Sorry, I am by no means a compiler expert, but spent enough years in study of them and taken enough stabs at writing my own that this take is hopelessly naive, and I really hope the author realizes it.


I have written multiple compilers, from C-like interpreters to protobuffer like compilers over the years, all in production, some still.

Not doing a finite state machine for any stream interpretation is a huge mistake, just as much as it is outside of the compiler tokenisation, such as protocol decoders and even just parsing config files.

edit: There is a well known book with a dragon on it, but I first learnt from The Unix Programming Environment, by Brian W. Kernighan and Rob Pike. In just a few pages they clearly explain how to write an basic calculator like this, that generates intermediate code to run on a stack based interpreter. I started with this and expanded it to emit assembly for an arcane DG machine. I know that code generated ran for at least 20 years. Kinda surreal as nothing lasts that long usually.


Could you elaborate on what you mean by that? To me this looks like a code generation library, an approach I like because of its concreteness and adaptability (for example, if we changed it to emit opcodes then it could be used for dynamic code generation, which has many fun and interesting applications.)

Also this post is aimed at the author's imagined teenage self. Does that change anything?


When writing a compiler you want to first look at what you expect and when you get that, what to expect next. If that does not come, where do you go?

We call this a graph, but there is no need to confuse things with advanced terms.


I don't see any parsing in TFA - I think they are taking a bottom-up approach by starting with a code generation library before even thinking about parsing.

I like this approach because it produces something that generates usable code without worrying about things like parsing or high level language syntax, and you've basically written a DSL/intermediate format which can be compiled by an existing C compiler.


thanks for the edit, i will endeavor to have a look at that source


Very true. Writing a compiler that works on correct code is often trivial. It’s not much harder than a minimal lisp or forth.

However, figuring out errors, adding type checking, etc now that’s hard.


Starting with the grammar vs starting with the backend: fight! Personally, I'm on Team Backend; if you have something that works, it's a lot easier to glue things in front of it that maintain the property of it working, because you can just run the product. Testing a parser in isolation is a lot more annoying.


The article describes AST -> Assembly transformation, no?


* compiler without optimization phase


Aka: a compiler.

A compiler is just `source -> generate -> target`. All the other fancy things are optional. IMPORTANT things, but don't detract from calling it a "compiler".

---

This is similar to anything else: A single `.html` served on `:80` is a website. A 2d, grayscale `pacman` is a videogame. `print("hello world")` is a program, etc.


Aka a function tongue-in-cheek :p


If I read the article correctly, it's actually "a compiler for a language that only supports basic arithmetic, without a lexer, a parser, symbols, nor an optimization phase".

Which _is_ surprisingly easy to write. But it might be a stretch to call it "a compiler" :D


Yeah, I've written stuff like this many times -- mostly producing AST then bytecode, etc. I never really wanted to call it a 'compiler' per se though that's what it is. Because I generally stand in awe of serious optimizing compiler engineers, they have deep domain expertise that is admirable.


True, but no need to self-gatekeep the terminology here just to appease a rando commenter. :)


Yes draw two circles and then the rest of the owl kind of thing.


Eh, a college friend put together a C compiler in 100 lines of yacc (not particularly golfed, either, C just isn't that sophisticated.) Great for playing with "what if C + this one feature" kind of things in a concrete way. So from my perspective, not a stretch at all...


Parsing and symbol resolution on simple languages is kinda trivial in my opinion, but I'll probably write some articles about that stuff later

Control flow and pointers are coming in the next part, and it's still surprisingly easy


Is memory / register allocation in your roadmap too ? (honest question, I'm not sure it's in the scope of your articles)


Register allocation is not on the roadmap, I think it's too hard for this series... But maybe I just haven't figured out an easy enough way to do it yet :^)

Memory allocation? You mean making a heap allocator? In the spirit of lowering the bar for creating new languages, I lean more towards calling into libc's malloc instead of building a custom solution haha


As opposed to a tree-walking interpreter, which is usually presented as the only viable way to write your first programming language implementation.


* compiler without good error messages


Clang may have better static checking, but Apple's MPW C compiler still had better error messages.

https://news.ycombinator.com/item?id=37283375


Cute and whimsical, but I was thinking along the lines of context for answering "why?"

"..because it was defined at [....] and used [....] and [...] as a int32 type (thus coerced to int64), but passed in [...] as a"


if you use the nanopass approach you can add optimisation phases relatively easily


but since he hardcodes the register allocation, i wonder how he can control anything here


you can refactor over time to add a register allocator. the key point is it's usually easier to do these things within the context of an end-to-end working compiler than with a big design up front.


yeah, fair enough


Probably adheres to the principle; "Make it work, then make it fast"


It's actually three steps: make it, make it work, make it work faster.

this may actually be somewhere between step one and two.


If you cared about performance, you'd feed it into LLVM anyways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: