Rob Pike: Notes on Programming in C

drblast · on Nov 9, 2010

Can anyone who's worked on a well-written, large C project comment on the best way to structure headers and include files?

I don't like including stdio.h and stdint.h in every .c and .h file, but it seems like the "standard" way to compile with Makefiles is to compile each .h and .c pair as an object then link them together at the end. So you need the header files to use includes and such in every file or the compiler barfs.

Every time I try to structure a C project I feel like I'm doing it wrong and that there must be an elegant solution that I'm missing.

tptacek · on Nov 11, 2010

Forget what 'gte910h says. Here's what every project I've worked on professionally has ended up doing:

  #ifndef MYPROJECT_H
  #define MYPROJECT_H

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/types.h>
  #include <sys/time.h>
  #include <sys/socket.h>

  #include "trie.h"
  #include "hash.h"

  #endif

Then, in each project file, at the top, just include:

  #include "myproject.h"

There are all sorts of niggling little reasons why you aren't supposed to do this, just like there are niggling little reasons why you're supposed to cast functions that return values to (void) when you don't use their values. Ignore them and get on with your life.

anonymousDan · on Nov 9, 2010

I'm not a C/C++ expert but I can highly recommend the book "Large-Scale C++ Software Design" by John Lakos. It gives an extensive discussion on how to structure programs and addresses the issue above in detail, in addition to many others. I'd be very interested to hear the opinions of any HN C++ experts on the principles espoused in this book (assuming a few have read it!).

jbn · on Nov 10, 2010

It's a great book with valuable insights, i have successfully used them multiple times to decrease build times of multi-million LoC projects.

sixtofour · on Nov 10, 2010

Logged in specifically to up vote Lakos.

Will log out and log back in again to attempt a second up vote.

drblast · on Nov 10, 2010

Thanks, just what I was looking for!

carefulfungi · on Nov 10, 2010

Absolutely - a wonderful book!

tedunangst · on Nov 9, 2010

I'm not positive what you mean "you need the header files to use includes". header files should generally not include other files. C files include headers. If the headers have a dependency that can't be resolved by a forward declaration, it's often best to make the C file include the headers in the right order.

headers including headers leads to messes where people just include a random selection of headers and hope that pulls in everything they want. It's very annoying to untangle years later.

drblast · on Nov 10, 2010

If I have data types defined as function arguments in header files, I need to include the headers for those data types. There's apparently no way around this, and it does result in messy dependencies.

I'm just curious if large projects have a systematic way of dealing with the dependency tree created by having fine-grained compilation units.

The simplest way to write a C program is as a single file. Then you have none of these problems, other than it's very difficult to navigate through the code as nothing is separated by functional area, and you have to recompile the whole program for every small change.

Next in the complexity chain, you could break apart the single C file into multiple .c files and include them in the right order, and not use header files at all. However, that still requires that you recompile the whole program for every change.

So now we're down to compiling every separate .h/.c entity as its own object file to speed up compilation, but structuring these files so that they compile and include everything in the correct order becomes a much bigger problem.

jws · on Nov 10, 2010

If I have data types defined as function arguments in header files, I need to include the headers for those data types.

You do not need the definition of a structure to use a pointer to it in a function prototype…

  struct puzzle *newPuzzle( int difficulty, struct preferences *prefs);

…will compile just fine without the definition of struct puzzle. this let's you hide your opaque types inside the .c files or a private .h file.

drblast · on Nov 10, 2010

There are other data types besides pointers to structs. Specifically, stdint.h data types are a good example (on linux).

  int
  myfunction(uint64_t arg0);

When writing cross-platform code, there are typically a lot of these kinds of things going on.

Also, I typically include data type definitions in header files if they're used in more than one context. If I define a new data type in a header file that depends on other data types, I have to include those headers in the header file.

Let's say I'm writing an encoder/decoder for a protocol and I want to separate the encode functions from the decode functions. Maybe they're in separate programs. They'll both use the same data structures for the most part.

It it's a network protocol the layers of dependencies can get pretty deep.

This isn't a problem with a one-liner solution, it's "how do you structure a huge project to reduce weird dependency issues when compiling while keeping the code readable?"

gte910h · on Nov 9, 2010

(Apoligies, HN is mangling my comment when I try to include C style comments, so I'm using C99/C++ style ones)

You should include whatever .h files you need in every file that uses them (things like stdlib and stdio for sure when you're using them). This makes it so you have fewer interfile dependencies which keeps your compile times down. So you only include other .h files in .h files if you need the types in them (or put a blank prototype in for it, aka struct foo;). Otherwise you include .h files only in .c files.

Feels hairy, yes, but that's the correct way to do it. By doing it creative ways, you cause great pain down the road (especially compilation times, and hunting for simple bugs). If you're only using one or two functions, sometimes people just prototype the function:

// some people just would stick this prototype in their .c

//file they're writing if they need nothing else in math.h

double sqrt(double x);

I think you're missing something about C there though the way you talk about compiling things.

h files aren't really compiled at all. They are substituted into files where they're #included. You can actually write your entire project without them if you really want to (not suggested, but you can) by just putting prototypes for the functions you call at the top of the c files that need to copy them. That's actually why its important to put guard include statements at the beginning and end of h files to make sure they aren't included more than once. A properly written h file shouldn't cause compiler errors when you do this to it:

#include "foo.h"

#include "foo.h"`

The way you make it proper is you do this the top:

#ifndef __FOO_H__

#define __FOO_H__

and this at the bottom

#endif // __FOO_H__

Many IDEs do this automatically and vim and emacs both have plugins you can get to do this. Robert Pike's piece is from 1989, when computers ran much slower. The lexical analysis time from this is not worth talking about anymore.

Structure:

Generally C is structured with a main .c file which has a few functions related to start up in it and includes of header files in other modules which are needed. Each of the .c files includes the header files it needs, from either internal project modules, or from system level header files (aka <thingsInAlligatorBrackets.h>). Generally each .c file is a module, although in Object Oriented C, you may find modules built up of several different c files with each "class family" in a single .c files. Modules are purely conceptual in C.

Some C people get macro happy. Do no do this. It is unclear to the rest of the C developing universe. Modern C compilers can do inline substitution quite well, often giving better performance than your macros. When people get macro happy, they end up making a file like "GlobalMacros.h" or even if they don't, they just have a lot of constants and end up with "LotsOfConstants.h" included in everything.

Resist the urge to put your .h includes in there but NOT in the files that actually depend on it. If you do that you break code if that .h file is ever taken out.

As far as final compilation goes: Every C file is compiled into an object file then linked using the linker. Alternatively, groups of .c files are compiled into .o files then linked into .a files (static libraries). These libraries are then linked in during the final compilation step.

TheNewAndy · on Nov 10, 2010

Your header guard is bad (but not horribly so). Identifiers with two underscores at the beginning are reserved in the C90 standard (as are identifiers with one underscore followed by an upper case letter).

Your comments are also only valid in C99. Unless you have a really good reason, then you should use the block style comments (and I can't think of any good reason). (oh, maybe I should have read your very first sentence... yes that seems like a good reason :-)

gte910h · on Nov 10, 2010

>Your header guard is bad (but not horribly so). Identifiers with two underscores at the beginning are reserved in the C90 standard (as are identifiers with one underscore followed by an upper case letter).

That's not actually the header guard that pops out when I put them in (as I do so automatically using vim code similar to that described below), however you are correct that you have a chance to clash with compiler dependent info macros (such as __LINE__) if you use a double underscore before a header guard.

I'm going to see if I can edit the post for those who don't see this exchange below.

http://yvan.seth.id.au/Entries/Technology/Code/Vim_Header_Gu...

I was trying to simplify yet avoid the UTIL_H clash problem and made an error there.

loup-vaillant · on Nov 10, 2010

To write code, skip 2 or more characters at the beginning of the line, Markdown style.

  #include "foobar.h" /* no mangling */

http://news.ycombinator.com/formatdoc

gte910h · on Nov 10, 2010

Thanks. I was trying to do the methods explained on daring fireball about markdown, but they weren't working so I just went C99 style.

signa11 · on Nov 10, 2010

if you have not looked at 'c interfaces and implementation' i would heartily recommend it. it is very, very good.

rcfox · on Nov 9, 2010

I disagree with the include files section.

As long as each file has a proper include guard, it won't matter how many times you've included a file. I'd much rather have the preprocessor figure out when and where to include a file than do it myself. Maybe this was a significant performance issue in 1989, but it shouldn't be anymore.

That said, you should only include files that are direct dependencies. There's no need to include stdio.h in every other file.

palish · on Nov 9, 2010

That's a naive viewpoint. It's more of a concern in C++ ... if your codebase grows to massive proportions, and you #include headers within headers, then modifying a header can result in ~60% of your codebase being rebuilt. (At work, I deal with this pretty much every day.)

Including header files within headers is simply unnecessary, especially in C. You need to forward declare in header files, then include in source files where you actually use that header file.

gte910h · on Nov 9, 2010

C is not C++.

In C you shouldn't be chaining header files because you don't make function calls in header files in C code (well most good C code at least, people abuse macros sometimes).

In C you should only include header files in header files if you have types in the other header files. It's okay to forward declare if you must, but its often quite a bit clearer to not.

Additionally, you should always include all the h files that have your undefined functions (only relying on include chaining for header files at most). So if you have:

DogModule.c calling Sounds.c functions and exporting a public function in DogModule.c called "PlayBark(struct soundNode);", DogModule.h and things that use DogModule should both import Sounds.h.

Good C habits and structure are very different than good C++ habits and structure. They're not the same language, not even close. Mangling your concepts of Good C and Good C++ is a great way to write both poorly.

palish · on Nov 9, 2010

Insightful, thank you.

rcfox · on Nov 9, 2010

Suppose you've written some key data structure in its own file. Let's say it's a binary tree. The binary tree's functions are all implemented in bintree.c, and its functions, defines, and struct are all declared in bintree.h.

Now, you want to write some application-specific functionality that works on your binary tree. You need to pass a pointer to the binary tree to your new functions. Of course, you implement these new functions in app.c, and the functions are declared in app.h so that other parts of your application can call them.

You're going to need to include bintree.h in app.h to ensure that the compiler knows about the binary tree's structure. Otherwise, you'll have to make sure to always include bintree.h before you include app.h. Is that really something you want to be bothered with?

palish · on Nov 9, 2010

In app.h, you should forward-declare "typedef struct bintree_t bintree;"

Then, in each source file that #includes "app.h" and also CALLS binary-tree-related functions, you #include "bintree.h".

Like I said, this is more of a concern in C++ where compile times are way, way slower than C. And in C++, it's very much worth it to forward declare in header files. It makes a big difference once your source code base has grown to epic proportions.

frou_dh · on Nov 10, 2010

It's a technique, but not so applicable to say "you should" do it.

edsrzf · on Nov 9, 2010

I totally agree.

Not only will headers including headers cause files to rebuild unnecessarily, but it can cause build times to increase simply because the compiler has to compile all those extra headers each time they're included in a source file. (Precompiled headers can help, but they aren't always an option.)

jbn · on Nov 10, 2010

yes, and that's using the Pimpl idiom to decouple things can improve build times. If your whole codebase rebuilds everytime a header is touched, then your headers are probably not correctly organized. Having minimal headers (using forwards decls as much as possible) is vital to keeping rebuild times down.

meese_ · on Nov 9, 2010

I think that advice is just outdated. It's a different situation now that it's a standard practice and there are optimizations supporting it (http://gcc.gnu.org/onlinedocs/gcc-2.95.3/cpp%5F1.html#SEC8).

mwcremer · on Nov 9, 2010

I think Pike's point about "thousands of needless lines of code" is that people do this:

#include "foo.h"

instead of this:

#ifndef __FOO_H__

#include "foo.h"

#endif

where "foo.h" looks like this:

#ifndef __FOO_H__

#define __FOO_H__

/* contents /

#endif / __FOO_H__ */

which in the former case results in having to lex foo.h in order to skip the contents.

However, as pointed out below, modern compilers will effectively do the same thing for you automatically.

[Edited to make the code readable]

TheNewAndy · on Nov 10, 2010

Just a quick pedantic point - __FOO_H__ is a reserved identifier (C90 reserved). You shouldn't be writing it in your code.

ori_b · on Nov 9, 2010

If you keep your header files independent, you have no reason to include a single header multiple times, and you can't do it accidentally through another header.

If your header files don't transitively include other headers, then

    #include "a.h"
    #include "b.h"

will never cause an error. Therefore, in well written code, the include guards will be no-ops, and for simplicity's sake, they may as well be left off.

rcfox · on Nov 10, 2010

Sure, this isn't a problem in trivial projects. If you want to use types other than what the language gives you, you're going to have to introduce a dependency.

jbn · on Nov 10, 2010

even with internal include guards, just the time to read through (and skip) the whole files for later inclusions can become dominant in cpp processing time (i.e. it's mostly waiting on I/O reading the same headers many times over). That's why Lakos advises external include guards for large systems.

cdavid · on Nov 10, 2010

Unless you have gigantic projects, your project source code is likely to be mostly in memory through kernel IO buffers nowadays, so even this is not true anymore IMO.

EDIT: just for kicks, I compiled twice numpy (100 kLOC), first just after flushing IO buffers (echo 1 > /proc/sys/vm/drop_caches in recent linux kernels):

- cold case: 24 seconds - host case: 16 seconds

Of course, it is hard to say what contributes to the slowness (files themselves, loading in memory all the programs needed for compilation, etc...).

viraptor · on Nov 10, 2010

If you include `ccache` (or similar), the recompilation time will drop to ~0. One-off compilation time doesn't matter that much and developers have many other tools they can use (pch?).

cdavid · on Nov 10, 2010

Partial rebuilds are obviously quite important - but in that case, IO is even less of an issue, because it is the hot case assuming you have enough memory. I forgot about one case where you may not have enough memory: large C++ program with multiple compilation in // - this can easily takes GB of memory for template heavy code.

jbn · on Nov 11, 2010

one easy way to slow down compiles on Windows is to include windows.h everywhere, this pulls down about 2 Mb of declarations that the cpp front-end has to parse just to reach in the internal guards... it all depends on how large your header files are (and recursively how large is all headers that they recursively pull in).

gte910h · on Nov 10, 2010

C is not C++

C compilers are leagues, elephants and donuts faster than C compilers.

In C, internal guards are preferred and cause very little to no compilation time hit. In C++, external are often preferred, especially in large files.

Do not take C++ advice to apply to C. The best practices of both languages are very different.

rauljara · on Nov 9, 2010

I think it's helpful to have criteria like this out there, so that we can discuss them. That said, what's with the bias against capital letters? maxphysaddr is more readable than MaxPhysAddr? godisnowhere is more readable than GodIsNowHere? Or is that GodIsNowhere? I'm personally a fan of coding styles that cut down on ambiguity.

jmulho · on Nov 9, 2010

I prefer god_is_nowhere. The easiest thing to read is what you are used to. To fairly compare godisnowhere, GodIsNowhere, and god_is_nowhere, you need to stick with each for several days. I bet you'll discover, as I have, that god_is_nowhere is the easiest to read.

TheNewAndy · on Nov 10, 2010

It makes the distinction between god_is_no_where and god_is_nowhere much clearer than GodIsNowhere and GodIsNowHere. A slightly more subjective one is for things with acronyms in them. Compare: check_url_index, CheckURLIndex and CheckUrlIndex.

amh · on Nov 9, 2010

I think "godisnowhere" is more an example of a straight-up badly chosen name (identifiers should not be complete sentences) rather than a case where caps are better than lowercase.

I'm not against camelcase and its variants, but there are a wide variety of naming standards using caps, it's quite possible to get mixed up between them and introduce ambiguity or worse (referring to the wrong variable).

TheNewAndy · on Nov 10, 2010

I believe the intention of this name was the parsing ambiguity of the word "nowhere" when you don't have word delimiters, not as an example of an actual name, or a statement about the location of God :)

epo · on Nov 10, 2010

Surely, that's his point, putting an ambiguity into a variable name is just asking for trouble, choose a better name.

TheNewAndy · on Nov 10, 2010

The ambiguity isn't in the name on its own "God is nowhere". It is in the combination of that name, and a bad naming convention.

If you had a naming convention which prohibits vowels, then the identifier "shrt" is ambiguous because it could be "short" or "shirt". That doesn't mean that "short" is a bad name for an identifier.

antirez · on Nov 9, 2010

iLoveFunctionLikeThis() but I must admit this is one of the rare case of useless tip in this article. All the rest is almost gold for C programmers.

What is funny is that reading it will hardly make me or you a better programmer because most of this things will take anyway years to be fully understood / mastered, and of course it's an endless process of refinement.

But still very useful to read what we should try to focus on when writing good C code.

georgecmu · on Nov 10, 2010

We use capital letters, and so half the timestamp variables are named xxxTimestamp and the other half yyyTimeStamp, which makes it a fun guessing game.

greatfog · on Nov 12, 2010

The yyyTimeStamp form could be useful if the program also dealt with postage stamps, rubber stamps, and food stamps.

pgbovine · on Nov 9, 2010

''' The following data structures are a complete list for almost all practical programs:

array linked list hash table binary tree '''

this is why it's such a pleasure to program in modern high-level languages (e.g, python, ruby) --- these basic data structures (essentially sequences and mappings) are baked into the language with simple, concise syntax

chc · on Nov 9, 2010

Except for one. Quick: What's the syntax for a binary tree in Python?

OK, I'll give away the answer: It's not there. I don't think binary trees are even in the standard library for Python or Ruby. It's a sadly neglected data structure because, when exposed for what it is, it's a little bit too computer science-y for most people's taste.

scott_s · on Nov 9, 2010

Balanced binary tree versus hash table is an implementation choice for the associative array abstract data type. Similarly, growable array versus linked list are implementation choices for the list abstract data type.

I don't buy that balanced binary trees are "too computer sciencey for most people's taste" as an argument against not have them as implementations of native types in Python. Designing a good hash table is equally as subtle in terms of both the theory and the implementation itself. But those details are not exposed through the interface.

shadowmatter · on Nov 9, 2010

"Balanced binary tree versus hash table is an implementation choice for the associative array abstract data type."

You can treat it as an implementation detail, but it's a good idea to let the client know that the underlying key-value collection is actually sorted by key. This allows the client to iterate over the key-value pairs in sorted order without dumping the keys to an array and then sorting it, retrieving the k smallest keys through forward iteration, retrieving the k largest keys through reverse iteration, etc. I think Java solves this nicely by creating the SortedMap subinterface of Map; if you have a SortedMap, you're assured these properties hold. (The typical implementation of SortedMap is a balanced tree, while the typical implementation of Map is a hash table.)

caf · on Nov 9, 2010

A binary tree also allows efficient iteration in key order ("next" and "previous" operations).

eru · on Nov 10, 2010

Yes. As the other poster argued, a tree needs some kind of ordering (like Hashtables need the keys to be hashable). But trees also give you a persistent data structure. Most hashtables including Python's dict are ephemeral, i.e. not copy-on-write.

derefr · on Nov 10, 2010

Python:

    conn = sqlite3.connect('foo')
    curs = conn.cursor()
    curs.execute('create table things (id INTEGER PRIMARY KEY, stuff TEXT)')
    curs.execute('CREATE UNIQUE INDEX tree ON things (id ASC)')

Now "tree" is your tree. ;)

Quite seriously, though, people (that aren't systems programmers) don't think "this is a load of data—I need a binary tree!" People instead think "this is a load of data—I need a database!" and then, when the database isn't actually help their queries go faster, they slap their foreheads and exclaim "oh, wait, I need an indexed database!" Of course, the index just happens to be stored as a b-tree—but they don't know that, and they don't care.

So, to reinterpret, this is basically the data-structure-usage flowchart for the average app programmer:

1. Need quick random access? Use an array.

2. Need quick insertion and removal, or lazy reification (to support, i.e., infinite sequences)? Linked lists, then.

3. Need quick search by key? Hash tables have you covered. (Well, these days, any "hash table" ADT also carries around its keys, so you also get O(N) enumeration of pairs.)

4. And, if your data set makes any of the above solutions choke? Just throw a database at it, and hope it helps.

chc · on Nov 10, 2010

I apologize for being a little bit pedantic here, but a B-tree is not the same thing as a balanced binary tree, and a SQL database is something different still. Even if the SQL database happens to use binary trees somewhere in its implementation, using that database is not at all the same as using a binary tree for your data structures.

derefr · on Nov 10, 2010

I didn't mean to imply that they were the same—just that, in practice, where a systems programmer thinks "use an in-memory binary tree data structure", an application programmer thinks "use a database", and a B-tree is what they end up using under the layers of abstraction and encapsulation. Yes, a binary tree and a B-tree are bad substitutes for one another, but you'll be much, much better off substituting a B-tree for a binary tree, or vice-versa, than substituting in, say, a linked list.

A cooking analogy: you wouldn't drink cream instead of milk—but if a recipe calls for milk and you don't have any, cream is a much better call than orange soda.

For some things (implementing a virtual filesystem, say), the app programmer will make the call to use a database, and they'll be fine, because a B-tree is just what a virtual filesystem needs. For others—storing and evaluating a Lisp AST, for example—they'll have to make O(N) roundtrips through the database and create a bunch of useless indexing cruft, but if the AST is complex enough, and if they're beta-reducing and caching their results in each parent node as they go, then this is still likely to outperform an alternative implementation that naively uses any of the other three data structures.

derefr · on Nov 10, 2010

Am I -1 points of wrong here? I made two assertions:

1. People writing code in HLLs don't really use binary trees directly;

2. When people writing code in HLLs have a problem that would be best solved with a binary tree, and they can't find a wrapped native library that internally uses binary trees, then the generic solution that 90% of developers rely on (whether it helps or not) is to use SQLite, BDB, or another library which basically acts as a complex surrogate for a tree data structure.

archangel_one · on Nov 9, 2010

I think it's simpler than that: when you have arrays, linked lists and hash tables there aren't _that_ many cases when you need a binary tree, so I guess the designers felt they weren't worth baking into the language at a syntax level. I agree it would have been nice to have in the standard library though.

chc · on Nov 9, 2010

This is sort of a "when all you have is a hammer…" situation. You don't need binary trees, but only in the same sense most of these data structures can stand in for one or more of the others. Binary trees can actually be better than hash tables in many cases where hash tables are used in mainstream scripting languages simply due to language support. An informative post on the subject: http://enfranchisedmind.com/blog/posts/problems-with-hash-ta...

jerf · on Nov 10, 2010

If you are seriously worried about the performance implications of a binary tree vs. the Python hashes, you are working in the wrong language. Whatever performance implications you are worried about are utterly dwarfed by the Python penalty.

I am not being sarcastic or cynical. There really isn't much point in really nailing some of these things to the wall in Python or Perl or Ruby or the other interpreted languages hanging out at the bottom of the shootout[1] because you've already thrown performance out the window as a consideration just by cracking open your editor and starting a new file. This is an argument to take up with languages much higher up the shootout.

[1]: http://shootout.alioth.debian.org/ , if you don't know what I mean by that. The benchmarks there may not see all and know all, but they aren't meaningless, either.

chc · on Nov 10, 2010

I really hate arguments along these lines. Python programs might not be be as fast in the average case as, say, a C program, but that doesn't mean you don't care about performance at all. Yes, I will take a several-microsecond slowdown for the convenience of using Python over C. That's a fair tradeoff, as it really isn't noticeable from my perspective — it's the difference between, like, 10% CPU utilization and 1% utilization, and still done in the blink of an eye. But the difference between 56 seconds to process some data and 0.8 seconds? That matters.

jerf · on Nov 10, 2010

I agree. But a binary tree vs. a hash table isn't going to be that difference. That's my point. Python won't excuse an O(n^3) when you should have been using an O(log n) algorithm, but the constant factors of Python are so huge they dominate a lot of the choices you'd make between two different O(log n) algorithms. You'd never notice the difference between the existing built-in Python dict and a hypothetical built-in binary tree, it would just be buried by some of the fundamentally-cache-hostile stuff Python does as it is bouncing all around memory with all the indirection it has on every line that has nothing to do with either structure.

jemfinch · on Nov 10, 2010

> But the difference between 56 seconds to process some data and 0.8 seconds? That matters.

This must be a purely theoretical argument for you, because you wouldn't be arguing this point if you were consistently analyzing data in Python and C/C++.

I routinely analyze gigabytes of aggregated data in both Python and C++, and have occasionally rewritten the Python in C++ to be faster (as an aside, for basic processing it's surprising how simple the translation is). I routinely get between 10x and 100x improvement in computation speed, and often see similar reductions in memory used.

Python is not slower than C/C++ by a few microseconds; it's slower for most operations by two or more orders of magnitude. Whether you have a dictionary implemented by a hash table or by a balanced binary tree is not going to have any significant impact on your analysis speed.

chc · on Nov 10, 2010

It's not purely theoretical — it's just not universally applicable. Even Ruby is fast enough to do most things faster than I can perceive, but I know from practical experience that there are fast and slow ways to do those things even in a "slow" language. I've tweaked Ruby algorithms before to make programs go from "OMG Ruby is so slow!" to perceptually the same as a C program (though a few tenths of a second slower in absolute terms).

What you're arguing here is that because there are some cases where pure Python is simply not fast enough, there is no point in thinking about performance at all in a Python program. It's a false dichotomy. There's a wide range of performance options between "just forget about performance altogether" and "rewrite the entire application in C++."

jemfinch · on Nov 10, 2010

> What you're arguing here is that because there are some cases where pure Python is simply not fast enough, there is no point in thinking about performance at all in a Python program. It's a false dichotomy.

No, it's a very real dichotomy. If your complexities are right (e.g., you're not using a quadratic algorithm where you could be using a linearithmic algorithm) and Python still isn't fast enough for your task, switching from a hash table to a binary tree is not going to make it fast enough.

jemfinch · on Nov 10, 2010

That's a great article, and I agree with most of it (e.g., that a balanced binary tree should be the default set/map data structure in programming languages; unfortunately it seems only C++ gets that right), but its reference to FNV as a good hash function is anachronistic.

For a more modern and much faster hash, see http://sites.google.com/site/murmurhash/ . Murmurhash3 is an order of magnitude faster than FNV: http://code.google.com/p/smhasher/wiki/MurmurHash3 .

chc · on Nov 10, 2010

The article actually just says FNV is popular, not good. The point (as I understand it) was how common bad hash functions are.

eru · on Nov 10, 2010

Haskell also gets it right. Look at Data.Map and Data.Set.

jemfinch · on Nov 10, 2010

Haskell "gets it right" because you can't write an applicative hash table with the same computational complexity as a mutable one. Haskell's option is either "ordinary balanced binary trees" or "hash table in the IO monad".

rg3 · on Nov 9, 2010

Your point about the need of certain data structures reminds me of the famous humorous text "Real programmers don't use Pascal", which said:

"As all Real Programmers know, the only useful data structure is the Array. Strings, Lists, Structures, Sets-- these are all special cases of arrays and can be treated that way just as easily without messing up your programming language with all sorts of complications."

The full text is available here and is a recommended read if you have some spare minutes: http://www.pbm.com/~lindahl/real.programmers.html

jemfinch · on Nov 9, 2010

Except for two, actually: there aren't any linked lists in Python, either.

Why? Because there's only one thing that a linked list is better at than a growable array: O(1) splice. Everything else is better with a growable array.

jedbrown · on Nov 9, 2010

jemfinch · on Nov 10, 2010

Yes, and?

Growable arrays use the same or less memory and offer random access to elements. They're not applicative, but then, if you insist on applicative data structures you're obviously not very concerned with performance anyway so there's no reason to bother with any particular data structure.

loup-vaillant · on Nov 10, 2010

The burden of proof should be on the mutable end, not of the applicative one. Purely functional data-structures are far less error prone than mutable ones. They have simpler APIs, which lead to simpler and clearer code. In other words, mutable state is the biggest source of errors. It should be avoided when possible, and insulated otherwise. http://www.loup-vaillant.fr/articles/assignment

Therefore, mutable data-structures are almost always an optimization. You shouldn't use them unless your program is demonstrably too slow for your needs. (Incidentally, the same argument applies more strongly to C itself.)

Finally, immutable data structures can sometimes be faster, for instance when you do backtracking, or keep track of undo information. With mutable data structure, you have to copy the whole thing.

jemfinch · on Nov 10, 2010

> The burden of proof should be on the mutable end, not of the applicative one.

On the contrary: the burden of proof is on the uncommon software development methodology, on the uncommon data structure, on the uncommon programming languages. Billions of lines of effective code have been written in imperative languages using mutable data structures. Such programs are what make entire industries possible today. Your love of abstract applicative elegance does not make up for the fact that in the world we live, imperative languages with mutable data structures rule. The burden of proof is not on the status quo, but on the challengers to the status quo.

As of yet, applicative programming has not borne that burden.

loup-vaillant · on Nov 11, 2010

Mutable state and manual memory management are responsible for almost all bugs I have seen. There are other classes of bugs of course, but these two far out-cost them.

Mutable state and manual memory management are mostly unnecessary. That has been proven years ago. See GHC, Xmonad, and Darcs for substantial examples. Of course, many important pieces of software have to use imperative programming, but this is the minority.

Conclusion: we should all learn functional programming techniques. And we should all use them where appropriate (meaning, most of the time). All of us. No exception. And I mean it literally.

Yet almost everyone never use functional programming. That's the reality, OK. That doesn't mean however, that what is so, should be. For the status quo to be a good thing, it must have good reasons behind it.

And I don't know of any. I know of several bad reasons though: anthropomorphism, analogies, lack of teaching, refusal to learn or to change one's mind, and the passive acceptance of the resulting network effect. Dijkstra warned us. We didn't listen. Mistakes ensued. Money was wasted. Computers were breached. People died.

I'll grant you that the lack of competence may mean that functional programming isn't the best idea for the next project. But that's no excuse for not even trying. That's no excuse for not teaching it.

Still not convinced? Then learn FP and read' Okasaki's thesis, if you haven't already. If that's not enough, then we may be in an actual, interesting disagreement. In this case, I beg you to give me more details. Debunking my essay on the assignment statement would be great. And I now operate under the Croker's rule[1].

[1]: http://sl4.org/crocker.html

jemfinch · on Nov 15, 2010

> Mutable state and manual memory management are responsible for almost all bugs I have seen.

I don't think this proves what you seem to think it proves. Mutable state is pervasive in most software written today. Almost all bugs can be classified as bugs of mutable state in one way or another if you so insist. In other words, your statement is akin to saying, "Computers are responsible for almost all bugs I have seen." Yes, they are. No, it's not useful to say they are.

> Mutable state and manual memory management are mostly unnecessary.

This depends on how you define "unnecessary". There is still no applicative, constant-time finite map.

> That has been proven years ago. See GHC, Xmonad, and Darcs for substantial examples.

Programs have been written in every kind of programming language with every kind of programming methodology. The existence of a few examples of such programs is not a compelling demonstration, let alone proof, of the usefulness of a paradigm. The plural of anecdote is not data.

> Of course, many important pieces of software have to use imperative programming

All software has to use imperative programming at some level. Purely applicative programs can do nothing. The question is not whether the state of the world is modified, but how it is modified and if those modifications should be somehow segregated from other parts of the program.

> Conclusion: we should all learn functional programming techniques. And we should all use them where appropriate (meaning, most of the time). All of us. No exception.

These sound like the ravings of an ideologue, not someone who's thought through the issues.

> For the status quo to be a good thing, it must have good reasons behind it.

There are several fundamentally good reasons why imperative programming is the status quo and applicative programming is not:

1. Imperative programming maps with significantly less impedance to the actual machines the code runs on, leading to a simpler and more comprehensible performance model for imperative programs.

2. Imperative programming does not require of programmers the same mathematical expertise and intuition that applicative programming requires. Whatever your blue sky hopes, the mass of programmers writing useful code today do not have the intellect or education to be as effective with applicative languages as they are with imperative languages. Programmers intuitively understand the concept of telling a computer what to do; far fewer can construct bodies side-effect-free functions expressing their instructions to an abstract mathematical reduction machine.

3. The problems we wish to solve in the Real World are largely imperative problems. Imperative programming maps more easily to these problems of command and control than applicative programming.

> Dijkstra warned us. We didn't listen.

Dijkstra also would have been the first to oppose to your applicative ideology. He considered applicative programming to be abstraction inversion, inefficient, and unnecessary. He would argue (with Hoare) that reasoning about programs is as easy with an appropriate imperative development methodology as with applicative methods.

> Still not convinced? Then learn FP

I've already learned FP. I've written tens of thousands of lines of code in OCaml. I've implemented Lisp interpreters. I understand monads and call/cc. The fact remains, when I need to get work done, imperative programming is faster and easier than applicative programming. The language I'm designing for my own edification is not an applicative language because I don't believe that side effects are the fundamental constraint on programming in the large (in fact, I would argue that side effects are required for programming in the large), and I recognize that real programming requires side effects to accurately model the real problems that real programs solve.

As for your essay on assignment, you're not entirely wrong, but your solution is misguided. Okasaki may have shown that in many cases you can implement applicatively a data structure similar to the common mutable one, and even (much of the time) get the same complexity bounds for that data structure. Both of you completely gloss over the constant factors, though, and those constant factors are massive, both in terms of the processing requirements for accurate garbage collection and in terms of the memory inefficiency induced by all the page faults applicative data structures' indirections require.

The real problem with programming today is aliasing.

loup-vaillant · on Nov 16, 2010

Thank you. I really appreciate that you took your time to answer.

> All software has to use imperative programming at some level. Purely applicative programs can do nothing. The question is not whether the state of the world is modified, but how it is modified and if those modifications should be somehow segregated from other parts of the program.

We probably misunderstood each other, because I completely agree with that point. Now my point is that effect aren't segregated well enough. And that anyone who knows FP will have a better grasp of state, and will be able to segregate it more properly. I'm not against mutability. I'm against pervasive, rampant, unsegregated mutability, which I see way too much at work.

About your good reasons… I'd say reason 1, while still important, is invoked too often. It looks to me like premature optimization justify the use of verbose, unsafe, but familiar, techniques and languages. Reason 2 is basically the lack of teaching I was talking about. I don't know how to fix it, I just believe it should be. I partly deny reason 3. Command and control is of course best expressed with imperative programming, but no problem I saw is limited to that. Most problems have huge "pure" chunks, which are easily expressed applicatively. And in the large… I think modules developed by different teams should have explicit and small interfaces (if only to reduce communication overhead). Making those interfaces functionally pure where possible looks like the easiest way to achieve that. (Note that imperative inner cores don't bother me much, as long as the outer shells are kept pure.)

Apparently, I haven't read Dijkstra enough. I'll check it out.

It looks like I could learn from your code (you're definitely more experienced than I am). Could you show some you think I should look at?

> The real problem with programming today is aliasing.

Wait a minute, this sounds extremely interesting. I could say that the substitutability provided by purity makes that point moot, but I may have overlooked other problems (or solutions). Do you have a few pointers about that?

jemfinch · on Nov 18, 2010

I'll come back and reply more thoroughly later, but to give you some interesting reading in the meantime, I figured I'd drop you a link: http://www.cse.ohio-state.edu/rsrg/documents/2009/09PHABKW.p... . The citations in that paper will be somewhat interesting as well.

loup-vaillant · on Nov 18, 2010

Reading the paper. Really interesting, I'm learning something new there. You can assume I've read it thoroughly when you reply.

eru · on Nov 10, 2010

Indeed. The lack of persistent data structures in a Python is a much bigger barrier to a functional coding style than the lack of tail call optimization.

You can mostly get around the missing tail call optimization by using combinators and avoiding naked recursion---a good idea anyway. And then implement your combinators, say map or foldr, with a loop. It's ugly, but nobody has to see it.

There's no real way around the ephemeral nature of python lists and dicts---short of implementing your own versions. To say something positive: Strings, numbers and tuples are immutable in Python, though.

jedbrown · on Nov 10, 2010

The dynamic array is only amortized O(1) instead of guaranteed O(1) (do you have real-time constraints?), insertion mid-list/queue is always O(n) instead of O(1) with an iterator for the linked list variant. I was just disagreeing with "everything else is better with a growable array", the array implementation is indeed better for many common use cases, but splice is not the only exception.

jemfinch · on Nov 10, 2010

Your point about amortization is good.

Insertion mid-list is just splice with a singleton; I don't bother mentioning it separately.

Those really are the only two advantages of linked lists over growable arrays.

jedbrown · on Nov 11, 2010

Okay, how about removal mid-list with an iterator? Technically that could also be called "splice" after taking sublists, but that makes the term so broad as to be almost useless.

acqq · on Nov 10, 2010

Still, Pike writes from the perspective of a C programmer. Having the pointers and 1-1 mapping between the types and the memory layout of the can combine the principles of the structures in something bigger and more complex. For example, the tree with where some elements are also linked in different lists. I'm not saying that you can't have some logical equivalent in higher languages, but you can't have that expressiveness and efficiency.

yason · on Nov 9, 2010

I agree with about everything but the best part is the function pointer chapter.

Function pointers are an insanely novel concept. They're not even a C thing per se but since C has first-class pointers they fit right it.

Nevertheless, most design patterns, closures, higher-order functions, more complex dispatching, etc. are effectively just a function pointer and possibly some userdata. You can think in Lisp but do it all in C with function pointers.

eru · on Nov 10, 2010

I agree that the paragraphs about tables driven programs and function pointers are the best part. You can read "Combining datadriven programs with function pointers leads to an astonishingly expressive way of working, a way that, in my experience, has often led to pleasant surprises." as much as an endorsement of functional languages than the endorsement of OO-like techniques it was meant to.

Though C is not really adequate to do Lisp-like programming. Because you can't capture free variables with C function pointers. Of course you can get around this restriction with lambda lifting [1], but why would you want to do the job of the compiler?

As an example, it is quite hard to encode function composition in C, i.e. write a function that takes two function and returns their composition. Or the same specification in (non-idiomatic [2]) Haskell:

  composition :: (b -> c) -> (a -> b) -> a -> c
  composition f g =
      let h a = f (g a)
      in h

I don't know of any way to make up new functions / function pointers in C at runtime.

[1] http://en.wikipedia.org/wiki/Lambda_lifting [2] Idiomatic would be

  composition :: (b -> c) -> (a -> b) -> a -> c
  composition f g a = f (g a)

but that's harder to understand for a C programmer who's not familiar with currying. Of course Haskell also has composition as a build-in.

Or in Scheme (sorry, I am getting carried away with this):

  (define ((composition f g) a)
    (f (g a)))

viraptor · on Nov 10, 2010

> They're not even a C thing per se

Really? from stdlib.h:

    void
    qsort(void *base, size_t nel, size_t width,
        int (*compar)(const void *, const void *));

yason · on Nov 10, 2010

Not even a C thing per se, as in "C has function pointers but it's not really a C specific feature as some other languages have them too. Like assembly, very obviously."

Many other languages don't. If they're sufficiently high-level they might have function objects or closures, which often cover everything you want. If they're Java they have neither but you can write a Visitor-style class or some other scheme from the design patterns book that emulates the behaviour of a function pointer.

But fundamentally (and when the program is compiled down to machine code also hopefully) they're just function pointers.

eru · on Nov 10, 2010

I agree. Though what it gets compiled down to depends on the compilation strategy. If your compiler is smart enough, and your use of functions as variables / function pointers is weak [1] enough, your functions may, say, just get inlined into a big case-statement-like construct.

[1] In the sense of http://en.wikipedia.org/wiki/Strength_reduction

contol-m · on Nov 9, 2010

There is a lot more in his book, "The Practice of Programming" - http://cm.bell-labs.com/cm/cs/tpop/toc.html

wtracy · on Nov 9, 2010

"Thus I say maxphysaddr (not MaximumPhysicalAddress) for a global variable, but np not NodePointer for a pointer locally defined and used."

Yes, because "maxphysaddr" is extremely readable for people who are not native English speakers.

Wait, was that "maxiphysadd" or "maxmphyaddrss"? Better stop coding (breaking up my train of thought) and scroll back up to the variable declaration to double-check how I spelled it.

(Yes, I know that modern IDEs can remember variable names for you, but there's always those emergency situations where you wind up trying to debug something with VIM over SSH via a slow-as-molasses hotel wifi service.)

drv · on Nov 9, 2010

I don't have any first-hand experience with the non-native English aspect, but I doubt such a person can remember how to spell MaximumPhysicalAddress without looking either.

If you stick with a consistent mapping of terms to shortened version ("maximum" always becomes "max") throughout a code base, these names read (and can be typed) much more smoothly than the 30-character monstrosities common in big frameworks.

tedunangst · on Nov 9, 2010

If you can't remember maxphysaddr vs maxiphysadd, I doubt you'll do much better remembering MaxPhysAddr vs MaxiPhysAdd.

chc · on Nov 9, 2010

MaxPhysicalAddress is more readable than all the above. Pike simply had it wrong, IMO, probably from having been immersed in a culture of poor readability for so long that he had become blind to some of it.

caf · on Nov 9, 2010

I agree with Pike on this one - and I think it's likely true what he says about being used to reading prose. For those of us who do a lot of non-programming reading, Funny Out-Of-Place Capitals interrupt the smooth flow when reading variable names.

I do_tend_to_use underscores, though - I find that they don't have the semiotic noise value as WeirdCapitals.

chc · on Nov 9, 2010

I find the opposite — underscores are hard to read, but innercaps are almost effortless. At any rate, they're equivalent — you can mechanically transform from one to the other without data loss. I would say that either is preferable to not delimiting your words at all. It sounds to me like you agree with me more than Pike, modulo an insignificant matter of taste.

tedunangst · on Nov 9, 2010

In 1989, external symbols were (possibly) only unique wrt the first 6 letters. Trying to guess if anybody else would name a variable MaxPhy* isn't a fun game...

chc · on Nov 9, 2010

That's true, but Pike didn't explain it that way. He wasn't talking about compatibility, but readability. That part of the article was explaining that longer and more complicated is not necessarily better in terms of readability, and there is a sweet spot where something is concise but not overabbreviated. That's all obviously true. Then he suggested that "maxphysaddr" is in this sweet spot for the idea of the highest point in physical memory that can be addressed, which I think is wrong.

greyman · on Nov 10, 2010

I program in C++ and not C, but in my opinion, unless we are talking about a very small project, it is even wrong to have a global variable, not to mention naming it maxphysaddr! If it is some utility with 50 lines of code, then ok, but otherwise it's a maintenance nightmare to have such names.

As a local variable, it is ok to have "ii" for loops or "p" for pointers, though.

eru · on Nov 10, 2010

Global variables are semi-OK as long as you don't mutate them.

Lots of project have global functions and global classes, and nobody cares.

cma · on Nov 10, 2010

ctrl-x p

basman · on Nov 10, 2010

The rule against nested includes seems dated — he doesn't like include guards in header files because "[t]he result is often thousands of needless lines of code passing through the lexical analyzer, which is (in good compilers) the most expensive phase." Surely not any more?

cdavid · on Nov 10, 2010

Indeed. I have read that parsing is the slowest part of compilation several times, but I guess this is very dated. For example, compiling numpy (a python package with ~ 100 kLOC of C code):

  - CFLAGS="-O0" -> ~ 20 sec
  - CFLAGS="-O1" -> ~ 35 sec
  - CFLAGS="-O3" -> ~ 60 sec

Since the amount of parsed code for those modes is approximately the same (for numpy it is exactly the same, but one could imagine differences in glibc headers, etc...), this shows that parsing is not the bottleneck anymore if you build non-debug builds.

mfukar · on Nov 10, 2010

No, that shows that the optimizer work takes a non-trivial amount of time. Optimizations (how often?) happen after parsing, type-checking and constructing the intermediate representation.

cdavid · on Nov 10, 2010

sure, that's exactly the point I am trying to make: if it takes 75 % time more to compile @ O1 than O0, it means that parsing takes at most ~ 60 % of compilation time for quite conservative optimization. I think it is quite common to compile above the lowest level of optimization.

To have an accurate estimation, one could look at clang, and only run the tokenizer.

gte910h · on Nov 10, 2010

Yeah, the parsing argument is bunk anyhow.

The pre-processor parsing is trivial and fast. Its the actual language which is the bulk of the time.

kqueue · on Nov 9, 2010

I disagree with him on the include part. It is very difficult to remember the order of includes + dependencies for each file you want to include. It becomes very messy.

__david__ · on Nov 10, 2010

I agree. I have better things to do than to memorize the random prerequisites of header files. Not to mention cluttering my C source with junk that isn't important to the source in that particular file. My rule is that I should be able to make an empty C file and put a single #include "header.h" line. If it doesn't compile then there is a bug in that .h file.

carefulfungi · on Nov 10, 2010

One also sees the common advice to include the paired header (ie, foo.c should include foo.h) as the first #include to verify that foo.h is dependency-complete.

The problem with dependency-complete headers is that once a .c file relies on that header for a symbol, if the header is changed to no longer require the symbol, all those .c files have to have their #includes fixed. This can take a lot of time to iterate through in a projects that require hours to compile.

gte910h · on Nov 10, 2010

Notice the date folks: 1989.

Lots of development has gone into C for the better since then.