The Design Of SQLite4 (work in progress)

bane · on June 27, 2012

Sqlite is really just one of those rare software projects that simply makes the world a better place. No more thinking about file storage formats, searching and indexing strategies, long term file readability (it's almost as future proofed as ASCII), extensible, flexible, fast and all around awesome.

The number of projects where sqlite turned a difficult project from impossible or very hard to doable is almost uncountable.

Here's a fun use case. Generate sqlite databases for each user, let them do whatever they want in that database from whatever interface you are letting them use. Store the sqlite file as a binary blob in a "real" database with associated metadata for fast look up and retrieval, plus whatever other things your database gives you (replication, redundancy etc.) And now you basically get sand boxed work spaces for your users almost for free.

fusiongyro · on June 28, 2012

It's a clever idea. Please don't do it.

SQLite has one very limited write scenario: one writer at a time. Your real database isn't going to understand that. The real database will happily hand out the inner SQLite database to multiple connections. At best, you'll be converting SQLite's locking into a bunch of transaction failures for conflicting updates; at worst, you'll have a race condition where only the last of several edits gets persisted.

On top of that you have the disk churn associated with rewriting the database on every change. Suppose your SQLite database is 20 MB—every write to that means 20 MB from the real database to your application and back, which means 20 MB of disk thrashing too. Guess how many users need to do this before your database is spending all day rewriting essentially unchanged bytes. Depending on your storage backend, you can't even necessarily expect that the old ones are overwritten—with MVCC, the database may wind up just allocating new blocks, in which case you're creating a lot of garbage collection work and wasting a lot of space.

Every real database has a facility to let you isolate schemas from each other. In MySQL, you can create separate databases for each user; in PostgreSQL you can do that or just create separate schemas. There are better solutions to this problem that don't have the drawbacks identified above. It's a cute idea, but in practice I don't think it will work out at all well.

bane · on June 28, 2012

Oh it's clearly some of the best/worst kind of mad-scientist type of madness.

But it can often be tolerably handled with a row-level flag in the outer-database to handle "locking" the database if you are operating in a scanrio where that makes sense.

There are other scenarios where users simply aren't writing anything to the inner SQLite databases, but you are simply using it as a storage medium. Imagine getting a SQLite database of publicly accessible government data instead of a csv dump or some unparsable XML file. Some organizations have already caught on to the idea, the NTSB for example gives out MS-Access databases of preliminary crash investigations.

I just think it's a crazy fun hacker-style idea that I've seen from time-to-time as a replacement for even harder to work with custom binary file formats stuffed into a relational database (don't ask, there are oddball projects where that kind of thing gets specified -- like certain high-end modeling systems where the user might check out a model to work on in isolation of the parent system, the data might be stored as SQLite databases instead of some proprietary bit bucket, and then uploaded back to the server in total where the model data is yanked out for server side processing).

The upside is that you can replace two sets of tools (relational tools and custom binary tooling) for working with data with one set of tools (just the relational) which ends up clarifying things quite a bit.

The above scenario, where isolating the user's datapool so that they can take it offline, do work, then check it back in to the system is almost absurdly simple.

But honestly, there's a very very limited number of use-cases where this makes sense and doing something more sane makes much more sense -- but when it works, it's ends up being one of those "so stupid it's brilliant" things.

fusiongyro · on June 28, 2012

> But it can often be tolerably handled with a row-level flag in the outer-database to handle "locking" the database if you are operating in a scanrio where that makes sense.

You do that and you're just trading one race condition for another.

> There are other scenarios where users simply aren't writing anything to the inner SQLite databases, but you are simply using it as a storage medium.

It's a pretty labor-intensive method. Many databases have other ways of introducing unstructured data that aren't going to murder performance. PostgreSQL's hstore, for example.

> the user might check out a model to work on in isolation of the parent system, the data might be stored as SQLite databases instead of some proprietary bit bucket, and then uploaded back to the server in total where the model data is yanked out for server side processing

I have a friend who has used SQLite as an interchange format. It worked reasonably well, but he also didn't have to worry much about schema versions. Still, you can do that without making your backend pay the penalty.

> The above scenario, where isolating the user's datapool so that they can take it offline, do work, then check it back in to the system is almost absurdly simple.

It's still more work than using a file system for no advantage. Just having it in the database doesn't necessarily constitute a win—most DBAs would suggest you weigh the filesystem alternative against dumping files into BLOBs in the database before deciding which way to go.

There will always be exceptions to any hard and fast rule, but I don't think the ones you've described above warrant this design.

lloeki · on June 28, 2012

> You do that and you're just trading one race condition for another.

    SELECT pk, sqlite, version FROM foo;
    -- later
    UPDATE foo SET sqlite='myblob', version=43 WHERE pk = whatever and version = 42;
    -- if zero row update, there's a conflict

Those are not the race conditions you are looking for.

Terrible design though.

Purely a thought experiment but I'm wondering if a Postgres SQLite data type (similar to the JSON data type) could be workable.

fusiongyro · on June 28, 2012

It's true. Hibernate even has special support for this ("row versioning") which I should have remembered. This isn't a race condition, but it is foisting more work on the developer that the database is already prepared to do (albeit in a different manner).

The advantage to having a SQLite datatype would be that you could do the kinds of things you can do with the XML data type (extracting information in the outer layer for querying, indexing or simply ensuring validity) but you're still going to thrash the disk with every little change. And once you've committed to Postgres, you may as well use its own solution to this problem, schemas.

neeleshs · on June 28, 2012

Agree. I have seen similar irritating problems in the past with huge JSON blobs stored in a real database and application acting upon the JSON. The concurrency issues on writing this blob and the contention was terrible.

bane · on June 28, 2012

I've seen a bunch of similar ones where they just dumped XML blobs in as well. It's a terribly common usage pattern that doesn't work well in the general sense, but can work in some specific scenarios. It's a shame databases don't provide better support for this kind of pattern.

fusiongyro · on June 28, 2012

Postgres has a nice XML type. The advantage to doing this is at least you can rely on the database to guarantee well-formedness and use XPath functions to do things like query the XML from SQL and build indexes on stuff buried within.

Postgres 9.2 will have JSON support as well, but lacking accompanying functions it will be of limited utility.

gouranga · on June 28, 2012

That's a filesystem isn't it (not NFS as locks don't work properly on NFS filesystems)?

1. One per user - in home directory. Check.

2. Real database with metadata - filesystem. Check.

3. Replication - filesystem (block replication). Check.

4. Redundancy - filesystem (RAID). Check.

So technically it already does that without any need for further abstraction. You can use UNIX perms to set it read only for all and read-write for user, then you can share it :)

For reference, I've built a HUGE amount of stuff with sqlite3 - you are right it does make the world a better place! We use it for:

1. unit test back end for NHibernate as it's fast and in memory.

2. issue/tickets/wiki (trac)

3. data processing.

4. archival datasets.

5. general swiss army knife when you can't be bothered to do it in SSIS.

malkia · on June 28, 2012

Perfect replacement for what you usually have in game dev studios - lots of small loose files (we have about 380,000 of those) - materials, textures, shaders, models, animations, etc.

No easier format than just reusing SQLite.

netshade · on June 28, 2012

Depending on the goal you're trying to achieve, it could be a great tool, but I wouldn't diss the filesystem too much. You lose the discoverability of all your fs tools, a whole suite of APIs no longer becomes available (async IO, say), and you throw another access layer in front of your data w/ a corresponding performance drop.

Not to say that SQLite wouldn't offer lots of advantages, but I'm not sure I'd be in a huge rush to start packing all my game assets into a db unless I had some pretty convincing wins. I'd be much more likely to stick w/ the 380k in development and then pack into contiguous level binfiles before shipping. ( I'm totally armchair coding though, I admit :) )

ahi · on June 27, 2012

I am interested in more details/criticisms of the above use case. Better approaches? or "Holy balls, that's idiocy!"

simcop2387 · on June 27, 2012

I would hope it would be simpler to manage via some kind of clustered filesystem instead nested databases.

columbo · on June 28, 2012

Hrm... an online application perhaps like autocad where you have hundreds/thousands of bits of information for every user's "file"?

Instead of saving all that data in xml/json you could keep it in a database. This would also allow you to release in versions of the main software by keeping older users and their files on the previous version.

Even as I type this I start to see massive alarm bells but it was something that came to mind.

Another option is an external application (like survey monkey) that publishes a sqlite file that you could then use in your system to store data for that external form without modifying your database... again much pain could be had down this road.

bane · on June 28, 2012

That's almost exactly the use case I've seen it used for. A custom modeling software with collaboration features. Modellers built models in local client software which stored the model as a SQlite database and u/l them to a central MySQL database for distribution.

When changes were made to the model, a new one was simply uploaded and a flag was set in the old one to take it off line (or was it only the newer versions were available unless the user requested an older one...I can't remember to be honest).

The metadata about the model gave details like the user who created it, dates and times of upload, name of the model, various key words about the model etc. "Replacing" an older version with a newer version just involved uploading the new version into a new row in the database and copying over a dozen or so meta-data fields.

Once the model user downloaded a model, it was as easy to work with on their end with a variety of tools as new tooling simply had to be written to the schema not to some bizarro binary format or parse some gargantuan text file into a data structure. Plus elements in the model were already indexed in the SQLite file so internal searches for model elements was trivial.

There simply wasn't a case where two users would be writing to the same SQLite inner database, but dozens could read them without problem as that was all handled by the outer database, and it preserved old versions of the models without issue.

It also meant migrating to a better outer database meant only migrating a fairly simple meta-data schema and extracting out the SQLite database then just writing it all back into the new database. It took a few days to migrate a few TB but a very short time to write all the migration scripts...an hour or so.

The plus side is a modeler could take their model off-line with them and work on them at their leisure, then upload them when they came back into work.

I was never personally involved in the project, but thought it was one of the craziest MacGuyver-esque things I'd ever seen. But once I stopped thinking about the SQLite inner databases as databases and more as just file blobs with a convenient to work with standard format it all made sense. The outer database was never responsible for searching/reading/writing the inner databases, that all happened at the client end.

mark_story · on June 28, 2012

I really hope you never have to do a schema migration. That's going to be hell with your matryoshka database.

DrJokepu · on June 27, 2012

If you do this that would mean that you could have no shared locks safely on the SQLite database while you have a reserved lock, that is, if one process is planning to write to the database, other processes cannot even read it. Even then, you would have to be very very careful if you want to avoid losing data.

Basically, unless you can (and willing to) write your own SQLite VFS that is designed to handle this safely, I wouldn't do it.

zacharyvoase · on June 27, 2012

As long as SQLite4 preserves the level of testing we've seen in SQLite3, I’ll be extremely happy: http://www.sqlite.org/testing.html

swah · on June 27, 2012

Sidenote: this flat file organization (http://www.sqlite.org/src4/dir?name=src) gives me a wonderful feel about the design and size of the project, although it probably makes Java programmers cringe.

gojomo · on June 27, 2012

"The default built-in storage engine is a log-structured merge database. It is very fast, faster than LevelDB, supports nested transactions, and stores all content in a single disk file. Future versions of SQLite4 might also include a built-in B-Tree storage engine. "

Faster than LevelDB, and in a single file? Interesting.

swah · on June 27, 2012

Yeah, also I thought that comparison was a little out of place...

Splenivore · on June 27, 2012

In all fairness, the LevelDB guys made the comparison first: http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html

Maybe outperforming LevelDB is a design goal for SQLite 4?

rozim · on June 28, 2012

The leveldb page shows mixed results wrt sqlite3 - I would say that it is generally faster that sqlite3 based on the results. There is no comparison to sqlite4.

Most notably in large value operations, and also slightly in random reads, sqlite3 wins. In in other benchmarks however, leveldb is faster.

The leveldb page notes that they perform poorly with large values due to the increased I/O that is done. The relatively poor random read performance isn't discussed but my guess is that it's due to having to fetch a compressed page of data that contains values for other keys, thus leveldb is doing "more work" than strictly necessary.

wvenable · on June 27, 2012

The thing that really bugs me with SQLite3 is the inability to modify tables. It's not much of a problem unless you have foreign key constraints and then it's a nightmare. Ultimately I resorted to just recreating the entire database from a script every time I had to make a change. If this is "fixed" in SQLite4 then I'm sold.

I'm surprised that there aren't any real binary integers stored anymore but I can see how that would reduce complexity compared to how numbers were handled in SQLite3.

rogerbinns · on June 27, 2012

You can alter tables to add columns. Doing more complex modifications would require SQLite internally to do pretty much what you do - recreating a new database from existing content.

wvenable · on June 27, 2012

I'd love if it just created a new table, moved the data over, and dropped the original table while maintaining the foreign key relationships. That would be perfect.

If you have foreign key relationships referencing the table, you can't drop the table. You can't remove the foreign key relationships without modifying the related table. But you can't modify that related table, you can only drop it. But you can't drop that table because of other foreign key relationships...

premchai21 · on June 27, 2012

But, last I checked, you could PRAGMA foreign_keys=off temporarily, which would also inhibit the automatic handling of foreign key constraints in other tables when the target table is renamed or dropped. Then you can create a new table, populate it, drop the old one, rename the new one to the old name, and turn foreign_keys back on, and the constraints will now target the new table.

wvenable · on June 27, 2012

Does that actually work for dropping and renaming tables? I thought I tried that.

rogerbinns · on June 28, 2012

In any event you can get the foreign keys from sqlite_master, delete them during the reorganisation and then put them back afterwards.

premchai21 · on June 28, 2012

It does in 3.7.13 (and presumably older versions, but I don't have a solid record of this immediately available).

EzGraphs · on June 27, 2012

SQLite3 is great. Had a few questions that were not directly addressed on the linked page. 1) Will the SQL implementation be identical between SQL3 and SQL4? 2) Does "I/O is direct to disk" indicate the same sort of locking that exists in SQLite3 (the database is locked DML which makes its application as a web server db somewhat limited).

My guess is yes to both...

cryptonector · on June 29, 2012

I want all sorts of things in SQLite4, but let's start by saying that usign a log-structured DB format is very welcomed news.

I'd also like the VFS to provide the page cache, so that we could use mmap(2) windows instead of read(2) and pread(2), which should result in fewer data copies and better performance.

Everything else I'd like to see in SQLite4 is icing on the cake.

spdegabrielle · on June 27, 2012

I'm a big fan of SQLite3, and fossil-scm is my weapon of choice for solo projects. '4 is looking good.

zvrba · on June 28, 2012

> SQLite4 does all numeric computations using decimal arithmetic. SQLite4 never uses C datatypes double or float (except in interface routines when converting between double and the internal decimal representation).

Is it only me who thinks this is a bad idea? Suddenly, what the DB computes for you won't be bit-for-bit identical with what your C program would have computed using IEEE arithmetic. Also, Goldberg has a number of slides showing that (counterintuitively), extra precision and multiple roundings produce LESS accurate results.

AnIrishDuck · on June 28, 2012

Yeah, this section of the design document made me do a double-take. The problems with this aren't just limited to the impedance mismatch; it also seems that arithmetic will be a ton slower due to lack of dedicated hardware.

j_s · on June 28, 2012

Nearly-on-topic warning: Android automatically deletes corrupt SQLite files http://stackoverflow.com/questions/7764943

rogerbinns · on June 27, 2012

Not enough of a change from the existing SQLite 3.

What I'd like to see most is trivial interoperability with JSON. Almost all the code I write that goes near SQLite has a trip through JSON as well and it gets tedious mapping between them. This also means being able to support things like a value being a list which Postgres can do.

Other than that a way of doing offline mode and online synchronization builtin. CouchDB tried to address this space, but wasn't practical to integrate into most products.

gcp · on June 28, 2012

This really sounds like the kind of thing that is perfectly doable outside the SQLite 3 core.

rogerbinns · on June 28, 2012

Which bit? Be able to do lists and a flexible schema would have to be part of the core.

Being able to do synchronization can be done outside of the core, but is fiddly and error prone. It is far better done as a fundamental part of the database.

edwinyzh · on June 28, 2012

Good news. Sqlite3 has already been great. It's also good to read that the API will not be changed much, it's very important.

pdw · on June 27, 2012

I wonder how this pluggable storage architecture relates to the SQLite/BerkeleyDB hybrid that Oracle created a while back.

ArbitraryLimits · on June 27, 2012

TL;DR All replacements for fopen() grow until they span multiple files.

dchest · on June 27, 2012

Why multiple files?

The (default) on-disk database image is a single disk file with a well-documented and stable file format, making the SQLite4 library suitable for use as an application file format.

rogerbinns · on June 28, 2012

Incidentally SQLite isn't actually a single file. The main database is. Transactions in progress and the WAL are stored in other files. If you copied just the main file at an inopportune time then it would be invalid. You have to keep the main file and the journal/WAL together.

dchest · on June 28, 2012

Journal is a temporary file created when writing into database, which you can turn off.

http://www.sqlite.org/pragma.html#pragma_journal_mode

WAL mode is a different beast specifically not intended to be a replacement for application format database, and it's not used by default.

So, no, SQLite is a single-file database. It provides more options than that, but the statement is true.

rogerbinns · on June 28, 2012

You can't turn off the journal unless you are happy to lose data in which case don't bother with a database. See section 1.3 of http://www.sqlite.org/howtocorrupt.html

dchest · on June 28, 2012

There are many more reasons to use single-file databases other than safety of file write operations.

TextEdit in OS X also creates a temporary file when it saves a document, and then atomically renames it to the original filename; would you claim that RTF document consists of two files in this case?

rogerbinns · on July 2, 2012

Not even a close analogy as the text file is always valid.

More accurate would be changes being done by modifying blocks of the original file as well as appends on the end so that it is not valid by itself, and storing enough information to make it valid/undo in a second file.