Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PostgreSQL Data Migration Tips (tilt.com)
79 points by throughnothing on March 19, 2015 | hide | past | favorite | 21 comments


It was bugging me that the target of the "rosser" link was not pointing to the comment he shared his bulk update technique so I dug it up for anyone else interested:

https://news.ycombinator.com/item?id=9018756


Author here. Great point, I just updated this link to point directly to his comment rather than his HN profile.


  FOR UPDATE NOWAIT immediately locks the rows being retrieved in the first step (as if they were to be updated)
Technically "FOR UPDATE" only makes an attempt to lock rows, and the "NOWAIT" instructs postgres, in the case that another transaction already has a lock on the row, to raise an error immediately instead of the default behavior of waiting for the lock to become available.

http://www.postgresql.org/docs/9.4/static/sql-select.html


That's a good clarification. I think it's better for the data migration to error out when acquiring the lock on rows and then retry, rather than waiting indefinitely, but YMMV (your mileage may vary).


I'm not sure I understand the purpose of the loop in the last example. AFAIK top-level plpgsql statements (including DO blocks run in psql) execute in a single transaction, so it seems like you end up slowly locking the entire table, as the transaction won't commit until the loop completes. (I learned this the hard way by trying to "batch"-update a table with tens of millions of rows in production.)

The normal way to handle batch updates is to perform the loop outside of postgresql, so that each batch is in its own transaction.


  AFAIK top-level plpgsql statements (including DO blocks run in psql) execute in a single transaction
This is true according to the docs. Anonymous code blocks are "transient anonymous functions", and functions are executed within a transaction.

  it seems like you end up slowly locking the entire table
The selected rows would be locked for update, delete, and select for updates, but not for regular reads. Perhaps his users table is used primarily for reads, which made this command run with negligible consequences?

http://www.postgresql.org/docs/9.4/static/sql-do.html

http://www.postgresql.org/docs/9.4/static/plpgsql-structure....


Author here. That's a great point bhahn, I just updated my gist to properly handle the case that you just outlined:

https://gist.github.com/aanari/349c7d97ed50c6f69930#file-bat...

By creating a separate function for the locking and updating of rows, we ensure that the `BEGIN/END` transaction is handled per iteration rather than at the very end, so we only lock rows while they are being processed. Since Postgres does not support nested transaction blocks, calling a defined function from within an anonymous function block seemed to be the easiest and clearest path to achieve this.


https://gist.github.com/aanari/349c7d97ed50c6f69930#file-bat...

The anonymous block is implicitly called in a transaction, so all the calls to batch_at_will will be executed in the parent transaction; there's no way around this (except for using db_link but that's pretty smelly imo).

The only way to batch update while only locking rows in the batch is to run the loop outside of postgres like OP suggested.


I like how SQL is becoming cool again.


Did it ever really go out of style?


Not for me. I've used ORMs for the simplest of CRUD operations from web apps but couldn't imagine ever working without SQL for everything else.


To be fair, when it comes to data consistency especially on structured dbs in high load environments, all-SQL solutions are usually better than, say, client side scripts wrapped in transactions


Bone to pick with the CTE that archives users. I'd rather see it insert them into the archive table first, then delete them. Not much of a difference, but my instinct is to cover all possible failures, and be especially careful around deleting rows.

It also means that god forbid it did die halfway through, and PG isn't smart enough to pick up where it left off safely, you won't lose any data, and at worst would end up with a duplicate archived row (easy enough to catch with some maintenance scripts and origin ids)


Should be a transaction. I bet if you could a way to corrupt data on safe hardware with this query and a power-plug test, the pg developers would treat it as a high-priority bug.


That's correct, as jpitz mentioned if the code is run inside a transaction block, then we don't have to worry about the failing DELETE causing the INSERTs to fail.


The transaction is implict - in fact, I don't know of a way to do this outside the scope of a transaction. Is there?


My instinct here is to write the query in such a way that even if pg changes in the future, or if I migrate to an almost compatible alternative tomorrow, it should work the same way. Perhaps more importantly, it sends the message to future devs (including me) that rows should only be deleted after they've already been archived. It also covers that 0.000000001% chance that everything else went wrong and the intern did it manually and managed to screw it up.

In the end, queries are code, and code is our way to communicate or intent to the next developer, so it's better to do the delete after the insert


> My instinct here is to write the query in such a way that even if pg changes in the future, or if I migrate to an almost compatible alternative tomorrow, it should work the same way.

Transaction semantics are a fundamental feature of pg, of SQL, and of ACID databases generally. Anything that doesn't preserve them is not an "almost compatible alternative".

> Perhaps more importantly, it sends the message to future devs (including me) that rows should only be deleted after they've already been archived.

Using an explicit transaction, even if it is not necessary, communicates to future devs the actual intent, which isn't "rows deleted after they are archived" but "deletion and archive should occur in a single atomic step, and either both happen or neither happen." The ordered approach is just a way to express the least harm alternative in an environment that doesn't provide atomicity guarantees, but that's inappropriate when working in an environment that actually provides atomic transactions.

One of the biggest problems that programmers with a stronger background outside of databases often have in writing SQL DB code is that they keep using ingrained workarounds for the fact that many programming environments lack convenient support for grouping operations into atomic units in SQL, which supports such grouping.


> My instinct here is to write the query in such a way that even if pg changes in the future, or if I migrate to an almost compatible alternative tomorrow, it should work the same way.

I agree, but, to nitpick, if transaction semantics change that much, you aren't dealing with an almost compatible alternative. Further, I'll wager vast sums of money that pg will not change in that way, ever.

> In the end, queries are code, and code is our way to communicate or intent to the next developer, so it's better to do the delete after the insert

I wholeheartedly agree, so long as your code correctly captures the requirements. For ( contrived, uncommon ) example if the requirement states that, transactionally, the row must either be in either the live table or the archive table, but not both ( for reporting non-duplication, ) then the form that performs the insert-delete is the correct capture of that requirement.


nice ! thank you for this


No problem mizerable! Glad you enjoyed it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: