Large SQL data migrations

Created on June 21, 2021

Updated on October 6, 2023

Use PL/pgSQL and batches
- Transaction Management
  - Stored Procedures
  - dblink extension
Drop and recreate indexes
Use batches
Dealing with constraints
Resources

:ID: F544CDDD-34B9-4475-B265-1139F18D9090

# Use PL/pgSQL and batches

See SQL Procedural Language. This is useful for batching (see below).

# Transaction Management

One important caveat here is that we can’t COMMIT transactions in PL/pgSQL. This becomes an issue when using LOOP on batched operations where it would be ideal to commit on each iteration of a loop. For this, we need to use PostgreSQL Stored Procedures See plpgsql transaction management docs.

# Stored Procedures

Can be used in PG versions 11+

Provide feedback to stdout
commit at the conclusion of each loop

CREATE OR REPLACE PROCEDURE update_things_foo_to_bar()
  LANGUAGE plpgsql
  AS $$

  DECLARE
    current_id INT := (select min(id) from things);
    max_id INT := (select max(id) from things);
    batch_size INT := 1000;
    rows_updated INT;

  BEGIN
    WHILE current_id <= max_id LOOP
      UPDATE things
      SET foo = 'bar'
      WHERE id >= current_id
      AND id < current_id + batch_size;

      GET DIAGNOSTICS rows_updated = ROW_COUNT;

      COMMIT;

      RAISE NOTICE 'current_id: %; rows_updated: %', current_id, rows_updated;

      current_id := current_id + batch_size;
    END LOOP;
  END $$

# dblink extension

For older versions of Postgres, one hack is to use the dblink extension, build a SQL string, execute it on the dblink connection and COMMIT. Note the use of format to interpolate any variables into the SQL string. Variables declared in the script’s scope won’t be available to be evaluated on the dblink context.

The overall structure here is two functions:

An outer function that handles looping.
An inner function called on each loop iteration that performs the data operation (ie, inserts and/or updates).
Call the outer function to kick it off.

For example:

CREATE OR replace function f_migrate_things_batch(batch_start, batch_end)
RETURNS void AS
$$
    PERFORM dblink_connect('my_dblink','dbname=my_database port=5432 user=username');

    sql_string := $query$
      -- Some long SQL string that references batch_start/batch_end as `%1$s` and `%2$s`
    $query$;

    -- Interpolate dynamic variables since dblink won't have access to them.
    sql_string := format(sql_string, batch_start, batch_end);

    PERFORM dblink('my_dblink', sql_string);
    PERFORM dblink('my_dblink','COMMIT;');
    PERFORM dblink_disconnect('my_dblink');
$$
LANGUAGE plpgsql;

CREATE OR replace function f_migrate_things_in_batches()
RETURNS void AS
$$

    DECLARE
        row_count   integer := 0;
        batch_size  integer := 5000;
        batch_start integer := 1;
        batch_end   integer := batch_size;

    BEGIN

        CREATE EXTENSION IF NOT EXISTS dblink;

        row_count := (SELECT count(*) FROM things);
        RAISE NOTICE '% things to update', row_count;

        WHILE row_count > 0
            LOOP
                PERFORM f_migrate_things_batch(batch_start, batch_end);

                batch_start := batch_start + batch_size
                batch_end := batch_end + batch_size;
                row_count := row_count - batch_size;

                RAISE NOTICE '% things remaining', row_count;

            END LOOP;

    END;
$$
LANGUAGE plpgsql;

select f_migrate_things_in_batches();

# Drop and recreate indexes

This can speed things up.

# Use batches

Here I’m just using a counter to increment by a certain batch size. This is a kind of boiler plate template for doing this.

Also, there is the use of GET DIAGNOSTICS which is a handy tool to get information about the previously executed statement. ROW_COUNT is one of the items that can be used with GET DIAGNOSTICS and is not the same as row_count variable in the example.

DO $$
DECLARE
    row_count   integer := 0;
    batch_size  integer := 5000;
    batch_start integer := 1;
    batch_end   integer := batch_size;
    affected    integer;

BEGIN
    row_count := (SELECT count(*) FROM things);
    RAISE NOTICE '% things to update', row_count;

    WHILE row_count > 0
        LOOP

            -- DO STUFF LIKE INSERT OF UPDATE records
            -- UPDATE things...

            GET DIAGNOSTICS affected = ROW_COUNT;
            RAISE NOTICE '% things migrated', affected;

            batch_start := batch_start + batch_size;
            batch_end := batch_end + batch_size;
            row_count := row_count - batch_size;

            RAISE NOTICE '% things remaining', row_count;
        END LOOP;
END $$

# Dealing with constraints

A nice feature is ON CONFLICT (UPSERT) for taking certain actions when encountering issue with constraints so it doesn’t blow up the entire migration.

Note that the UPDATE action cannot affect the same row more than once.

Alex's Slip-box

Large SQL data migrations

Table of Contents

# Use PL/pgSQL and batches

# Transaction Management

# Stored Procedures

# dblink extension

# Drop and recreate indexes

# Use batches

# Dealing with constraints

# Resources

Alex's Slip-box

Large SQL data migrations

Table of Contents

# Use PL/pgSQL and batches

# Transaction Management

# Stored Procedures

# dblink extension

# Drop and recreate indexes

# Use batches

# Dealing with constraints

# Resources

Search Results