Alex's Slip-box

These are my org-mode notes in sort of Zettelkasten style

Scrubbing data

:ID: 1608AF00-E150-468D-9387-B28A36DADEE0

A strategy for scrubbing data from a postgres dump for use in a performance testing environment. You might have a dump of prod data that contains sensitive information.

The dump could be loaded by someone authorized to do so, a SQL script run against the loaded data to scrub the sensitive data, then dumped again and shared with developers.

# Compose a function

See also https://github.com/teoljungberg/fx for database function version management in Rails.

Use a function to transform data while keeping it generally consistent (strings are the same length). See also https://www.postgresql.org/docs/current/functions-string.html for functions that are useful such as

  • CONCAT
  • SUBSTRING
  • MD5
  • RANDOM
  • GREATEST
  • SPLIT_PART

# Email scrubber example

Here is an example of an email scrubbing function.

CREATE OR REPLACE FUNCTION scrub_email(email_address varchar(255))
  RETURNS varchar(255) AS $$
  SELECT CONCAT(
    SUBSTRING(
      MD5(RANDOM()::TEXT),
      1,
      GREATEST(LENGTH(SPLIT_PART(email_address, '@', 1)), 5)
    ),
    '@',
    SPLIT_PART(email_address, '@', 2)
    );
  $$ LANGUAGE SQL;

To use it:

SELECT scrub_email(email) from users;

# Update table strategy

Use the function to update the table, destructively. Use vacuum and reindex after the updates. Also consider update in batches.

UPDATE users set email = scrub_email(email);

# Batches

See also Large SQL data migrations for using stored procedures

# Vacuum

After updates to rows, old row versions are still retained.

VACUUM (ANALYZE, VERBOSE) users;

# Reindex

REINDEX INDEX users_email_idx;

# Table copying strategy

The copied table will take the place of the original table and will contain sensitive data columns with the scrubbed data. This is non-destructive and can be rolled back.

# Copy table

This just copies the table schema. Not the data. See also https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-PARMS-LIKE

CREATE TABLE users_copy (LIKE users INCLUDING ALL)

# Excluding things for speedier copying

  • You can use EXCLUDING after INCLUDING ALL (eg, LIKE users INCLUDING ALL EXCLUDING INDEXES).
  • Not having the indexes in place when the data is copied in the next step will be faster. That could be useful for millions of rows.
  • The trade off is that the indexes and FKs will still need to be created after the copying is done.

# Copy data

Here custom scrubbing functions can be applied to copy data into the new table.

INSERT INTO users_copy(first_name, last_name, email, type, created_at, updated_at)
            (
              SELECT first_name, last_name, scrub_email(email), type, created_at, updated_at
              FROM users
            )

# Change sequence ownership

If the primary key is generated by a sequence, that sequence is still owned by the original table it was copied from.

ALTER SEQUENCE users_id_seq OWNED BY users_copy.id

# Copy indexes and FKs

*If the table was copied excluding Indexes and Constraints, they will need to be copied over but with a slightly different name.

To list the indexes on the original table:

select pg_get_indexdef(indexrelid) || ';' AS index from pg_index where indrelid = 'public.users'::regclass;

# Drop and rename

Drop the original table and rename the copy in the same transaction. Cascade will drop related objects like views.

BEGIN;
DROP TABLE users CASCADE;
ALTER TABLE users_copy RENAME TO users;
COMMIT;

Search Results