Scrubbing data
Table of Contents
:ID: 1608AF00-E150-468D-9387-B28A36DADEE0
A strategy for scrubbing data from a postgres dump for use in a performance testing environment. You might have a dump of prod data that contains sensitive information.
The dump could be loaded by someone authorized to do so, a SQL script run against the loaded data to scrub the sensitive data, then dumped again and shared with developers.
# Compose a function
See also https://github.com/teoljungberg/fx for database function version management in Rails.
Use a function to transform data while keeping it generally consistent (strings are the same length). See also https://www.postgresql.org/docs/current/functions-string.html for functions that are useful such as
- CONCAT
- SUBSTRING
- MD5
- RANDOM
- GREATEST
- SPLIT_PART
# Email scrubber example
Here is an example of an email scrubbing function.
CREATE OR REPLACE FUNCTION scrub_email(email_address varchar(255)) RETURNS varchar(255) AS $$ SELECT CONCAT( SUBSTRING( MD5(RANDOM()::TEXT), 1, GREATEST(LENGTH(SPLIT_PART(email_address, '@', 1)), 5) ), '@', SPLIT_PART(email_address, '@', 2) ); $$ LANGUAGE SQL;
To use it:
SELECT scrub_email(email) from users;
# Update table strategy
Use the function to update the table, destructively. Use vacuum and reindex after the updates. Also consider update in batches.
UPDATE users set email = scrub_email(email);
# Batches
See also Large SQL data migrations for using stored procedures
# Vacuum
After updates to rows, old row versions are still retained.
VACUUM (ANALYZE, VERBOSE) users;
# Reindex
REINDEX INDEX users_email_idx;
# Table copying strategy
The copied table will take the place of the original table and will contain sensitive data columns with the scrubbed data. This is non-destructive and can be rolled back.
# Copy table
This just copies the table schema. Not the data. See also https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-PARMS-LIKE
CREATE TABLE users_copy (LIKE users INCLUDING ALL)
# Excluding things for speedier copying
- You can use
EXCLUDING
afterINCLUDING ALL
(eg,LIKE users INCLUDING ALL EXCLUDING INDEXES
). - Not having the indexes in place when the data is copied in the next step will be faster. That could be useful for millions of rows.
- The trade off is that the indexes and FKs will still need to be created after the copying is done.
# Copy data
Here custom scrubbing functions can be applied to copy data into the new table.
INSERT INTO users_copy(first_name, last_name, email, type, created_at, updated_at) ( SELECT first_name, last_name, scrub_email(email), type, created_at, updated_at FROM users )
# Change sequence ownership
If the primary key is generated by a sequence, that sequence is still owned by the original table it was copied from.
ALTER SEQUENCE users_id_seq OWNED BY users_copy.id
# Copy indexes and FKs
*If the table was copied excluding Indexes and Constraints, they will need to be copied over but with a slightly different name.
To list the indexes on the original table:
select pg_get_indexdef(indexrelid) || ';' AS index from pg_index where indrelid = 'public.users'::regclass;
# Drop and rename
Drop the original table and rename the copy in the same transaction. Cascade will drop related objects like views.
BEGIN; DROP TABLE users CASCADE; ALTER TABLE users_copy RENAME TO users; COMMIT;