Pseudonymizer (DEPRECATED) (ULTIMATE)
Deprecated in GitLab 14.7.
WARNING: This feature was deprecated in GitLab 14.7.
Your GitLab database contains sensitive information. To protect sensitive information when you run analytics on your database, you can use the Pseudonymizer service, which:
- Uses
HMAC(SHA256)
to mutate fields containing sensitive information. - Preserves references (referential integrity) between fields.
- Exports your GitLab data, scrubbed of sensitive material.
WARNING: If the source data is available, users can compare and correlate the scrubbed data with the original.
To generate a pseudonymized data set:
- Configure Pseudonymizer fields and output location.
- Enable Pseudonymizer data collection.
- Optional. Generate a data set manually.
Configure Pseudonymizer
To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to store the scrubbed data:
-
Create a manifest file: This file describes the fields to include or pseudonymize.
-
Default manifest - GitLab provides a default manifest in your GitLab installation
(example
manifest.yml
file). To use the example manifest file, use theconfig/pseudonymizer.yml
relative path when you configure connection parameters. - Custom manifest - To use a custom manifest file, use the absolute path to the file when you configure the connection parameters.
-
Default manifest - GitLab provides a default manifest in your GitLab installation
(example
-
Configure connection parameters: In the configuration method appropriate for
your version of GitLab, specify the object storage
connection parameters (
pseudonymizer.upload.connection
).
For Omnibus installations:
-
Edit
/etc/gitlab/gitlab.rb
and add the following lines by replacing with the values you want:gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml' gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name gitlab_rails['pseudonymizer_upload_connection'] = { 'provider' => 'AWS', 'region' => 'eu-central-1', 'aws_access_key_id' => 'AWS_ACCESS_KEY_ID', 'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY' }
If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs.
gitlab_rails['pseudonymizer_upload_connection'] = { 'provider' => 'AWS', 'region' => 'eu-central-1', 'use_iam_profile' => true }
-
Save the file and reconfigure GitLab for the changes to take effect.
For installations from source:
-
Edit
/home/git/gitlab/config/gitlab.yml
and add or amend the following lines:pseudonymizer: manifest: config/pseudonymizer.yml upload: remote_directory: 'gitlab-elt' # bucket name connection: provider: AWS aws_access_key_id: AWS_ACCESS_KEY_ID aws_secret_access_key: AWS_SECRET_ACCESS_KEY region: eu-central-1
-
Save the file and restart GitLab for the changes to take effect.
Enable Pseudonymizer data collection
To enable data collection:
- On the top bar, select Menu > Admin.
- On the left sidebar, select Settings > Metrics and Profiling, then expand Pseudonymizer data collection.
- Select Enable Pseudonymizer data collection.
- Select Save changes.
Generate data set manually
You can also run the Pseudonymizer manually:
- Set these environment variables:
-
PSEUDONYMIZER_OUTPUT_DIR
- Where to store the output CSV files. Defaults to/tmp
. These commands produce CSV files that can be quite large. Make sure the directory can store a file at least 10% of the size of your database. -
PSEUDONYMIZER_BATCH
- The batch size when querying the database. Defaults to100000
.
-
- Run the command appropriate for your application:
-
Omnibus GitLab:
sudo gitlab-rake gitlab:db:pseudonymizer
-
Installations from source:
sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production
-
Omnibus GitLab:
After you run the command, upload the output CSV files to your configured object storage. After the upload completes, delete the output file from the local disk.