Data Masking: Anonymizing Production Data Securely For Testing & Development Environments
Data masking: Your production environment should be as secure as possible and access should be as limited as possible. But what happens if something goes wrong and you have to investigate a query?
Sometimes it is possible to create a similar test example manually, but in many cases you might need a sizable amount of documents with corresponding relations to test and investigate properly. This could easily be done by using the existing data set from production but extracting personal information like real names, birthdays or credit card number might not be approved by management.
The data masking feature of arangodump provides a convenient way to extract production data but mask critical information that should not be visible. This includes names, birthdays, credit card numbers, addresses, emails or phone-numbers.
In this tutorial, we show you how the data masking feature of ArangoDB in both Community and Enterprise can be used and easily mask your production data dumps for safe usage in lower security level environments. Find all configuration options with examples in the Data Masking Docs. Please note the following masking functions are available:
In the Community Edition:
- Random String
In the Enterprise Edition:
- Xify Front
- Zip
- Datetime
- Integral Number
- Decimal Number
- Credit Card Number
- Phone Number
- Email Address
Example
Assume you have two collections persons
and creditcards
. The persons
collection contains personal data, for example:
{ "_key": "196", "_id": "persons/196", "_rev": "_YKgHeVe--_", "name": "Hans Meier", "birthday": "2017-01-01", "spouse": { "name": "Anna Meier-Müller" } }
The creditcards
contains credit card information:
{ "_key": "157", "_id": "creditcards/157", "_rev": "_YKgGZla--_", "name": "Hans Meier", "cardnumber": "2010034121851593" }
For the sake of simplicity let name
be the join key between persons
and creditcards
. In a real world data model you will have an identifier linking them.
All this personal data must be protected from illegitimate access. You cannot simply dump all persons and use them as test-data for your development and test environments.
Masking the name
OK, so we cannot simply dump the names
collection but we also cannot simply replace the name with a random string because we have in this – albeit naive – example used the name as foreign key in creditcards
. Hence, we need to obfuscate the same name with the same value in both collections names
and creditcards
.
In order to obfuscate all names, you can use the following masking definition:
{ "*": { "type": "masked", "maskings": [ { "path": ".name", "type": "randomString" } ] } }
This will change the person to:
{ "_key" : "196", "_id" : "persons/196", "_rev" : "_YKgHeVe--_", "birthday" : "2017-01-01", "name" : "nsDqD93iWFQ=", "spouse" : { "name" : "++AravpM+Us=++Arav" } }
and the credit-card to
{ "_key" : "157", "_id" : "creditcards/157", "_rev" : "_YKgGZla--_", "cardnumber" : "2010034121851593", "name" : "nsDqD93iWFQ=" }
Please note that the name Hans Meier
is mapped to the same obfuscated string. However, two different runs of arangodump
will not produce the same obfuscated string for the name Hans Meier
because the cryptographic hash used will uses a random seeding.
Masking the credit card information
Likewise you can obfuscate the credit card information inside the creditcards
collection and the birthday inside the persons
collection by using
{ "creditcards": { "type": "masked", "maskings": [ { "path": "cardnumber", "type": "randomString" }, { "path": ".name", "type": "randomString" } ] }, "persons": { "type": "masked", "maskings": [ { "path": "birthday", "type": "randomString" }, { "path": ".name", "type": "randomString" } ] } }
Note that the above example now uses a stricter definition of the maskings.
The maskings for birthday
is restricted to persons
and the top-level attribute. The maskings for cardnumber
is restricted to creditcards
and the top-level attribute.
An alternative is to mask these fields everywhere as we did for name
. This will result in obfuscating the birthday:
{ "_key" : "196", "_id" : "persons/196", "_rev" : "_YKgHeVe--_", "birthday" : "dg0if0css6o=", "name" : "v5eZiFu9SwY=", "spouse" : { "name" : "ipRWCV74A4I=ipRWCV" } }
and the creditcard:
{ "_key" : "157", "_id" : "creditcards/157", "_rev" : "_YKgGZla--_", "cardnumber" : "rAHX1roXG7Q=rA", "name" : "v5eZiFu9SwY=" }
The information is now obfuscated, but the structure of the credit card number and the birthday has been changed. It might be important for your testing and development purposes to keep the basic structure of a specific attribute value.
The data masking functions within the Enterprise Edition of ArangoDB allows to generate random dates and credit card numbers.
{ "creditcards": { "type": "masked", "maskings": [ { "path": "cardnumber", "type": "creditCard" }, { "path": ".name", "type": "randomString" } ] }, "persons": { "type": "masked", "maskings": [ { "path": "birthday", "type": "datetime", "begin" : "1960-01-01", "end": "2019-12-31", "format": "%yyyy-%mm-%dd" }, { "path": ".name", "type": "randomString" } ] } }
This produces a more familiar birthday:
{ "_key" : "196", "_id" : "persons/196", "_rev" : "_YKgHeVe--_", "birthday" : "2018-08-17", "name" : "u1dwSWu5PwA=", "spouse" : { "name" : "+7fWsZU9x1k=+7fWsZ" } }
and credit card number:
{ "_key" : "157", "_id" : "creditcards/157", "_rev" : "_YKgGZla--_", "cardnumber" : "2014003182153159", "name" : "u1dwSWu5PwA=" }
As you can see in the examples above, we obfuscated the birthday and credit card number without losing their structure.
Running the example
Create a 'production' data-set
To create the example “production data set” you can either apply the instructions given in the AQL tutorial to create the persons
and creditcards
collections via WebUI and fill the collections with the data OR follow the instructions below to use arangosh.
To test all features we will describe here, you may want to install the 3.5 Enterprise Edition (completely free for evaluation purposes).
Create a file persons.json
containing:
{ "name": "Hans Meier", "birthday": "2017-01-01", "spouse": { "name": "Anna Meier-Mueller" } } { "name": "Anna Meier-Mueller", "birthday": "2017-07-12", "spouse": { "name": "Hans Meier" } } { "name": "Hugo Egon", "birthday": "2010-03-17", "spouse": { "name": "Gerda Balder" } } { "name": "Gerda Balder", "birthday": "2010-09-23", "spouse": { "name": "Hugo Egon" } }
and a file creditcards.json
containing:
{ "name": "Hans Meier", "cardnumber": "2010034121851593" } { "name": "Gerda Balder", "cardnumber": "4109384211585194" }
Import these two files into a running instance of ArangoDB via arangosh:
arangoimp --server.database example --create-database --collection persons --create-collection --type jsonl --file persons.json arangoimp --server.database example --collection creditcards --create-collection --type jsonl --file creditcards.json
You should now see in arangosh:
production> db.persons.toArray() [ { "_key" : "6117008", "_id" : "persons/6117008", "_rev" : "_YY-vgnK---", "name" : "Hans Meier", "birthday" : "2017-01-01", "spouse" : { "name" : "Anna Meier-Mueller" } }, { "_key" : "6117009", "_id" : "persons/6117009", "_rev" : "_YY-vgnK--A", "name" : "Anna Meier-Mueller", "birthday" : "2017-07-12", "spouse" : { "name" : "Hans Meier" } }, { "_key" : "6117010", "_id" : "persons/6117010", "_rev" : "_YY-vgnK--C", "name" : "Hugo Egon", "birthday" : "2010-03-17", "spouse" : { "name" : "Gerda Balder" } }, { "_key" : "6117011", "_id" : "persons/6117011", "_rev" : "_YY-vgnK--E", "name" : "Gerda Balder", "birthday" : "2010-09-23", "spouse" : { "name" : "Hugo Egon" } } ] production> db.creditcards.toArray() [ { "_key" : "6117105", "_id" : "creditcards/6117105", "_rev" : "_YY-yFH2---", "name" : "Hans Meier", "cardnumber" : "2010034121851593" }, { "_key" : "6117106", "_id" : "creditcards/6117106", "_rev" : "_YY-yFH2--A", "name" : "Gerda Balder", "cardnumber" : "4109384211585194" } ]
Exporting with masking
Now that we created the production dataset, let’s export it again using the Data Masking feature.
Create a file maskings.json
containing
{ "creditcards": { "type": "masked", "maskings": [ { "path": "cardnumber", "type": "randomString" }, { "path": ".name", "type": "randomString" } ] }, "persons": { "type": "masked", "maskings": [ { "path": "birthday", "type": "randomString" }, { "path": ".name", "type": "randomString" } ] } }
Now export using:
arangodump --maskings maskings.json --server.database example --output-directory example
and import into a new database:
arangorestore --server.database masked --input-directory example
Check the result
Now look at the masked database
masked> db.persons.toArray() [ { "_key" : "6117010", "_id" : "persons/6117010", "_rev" : "_YY-vgnK--C", "birthday" : "6VboePBV8I4=", "name" : "3CO/HOVddJI=", "spouse" : { "name" : "ZGivCqK+kHk=" } }, { "_key" : "6117009", "_id" : "persons/6117009", "_rev" : "_YY-vgnK--A", "birthday" : "uPHeHaFtNk0=", "name" : "QtzuCEGSaH0=QtzuCE", "spouse" : { "name" : "1g8GpUKNOXk=" } }, { "_key" : "6117011", "_id" : "persons/6117011", "_rev" : "_YY-vgnK--E", "birthday" : "PaQzrHjnddw=", "name" : "ZGivCqK+kHk=", "spouse" : { "name" : "3CO/HOVddJI=" } }, { "_key" : "6117008", "_id" : "persons/6117008", "_rev" : "_YY-vgnK---", "birthday" : "2Fi7Ja9fNE4=", "name" : "1g8GpUKNOXk=", "spouse" : { "name" : "QtzuCEGSaH0=QtzuCE" } } ] masked> db.creditcards.toArray() [ { "_key" : "6117106", "_id" : "creditcards/6117106", "_rev" : "_YY-yFH2--A", "cardnumber" : "o3/NExaLEgM=o3/N", "name" : "ZGivCqK+kHk=" }, { "_key" : "6117105", "_id" : "creditcards/6117105", "_rev" : "_YY-yFH2---", "cardnumber" : "1R8fvrSJRE0=1R8f", "name" : "1g8GpUKNOXk=" } ]
Enterprise version
If you have the enterprise version, you can also use the following maskings.json
. In this masking definition we make use of the datetime and creditcard masking functions to preserve the structure of the attribute values of birthday
and cardnumber
.
{ "creditcards": { "type": "masked", "maskings": [ { "path": "cardnumber", "type": "creditCard" }, { "path": ".name", "type": "randomString" } ] }, "persons": { "type": "masked", "maskings": [ { "path": "birthday", "type": "datetime", "begin" : "1960-01-01", "end": "2019-12-31", "format": "%yyyy-%mm-%dd" }, { "path": ".name", "type": "randomString" } ] } }
This will result in:
masked> db.persons.toArray() [ { "_key" : "6117010", "_id" : "persons/6117010", "_rev" : "_YY-vgnK--C", "birthday" : "1973-11-22", "name" : "pnenaFH6ygI=", "spouse" : { "name" : "2gD27vP66+k=" } }, { "_key" : "6117009", "_id" : "persons/6117009", "_rev" : "_YY-vgnK--A", "birthday" : "2007-06-09", "name" : "Jhhr0axG9S4=Jhhr0a", "spouse" : { "name" : "SpS1y/cb6uw=" } }, { "_key" : "6117011", "_id" : "persons/6117011", "_rev" : "_YY-vgnK--E", "birthday" : "1965-06-02", "name" : "2gD27vP66+k=", "spouse" : { "name" : "pnenaFH6ygI=" } }, { "_key" : "6117008", "_id" : "persons/6117008", "_rev" : "_YY-vgnK---", "birthday" : "1985-08-11", "name" : "SpS1y/cb6uw=", "spouse" : { "name" : "Jhhr0axG9S4=Jhhr0a" } } ] masked> db.creditcards.toArray() [ { "_key" : "6117106", "_id" : "creditcards/6117106", "_rev" : "_YY-yFH2--A", "cardnumber" : "3000004160779761", "name" : "2gD27vP66+k=" }, { "_key" : "6117105", "_id" : "creditcards/6117105", "_rev" : "_YY-yFH2---", "cardnumber" : "3400003796970808", "name" : "SpS1y/cb6uw=" } ]
We hope the new data masking feature is useful for you and this tutorial could help you get started with it. If you have any feedback or questions to this tutorial please let us know via learn@arangodb.com.