Data Masking: Anonymizing Production Data Securely For Testing & Development Environments

Data masking: Your production environment should be as secure as possible and access should be as limited as possible. But what happens if something goes wrong and you have to investigate a query?

Sometimes it is possible to create a similar test example manually, but in many cases you might need a sizable amount of documents with corresponding relations to test and investigate properly. This could easily be done by using the existing data set from production but extracting personal information like real names, birthdays or credit card number might not be approved by management.

The data masking feature of arangodump provides a convenient way to extract production data but mask critical information that should not be visible. This includes names, birthdays, credit card numbers, addresses, emails or phone-numbers.

In this tutorial, we show you how the data masking feature of ArangoDB in both Community and Enterprise can be used and easily mask your production data dumps for safe usage in lower security level environments. Find all configuration options with examples in the Data Masking Docs. Please note the following masking functions are available:

In the Community Edition:

Random String

In the Enterprise Edition:

Xify Front
Zip
Datetime
Integral Number
Decimal Number
Credit Card Number
Phone Number
Email Address

Example

Assume you have two collections persons and creditcards. The persons collection contains personal data, for example:

{
  "_key": "196",
  "_id": "persons/196",
  "_rev": "_YKgHeVe--_",
  "name": "Hans Meier",
  "birthday": "2017-01-01",
  "spouse": {
    "name": "Anna Meier-Müller"
  }
}

The creditcards contains credit card information:

{
  "_key": "157",
  "_id": "creditcards/157",
  "_rev": "_YKgGZla--_",
  "name": "Hans Meier",
  "cardnumber": "2010034121851593"
}

For the sake of simplicity let name be the join key between persons and creditcards. In a real world data model you will have an identifier linking them.

All this personal data must be protected from illegitimate access. You cannot simply dump all persons and use them as test-data for your development and test environments.

Masking the name

OK, so we cannot simply dump the names collection but we also cannot simply replace the name with a random string because we have in this – albeit naive – example used the name as foreign key in creditcards. Hence, we need to obfuscate the same name with the same value in both collections names and creditcards.

In order to obfuscate all names, you can use the following masking definition:


    {
        "*": {
            "type": "masked",
            "maskings": [
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        }
    }

This will change the person to:

  {
      "_key" : "196",
      "_id" : "persons/196",
      "_rev" : "_YKgHeVe--_",
      "birthday" : "2017-01-01",
      "name" : "nsDqD93iWFQ=",
      "spouse" : {
        "name" : "++AravpM+Us=++Arav"
      }
    }

and the credit-card to

  {
      "_key" : "157",
      "_id" : "creditcards/157",
      "_rev" : "_YKgGZla--_",
      "cardnumber" : "2010034121851593",
      "name" : "nsDqD93iWFQ="
    }

Please note that the name Hans Meier is mapped to the same obfuscated string. However, two different runs of arangodump will not produce the same obfuscated string for the name Hans Meier because the cryptographic hash used will uses a random seeding.

Masking the credit card information

Likewise you can obfuscate the credit card information inside the creditcards collection and the birthday inside the persons collection by using


   
    {
        "creditcards": {
            "type": "masked",
            "maskings": [
                {
                    "path": "cardnumber",
                    "type": "randomString"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        },
        "persons": {
            "type": "masked",
            "maskings": [
                {
                    "path": "birthday",
                    "type": "randomString"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        }
    }

Note that the above example now uses a stricter definition of the maskings.

The maskings for birthday is restricted to persons and the top-level attribute. The maskings for cardnumber is restricted to creditcards and the top-level attribute.

An alternative is to mask these fields everywhere as we did for name. This will result in obfuscating the birthday:

   {
      "_key" : "196",
      "_id" : "persons/196",
      "_rev" : "_YKgHeVe--_",
      "birthday" : "dg0if0css6o=",
      "name" : "v5eZiFu9SwY=",
      "spouse" : {
        "name" : "ipRWCV74A4I=ipRWCV"
      }
    }

and the creditcard:

   {
      "_key" : "157",
      "_id" : "creditcards/157",
      "_rev" : "_YKgGZla--_",
      "cardnumber" : "rAHX1roXG7Q=rA",
      "name" : "v5eZiFu9SwY="
    }

The information is now obfuscated, but the structure of the credit card number and the birthday has been changed. It might be important for your testing and development purposes to keep the basic structure of a specific attribute value.

The data masking functions within the Enterprise Edition of ArangoDB allows to generate random dates and credit card numbers.

   {
        "creditcards": {
            "type": "masked",
            "maskings": [
                {
                    "path": "cardnumber",
                    "type": "creditCard"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        },
        "persons": {
            "type": "masked",
            "maskings": [
                {
                    "path": "birthday",
                    "type": "datetime",
                    "begin" : "1960-01-01",
                    "end": "2019-12-31",
                    "format": "%yyyy-%mm-%dd"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        }
    }

This produces a more familiar birthday:

  
    {
      "_key" : "196",
      "_id" : "persons/196",
      "_rev" : "_YKgHeVe--_",
      "birthday" : "2018-08-17",
      "name" : "u1dwSWu5PwA=",
      "spouse" : {
        "name" : "+7fWsZU9x1k=+7fWsZ"
      }
    }

and credit card number:

 {
      "_key" : "157",
      "_id" : "creditcards/157",
      "_rev" : "_YKgGZla--_",
      "cardnumber" : "2014003182153159",
      "name" : "u1dwSWu5PwA="
    }

As you can see in the examples above, we obfuscated the birthday and credit card number without losing their structure.

Running the example

Create a 'production' data-set

To create the example “production data set” you can either apply the instructions given in the AQL tutorial to create the persons and creditcardscollections via WebUI and fill the collections with the data OR follow the instructions below to use arangosh.

To test all features we will describe here, you may want to install the 3.5 Enterprise Edition (completely free for evaluation purposes).

Create a file persons.json containing:

    { "name": "Hans Meier", "birthday": "2017-01-01", "spouse": { "name": "Anna Meier-Mueller" } }
    { "name": "Anna Meier-Mueller", "birthday": "2017-07-12", "spouse": { "name": "Hans Meier" } }
    { "name": "Hugo Egon", "birthday": "2010-03-17", "spouse": { "name": "Gerda Balder" } }
    { "name": "Gerda Balder", "birthday": "2010-09-23", "spouse": { "name": "Hugo Egon" } }

and a file creditcards.json containing:

 { "name": "Hans Meier", "cardnumber": "2010034121851593" }
 { "name": "Gerda Balder", "cardnumber": "4109384211585194" }

Import these two files into a running instance of ArangoDB via arangosh:

 arangoimp --server.database example --create-database --collection persons --create-collection --type jsonl --file persons.json
 arangoimp --server.database example --collection creditcards --create-collection --type jsonl --file creditcards.json

You should now see in arangosh:

production>  db.persons.toArray()
    [
      {
        "_key" : "6117008",
        "_id" : "persons/6117008",
        "_rev" : "_YY-vgnK---",
        "name" : "Hans Meier",
        "birthday" : "2017-01-01",
        "spouse" : {
          "name" : "Anna Meier-Mueller"
        }
      },
      {
        "_key" : "6117009",
        "_id" : "persons/6117009",
        "_rev" : "_YY-vgnK--A",
        "name" : "Anna Meier-Mueller",
        "birthday" : "2017-07-12",
        "spouse" : {
          "name" : "Hans Meier"
        }
      },
      {
        "_key" : "6117010",
        "_id" : "persons/6117010",
        "_rev" : "_YY-vgnK--C",
        "name" : "Hugo Egon",
        "birthday" : "2010-03-17",
        "spouse" : {
          "name" : "Gerda Balder"
        }
      },
      {
        "_key" : "6117011",
        "_id" : "persons/6117011",
        "_rev" : "_YY-vgnK--E",
        "name" : "Gerda Balder",
        "birthday" : "2010-09-23",
        "spouse" : {
          "name" : "Hugo Egon"
        }
      }
    ]
 
    production>  db.creditcards.toArray()
    [
      {
        "_key" : "6117105",
        "_id" : "creditcards/6117105",
        "_rev" : "_YY-yFH2---",
        "name" : "Hans Meier",
        "cardnumber" : "2010034121851593"
      },
      {
        "_key" : "6117106",
        "_id" : "creditcards/6117106",
        "_rev" : "_YY-yFH2--A",
        "name" : "Gerda Balder",
        "cardnumber" : "4109384211585194"
      }
    ]

Exporting with masking

Now that we created the production dataset, let’s export it again using the Data Masking feature.

Create a file maskings.json containing


    {
        "creditcards": {
            "type": "masked",
            "maskings": [
                {
                    "path": "cardnumber",
                    "type": "randomString"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        },
        "persons": {
            "type": "masked",
            "maskings": [
                {
                    "path": "birthday",
                    "type": "randomString"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        }
    }

Now export using:

 arangodump --maskings maskings.json --server.database example --output-directory example

and import into a new database:

 arangorestore --server.database masked --input-directory example

Check the result

Now look at the masked database

masked> db.persons.toArray()
    [
      {
        "_key" : "6117010",
        "_id" : "persons/6117010",
        "_rev" : "_YY-vgnK--C",
        "birthday" : "6VboePBV8I4=",
        "name" : "3CO/HOVddJI=",
        "spouse" : {
          "name" : "ZGivCqK+kHk="
        }
      },
      {
        "_key" : "6117009",
        "_id" : "persons/6117009",
        "_rev" : "_YY-vgnK--A",
        "birthday" : "uPHeHaFtNk0=",
        "name" : "QtzuCEGSaH0=QtzuCE",
        "spouse" : {
          "name" : "1g8GpUKNOXk="
        }
      },
      {
        "_key" : "6117011",
        "_id" : "persons/6117011",
        "_rev" : "_YY-vgnK--E",
        "birthday" : "PaQzrHjnddw=",
        "name" : "ZGivCqK+kHk=",
        "spouse" : {
          "name" : "3CO/HOVddJI="
        }
      },
      {
        "_key" : "6117008",
        "_id" : "persons/6117008",
        "_rev" : "_YY-vgnK---",
        "birthday" : "2Fi7Ja9fNE4=",
        "name" : "1g8GpUKNOXk=",
        "spouse" : {
          "name" : "QtzuCEGSaH0=QtzuCE"
        }
      }
    ]
 
    masked> db.creditcards.toArray()
    [
      {
        "_key" : "6117106",
        "_id" : "creditcards/6117106",
        "_rev" : "_YY-yFH2--A",
        "cardnumber" : "o3/NExaLEgM=o3/N",
        "name" : "ZGivCqK+kHk="
      },
      {
        "_key" : "6117105",
        "_id" : "creditcards/6117105",
        "_rev" : "_YY-yFH2---",
        "cardnumber" : "1R8fvrSJRE0=1R8f",
        "name" : "1g8GpUKNOXk="
      }
    ]

Enterprise version

If you have the enterprise version, you can also use the following maskings.json. In this masking definition we make use of the datetime and creditcard masking functions to preserve the structure of the attribute values of birthdayand cardnumber.

 {
        "creditcards": {
            "type": "masked",
            "maskings": [
                {
                    "path": "cardnumber",
                    "type": "creditCard"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        },
        "persons": {
            "type": "masked",
            "maskings": [
                {
                    "path": "birthday",
                    "type": "datetime",
                    "begin" : "1960-01-01",
                    "end": "2019-12-31",
                    "format": "%yyyy-%mm-%dd"
                },
                {
                    "path": ".name",
                    "type": "randomString"
                }
            ]
        }
    }

This will result in:

  masked> db.persons.toArray()
    [
      {
        "_key" : "6117010",
        "_id" : "persons/6117010",
        "_rev" : "_YY-vgnK--C",
        "birthday" : "1973-11-22",
        "name" : "pnenaFH6ygI=",
        "spouse" : {
          "name" : "2gD27vP66+k="
        }
      },
      {
        "_key" : "6117009",
        "_id" : "persons/6117009",
        "_rev" : "_YY-vgnK--A",
        "birthday" : "2007-06-09",
        "name" : "Jhhr0axG9S4=Jhhr0a",
        "spouse" : {
          "name" : "SpS1y/cb6uw="
        }
      },
      {
        "_key" : "6117011",
        "_id" : "persons/6117011",
        "_rev" : "_YY-vgnK--E",
        "birthday" : "1965-06-02",
        "name" : "2gD27vP66+k=",
        "spouse" : {
          "name" : "pnenaFH6ygI="
        }
      },
      {
        "_key" : "6117008",
        "_id" : "persons/6117008",
        "_rev" : "_YY-vgnK---",
        "birthday" : "1985-08-11",
        "name" : "SpS1y/cb6uw=",
        "spouse" : {
          "name" : "Jhhr0axG9S4=Jhhr0a"
        }
      }
    ]
 
    masked> db.creditcards.toArray()
    [
      {
        "_key" : "6117106",
        "_id" : "creditcards/6117106",
        "_rev" : "_YY-yFH2--A",
        "cardnumber" : "3000004160779761",
        "name" : "2gD27vP66+k="
      },
      {
        "_key" : "6117105",
        "_id" : "creditcards/6117105",
        "_rev" : "_YY-yFH2---",
        "cardnumber" : "3400003796970808",
        "name" : "SpS1y/cb6uw="
      }
    ]

We hope the new data masking feature is useful for you and this tutorial could help you get started with it. If you have any feedback or questions to this tutorial please let us know via learn@arangodb.com.

Data Masking: Anonymizing Production Data Securely For Testing & Development Environments

Example

Masking the name

Masking the credit card information

Running the example

Create a 'production' data-set

Exporting with masking

Check the result

Enterprise version

Quick Links

Info

About Us

Stay In Touch