How We Evaluated Data Scrubbing Services (And Then Decided To Build Our Own)
At Turngate, we scrub data so we can perform analysis without compromising privacy. We want to be able to find trends, tag data, and see what additional context to highlight. And, we want to be able to do all that without giving internal access to customer data. There are a lot of really great data scrubbing services and as we evaluated them, we learned a lot.
As a quick refresher, the goal of data scrubbing is to remove sensitive information, usually some form of PII (personally identifiable information). Examples of PII include date of birth, full name, mailing address, and IP address.
Our requirements
While evaluating the various data scrubbing services, we had the following requirements:
#1: No identifiable information of any kind
We had a hard requirement that no identifiable information of any kind found its way into the resulting data. To build a privacy-centered tool, Turngate used strict boundaries around the principle that there should be no way of linking a piece of data back to the customer or company it originated from. That way, the person doing the analysis would have no way of seeing which customer’s data they were looking at.
So we started with scrubbing your normal list of PII suspects:
- Names
- Email addresses
- Mailing addresses
- Phone numbers
- IP addresses
But then we added a bunch of additional items that might leak information about the customer (or their activity), such as:
- Domain names
- Company names
- URLs
- Any free-form text (such as labels, subject lines, comments, etc)
#2: Data must remain in the original shape
We scrub data to enable data analysis. If the data doesn’t accurately reflect the original data, the decisions drawn from the analysis won’t be accurate, rendering the entire exercise useless.
For example, let’s pretend we have one person with four records, and a second with two records (see below). I would expect the output to be sanitized where any identifiable information has been changed. However, there should be the same number of people, each with the same number of records. This is only one example. If you’re not careful, as complex data compounds with thousands of such decisions, you can quickly end up with a pile of scrubbed data that is useless for data analysis.
#3: We must work on nested JSON data
The data that we are scrubbing comes from devices and services with complex APIs. A lot of the scrubbing products that we looked at required data to be modeled into columnar tables and would have demanded a huge amount of preprocessing before it was ready to scrub. For example, below is the JSON structure of one of the API responses.
Given that we have a bunch of integrations and hundreds of events per integration (each with their own data format) manually trying to model this was infeasible. We needed something that could read, scrub, and write JSON.
#4: Minimal effort
We’re a small startup and we don’t have a person we can dedicate to these scrubbing efforts, much less an entire team like some largest companies have. We need something that can be designed, implemented, and operated by a single engineer who is juggling other responsibilities (that’s me, by the way!).
#5: No surprises
Lastly, we wanted to make a decision on every field in the JSON. If something slipped through, we wanted it to be because of a bad decision we made, not because something unexpected happened.
Examples of surprises we wanted to avoid:
- The API didn’t name things the way we expected, and since we were only looking for bespoke fields to scrub, they slipped through.
- The API added a new field. Since we did know about it and did have a scrubber attached to that field, it slipped through.
Common Strategies Used By Data Scrubbing Services
Now that we’ve come up with a list of requirements, let’s take a look at common data scrubbing strategies and see how they stack up against our requirements.
We’ll be using the following grades for each of five criteria listed above:
- 🟢 The strategy sufficiently meets the requirement
- 🔴 The strategy fails to meet the criteria
- 🟡 It’s complicated… depending on certain factors it might work
No Scrubbing
Obviously no data scrubbing service actually does this, but one method that we could have done is to throw caution (and our customer’s trust) to the wind and combine all our customer’s raw data into one giant, nasty pile. While this certainly is easy and enables accurate data analysis, it is completely unacceptable. Not only is it an awful breach of user trust, it would have landed us in legal hot water, to boot.
Evaluation:
- No identifiable information – 🔴 Nope, data’s not scrubbed at all
- Retain original shape – 🟢 The data isn’t modified at all
- Works on nested JSON – 🟢 Doing nothing works on anything
- Minimal effort – 🟢 Yup, this minimal effort for sure
- No surprises – 🟢 Since you’re choosing not to protect PII at all, you can’t be surprised if PII data slips through… at least not until a lawsuit shows up at your door
So while this technically does tick most boxes, it was completely an unacceptable solution for us.
Data Redaction
Another method is to redact sensitive data. There are many ways to do this, but common strategies are to replace values with empty strings or placeholder values, such as [NAME] or $NAME. Scrubadub is a popular library that supports data redaction in text.
Evaluation:
- No identifiable information – 🟢 With the right configuration, you can scrub everything you need
- Retain original shape – 🔴 Since you’re just stripping out values, you have no idea how data links together and it makes a lot of data analysis impossible. It might work for you if you have numerical identifiers that don’t need to be scrubbed. However, most of the data we were getting back from APIs used email addresses as identifiers.
- Works on nested JSON – 🔴 For simple JSON, can technically treat a JSON blog as plain text and run the scrubber on that. For more deeply nested data where the scrubber needs to know the context of where it is in the tree, it won’t cut it
- Minimal effort – 🟢 There are some good popular open source libraries that can be leveraged
- No surprises – 🔴 Since you’re looking for bespoke things to redact, there’s no way to prevent new surprises from slipping through
Generating Synthetic Data
Yet another common approach is to use completely synthetic data. By generating the data from scratch there is a much reduced risk of leaking data (you might assume that there is zero risk of data leakage, but that might not be the case, more on that in a bit).
Generating synthetic data is usually a two-step process. First, the model is trained so that the system knows what structure to generate the data in. This training might be an automated process of going through the data and automatically inferring the structure, or it might require the operator to manually define the structure of the data.
Evaluation:
- No identifiable information – 🟡 Since you’re generating the data from scratch, at first glance it might be assumed that there is no chance of leaking the source data. There is a good chance this might be the case. However, products that rely on AI and LLMs may have a tougher time proving this to be true. Check out https://leak-llm.github.io/ for a delve into how LLMs can sometimes leak their underlying data
- Retain original shape – 🟡 For simple data this would probably be possible. However, as the complexity increases, the less likely you will be able to generate a synthetic dataset from scratch that has the original characteristics
- Works on nested JSON – 🟡 For simple JSON, this is possible. However, in evaluating some data generation services, they choked on the deeply nested data structures we were working with
- Minimal effort – 🔴 Synthetic data generation, especially data that can be used as an acceptable facsimile during data analysis is very difficult. This difficulty further increases as the complexity of the data grows
- No surprises – 🟢 One good thing about synthetic data is that you don’t have to worry about new source data affecting the output, unless you do new training of the dataset.
Data Masking
Data masking is the process of swapping values in authentic data for inauthentic / generated values. Fields containing sensitive information are swapped for realistic looking (but fake) values. Tools that do this well will maintain an internal database so that the values can be swapped consistently. This allows the data to retain its original shape and characteristics (which again is important for data analysis)
Evaluation:
- No identifiable information – 🟢 With the right configuration, you should be able to scrub everything you need
- Retain original shape – 🟢 Assuming the system you’re using is consistently swapping values, the scrubbed data should have the data characteristics of the original data
- Works on nested JSON – 🔴 The services and tools we looked at required data structured into tables and wouldn’t work on nested dynamic JSON
- Minimal effort – 🟡 Scrubbing is fairly straightforward and it’s trivial to retain the original shape of the data. However, deconstructing all the nested JSON and modeling them into tables (and then reconstructing them into JSON) would have been a huge amount of work
- No surprises – 🟢 Since you’re forced to model all the data into tables, new fields appearing in the JSON wouldn’t slip through undetected unless the model was manually updated
Summary
Here’s a summary of all the different methods from various data scrubbing services which were evaluated (except for “no scrubbing” which wasn’t an acceptable solution):
While some methods were closer than others, unfortunately, there was nothing that we evaluated which met all our requirements.
So what’s an engineer to do when he has a problem to solve and no solution in sight? Start coding, of course!
Introducing Loki
At Turngate, we developed an internal tool which combines the best concepts of data masking with synthetic data generation to meet all of our requirements. We called it Loki, named after the shape-shifting god of mischief from Greek mythology.
Taking inspiration from synthetic data generation, we use a two-step process where we first analyze the data and learn how the data is structured. Then we use data masking principles to substitute values in a consistent manner so that the data’s shape and analytical properties are maintained. Those mappings are stored in a database so things remain consistent between runs, even between schemas.
As a reminder, we’re not only scrubbing for obvious PII, but we also perform substitution on all fields coming back from APIs which are free-form text (such as labels, subject lines, and comments). In order to protect customer data, we generate replacement free-form text that, although often nonsensical, shares the needed analytical properties (such as having the same number of words).
So how does Loki stack up against our original criteria?
Evaluation:
- Contains no identifiable information – 🟢 We are able to scrub all fields we need, even free-form text fields (see above for details)
- Retains original shape – 🟢 Since we’re swapping values consistently, the scrubbed data has the data characteristics of the original data
- Works on nested JSON – 🟢 Loki is able to analyze and generate deeply nested and dynamic JSON dat
- Requires minimal effort – 🟢 Writing the first version of Loki took a couple weeks of work. Now that we have the engine completed, the effort needed to generate new schemas and define new mapping rules is minimal
- Gives no surprises – 🟢 Loki only generates what is stored in the schema. When new fields are returned by the API, they don’t appear in the output until they are added to the schema.
We noticed a gap in existing data scrubbing tooling, so we created our own. If you’re also struggling to scrub data in a way that protects user privacy while not diminishing the usefulness for data analysis, please contact us! We’re considering open sourcing this tool to help shape the ecosystem of data scrubbing services and with enough interest, we can start working on it.
More blog posts
Get higher confidence in your investigations with articles from the Turngate Team.