Use Cases for Anonymizing Log Data
At Treasure Data, we store and manage lots of data for our customers as a cloud-based service for big data. Our customers often ask if there is a way for them to ensure that sensitive information (or personally identifiable information, known as PII) is redacted or anonymized in order to maintain compliance to company policies and data-related regulations such as HIPAA, the PCI DSS, or EU data protection directives.
The answer is a resounding yes, and here is the even better news: The solution works with all SaaS providers supported by Fluentd, including Loggly.
How It Works
Fluentd is an open source data collector designed to unify logging infrastructure. Fluentd forms the core of Treasure Agent, which is a lightweight data collector for Treasure Data’s big data backend, but it also can bring logs to other destinations such as Loggly.
Fluentd has an open, pluggable architecture that allows users to extend its functionality via plugins. One of the plugins is called the anonymizer plugin, which takes care of encrypting data fields with various encryption schemes.
Sometimes, encryption isn’t sufficient; you may not want certain data to be stored at all. For these use cases, there is the record_reformer plugin that can delete specific fields.
andour team can use Loggly and other cloud-based backend systems without worrying about leaking sensitive, personally identifiable information (PII) by doing the following:
- Encrypting data with the anonymize plugin
- Filtering it with the record_reformer plugin before streaming it to Loggly with the Loggly output plugin
Setting Up Fluentd and Loggly
First, you must download Fluentd. The easiest way is to use Treasure Data’s td-agent package for deb (Ubuntu/Debian), rpm (RHEL/CentOS) and OSX (dmg installer). Note that the program will be called td-agent, not fluentd (some people find this confusing).
For example, here is the script to install td-agent on Ubuntu Trusty (14.04):
curl -L http://toolbelt.treasuredata.com/sh/install-ubuntu-trusty-td-agent2.sh | sh
The second step is downloading the anonymizer, record_reformer, and Loggly plugins.
sudo /usr/sbin/td-agent-gem install fluent-plugin-anonymizer sudo /usr/sbin/td-agent-gem install fluent-plugin-record-reformer sudo /usr/sbin/td-agent-gem install fluent-plugin-loggly
Finally, you configure td-agent by editing /etc/td-agent/td-agent.conf (Please visit our documentation to learn more about how to configure Fluentd). Open /etc/td-agent/td-agent.conf with your favorite editor and replace the example configuration with the following:
Finally, restart td-agent to start streaming logs into Loggly.
sudo /etc/init.d/td-agent restart
When you head over to Loggly, you’ll see events like this:
You’ll note that:
- It no longer has the “host” field. This was successfully redacted by the record_reformer plugin.
- The “user_id” field has been anonymized using SHA512 (the original user id was “please-hide-me”).
What’s Next for Your Logs?
You can learn more about Fluentd in general or the plugins used here (anonymizer, record-reformer, loggly). Also, I want to mention that this is the simplest of of anonymization techniques. For those interested, I strongly recommend this blog about differential privacy (Be aware that differential privacy techniques are more involved than what Fluentd’s anonymizer plugin offers.)
Follow Kiyoto on Twitter: @kiyototamura