August-PR-Images_PR Image - 1920x1005

Anonymizing Data with LogStream

September 10, 2018
Written by
Dritan Bitincka's Image

Dritan Bitincka is Cribl's Head of Product and co-founder. Dritan has built a career of n... Read Moreearly 20 years as a customer-centric technical leader, having led and trained the Professional Services and BD Tech for AWS teams as a Principal Architect at Splunk. He has designed and implemented hundreds of deployments of large-scale multi-TB distributed systems. Read Less

Categories: Engineering

One of the key problems with creating a centralized repository of logs is it also creates a single place where attackers can get to sensitive information. Whether that’s implementation details like network traffic or sensitive information like usernames, API keys or social security numbers. A common requirement, especially in the context of regulations like GDPR, is to minimize this risk by obfuscating or masking potentially sensitive information.

With LogStream you can mask data using a variety of techniques by applying the Masking Function on any event that matches any arbitrary condition. Let’s take a look and see how we do it.

Masking With Cribl

Cribl ships out of the box with a Masking Function that allows for multiple ways to anonymize data. Similar to sed, it looks for a target pattern and then applies a replacement but it’s much more flexible and with a lot more features.

masking0

Match Regex: is a regex pattern that describes the content to be replaced. By default it will stop after the first match unless the /g flag is used. Matching groups are optional.
Replace Expression: is a JS expression or literal to replace matched content.
Matching groups: can be referenced in the Replace Expression as g1, g2gN and the entire match as g0.
Apply to Fields: Either one or a set of fields to apply masking to. Defaults to _raw.

Replace Expression input field accepts a full JS expression that evaluates to a value so you’re not necessarily limited to what’s under C.Mask. For example, you can do conditional replacement: g1%2==1 ? `fieldA="odd"` : `fieldA="even"`. In addition, the Replace Expression can reference other event fields as event.<field>. For example, `${g1}${event.source}` will replace g2 with the source of that event.


Different Ways To Mask

Data masking can be broadly categorized into two sets;

  • Set 1: Masking where uniqueness of information is flattened and lost forever. E.g., when social=123456789 is masked to social=XXXXXX
  • Set 2: Masking where uniqueness of information is preserved. E.g., when social=123456789 is masked with a hash say, sha256() (and optionally a salt) to social=ae7be433101d00266be6a201....38bf5fff

Why is the uniqueness of masks important? Well, if data is flattened to XXXX or REDACTED, or replaced with a random string, you lose the ability to distinguish between unique values. This means means that analytics/trending/reporting on that data is now impossible. With hashing you’re guaranteed – within the probability of collision of that specific algorithm – to have unique and non-reversible masks, which you can leverage to your advantage; you can mask but you can also report.

There are several masking methods that are available under C.Mask.:
C.Mask.random: Generates a random alphanumeric string.
C.Mask.repeat: Generates a repeating char/string pattern, e.g YYYY.
C.Mask.REDACTED: The literal ‘REDACTED‘.
C.Mask.md5: Generates a MD5 hash of given value.
C.Mask.sha1: Generates a SHA1 hash of given value.
C.Mask.sha256: Generates a SHA256 hash of given value.

Almost all methods have an optional len parameter which can be used to control the length of the replacement.


Masking Examples

Let’s assume we’re masking the digits in this pattern: cardNumber=214992458870391. The Regex Match that we’ll use is: /(cardNumber=)(\d+)/g. In this example:

  • g0 = cardNumber=214992458870391
  • g1 = cardNumber=
  • g2 = 214992458870391

Set 1 : uniqueness of information is lost

Random Masking with default character length (4)

Replace Expression: `${g1}${C.Mask.random()}`
Result: cardNumber=HRhc

Random Masking with defined character length

Replace Expression: `${g1}${C.Mask.random(7)}`
Result: cardNumber=neNSm8r

Random Masking with length preserving replacement:

Replace Expression: `${g1}${C.Mask.random(g2)}`
Result: cardNumber=DroJ73qmyaro51u3

Repeat Masking with default character length (4):

Replace Expression: `${g1}${C.Mask.repeat()}`
Result: Result: cardNumber=XXXX

Repeat Masking with defined character choice and length:

Replace Expression: `${g1}${C.Mask.repeat(6, 'Y')}`
Result: cardNumber=YYYYYY

Repeat Masking with length preserving replacement:

Replace Expression: `${g1}${C.Mask.repeat(g2)}`
Result: cardNumber=XXXXXXXXXXXXXXX

Literal REDACTED masking:

Replace Expression: `${g1}${C.Mask.REDACTED}`
Result: cardNumber=REDACTED

masking-redacted

Set 2: uniqueness of information is preserved

Hash Masking (applies to: md5, sha1 and sha256):

Replace Expression: `${g1}${C.Mask.md5(g2)}`
Result: cardNumber=f5952ec7e6da54579e6d76feb7b0d01f

Hash Masking with Salt:

Replace Expression: `${g1}${C.Mask.md5(g2+'SALTSTRINGHERE')}`
Result: cardNumber=f5952ec7e6da54579e6d76feb7b0d01f

Hash Masking with left N-length substring: 

ReplaceExpression: `${g1}${C.Mask.md5(g2, 12)}`
Result: cardNumber=d65a3ddb2749

Hash Masking with right N-length substring:

ReplaceExpression: `${g1}${C.Mask.md5(g2, -12)}`
Result: cardNumber= 933bfcebf992

masking-md5-right


When To Use What

The application of either of the above methods will depend on your exact requirements but here are a few best practices on when to use what:

  • If you simply want hide information without ever needing to report on it, use a plain string replacement or C.Mask.REDACTED – it’s super fast and does not add to the cardinality of your dataset. I.e. does not increase the index size and therefore storage requirements
  • If you want to report on the data in the future then use of any of the hashing algorithms. If the dataset is small use the right and length substrings if you don’t want to deal with full hashes.
  • Use hashing only when needed – while still fast, it obviously is more computationally expensive than the other methods.
  • Use salted hashes to defend against dictionary or rainbow table attacks.

If you are excited and interested about what we’re doing, please join us in Slack #cribl, tweet at us @cribl_io, or contact us via hello@cribl.io. We’d love to hear your stories!

Get LogStream

.
Blog
Feature Image

Cribl Stream: Up To 47x More Efficient vs OpenTelemetry Collector

Read More
.
Blog
Feature Image

12 Ways We Sleighed Innovation This Year

Read More
.
Blog
Feature Image

Scaling Observability on a Budget with Cribl for State, Local, and Education

Read More
pattern

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.

box

So you're rockin' Internet Explorer!

Classic choice. Sadly, our website is designed for all modern supported browsers like Edge, Chrome, Firefox, and Safari

Got one of those handy?