One of the key problems with creating a centralized repository of logs is it also creates a single place where attackers can get to sensitive information. Whether that’s implementation details like network traffic or sensitive information like usernames, API keys or social security numbers. A common requirement, especially in the context of regulations like GDPR, is to minimize this risk by obfuscating or masking potentially sensitive information.
With LogStream you can mask data using a variety of techniques by applying the Masking Function on any event that matches any arbitrary condition. Let’s take a look and see how we do it.
Masking With Cribl
Cribl ships out of the box with a Masking Function that allows for multiple ways to anonymize data. Similar to sed
, it looks for a target pattern and then applies a replacement but it’s much more flexible and with a lot more features.
Match Regex: is a regex pattern that describes the content to be replaced. By default it will stop after the first match unless the /g
flag is used. Matching groups are optional.Replace Expression: is a JS expression or literal to replace matched content.Matching groups: can be referenced in the Replace Expression as g1
, g2
… gN
and the entire match as g0
.Apply to Fields: Either one or a set of fields to apply masking to. Defaults to _raw
.
Replace Expression input field accepts a full JS expression that evaluates to a value so you’re not necessarily limited to what’s under C.Mask
. For example, you can do conditional replacement: g1%2==1 ? `fieldA="odd"` : `fieldA="even"`
. In addition, the Replace Expression can reference other event fields as event.<field>
. For example, `${g1}${event.source}`
will replace g2
with the source
of that event.
Different Ways To Mask
Data masking can be broadly categorized into two sets;
Set 1: Masking where uniqueness of information is flattened and lost forever. E.g., when
social=123456789
is masked tosocial=XXXXXX
Set 2: Masking where uniqueness of information is preserved. E.g., when
social=123456789
is masked with a hash say,sha256()
(and optionally a salt) tosocial=ae7be433101d00266be6a201....38bf5fff
Why is the uniqueness of masks important? Well, if data is flattened to XXXX
or REDACTED
, or replaced with a random string, you lose the ability to distinguish between unique values. This means means that analytics/trending/reporting on that data is now impossible. With hashing you’re guaranteed – within the probability of collision of that specific algorithm – to have unique and non-reversible masks, which you can leverage to your advantage; you can mask but you can also report.
There are several masking methods that are available under C.Mask.
:C.Mask.random
: Generates a random alphanumeric string.C.Mask.repeat
: Generates a repeating char/string pattern, e.g YYYY
.C.Mask.REDACTED
: The literal ‘REDACTED
‘.C.Mask.md5
: Generates a MD5 hash of given value.C.Mask.sha1
: Generates a SHA1 hash of given value.C.Mask.sha256
: Generates a SHA256 hash of given value.
Almost all methods have an optional len
parameter which can be used to control the length of the replacement.
Masking Examples
Let’s assume we’re masking the digits in this pattern: cardNumber=214992458870391
. The Regex Match that we’ll use is: /(cardNumber=)(\d+)/g
. In this example:
g0
=cardNumber=214992458870391
g1
=cardNumber=
g2
=214992458870391
Set 1 : uniqueness of information is lost
Random Masking with default character length (4)
Replace Expression: `${g1}${C.Mask.random()}`
Result: cardNumber=HRhc
Random Masking with defined character length
Replace Expression: `${g1}${C.Mask.random(7)}`
Result: cardNumber=neNSm8r
Random Masking with length preserving replacement:
Replace Expression: `${g1}${C.Mask.random(g2)}`
Result: cardNumber=DroJ73qmyaro51u3
Repeat Masking with default character length (4):
Replace Expression: `${g1}${C.Mask.repeat()}`
Result: Result: cardNumber=XXXX
Repeat Masking with defined character choice and length:
Replace Expression: `${g1}${C.Mask.repeat(6, 'Y')}`
Result: cardNumber=YYYYYY
Repeat Masking with length preserving replacement:
Replace Expression: `${g1}${C.Mask.repeat(g2)}`
Result: cardNumber=XXXXXXXXXXXXXXX
Literal REDACTED masking:
Replace Expression: `${g1}${C.Mask.REDACTED}`
Result: cardNumber=REDACTED
Set 2: uniqueness of information is preserved
Hash Masking (applies to: md5, sha1 and sha256):
Replace Expression: `${g1}${C.Mask.md5(g2)}`
Result: cardNumber=f5952ec7e6da54579e6d76feb7b0d01f
Hash Masking with Salt:
Replace Expression: `${g1}${C.Mask.md5(g2+'SALTSTRINGHERE')}`
Result: cardNumber=f5952ec7e6da54579e6d76feb7b0d01f
Hash Masking with left N-length substring:
ReplaceExpression: `${g1}${C.Mask.md5(g2, 12)}`
Result: cardNumber=d65a3ddb2749
Hash Masking with right N-length substring:
ReplaceExpression: `${g1}${C.Mask.md5(g2, -12)}`
Result: cardNumber= 933bfcebf992
When To Use What
The application of either of the above methods will depend on your exact requirements but here are a few best practices on when to use what:
If you simply want hide information without ever needing to report on it, use a plain string replacement or
C.Mask.REDACTED
– it’s super fast and does not add to the cardinality of your dataset. I.e. does not increase the index size and therefore storage requirementsIf you want to report on the data in the future then use of any of the hashing algorithms. If the dataset is small use the right and length substrings if you don’t want to deal with full hashes.
Use hashing only when needed – while still fast, it obviously is more computationally expensive than the other methods.
Use salted hashes to defend against dictionary or rainbow table attacks.
If you are excited and interested about what we’re doing, please join us in Slack #cribl, tweet at us @cribl_io, or contact us via hello@cribl.io. We’d love to hear your stories!