One of the key problems with creating a centralized repository of logs is it also creates a single place where attackers can get to sensitive information. Whether that’s implementation details like network traffic or sensitive information like usernames, API keys or social security numbers. A common requirement, especially in the context of regulations like GDPR, is to minimize this risk by obfuscating or masking potentially sensitive information.
With LogStream you can mask data using a variety of techniques by applying the Masking Function on any event that matches any arbitrary condition. Let’s take a look and see how we do it.
Cribl ships out of the box with a Masking Function that allows for multiple ways to anonymize data. Similar to sed
, it looks for a target pattern and then applies a replacement but it’s much more flexible and with a lot more features.
Match Regex: is a regex pattern that describes the content to be replaced. By default it will stop after the first match unless the /g
flag is used. Matching groups are optional.
Replace Expression: is a JS expression or literal to replace matched content.
Matching groups: can be referenced in the Replace Expression as g1
, g2
… gN
and the entire match as g0
.
Apply to Fields: Either one or a set of fields to apply masking to. Defaults to _raw
.
Replace Expression input field accepts a full JS expression that evaluates to a value so you’re not necessarily limited to what’s under C.Mask
. For example, you can do conditional replacement: g1%2==1 ? `fieldA="odd"` : `fieldA="even"`
. In addition, the Replace Expression can reference other event fields as event.<field>
. For example, `${g1}${event.source}`
will replace g2
with the source
of that event.
Data masking can be broadly categorized into two sets;
social=123456789
is masked to social=XXXXXX
social=123456789
is masked with a hash say, sha256()
(and optionally a salt) to social=ae7be433101d00266be6a201....38bf5fff
Why is the uniqueness of masks important? Well, if data is flattened to XXXX
or REDACTED
, or replaced with a random string, you lose the ability to distinguish between unique values. This means means that analytics/trending/reporting on that data is now impossible. With hashing you’re guaranteed – within the probability of collision of that specific algorithm – to have unique and non-reversible masks, which you can leverage to your advantage; you can mask but you can also report.
There are several masking methods that are available under C.Mask.
:
C.Mask.random
: Generates a random alphanumeric string.
C.Mask.repeat
: Generates a repeating char/string pattern, e.g YYYY
.
C.Mask.REDACTED
: The literal ‘REDACTED
‘.
C.Mask.md5
: Generates a MD5 hash of given value.
C.Mask.sha1
: Generates a SHA1 hash of given value.
C.Mask.sha256
: Generates a SHA256 hash of given value.
Almost all methods have an optional len
parameter which can be used to control the length of the replacement.
Let’s assume we’re masking the digits in this pattern: cardNumber=214992458870391
. The Regex Match that we’ll use is: /(cardNumber=)(\d+)/g
. In this example:
g0
= cardNumber=214992458870391
g1
= cardNumber=
g2
= 214992458870391
Random Masking with default character length (4)
Replace Expression: `${g1}${C.Mask.random()}`
Result: cardNumber=HRhc
Random Masking with defined character length
Replace Expression: `${g1}${C.Mask.random(7)}`
Result: cardNumber=neNSm8r
Random Masking with length preserving replacement:
Replace Expression: `${g1}${C.Mask.random(g2)}`
Result: cardNumber=DroJ73qmyaro51u3
Repeat Masking with default character length (4):
Replace Expression: `${g1}${C.Mask.repeat()}`
Result: Result: cardNumber=XXXX
Repeat Masking with defined character choice and length:
Replace Expression: `${g1}${C.Mask.repeat(6, 'Y')}`
Result: cardNumber=YYYYYY
Repeat Masking with length preserving replacement:
Replace Expression: `${g1}${C.Mask.repeat(g2)}`
Result: cardNumber=XXXXXXXXXXXXXXX
Literal REDACTED masking:
Replace Expression: `${g1}${C.Mask.REDACTED}`
Result: cardNumber=REDACTED
Hash Masking (applies to: md5, sha1 and sha256):
Replace Expression: `${g1}${C.Mask.md5(g2)}`
Result: cardNumber=f5952ec7e6da54579e6d76feb7b0d01f
Hash Masking with Salt:
Replace Expression: `${g1}${C.Mask.md5(g2+'SALTSTRINGHERE')}`
Result: cardNumber=f5952ec7e6da54579e6d76feb7b0d01f
Hash Masking with left N-length substring:
ReplaceExpression: `${g1}${C.Mask.md5(g2, 12)}`
Result: cardNumber=d65a3ddb2749
Hash Masking with right N-length substring:
ReplaceExpression: `${g1}${C.Mask.md5(g2, -12)}`
Result: cardNumber= 933bfcebf992
The application of either of the above methods will depend on your exact requirements but here are a few best practices on when to use what:
C.Mask.REDACTED
– it’s super fast and does not add to the cardinality of your dataset. I.e. does not increase the index size and therefore storage requirementsIf you are excited and interested about what we’re doing, please join us in Slack #cribl, tweet at us @cribl_io, or contact us via hello@cribl.io. We’d love to hear your stories!
Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.