Cribl has closed our Series B funding from Sequoia Capital! Learn More

Extending Cribl: Building Custom Functions

Clint Sharp
Written by Clint Sharp

November 29, 2018

One constant in log use cases is that you can’t plan for what you’re going to find at customers. Whether it’s multiple levels of encapsulation, like JSON-in-XML-in-Pipe-Separated (yes we’ve seen this), a need to radically transform the structure of events in a way we haven’t seen, or a need to reach out to an external system we’ve never worked with before, we knew going into this market we’d need to provide an easily extensible product. When we find ourselves in a place where a customer can’t chain our flexible out-of-the-box functions like Drop, Eval, or Mask, our customers or we can easily drop in a custom function which meets their needs.

One of the reasons that we chose JavaScript was its rich ecosystem of libraries and its emergence as a universal runtime with WebAssembly. Cribl allows you to easily drop in your own code, interpreted or compiled, and get full access to your log data in motion. Even before we had a UI or any out-of-the-box functions we proved out all our use case ideas through custom functions. We provide a very simple API to working with data, requiring you only to implement two methods: init and process. Configuration for custom functions works the same way as with out-of-the-box functions by providing a JSON Schema and a UI schema as implemented via React JSON Schema Form. With this simple schema definition language, which you may be familiar with as the same behind Swagger, the UI will automatically render the forms properly allowing you to provide even sophisticated configuration available to your end users.

Lastly, one other advantage we get from JavaScript as a language is our ability to allow users to configure functions with JavaScript through the form of JavaScript Expressions. Along with our library of functions we ship for Masking and Encoding, powerful transformations and operations are possible through one-liner JavaScript expressions included in Cribl configurations.

With this post, we will walk you through how to build some custom functions with Cribl. We’ll start with some examples from the functions we ship, then conclude with a common use case: doing a DNS lookup against IP Addresses found in raw data.

What is a function?

First, we should define what is a function in Cribl. In Cribl, functions are a combination of code, configuration and data. Functions are a directory of files. Here is our regex_filter function that we ship with Cribl:

regex_filter
├── conf.schema.json
├── conf.ui-schema.json
└── index.js

index.js contains our JavaScript code. It can include any built-in Node modules or reference other JavaScript files in its directory. Support for npm modules is on the backlog.

conf.schema.json and config.ui-schema.json are schema files for React JSON Schema Form, which will be covered in more detail below.

You can use Data Preview to store sample data with out-of-the-box or custom functions, for testing and validation.

To install a function, perhaps from our content repository, simply drop the function directory into $CRIBL_HOME/local/cribl/functions, and restart Cribl. After that, the function will be available in the UI.

Note: Prior to LogStream 1.7, this subdirectory was: $CRIBL_HOME/bin/functions.

Next, let’s look into the details of how a function is implemented.

Drop: The Simplest Function

Let’s examine a function which Cribl ships with: Drop. Drop is an incredibly simple function. If the Filter expression matches, we’ll drop the event. The Filter expression gets evaluated before the function itself gets called, so Drop is only executed for events which should be dropped.

Let’s look at the code for the function:

exports.name = 'Drop';
exports.version = '0.1';
exports.group = 'Standard';
exports.process = () => null;

Cribl functions are NodeJS modules, and we look for several module variables to be defined, the names of which should seem obvious. Name defines how the UI will display the function name, Version documents the function’s version, and Group is used by the UI to group like functions.

The process method is called for every event. It is passed the event, which is a JavaScript object that contains all the key/value pairs from our event. These key/value pairs are sent to our destination systems: in Splunk, they become index-time fields, in Elastic they become the shape of the event, or to a FileSystem or S3 they are serialized as JSON documents, one per line. In the case of the Drop function, we do not use the contents of the event, so the method is quite simply, return null for every call. When Cribl receives a falsey return value, we will drop the event.

Next Example: RegEx Filter

Now let’s introduce a slightly more sophisticated example. The next function, RegEx Filter, will drop an event if a given regular expression matches. This introduces some configuration into the function, allowing the user to input data. It implements both init and process, and ships conf.schema.json and conf.ui-schema.json for defining configurable variables.

First, let’s look at the biggest new item we’ve introduced, JSON Schema. If you’ve never heard of JSON Schema, check out their tutorial. We use React JSON Schema Form to render JSON Schema as forms. You can use their interactive playground to test forms and see what options are available. For RegEx Filter, we’ve introduced a simple schema which defines two config variables: regex which defines the RegEx we’ll execute against the data, and field which defines which field we’ll test for a RegEx match. Here’s the Schema JSON, contained in conf.schema.json:

{
  "type": "object",
  "title": "",
  "properties": {
    "regex": {
      "title": "Regex",
      "description": "Regex to test against",
      "type": "string",
      "regexp": true
    },
    "field": {
      "title": "Field",
      "description": "Name of the field to apply the regex on (defaults to _raw)",
      "type": "string",
      "default": "_raw"
    }
  }
}

This should be seem straightforward. We are returning an object whose properties, regexand field, have various properties defined about them, including their title, description, type, and default values. Any JSON Schema will work here, including sophisticated examples we’ve seen in Swagger. For some more sophisticated examples in Cribl, look at the Mask or Lookup functions.

React JSON Schema Form also allows us to specify some information that isn’t covered simply in the schema for the data. The UI may need to differentiate a password field from a normal string field, for example. In this case, we’re defining the RegEx field to use a custom input type which will validate a Regular Expression in conf.schema.json:

{
  "regex": {
    "ui:widget": "RegexInput",
    "ui:placeholder": "Regular expression"
  }
}

The UI Schema matches a given field name, in this case regex, and it tells it to use aui:widget of RegexInput. Now, let’s look at the code in index.js:

exports.name = 'Regex Filter';
exports.version = '0.1';
exports.group = 'Standard';

const { NamedGroupRegExp } = C.util; 

let regex;
let field = '_raw';
exports.init = (opts) => {
  const conf = opts.conf || {};
  regex = null;
  field = '_raw';

  if (conf.regex) {
    regex = new NamedGroupRegExp(conf.regex);
  }
  if (conf.field) {
    field = conf.field;
  }
};

exports.process = (event) => {
  if (regex) {
    regex.lastIndex = 0; // common trap of setting "global" flag
    return regex.test(event[field]) ? null : event;
  }
  return event;
};

The function is, again, quite simple. Most of the code is validating inputs to ensure the user has properly filled out regex and field. Let’s look at the new concepts. First, we declare module-level variables:

let regex;
let field = '_raw';

JavaScript is single-threaded, so we can safely declare state at the module, which will persist across each invocation of the Function’s process method. Next, we declare an init method which is called with an object. We use the name opts, which contains the key/value pairs configured by the user.

exports.init = (opts) => {
  const conf = opts.conf || {};
  regex = null;
  field = '_raw';

  if (conf.regex) {
    regex = new NamedGroupRegExp(conf.regex);
  }
  if (conf.field) {
    field = conf.field;
  }
};

React JSON Schema Form validates input provided by the UI, but users can configure via YAML or JSON configs, so we must also include validation in our functions to ensure we are not misconfigured. The majority of the code in init is validating that the user has inputted regex and field in the configuration. Now, let’s look at process:

exports.process = (event) => {
  if (regex) {
    regex.lastIndex = 0; // common trap of setting "global" flag
    return regex.test(event[field]) ? null : event;
  }
  return event;
};

Here again, we’re simply testing the value in field to see if it matches regex. If so, we return null; if we not, we return the event unmodified.

Reaching Out: Enriching Data using DNS

Lastly, let’s look at an example which shows a few more capabilities: asynchronous execution, reaching out to a third party system, and modifying an event. This really shows the power of Cribl’s extensibility: Custom user code can employ information in an event to modify the event, using information accessed elsewhere. Even though Cribl did originally not ship with this function, we can meaningfully extend LogStream to implement a use case that is currently difficult to do in all logging systems: do a DNS lookup at ingestion time instead of read time. This function is hosted in our content repo, under dns.

Note: Since first publishing this post, we’ve developed this example into the Reverse DNS (beta) out-of-the-box function that now ships with Cribl LogStream.

To keep it simple, this version of the function has no configuration; it simply enriches any IPv4 address it finds in the event’s _raw field. Our example function also does not support cache expiry, nor a few other features we’d likely implement for use beyond a demo. We’ve since enhanced it to make it more full-featured. But this original version shows how we enable users to extend LogStream with less full-featured implementations than Cribl would need in order to ship a generic version. Let’s look at the code:

exports.name = 'DNS Lookup';
exports.version = '0.1';
exports.group = 'Demo Functions';

const dns = require('dns');

const ipv4Regex = /(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/gm;
const cache = {};

function reverse(IP, midx) {
  if (!cache[IP]) {
    cache[IP] = {
      promise: new Promise((resolve, reject) => { // eslint-disable-line
        dns.reverse(IP, (err, hostnames) => {
          if (!err) {
            const value = [`dns${midx !== 1 ? midx.toString() : ''}`, hostnames.join(' ')]; // if idx is not 1, name field dns2, dns3, etc
            cache[IP].value = value;
            resolve(value);
          } else {
            resolve([]);
          }
        });
      }),
    };
    return cache[IP].promise;
  } else if (!cache[IP].value) {
    return cache[IP].promise;
  }
  return Promise.resolve(cache[IP].value);
}

exports.disabled = 0;
exports.asyncTimeout = 500; // ms
exports.process = (event) => {
  const promises = [];
  let matches;
  let matchIdx = 1;
  ipv4Regex.lastIndex = 0; // ensure this is properly reset
  while (matches = ipv4Regex.exec(event._raw)) {
    const midx = matchIdx;
    const IP = matches[0];
    promises.push(reverse(IP, midx));
    matchIdx++;
  }
  if (promises.length === 0) {
    return event;
  }
  return Promise.all(promises)
    .then((entries) => {
      entries.filter(e => e !== undefined).forEach(e => {
        event[e[0]] = e[1];
      });
      return event
    })
    .catch(() => {
      return event;
    });
};

Our function defines a few module-level variables, such as importing Node’s dnsmodule, setting up a cache variable, and defining a RegEx we will use for matching IPv4 addresses. Let’s look at our process implementation:

exports.process = (event) => {
  const promises = [];
  let matches;
  let matchIdx = 1;
  ipv4Regex.lastIndex = 0; // ensure this is properly reset
  while (matches = ipv4Regex.exec(event._raw)) {
    const midx = matchIdx;
    const IP = matches[0];
    promises.push(reverse(IP, midx));
    matchIdx++;
  }
  if (promises.length === 0) {
    return event;
  }
  return Promise.all(promises)
    .then((entries) => {
      entries.filter(e => e !== undefined).forEach(e => {
        event[e[0]] = e[1];
      });
      return event
    })
    .catch(() => {
      return event;
    });
};

We first match all the instances of the IPv4 regex we find in the _raw field, which is hard-coded for this function. For each match, we add a promise to an array which we then pass to Promise.all. With Promise.all, our function will wait for all DNS resolutions to complete before calling our .then() implementation, which merges the DNS responses back into the event object itself before returning it. The meat of the parent function’s logic is in the resolve function we’ve implemented, which wraps Node’s dns.reverse in a promise:

function reverse(IP, midx) {
  if (!cache[IP]) {
    cache[IP] = {
      promise: new Promise((resolve, reject) => { // eslint-disable-line
        dns.reverse(IP, (err, hostnames) => {
          if (!err) {
            const value = [`dns${midx !== 1 ? midx.toString() : ''}`, hostnames.join(' ')]; // if idx is not 1, name field dns2, dns3, etc
            cache[IP].value = value;
            resolve(value);
          } else {
            resolve([]);
          }
        });
      }),
    };
    return cache[IP].promise;
  } else if (!cache[IP].value) {
    return cache[IP].promise;
  }
  return Promise.resolve(cache[IP].value);
}

This method first checks our module-level cache object, called cache, and if it matches, the method returns a promise of the value in the cache. If not, the method creates a new promise, which resolves when the async dns.resolve returns. It checks for errors and returns the resolved value.

As you can see, this is fairly straightforward. In less than 60 lines, we’ve implemented a meaningful extension to Cribl’s functionality.

Conclusion

There are hundreds of different use cases which can be easily implemented as Cribl functions. We  don’t want to require everyone to invent their own implementations, so we’re launching a shared repo of functions that users have built to solve various use cases. In a version coming soon, you’ll be able to point Cribl at a URL for a repo on GitHub or BitBucket and import a function with a single click. For now, it’s simple to clone these repos and insert them into $CRIBL_HOME/local/cribl/functions, and the functions will show up in your UI upon restart.

Note: Prior to LogStream 1.7, this subdirectory was: $CRIBL_HOME/bin/functions.

What would you like Cribl to do, that it doesn’t do today? We’d love to collaborate on publishing a new extension to our content repo. We want everyone to be able to conceive of, and easily ship, their own ideas and share them with the community. We’d love to see your contributions, or file an issue and we’ll build you an implementation!

Additional Reading
Demystifying Collection Job Scheduling

Nick Romito Jun 24, 2020

Cribl product routing
Building for Multi-Petabyte Scale, Part 2

Ledion Bitincka Jan 27, 2020

Questions about our technology? We’d love to chat with you.