Anonymizing IP addresses with Vector VRL

Vector is the new kid on the block for collecting and aggregating logs. It’s similar to Fluentd, but one of the main benefits of Vector is that it is written in Rust, making it high performant and very efficient.

Vector comes with its own domain-specific language (DSL) for transforming log records: the Vector Remap Language (VRL).

For one of the services we recently deployed, we want to keep the logs for a very long period of time (on the order of years), but don’t want to store personal information such as IP addresses forever. We could remove the IP addresses altogether, but then we loose the information if the request came from an internal network (LAN) or from the internet. Hence, we need to write a VRL expression for performing the necessary IP address anonymization.

When looking through the list of functions, the ip_subnet function immediately seemed interesting:

ip_subnet(ip: <string>, subnet: <string>): Extracts the subnet address from the ip using the supplied subnet. Works with both IPv4 and IPv6 addresses. The IP version for the mask must be the same as the supplied address.

Exactly what we’re looking for! However, since the function works for IPv4 and IPv6 addresses and we need to strip different subnets of those (for example: removing the last two bytes from an IPv6 address won’t do much anonymization), we’ll need to distinguish between those. Luckily, VRL comes with handy is_ipv4 and is_ipv6 functions and supports if conditions.

But how much should we strip off the IP addresses?

For IPv4 addresses the consensus seems to be removing the last two bytes, i.e. keeping only the first 2 bytes (16 bit subnet): 123.123.x.x. This is for example the recommended privacy setting in the open source web analytics platform Matomo.

For IPv6 addresses the situation is bit more tricky since the IPv6 global addressing is very hierarchical. It’s still easy to identify an individual user when stripping the last 64 bits (the Extended Unique Identifier (EUI-64)), since this designates a single LAN (e.g. household). It appears reasonable to keep only the first 48 bits: according the IPv6 standard (RFC2374) this denotes the “public topology”. Practically, it means we can still identify the country and ISP or organization, but not more than that (see also this StackOverflow post for more context).

+--+-----+---+--------+--------+--------------------------------+
| 3|  13 | 8 |   24   |   16   |          64 bits               |
+--+-----+---+--------+--------+--------------------------------+
|FP| TLA |RES|  NLA   |  SLA   |         Interface ID           |
|  | ID  |   |  ID    |  ID    |                                |
+--+-----+---+--------+--------+--------------------------------+

<--Public Topology--->   Site
                      <-------->
                       Topology
                                <------Interface Identifier----->

Let’s get back to VRL and implement these rules:

1
2
3
4
5
6
7
.remote_addr = string!(.remote_addr)
if is_ipv6(.remote_addr) {
  .remote_addr = ip_subnet!(.remote_addr, "/48")
}
if is_ipv4(.remote_addr) {
  .remote_addr = ip_subnet!(.remote_addr, "/16")
}

We can give the expression a try on the VRL playground:

1
2
3
4
5
# input event (ipv4)
{ "remote_addr": "10.1.2.3", "url": "https://example.com" }

# output
{ "remote_addr": "10.1.0.0", "url": "https://example.com" }
1
2
3
4
5
# input event (ipv6)
{ "remote_addr": "fdaa:2395:059c:f2cf:1:2:3:4", "url": "https://example.com" }

# output
{ "remote_addr": "fdaa:2395:59c::", "url": "https://example.com" }

Note that due to the is_ipv4 and is_ipv6 checks the remote_addr field will be left as-is if it does not contain a valid IP address.

1
2
3
4
5
# input event (ipv6)
{ "remote_addr": "not.a:valid:ip", "url": "https://example.com" }

# output
{ "remote_addr": "not.a:valid:ip", "url": "https://example.com" }

The VRL expression can be used in the Vector configuration (TOML) as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[transforms.anonymize_ips]
type = "remap"
inputs = "my_input"
source = """
  .remote_addr = string!(.remote_addr)
  if is_ipv6(.remote_addr) {
    .remote_addr = ip_subnet!(.remote_addr, "/48")
  }
  if is_ipv4(.remote_addr) {
    .remote_addr = ip_subnet!(.remote_addr, "/16")
  }
"""

Happy logging!