Mo' Logs, Mo' Problems

As a sysadmin and/or backend engineer, you spend a lot of time digging through logs, and responding to problems based on what you see there. Doing so is often difficult and tedious, for a myriad of reasons:

  • Logs come from many different sources on a host.
  • Each source generates logs in a different format and puts them in a different place.
  • Each source treats the urgency behind its logs differently.

And those only apply within a single host, so multiply that headache by the number of hosts in your cluster!

Goal

Ultimately our goal was to have a normalized dataset consisting of all the logs that have ever happened in our system. With that we can then write an alert service which watches the data-set and alerts us humans when stuff starts hitting fans.

Solution

We've dealt with these problems and attempting to reach this goal in past projects, learned from our mistakes, and come up with the following system for collecting and processing logs:

  • Specify a log format for all of our projects to use, and use it wherever possible.
  • Collect the logs of all applications into the local journald.
  • Forward all journald instances to a central logstash instance, where they can be viewed and explored using kibana.
  • Use our own thumper to watch the logs and generate alerts based on them.

Format

The first step to a normalized data-set was to formalize our log format, i.e. make sure every service we write uses the same format for its logs, as well as make a decision on what exactly we wanted every log severity level (INFO, WARN, etc...) to mean.

Here's an example of the format we came up with:

~ ERROR -- an error happened -- userID="1111" err="some error" sky="blue"

We currently have implementations for go and javascript available.

The actual specifics of the format will likely be a good source of bike-shedding within your organization, but there are a couple things we found to actually be important:

  • Starting with a weird character like ~ makes it easier to identify our log format vs some other random service's log format in later steps.
  • Including the datetime string is not really necessary, and by default omitted in ours. Every language has a different datetime string format it likes to use, and we end up using journald's timestamp for the entry anyway so it's not truly needed.
  • The entry description is static; all contextual, dynamic information about the entry is included in the key/value pairs. This makes it easy to see at a glance what the entry is for, and also makes it easy to search in kibana for all entries related to a given context (e.g. all logs that have been caused by a particular userID or IP address).

Obviously we can't control the log format of all services we use, or for things like kernel messages. For those we have less fidelity and will later rely parsing with service-specific regex's.

Collection (local)

Most Linux distributions are using systemd these days, and with systemd we get journald. Journald's job is to collect all logs on the system into a single place and to make it easier to search through and manage them. At first many people (including us) disliked journald, but now it's hard to look back.

We've mandated that all services on our hosts must have their logs collected by journald. In most cases this is as simple as making sure they are run in non-daemon mode by systemd and are outputting their logs to stdout/stderr, where journald will automatically collect them.

Some services make this a bit difficult (looking at you, nginx and mongo!), but it's worth the effort to make it so. Having all logs in a single place locally means that there's only one place which needs to be forwarded to logstash, rather than having to setup some process which is reading specific log files. It also means that we rarely forget to make sure a service is having its logs processed.

Collection (global)

Once logs are herded into a single place on each host, we can easily forward them all to logstash. On every host we have a simple service which looks like this:

[Unit]
Description=journald-to-logstash
After=network.target

[Service]
Restart=always
TimeoutStartSec=5
StartLimitInterval=0
RestartSec=1s
User: root
ExecStart: /usr/bin/journalctl -o json -f -n0 | nc logstash.host 11111

[Install]
WantedBy=multi-user.target

And in our logstash we have an input which looks like this:

input {
  tcp {
    port  => 11111
    codec => json_lines
    tags => [ "source_systemd" ]
  }
}

Journald outputs all incoming logs as a single json encoded line, and logstash takes them in accordingly. Each json object contains information from journald about the entry, such as its timestamp, the service it came from, etc... Here's the filter we use to process the journald fields we care about:

if "source_systemd" in [tags] {
  mutate { rename => [ "MESSAGE", "message" ] }
  mutate { rename => [ "_SYSTEMD_UNIT", "service" ] }
  mutate { rename => [ "_HOSTNAME", "hostname" ] }
  mutate { rename => [ "_EXE", "binary" ] }
  mutate { rename => [ "_CMDLINE", "command" ] }
  mutate { rename => [ "_PID", "pid" ] }
  mutate { rename => [ "_UID", "uid" ] }
  mutate { rename => [ "_GID", "gid" ] }
  ruby { code => "event['__REALTIME_TIMESTAMP'] = event['__REALTIME_TIMESTAMP'].to_i / 1000" }
  date { match => [ "__REALTIME_TIMESTAMP", "UNIX_MS" ] }
}
Network issues with forwarding

In our current system with have our servers connected together via openvpn, and all communication is done over that connection. We were seeing a problem where the netcat process in the journald-to-logstash service wouldn't notice a disconnection if the vpn went down. Sometimes it would sit there for upwards of 20 minutes before reconnecting.

The cause of this, we found, is that the tcp re-transmission algorithm used by Linux attempts to retry sending a tcp packet 15 times before it gives up, waiting longer and longer between attempts. This is not something which can be easily changed on the application level, and not something netcat gives us any control over. The following changes the attempt count to 6, which amounts to around 30 seconds:

echo 6 > /proc/sys/net/ipv4/tcp_retries2

Kibana

With everything under the sun being processed by logstash, setting up kibana to actually view the logs is probably the easiest part of this whole process. Once set up we were able to easily create some saved queries and dashboards, like this one:

One thing to note about kibana is that by default it will query every index matching logstash-*. We had logstash set up to create a new index every day, so after a few months our queries started taking too long to complete.

We found the solution in this issue:

When I switched the pattern to [logstash-]YYYY.MM.DD and ticked Use event times to create index names it started working much better.

This effectively makes kibana dynamically choose which indexes to hit based on the query's time range.

Alerts

With all logs going to logstash and being processed that way, we now have the task of responding to those logs. Obviously we can't have someone manually watching all logs in kibana at all times, but we can have software do it. The elastic project has the watcher project, which watches elasticsearch data (where logstash stores logs) and can generate alerts based on what it sees, but it costs money. So we wrote our own!

Thumper is basically watcher, but written in go and using lua scripting. You specify a search query you want to run, and a lua script to run on the results. The lua script can generate an arbitrary set of actions which should be taken as a result, including generating a PagerDuty alert, hitting an http endpoint, etc... More action types can also be added easily.

Here's an example thumper alert which checks for errors every 5 minutes, and sends a pagerduty alert if it sees any:

- name: errors
  interval: "*/5 * * * *"

  search_index: logstash-{{.Format "2006.01.02"}}
  search_type: logs
  search: {
    "size":0,
    "query":{
      "filtered":{
        "query":{
          "bool":{
            "should":[
              { "query_string": { "query":"severity:error" } },
              { "range":{ "@timestamp":{"gt":"now-5m", "lte":"now"} } }
            ]
          }
        }
      }
    }
  }

  process:
    lua_inline: |
      if ctx.HitCount > 0 then
        return {
          {
            type = "pagerduty",
            description = "Some errors happened!",
          },
        }
      end

Elasticsearch has a very powerful query language which lets you do lots of crazy things, and with arbitrary lua being able to be run we're able to check for just about anything. We currently have checks for things like "are the number of warnings for any service significantly greater than they were this time yesterday?" and "have any hosts stopped logging anything all-together?".

Hopefully this overview of our own system helps you to come up with your own! Please let us know if you have any further ideas or see any ways we could improve our setup.

Brian Picciano

Brian Picciano

View Comments
Navigation