Taking the Pulse of Optify’s SaaS Platform with Graylog2

September 20, 2012 | by | Category:

Monitoring a SaaS (software as a service) platform is a complex undertaking. At Optify, we’re continually improving our monitoring and systems management capabilities to ensure that we’re providing a reliable system for our customers, while minimizing the requirement for hands-on monitoring by our engineering and operations teams. Our first line of defense is Zabbix, which we use for real-time monitoring of our many distinct services and third-party integrations. While Zabbix is great for monitoring and alerting on issues such as message queue backlogs, unavailable services, and other infrastructure problems, there are an additional set of issues that are generally only seen in application or server logs. In the past, we’ve exposed these issues in Zabbix by writing scripts that scan our application logs for certain patterns and generate alerts or update metrics based on these pattern matches. The key problem with this approach is obvious, you have to know what you’re looking for ahead of time, and you’re still stuck grepping logs if you want to explore what additional issues may be happening.

To solve this problem, we sought out a solution that would allow us to aggregate logs from across our production servers, explore them in a simple way (i.e., search), and easily trigger alerts based on the rate of messages matching a particular set of rules. After a bit of research and prototyping, we identified Graylog2 as a good fit for our needs.

Graylog2 web interface

Graylog2 is a free open source log management and exception tracking tool. The core of Graylog2 is a Java-based server component that receives messages over TCP or UDP and stores/indexes them in ElasticSearch and MongoDB. In addition, the Graylog2 web interface is a Ruby on Rails web application that allows you to view and search your log messages, configure pre-filtered streams for messages matching defined criteria, and generate alerts based on the count of messages matching a particular stream in a given time period.

Installing and configuring Graylog2 is fairly straightforward. The installation documentation is adequate for both the server and web interface, however both steps assume you already have preconfigured ElasticSearch and MongoDB instances in your environment.

First some statistics on our implementation:

  • All components (ElasticSearch, MongoDB, Graylog2 server, Graylog2 web interface) running on single server with a quad-core Xeon processor and 12GB of RAM.
  • 49.6 million syslog and application messages stored since July 31, 2012
  • 54 hosts (servers) monitored
  • Note: If Graylog2 is a mission-critical part of your infrastructure, you’ll want multiple servers with ElasticSearch and MongoDB running in a multi-node configuration
Graylog2 streams view

In order to send our application messages to Graylog, we had to introduce one additional component to our development stack. Nearly all of our software runs on the JVM and uses Log4j for logging. Adding support for Graylog was as simple as adding Gelf4j, a Log4j compatible appender for sending messages to the Graylog server in Graylog Extended Log Format (GELF). Once you’ve added this dependency via Ivy, or Maven, configuring the appender is as simple as:


<!-- define the appender -->
<appender name="gelf" class="gelf4j.log4j.GelfAppender">
<param name="Threshold" value="INFO"/>
<param name="host" value="graylog.example.com"/>
<param name="port" value="1942"/>
<param name="compressedChunking" value="false"/>
<param name="defaultFields" value='{"environment": "DEV", "application": "MyAPP"}'/>
<param name="additionalFields" value='{"thread_name": "threadName", "exception": "exception"}'/>
</appender>

<!-- add the appender to a logger -->
<root>
<priority value="INFO"/>
<appender-ref ref="graylog2"/>
</root>


// To do this programmatically...
GelfAppender appender = new GelfAppender();
appender.setThreshold(Priority.toPriority("WARN"));
appender.setHost("127.0.0.1");
appender.setPort(12201);
appender.setCompressedChunking(true);
appender.setDefaultFields(String.format("{\"environment\": \"%s\", \"application\": \"%s\", \"host\": \"%s\"}",
"DEV",
"MYAPP",
"MYHOST"));
appender.setAdditionalFields("{\"thread_name\": \"threadName\", \"exception\": \"exception\", \"logger_name\": \"loggerName\"}");
appender.activateOptions();

Logger.getRootLogger().addAppender(appender);

What we like about Graylog:

  • Easy to search for and correlate log messages across multiple hosts and JVMs
  • Search performs well even with 10′s of millions of messages
  • Streams feature is useful for "taking the pulse" of a set of hosts or messages

What could be improved:

  • Stream alert email has never worked correctly. Alerts trigger as expected, but the system continues to send mail after the message rate has dropped back below the defined threshold. We’ve applied a community patch to our web interface fork which we’ll be testing out soon
  • The documentation is a little lacking regarding how search phrases should be constructed. Once you understand that the log message is analyzed and tokenized using the ElasticSearch whitespace analyzer, it’s much easier to construct the correct query

Overall, Graylog2 has provided us with a much needed tool for analyzing and monitoring the health of our inbound marketing software platform. It’s not perfect, but development is active, and we expect to get even more value from Graylog in the future.