Introduction to Elasticsearch, Logstash and Kibana (ELK) Stack

Posted by Swapnil Desai on 14 November 2014

This is an introduction to the Elasticsearch, Logstash and Kibana (ELK) stack and how we have used it to capture, store and visualise application logs.

What is the ELK Stack?

The ‘ELK’ stack contains the following three components:

  1. Elasticsearch: A powerful open source search and analytics engine that makes data easy to explore. It is a search server based on Apache Lucene.
  2. Logstash: A log management tool used for centralised logging, log enrichment and parsing.
  3. Kibana: A browser-based HTML5 dashboard used to visualize Elasticsearch data.

The open-source ELK stack provides the ability to perform operational and data analytics including deep search functionality on almost any type of structured or unstructured data source.

How they work together

ELK Topology

The diagram above shows a typical flow of data in the ELK Stack. Consider, Logstash to be an event processing pipeline which collects data/logs from various sources of your IT infrastructure, enriches it and stores the event data/logs as a JSON document in Elasticsearch. Elasticsearch is a distributed document repository to store JSON documents. It provides rich and powerful functionality to query and search data within the documents. Kibana is a rich web based application which can be easily integrated with Elasticsearch to quickly generate real time visualizations important for making any business decisions.

Logstash - Collect Data

Logstash is a tool for managing events and logs. It provides an integrated framework for log collection, centralisation, parsing, storage and search. It can ship logs from many types of sources, parse them, get the right timestamp, index them, and store them. Logstash is free and open source. It has large collection of filters that allow you to modify, manipulate and transform those log events and extract the information you need from these log events to give them context.

Logstash is typically used in two roles:

  1. Shipper/Agents: Sends events to Logstash server. Logstash remote agents will generally only run this component.
  2. Indexer/Server: Receives and indexes the events within the Logstash server.

Logstash servers run one or more of these components independently, which allows us to separate components and scale Logstash.

Configuring Logstash

There are 3 main sections in the Logstash configuration file: inputs, filters, outputs.

Logstash allows to use different types on input to read from different log sources. Here, I will explain how to read multi-line log4j logs that allows joining of multiple lines from files into a single event using a Logstash “file” input.

A typical log4j log entry might look like this:

INFO 2014-10-03 10:47:02,415 [ActiveMQ Session Task-533]
GetDocument.JMSHandler.Request: <?xml version="1.0" encoding=“utf-8”?> 

We can capture these lines in a single Logstash event using a file imput configured like this:

input {
  file {
    path => "*.log"
    codec => multiline {
      pattern => "%{LOGLEVEL}\s+%{TIMESTAMP_ISO8601}"
      negate => "true"
      what => previous

The “multiline” codec means “if a line does not start with {LOGLEVEL} {TIMESTAMP}, then it is related to the previous line”

After the multi-line events have been captured Logstash can modify and transform these log events using filters. One of the filters widely used is the “grok” filter to parse plain text into something structured and queryable. The following filter configuration allows you to structure the log4j log event into separate user-defined fields as shown:

# within 'filters' block 
grok {
  match => { "message" => “%{LOGLEVEL:loglevel}\s+%{TIMESTAMP_ISO8601:logdate}\s+%{DATA:thread}\s+%{JAVACLASS:category}:\s+%{GREEDYDATA:msgbody}" }

Grok uses regular expressions and comes with several handy pre-defined patterns to match common log fragments like ‘LOGLEVEL’, timestamps and Java class names.

A “mutate” filter allows you to perform general mutations on fields. You can rename, remove, replace, and modify fields in your events. The following filter configuration allows to remove leading and trailing white spaces (including newline) from the message field using the “strip” field. Used prior to do any grep operations on Log events.

# within 'filters' block
mutate {
  strip => "message"

Also, consider the “gsub” field to remove any other characters from within the message.

A “date” filter can be used to parse date and time from the grok-ed fields. This value will be used as a timestamp when the events are stored in Elasticsearch to generate accurate visualizations over time.

# within 'filters' block
date {
  match => [ "logdate", "yyyy-MM-dd HH:mm:ss,SSS" ]

Logstash can drop events if they do not add any business values using a “drop” filter. (This example also shows using conditional statements in Logstash config)

# within 'filters' block
if !([category] =~ "GetDocument") {
  drop { }

Another powerful feature in Logstash is to query XML data using the “xml” filter. It supports xpath functionality to query only specific parts of the xml message fields. Below we extract “DocumentID” from the xml field “msgbody” and store it in a new field called “documentID”. The resulting parsed xml can be discarded as shown below.

# within 'filters' block
xml {
  source => "msgbody"
  xpath => [ "//*[local-name()='DocumentID']/text()", "documentID" ]
  target => "xml_documentID"
  store_xml => false

The last section in the Logstash configuration file is the “output” section which is often used to save the log events as documents in Elasticsearch using the “elasticsearch” configuration as shown:

output {
  elasticsearch {
    host => "localhost"
    index => application
    index_type => logs
    protocol => http

Elasticsearch - Store Data

Elasticsearch is a real-time distributed search and analytics engine. It allows you to explore your data at a speed and at a scale never before possible. It is used for full text search, structured search and performing analytics. Elasticsearch is a search engine built on top of Apache Lucene, a full-text search engine library. Elasticsearch stores JSON documents. The JSON format is hierarchical in nature and Elasticsearch is aware of this structure. Beyond simply searching, Elasticsearch can also apply result ranking algorithms and calculate aggregate statistics across search results.

Elasticsearch stores documents of the same type in an ‘index’. Indexes require a schema (like an SQL database table) but typically Elasticsearch is smart enough to create and modify the schema based on the documents being indexed. Because each document can have objects with different fields each time, objects mapped this way are known as “dynamic”. But sometimes, Elasticsearch’s dynamic mapping does not work the way we want it to. This is where explicit mapping of types can help to predefine the fields and have sensible defaults.

Kibana - Visualise data

Kibana is an open source browser-based analytics and search dashboard for Elasticsearch. Written entirely in HTML and Javascript it requires only a plain web server, Kibana requires no fancy server side components. Kibana strives to be easy to get started with, while also being flexible and powerful, just like Elasticsearch. Elasticsearch works seamlessly with Kibana to let you see and interact with your data.

Elasticsearch statistics viewed in Kibana

Kibana’s dashboards are organized into a system of rows and panels. The above Kibana “stats” panel shows the mean,max and min time taken for every operation/feature call along with the individual count. These values are dynamically pulled from the Elasticsearch documents at real time. Other panels thats can be used in Kibana to visualize data are “histogram”, “terms”, etc to perform different types of platform and operational analytics.

Stay tuned for part 2 of this blog, where I will discuss how to capture Windows performance metrics and visualise them using the ELK stack.



You might also enjoy: