Logging

Although logging is very important part of a software, there’s no too much information about it like about writing good software documentation. Developers know that it’s a mandatory step, but usually here the story ends.

Log is the story of the software, like a story of a book. Later the story becomes to history and finally it ends as a garbage. People write books, because others are interested about the story. If the story is untraceable, people stop reading. Logging is writing a story. If the story untraceable, the log is unusable. How can you write good log entries in a software? I hope this short page can answer this question.

Audience

Before you start anything in life, you need to know who is your audience, otherwise your effort is just waste of time. So for logging we need to know who is our audience, who reads logs.

System administrator: is responsible for the service works properly. They install and start new components, reload software when the system configuration changes and stabilize the software when some error happen. So they must know,

  • when the software has started properly,
  • when the software is down or does not work properly.
  • what the problem is when the software is down or does not work properly.

Production support: fixes issues in the software. They must know

  • what was the issue
  • what happened in the system, before the issue occurred
  • the state and the context of the software (configuration settings, variable dumps) in that time

Business analyst: analysis the system, makes statistics, measures quality. They use logs as a data source. They

  • make patterns for behaviours and do examinations
  • define user experience factors, like average response time, canceled processes.

QA Team: ensures the system administrator team about software works properly. So they need to know,

  • the software works properly

Developer: works on the software. They need

  • to see the run path
  • debugging information

Developer, QA Team, System administrator are usually working on the story of the software, in the present when the context is clear. Production support and Business analyst are working with the history, when lots of information are missing. Therefore as a developer we need to think with the head of a production support member or business analyst if we want to write good logs.

Levels

In logging we use different severity levels to classify messages. Lower levels are error messages and as we increase the log level we get more information about the system. RFC 5424 defines 8 different severity levels, but companies, frameworks sometimes use more or less levels according to their business strategy. For instance Ruby on Rails uses only 6 levels, :debug, :info, :warn, :error, :fatal, and :unknown.

    0 Emergency: system is unusable
    1 Alert: action must be taken immediately
    2 Critical: critical conditions
    3 Error: error conditions
    4 Warning: warning conditions
    5 Notice: normal but significant condition
    6 Informational: informational messages
    7 Debug: debug-level messages

To talk about these levels I would use a metaphor. Imagine a railway company who has a fix route from Bristol to London. The train starts at the same time each morning, goes on the same route and arrives at the same time. One execution of the software is this train on one morning. It has passengers (QA), mechanics (Production support), operators (system admins).

Info

We use info level to inform our passengers about the process. Passengers need information about departure of the train, the next stop, arrival time, connections.

    [info] The train started; time='8:30'
    [info] Reaching the next stop within 15 minutes; next_stop='Swindon'
    [info] The train stopped; city='Swindon', time='9:12'
    [info] The train continue its route; time='9:16'
    ...
    [info] Delay expected; delay='5min'
    [info] The train stopped; time='9:42'

In an application the following information are required on info level:

  • when the component is started
  • when the component needs help from other component
  • the component got help from another component
  • the component stopped
  • any information that can be useful for sysops, QA

Debug

Debug level is being used to detect errors. Software is usually executed in info level, the administrators turn on debug level when they experience something strange in the system. On this level every little detail must be share that can be useful for debugging.

    [info] The train started; time='8:30'
    [debug] Heat sensor information has been received; air='25C', wheel='33C', break='55C'
    [debug] Speed information; speed='89kmh'
    [debug] Crossing switch; switch_position='L2'
    [info] Reaching the next stop within 15 minutes; next_stop='Swindon'
    [debug] Speed information; speed='92kmh'

Error

The log standard defines 5 different errors levels, but my opinion is these are too many. In most of the cases two type of errors are enough, normal and critical. Don’t make your life harder!

Normal

When the system do abnormal activity, the software must notify its audience. Abnormal activity that is unusual, or can cause problem in the future and developer should fix the problem as soon as possible. Normal error doesn’t cause system abortion.

    [info] The train started; time='8:30'
    [error] Heat sensor detected high temperature in the engine; temp='122C'
    [debug] Heat sensor information has been received; air='44C', wheel='55C', break='82C'

Whether when an application can’t open a file is an error or debug message? It depends on the environement. When the file is required by the system and it should be there in normal case, then it’s an error.

    [error] Could not open database configuration; path='config/database.yml'

When the file comes from an input, then it’s a normal case when the file doesn’t exist, therefore in this case the error is just a debug message, since system administrators don’t care about a user used wrong file path.

    [debug] Missing database configuration, using default config; path='/tmp/database.yml'

Critical

An error is critical error, when the application can’t continue running and needs to abort. It can be a problem in the application, problem in the environment. If the engine of the train broke down, the train can’t start off, so it’s a critical error and the operators needs to do something quickly.

    [info] The train started; time='8:30'
    [error] Heat sensor detected high temperature in the engine; temp='122C'
    [critical] Could not start the engine.
    [info] Train broke down.

Format

Log messages must be unique to Production support and Business analyst be able to define the point of the system where the message was created. If the log message always changes, then very difficult to find the trace in the log or create pattern for this. So there are some suggestions for log format.

Static + Dynamic

Log message must contain a static and a dynamic part where the static describes the event that happened, the dynamic part defines the context. Before I describe what does it mean, just imagine the case when you need to find an issue in a software, and you see the following log entry in the log file:

    [ERROR] Thomas.peary could not log in MerchantService.

As a developer first you want to know the place where the error occurred. So you do a search:

    grep -ir "Thomas.peary could not log in to MerchantService" .

Unfortunately there’s no result. Why? Because the log is totally dynamic. What can you do? You can search for some static parts, for instance to “could not log in to”, but unfortunately there’re 20 different places where “could not log in to” is used. The log looks something like this in the code:
logger.info username + ” could not log in to ” + service_name.

Split the message to static and dynamic part with “;” sign. The first static part is a description, the second dynamic part gives context. See the previous example again:

    [ERROR] User could not log in to the service; user='thomas.peary', service='MerchantService', locale="en-ie"

The static part is “User could not log in to the service”. This static entry is unique in the software. If it’s not true, we need to extend it with addition information. The dynamic part user='thomas.peary', service='MerchantService' describes the context. It uses key, value pairs, where the key can be a variable name the value is the value of the variable. We can add additional information to the dynamic part, like session id or other fields that can give more context to the support team later.
If support team wants to find the message, just simply look for the static part, and there must be only one result in the code:

    $ grep -ir "User could not log in to the service" .
    user_session_controller.rb: logger.info "User could not log in to the service", :user, :service, :locale

By adding addition locale parameter to the context, Business Analyst is able to create a statistics about how many people could not log in to service Merchant in Ireland.

    pattern("Failed merchant logins in IE", "User could not log in to the service", "locale=en-ie", "service=MerchantService")