Using Log References
====================
Extracting references from log messages
---------------------------------------
If you're using structured logging, log references are contained in the ``ref``
field of the structured message.
If you're not using structured logging, you'll need to extract references
from log message text through configuration of a log ingestion tool
(such as `Vector `_).
References can be extracted from log messages using the following regular
expression:
.. code-block::
\[ref: ([0-9]{1,10})\]
When matching, the regular expression above results in one match group
containing the numerical reference in the log message. Valid references
may be in the range 1 to 4294967295.
The expression is compatible with common PCRE, ECMAScript, Rust, Python,
Golang, Java 8, C# and many other regular expression implementations.
Whenever possible, you should extract references from log messages before
storing or analysing them. This way your log storage and analysis system can
store references in a structured field, allowing you to more easily refer
to them in queries and probably resulting in them using less storage.
Using references to improve log analysis
----------------------------------------
.. note::
Log query examples in this section are given in SQL to clearly communicate
intent in a (mostly) vendor-agnostic way.
Identifying log event types
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Often we use text search patterns to identify application events from log
messages.
For example, consider the following log message from the
`Rocket web framework `_:
.. code-block::
2023-05-02T21:50:19.963Z INFO Bad incoming HTTP request.
We might use a query over the stored logs to count the events this log message
is referring to:
.. code-block:: sql
SELECT COUNT(message) WHERE message = 'Bad incoming HTTP request.';
Now imagine the framework developers decide to improve the context within
their log messages. The same event is now logged with an updated message:
.. code-block::
Bad incoming GET HTTP request.
Great! There's now a bit more useful information in the log message!
However, this now breaks our log query. Let's update it:
.. code-block:: sql
SELECT COUNT(message) WHERE message = 'Bad incoming GET HTTP request.';
Hold on... the ``GET`` in the log message looks like it might be interpolated,
and could potentially be replaced with alternative values (like ``POST`` or
``PUT``).
We can update our query to accommodate this:
.. code-block:: sql
SELECT COUNT(message) WHERE message LIKE 'Bad incoming % HTTP request.';
This works, but doing it this way means we have to accept that:
* Changes to log messages - which may happen without notice - might require us
to change our log analysis queries, and break our log analysis until we do!
* In the above example, our original query would produce the wrong results
after the log message change.
* Log analysis query complexity and execution cost will vary with log message
content.
* In the above example, we have introduced a wildcard string comparison
which will almost certainly incur more cost at query time than a
literal string comparison.
* Technical or performance limitations (e.g. with the query language or
analysis tool) may introduce ambiguity into our log analysis.
* In the above example, a different unrelated log message could be confused
with the one we were trying to count after introducing the wildcard
(imagine another log message like ``Bad incoming file over HTTP request.``)
The log references stored in log messages by Breadlog provide a more reliable,
lower effort alternative for identifying events based on logs. The equivalent
original log message from Rocket but with a Breadlog reference looks like
this:
.. code-block::
2023-05-02T21:50:19.963Z INFO [ref: 24] Bad incoming HTTP request.
(Or, if using structured logging, the ``ref`` value would be exposed as a
structured field. This could be serialised as JSON or another structured
format).
Our query to count these events is instead:
.. code-block:: sql
SELECT COUNT(ref_id) WHERE ref_id = 24;
After being updated, the log message looks like this:
.. code-block::
2023-05-02T21:50:19.963Z INFO [ref: 24] Bad incoming GET HTTP request.
Notice how the numerical reference doesn't change? This means the query we
used to analyse the logs still works after the change to the log message.
Also, the query works regardless of the interpolated content and without us
needing any knowledge upfront of the different possible values that could be
inserted into the log message.
Aggregating event types
^^^^^^^^^^^^^^^^^^^^^^^
As illustrated above, log references make running aggregate queries across
logs easier. This is important because it's very common to analyse logs this
way, for example to:
* Produce a histogram of event types.
.. code-block:: sql
SELECT ref_id, COUNT(ref_id) GROUP BY ref_id;
* Produce a time series of event types.
.. code-block:: sql
SELECT ref_id, EXTRACT(DAY FROM event_time) as event_day GROUP BY ref_id, event_day ORDER BY event_day DESC;
* Identify how event types are distributed by host.
.. code-block:: sql
SELECT hostname, ref_id, COUNT(ref_id) AS num_events GROUP BY hostname, ref_id;
By contrast, doing this with queries using log message text means:
* Having to do text processing within the query.
* Having to handle variable portions of the log message (like in the HTTP verb
example above).
* Updating the queries when log message text changes.
Sequence analysis
^^^^^^^^^^^^^^^^^
Looking at sequences of events can be useful, and of course is made more
accurate with reliable event identifiers. For example:
* An investigation into an application fault might reveal that the fault is
preceded by a certain sequence of events. If this sequence could be
identified automatically, there might be an opportunity to predict future
instances of the fault before it occurs.
* Unusual sequences of events might indicate unexpected system behaviour. If a
model of normal event sequences could be built and kept up-to-date, sequences
deviating from this model could be used to trigger additional checks.