Remove Duplicate Messages

Introduction

This example shows how to use the dup.lua module to filter out duplicate messages. If you ever need to write an efficient filter to eliminate duplicate messages then this may be what you are looking for.

This script matches identical duplicate messages, so it will not work if you need to match messages which have the same content but varying structure.

If you have any questions please contact us at support@interfaceware.com.

Using the Code [top]

  • Import the Remove Duplicate Messages channel from the Builtin: Iguana Tools repository
  • Experiment with the code to find out how it works
  • Then add the module(s) to your Translator project
  • Copy the require statement from the channel and add it at the top of your script
    Note: This module uses require to return a table
  • Adapt the code to your own requirements
  • Use the dup.isDuplicate{data=Data, lookback_amount=###} function to compare the incoming message with previous ones
  • There is also an optional “transform” argument which can given a function to remove trivial differences – see below.
  • Interactive scripting help is included for this module

Quite often when looking for duplicates there may be trivial differences which make what are really duplicate messages seem different. For example the MSH timestamp field might keep changing. For this reason the module provides the “transform” optional argument which can be supplied by you to pre-process messages to remove trivial differences before comparing the messages.

This is the github code for the main module:

How it works [top]

When the script starts up it queries the Iguana log for the last N messages that you specify. From then on it maintains a list of MD5 hashes for the last N messages in a first in, first out (FIFO) order. A message is a duplicate if it’s MD5 hash matches the MD5 hash of a previous message.

It’s very efficient. There is no disc IO other than when the script starts up when you begin running the channel. It makes use of a linked list to maintain the FIFO buffer so this will scale with large numbers of messages with no additional overhead. A hash table is used to do the look ups which will scale with log N.

The only real overhead is memory usage – which isn’t too bad since we are storing MD5 hashes rather than the full text of each message itself.

More information [top]