Round Robin Data Storage in MySQL

If you want to store large amount of volatile data (e.g. log file entries) in a database with a constant storage memory footprint and no maintenance to purge the old entries, a round robin database is the best solution. But how to implement it in MySQL?

Some of the heaviest tables in our database are tables which do some event logging. Let’s look at a table which log the most recent visitors on a member profile. Look at that ugly output:

mysql> SHOW TABLE STATUS like "profile_visits_log" \G;
*************************** 1. row **********************
           Name: profile_visits_log
           Rows: 6'226'066
 Avg_row_length: 21
    Data_length: 130'747'386
   Index_length: 393'205'760

It grows and grows… And the queries get slower and slower.

This is the case because we are keeping a lot of old, unused data. For example, on my member profile there’s data back to September 2007:

Profile visitor list on tilllate.com

So we have to get rid of that old data. For example we can do it like Xing and store the n most recent entries for each user. Drop the rest:

Profile visitors in Xing

Now, how do we do that in the most elegant way?

Note: We want always the 5 last entries. Even if they date back from half a year ago. So purging via WHERE last_update < DATE_SUB(NOW(), INTERVAL 7 DAYS is not an solution.

Here’s two options:

Delete old posts by a cron job

The old-fashioned way is to run a separate cron job every night to purge the old data:

Here’s some pseudo code:

foreach (tilllate-member-profile)
{
     get all visits of this profile
     delete all visits except the most recent 10 ones
}

Problems with this method

  • Separate, asynchronous cron job needed -> Logic is spread across the application which increases maintenance cost.
  • Complex job causing a lot of heavy queries all at once -> Long locking time.

Use the Round Robin Strategy

Here’s an idea I got from the RRDTool. The idea is to have a sequential-number going from 0 to 4 and then wrapping back to 0. With that “system storage footprint remains constant over time“. This is also know as “Circular buffer”.

Additional advantage: With the Round Robin Event Logging Strategy you have the whole code at the same place (The principle of cohesion). You don’t need a separate cron job in charge of the purging.

We’re working with the following table:

CREATE TABLE  `test`.`profile_visits_log_rrd` (
  `uid` int(10) unsigned NOT NULL default '0',
       # this is the uid of the visited member profile
  `sequence_id` tinyint(3) unsigned NOT NULL default '0',
       # This is the sequence number.
       # It goes from 0 to 4. Once at 4 it should
       # wrap and restart at 0',
  `last_update` datetime NOT NULL,
  `visitor_uid` varchar(45) NOT NULL,
  PRIMARY KEY (`uid`,`sequence_id`),
  KEY `uid-last_update` (`uid`,`last_update`)
) ;

Here’s some sample data:

uid sequence_id last_update visitor_uid
4 0 2007-12-30 21:02:14 6755474
4 1 2007-12-31 09:48:25 9887412
4 2 2007-12-31 10:59:22 208573
4 3 2007-12-31 11:51:21 757983
4 4 2008-01-01 13:54:08 3548743
5 0 2007-12-30 11:30:00 1400251
5 1 2007-12-31 16:17:36 620098
5 2 2008-01-01 13:55:25 3548743
5 3 2008-01-01 19:01:57 940873
5 4 2008-01-01 20:23:54 1270881

For each user you have only the last 5 entries.

Reading out data for a user

SELECT * FROM profile_visits_log_rrd m WHERE uid=5
ORDER BY last_update DESC
uid sequence_id last_update visitor_uid
5 2 2008-01-01 20:23:54 1270881
5 1 2008-01-01 19:01:57 940873
5 0 2008-01-01 13:55:25 3548743
5 4 2007-12-31 16:17:36 620098
5 3 2007-12-30 11:30:00 1400251

It uses the index `uid-last_update` and is therefore efficient.

Adding new data for a user

Imagine, at 2008-01-01 22:03:23 there’s a new visitor visiting the member profile of user ‘5’.

You don’t want a new row so you need to replace the oldest one. In the case above it is row

uid sequence_id last_update visitor_uid
5 3 2007-12-30 11:30:00 1400251

But to replace that row you gotta find out the primary key of that row. In this case it is uid==5 AND sequence_id==3.

You got the uid. So first you have to:

Step 1: Get the next sequence_id
SELECT (`sequence_id`+1) % 5 as next_sequence_id
 FROM profile_visits_log_rrd m
 WHERE uid=5
 ORDER BY last_update DESC
 LIMIT 1;

The price is low: Can use the index uid-last_update and therefore there’s no data access. No filesort. No temporary table.

In the example above this returns ‘3’. (We’re using modulo to make sure the number wraps when it goes over ‘4’)

Step 2: Add the data

REPLACE INTO profile_visits_log_rrd SET
  uid = 5,
  sequence_id = 3,
  last_update = '2008-01-01 22:03:23';

Maybe a INSERT ... ON DUPLICATE KEY is cheaper. But anyway: It uses indexes all the time

Add atomicity

To have atomar operations you need to put everything in a stored procedure (thanks, Leo):

LOCK TABLE member_gold_visitor_rrd WRITE;
SET @next_sequence_id = 0;
SELECT (@next_sequence_id:=((`sequence_id`+1) % 5))
    as next_sequence_id
 FROM member_gold_visitor_rrd
 WHERE uid=5
 ORDER BY last_update DESC
 LIMIT 1;
#For debug only:SELECT @next_sequence_id;
REPLACE INTO member_gold_visitor_rrd SET
  uid = 5,
  sequence_id = @next_sequence_id,
  last_update = now(),
  visitor_uid = 1400254;
UNLOCK TABLES;

Benefits of this technique

  • You don’t need a separate cron job (Cohesion!)
  • There’s no long table locking
  • Your tables don’t grow with time. Just with the number of users. (Constant memory footprint)
  • You can abstract this logging via a component to make it transparent for your developers. (As a matter of fact we have extended Zend Framework with such a class)

Better suggestions?

Does anyone have other ideas on how to solve this problem elegantly?

Maybe MySQL is not the right tool for that kind of data? – Let me know what you think

There’s a PostgreSQL implementation for that issue.

This entry was posted in Database, PHP and tagged . Bookmark the permalink.

14 Responses to Round Robin Data Storage in MySQL

  1. Ingo says:

    What about a trigger? Although you wouldn’t eliminate an insert/delete query, it would be purely in database and extensive locking might be avoided.

  2. Dennis J. says:

    Is the code under “Add atomicity” really correct? Using the test data above the SELECT sets @next_sequence_id to 4 and then the REPLACE updates the row with sequence_id 4 which is *not* the least recent entry unless I’m missing something here.

  3. Jared says:

    Can you not combine the SELECT & REPLACE into one statement, and do away with the explict locking?

    REPLACE INTO member_gold_visitor_rrd(uid, sequence_id, last_update, vistor_uid)
    SELECT 5, ((sequence_id+1) % 5), 1400254
    FROM member_gold_visitor_rrd
    WHERE uid=5
    ORDER BY last_update DESC
    LIMIT 1;

    Can’t remember if there any limitations that prevent doing that.

  4. Dennis J. says:

    Forget that. I misinterpreted the ORDER BY.

  5. The problem is that in most cases you’d want to preserve historical data for statistical purposes. So I don’t see a way to avoid a CRON

  6. tomatenschaf says:

    I think, the following should also work:

    REPLACE INTO member_gold_visitor_rrd (uid, last_update, visitor_uid, sequence_id) SELECT 5, NOW(), 1400254, ((`sequence_id`+1) % 5))
    FROM member_gold_visitor_rrd
    WHERE uid=5
    ORDER BY last_update DESC
    LIMIT 1

    or:

    REPLACE INTO member_gold_visitor_rrd (uid, last_update, visitor_uid, sequence_id) SELECT 5, NOW(), 1400254, ((max(`sequence_id`) % 5)+1)
    FROM member_gold_visitor_rrd
    WHERE uid=5

    (not testet, but I would try these)

  7. leo says:

    We also tried stuff with “REPLACE INTO”. At the begining the table is empty (or at least there’s no record for a certain user). So the “SELECT” will not return a record and it does fail

  8. Pingback: Round Robin Data Storage in MySQL - Tilllate Techblog - To take away

  9. Lukas says:

    You definately want to use INSERT ON DUPLICATE KEY UPDATE instead of REPLACE. With REPLACE you will dn an insert and then a delete is issued in case of a constraint violation. With on duplicate key update you will do the insert, it will detect the constraint violation and will do an update instead. Since the row size will remain the same, there is no need for any page shuffeling to be done. With the REPLACE you are poking wholes into your data pages, which will require frequent OPTIMIZE TABLES in order to ensure that you are getting as much data as possible in a single IO cycle.

    I have an example of a fixed length queue implemented using on DUPLICATE KEY UPDATE in one of my recent talks:
    http://pooteeweet.org/files/phpquebec08/sql_un_patterns.pdf

  10. Lathrop Preston says:

    I think in the application being discussed here for archive data all you really need would be maybe a weekly or monthly roll-up count. So that could still be done somehow.

  11. James says:

    Doubtful this would scale well. Particularly for data mining queries later on.

    I’m looking into scaling strategies right now and the one I’m thinking of is literally spitting data across dated databases or tables. Depending on your storage requirements having data split across multiple databases would allow for manageable clusters of servers each with dedicated processing power for data mining. The cost is typically financial (minimal with something like EC2) and code (locally working out which database (and possibly connection) and table to write to and later read back from.

  12. leo says:

    Well it’s mainly not about scaling. It’s about storage. With this you reduce the amount of data, which will prevent you from scaling for some time. Sure on the same time you also lose data, so talking about data mining, you will not have data to digg anymore.

    If you intrested in spliting databases with MySQL you might be interested in DB Slayer, MySQL Proxy or in MySQL’s own Partitioning feature. Although if it comes to the mining itself you might still write your own code.

  13. Martin says:

    Agree to Lucas… With REPLACE the DBMS has to check the keys and, if a matching value is found, it will execute a DELETE statement and then the according INSERT statement.

  14. Maryada says:

    RRD defines timeline implementation of Data in a database. That part is missing here