Skip to content Skip to footer

Staying Slim: Trimming Irrelevant Data From Production Cassandra Clusters

I’m proud to be the most recent addition to the technical team here at Youneeq and, while I’ve worked with databases before, this has been my first experience with Cassandra. Cassandra is a non-relational and highly-scalable database which allows us to conduct real time behavioral analysis and provide personalized recommendations for our clients’ users.

As with any database archiving and deleting decrepit data is a necessary task in order to keep the database functioning at a high level. This has been one of my first responsibilities at Youneeq and working intimately with Cassandra has provided me with an exciting learning experience. At Youneeq our goal is to provide 24/7 service to our clients and avoid periods of inefficient or inaccurate service due to maintenance.

When removing data from an existing production cluster without impacting the efficacy there are a number of options, the first of which is to simply truncate the table. This is by far the quickest and easiest option. However we do need to conserve some of our more recent records in the tables and therefore truncate is not appropriate. Dropping and recreating the table is similarly unfeasible in production, due to the potential delay before recreation. A simple row by row deletion, due to the schema, would have required the use of ‘Allow Filtering’, an anti-pattern in Cassandra and was thus deemed both too inefficient and too unreliable.

The approach we settled on was to select a subset of columns of the table and insert them into a relational database. This allowed us to select only those rows which required archival and subsequently remove them from Cassandra using their unique primary key. Using a relational database in this manner allows us to avoid anti-patterns in Cassandra and achieve regular archival of specific data in an automated fashion. Most importantly our clients are guaranteed uninterrupted service and our cluster is keeping those Terabytes off without changing its lifestyle. If only keeping slim was so easy for humans.

Stay tuned for our next video in which we’ll explore some of the differences between Google Analytics and The Youneeq Dashboard Experience!

Trimming Irrelevant Data From Production Cassandra Clusters

Let’s get this demo started

Drop us a line and we’d love to organize a time to meet and demo our cookieless AI solution.