Storing measument data

Measurement data

To store the measurement data that i'm retrieving from efergy (and other sources), i needed a database that could handle large amounts of data. Storing it is usually not a big issue, most databases can handle multiple millions of records easily, but the problem start when you try to read or aggregate these amounts of data. Until recently i used PostgreSQL, which is was fast enough for the time-series data that i wanted to store and read from a web application to show the data. But when reading data every 15 seconds, it would add up to +2 million records per year, and to be able to aggregate these amounts of data, PostgreSQL would no longer be fast enough on a small server. Surely i could scale up the server with faster processors, more RAM and SSD disks, but that would drive up the cost for simple monitoring solution. Another solution was to use the aggregation API available from energyhive, but i would become more diffucult to maintain if data sources from other devices where added.

There are also a lot of options available to handle this kind of data. After doing some research i started looking at NoSQL solutions. Being quite new to NoSQL databases, i selected a couple of them and started testing them for performance on aggregating large amounts of data. Inserting performance of the data was not much of an issue because the inserts were 15 seconds apart. I tested MongoDB, Hadoop and Elasticsearch.

MongoDB

Mongodb has small learning curve to get started with inserting and reading data. However like most NoSQL databases these type of databases can be quite difficult to query when it gets a little more complex. The speed of a single node instance is also not as fast as i wanted it to be. Most of these NoSQL databases get their speed when setting up a cluster of them.

Hadoop

Hadoop is more of a NoSQL eco system which contains all sorts of tools, than a standalone product. The setup time and the learning curve are quite high. I can be used stand alone, but to get a decent solution, it needs multiple tools to get the results that i wanted. Hadoop is not designed to run fast on a big amount of data. Hadoop is actually meant to easilly create a cluster of servers, which share a distributed file system and process the data to a new prepared 'report' datasets, which in turn can then be read. However to use it for live data you would have to setup a couple of other tools like Spark or Storm to process the data on the fly into report datasets. I'm pretty sure that when investing the time to learn and setup these it would be a very nice system and easy to scale out to a cluster, but since i'm going to use it on a single server, the speed and time needed to set this up was not what i was looking for.

Elasticsearch

Elasticsearch is actually meant as a search index as it is build on Apache Lucene. However it contains several aggregation functions, that include time interval aggregations out-of-the-box. The api of elasticsearch is JSON based which makes it a good candidate to save data to straight from Node-RED (see my post on Automating data collection). And the speed of this NoSQL database is, well, super fast, even in a single node installation. To give you an idea of its speed, it can aggregate +600k values of time series data in 230 milli seconds, and that on a single node, with only 4GB Ram, Atom CPU and non SSD HardDisk. So this is a really fast real-time engine, it aggregates the data on request without crunching it into read tables first.

comments powered by Disqus