Notes on Time Series Database InfluxDB

Subscribe Send me a message home page tags


#time series database  #InfluxDB 

Related Reading

Introduction

A time series database is special in many ways. It needs to support write-heavy operations and deal with spikes in read. Moreover, it needs to support time-series data. To some extent, we can view time-series data as a special key-value record with the timestamp being part of the key. The problem with this view is that it ignores an important aspect: the timestamps have meaning and they introduce relationships among data points. In most cases, the query used in a time-series analysis involves some sort of range search, which in turn involves a scan of the database. In order to have a good performance, it means we need to organize the data in a way that data points that are close to each other in time are also close to each other in space. One of the challenges is to achieve this locality property.

Key Concepts

InfluxDB has the following key concepts:

Roughly speaking, the data presented in InfluxDB is similar to a "stacked" pandas DataFrame. The measurement corresponds to the name of the DataFrame; tags correspond to the indices of the DataFrame; fields correspond to colums in the "unstacked" form.

In InfluxDB, tags are indexed and fields are not indexed. The primary key consists of timestamp and tags. There is an issue called high series cardinality issue: the tag values have a large variety. This is problematic because tags are indexed and if the value varies a lot (or unbounded), the indexing becomes useless. A simple rule of thumb is to only store categorical data in tags.

Note: As we mentioned earlier, the primary key consists of timestamp and tags. Therefore, technically speaking, timestamp is also a kind of tag. On the other hand, timestamp values vary a lot and are unbounded. It seems to contradict the rule of thumb. As we will see later, InfluxDB stores data in shard groups, which are organized by retention policy (e.g. duration). Dividing data into multiple shard groups is essentially doing the discretization in the time space, which makes the timestamp values become categorical data.

Here is an example of the data represented in InfluxDB (copied from the official website). location and scientist are tag names.

influxDB_data_table.png

We haven't defined what a series means in InfluxDB. A series is a key-value pair record. The key consists of measurement, tag set and a field. For example, key = (census, location = kalmath, scientist = anderson, bees). The value is the collection of data points. In our case is [(2019-08-18T00:00:00Z, 23), (2019-08-18T00:06:00Z, 28)]

The schema of the measurement census is given in the table below. It's called Bucket Schema in InfluxDB:

influxDB_bucket_schema.png

Storage Engine

The storage engine has the following components:

The first two components are common setups. TSM and TSI are special implementations in InfluxDB. The details of TSM and TSI are out of the scope of this post, here we just document a common sequence of operations. When InfluxDB receives a write request, it goes through the following steps

When the storage engine handles the query, it uses a copy of the data (e.g. cached snapshots + TSM files). In this way, the storage engine can continue accepting writes while serving the queries. On the other hand, the price we pay is the need to deal with eventual consistency. This is a reasonable cost in most the scenarios.

InfluxDB Shards and Shard Groups

Key concepts about shards and shard groups:

InfluxDB_shards_and_shard_group.jpg

Exmaple of Use Case

A common use case of time series database is IoT event processing. A typical event processing architecture has the following components:

IoT_event_processing_architecture.jpg

IoT devices generate tons of data. It's not cost-effective to store the raw data for an extended period of time. For real-time analysis such as anomaly detection, we may want to deal with the raw data. But for some offline analysis, aggregated data is good enough. For long-term storage, we can downsample the data. For example, instead of saving the data reported from devices every second, we can aggregate the data and only store the daily data.

The architecture/workflow above is described in the IoT Event Procesing and Analytics with InfluxDB in Google Cloud but it's not the only way to process the event data. One of the problems of this design is that all other services depend on the time series and the time series will become the bottleneck of the system soon. Note that technically speaking, the time series database and other services are all consumers of Iot data. In this type of scenario, a typical setup is to leverage a queueing system to decouple the producers and consumers.

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?

/static/shopping_demo.png