- InfluxDB Documentation - Key Concepts
- IoT Event Procesing and Analytics with InfluxDB in Google Cloud
A time series database is special in many ways. It needs to support write-heavy operations and deal with spikes in read. Moreover, it needs to support time-series data. To some extent, we can view time-series data as a special key-value record with the timestamp being part of the key. The problem with this view is that it ignores an important aspect: the timestamps have meaning and they introduce relationships among data points. In most cases, the query used in a time-series analysis involves some sort of range search, which in turn involves a scan of the database. In order to have a good performance, it means we need to organize the data in a way that data points that are close to each other in time are also close to each other in space. One of the challenges is to achieve this locality property.
InfluxDB has the following key concepts:
- Bucket schema
- Series key
- Retention policy
- Shard group
Roughly speaking, the data presented in InfluxDB is similar to a "stacked" pandas DataFrame. The measurement corresponds to the name of the DataFrame; tags correspond to the indices of the DataFrame; fields correspond to colums in the "unstacked" form.
In InfluxDB, tags are indexed and fields are not indexed. The primary key consists of timestamp and tags. There is an issue called high series cardinality issue: the tag values have a large variety. This is problematic because tags are indexed and if the value varies a lot (or unbounded), the indexing becomes useless. A simple rule of thumb is to only store categorical data in tags.
Note: As we mentioned earlier, the primary key consists of timestamp and tags. Therefore, technically speaking, timestamp is also a kind of tag. On the other hand, timestamp values vary a lot and are unbounded. It seems to contradict the rule of thumb. As we will see later, InfluxDB stores data in shard groups, which are organized by retention policy (e.g. duration). Dividing data into multiple shard groups is essentially doing the discretization in the time space, which makes the timestamp values become categorical data.
Here is an example of the data represented in InfluxDB (copied from the official website).
scientist are tag names.
We haven't defined what a series means in InfluxDB. A series is a key-value pair record. The key consists of measurement, tag set and a field. For example, key = (census, location = kalmath, scientist = anderson, bees). The value is the collection of data points. In our case is [(2019-08-18T00:00:00Z, 23), (2019-08-18T00:06:00Z, 28)]
The schema of the measurement
census is given in the table below. It's called Bucket Schema in InfluxDB:
The storage engine has the following components:
- Write Ahead Log (WAL)
- Time Structure Merge Tree (TSM)
- Time Series Index (TSI)
The first two components are common setups. TSM and TSI are special implementations in InfluxDB. The details of TSM and TSI are out of the scope of this post, here we just document a common sequence of operations. When InfluxDB receives a write request, it goes through the following steps
- The write request is appended to the end of the WAL file.
- Data is written to disk using
- The in-memory cache is updated.
- When data is successfully written to disk, a response confirms the write request was successful.
When the storage engine handles the query, it uses a copy of the data (e.g. cached snapshots + TSM files). In this way, the storage engine can continue accepting writes while serving the queries. On the other hand, the price we pay is the need to deal with eventual consistency. This is a reasonable cost in most the scenarios.
InfluxDB Shards and Shard Groups
Key concepts about shards and shard groups:
- Retention Period: The duration of time that a bucket retains data. InfluxDB drops points with timestamps older than their bucket’s retention period. The minimum retention period is one hour.
- Bucket: A bucket is a named location where time series data is stored. All buckets have a retention period. A bucket belongs to an organization.
- Shard: A shard contains encoded and compressed data for a specific set of series. A shard consists of one or more TSM files on disk. All points in a series in a given shard group are stored in the same shard (TSM file) on disk. A shard belongs to a single shard group.
- Shard Group: Shard groups are logical containers for shards organized by bucket. Every bucket with data has at least one shard group. A shard group contains all shards with data for the time interval covered by the shard group. The interval spanned by each shard group is the shard group duration.
- Shard Group Duration: The duration of time or interval that each shard group covers. Set the shard-group-duration for each bucket.
Exmaple of Use Case
A common use case of time series database is IoT event processing. A typical event processing architecture has the following components:
- data collection
- data processing
- data archiving
IoT devices generate tons of data. It's not cost-effective to store the raw data for an extended period of time. For real-time analysis such as anomaly detection, we may want to deal with the raw data. But for some offline analysis, aggregated data is good enough. For long-term storage, we can downsample the data. For example, instead of saving the data reported from devices every second, we can aggregate the data and only store the daily data.
The architecture/workflow above is described in the IoT Event Procesing and Analytics with InfluxDB in Google Cloud but it's not the only way to process the event data. One of the problems of this design is that all other services depend on the time series and the time series will become the bottleneck of the system soon. Note that technically speaking, the time series database and other services are all consumers of Iot data. In this type of scenario, a typical setup is to leverage a queueing system to decouple the producers and consumers.
----- END -----
©2019 - 2022 all rights reserved