Data Collection Infrastructure
Here’s how I think of data collection for research purposes:
- Flat-file Data lake (e.g. S3/Filesystem)
- NoSQL Data lake (e.g. InfluxDB/MongoDB)
- Data warehouse (e.g. TimescaleDB/Clickhouse)
Depending on the data, the data first lands in one of three categories above, which each have their own backup and post-processing priority.
Data lake/warehouse is means “I don’t care what the data looks like yet” (unstructured/lake) vs. “I know how I want to use the data” (structured/warehouse).
Data Lake - Flat files
When in doubt: use a compressed flat file (e.g. csv or json bzip’d) for reduced failure surface area — even for tick data archival in the beginning. I prioritize reducing failure surface area over performance in the beginning because the amount of data we’re usually talking about is <2TB, which can be batch converted to another format down the line (parquet, compressed sqlite, whatever).
Depending on the type of data, it’ll land in either one of the data lakes or the data warehouse first, because your backup solution doesn’t need to be so robust for every type of data.
Flat file data lake:
Flat file data lake is to capture unstructured ephemeral data (e.g. you reverse engineered an endpoint and want to archive this data).
- Solutions are saving to a flat file (csv, json) and then compressed
- Backup to S3 or a lower cost storage box.
Once I figure out how to use the flat file data, I either write another consumer into data lake or data warehouse or just download from S3 and work on it directly.
Data Lake - NoSQL
NoSQL Data lake:
Captures data to monitor often, but not sure how to use it yet:
- Solutions are InfluxDB and MongoDB. I prefer InfluxDB to have it available in Grafana, so I can stare at it and figure out how to use it
- Backup into compressed flat-file and pushed to S3
Once I figure out how to use this unstructured data, I post-process the data from data lake into the data warehouse, or just pull from data lake directly.
Access often and benefit from structure. This is where you can start thinking about custom APIs.
- Solutions are TimescaleDB/Clickhouse. I prefer TimescaleDB to compare against multiple SQL databases
- Backup to S3 with pgbackrest.
There are a lot of competing technologies of what to use for each of the three steps, but they all fit into this category. What’s more important is to reduce the failure surface area of your solution, and think your backup and restore for failures.
Focus on reducing failure surface area and thinking through the backup solutions, & reformat later
- Data lake: compressed flat files, backup to S3
- Data lake: InfluxDB/MongoDB, backup to S3 or warehouse
- Data warehouse: TimescaleDB/Clickhouse, backup to S3