Checklist
- What are the functional requirements?
- What is the data model?
- What are the non-functional requirements?
- What are the functional components/services in the system?
- How is data stored?
- How do we back up important data?
- Do we need reconciliation processes?
- What is the consistency model?
- Is there an exemplary model to follow? Is the problem we try to solve completely new?
- Is the system read-heavy or write-heavy?
- What is the scaling strategy? How do we scale the system?
- Is the system naturally partitioned? How do we shard data?
- Does the system have bottlenecks or hotspots?
- How do we monitor the system? Do we need an alarm system?
- What are the key metrics that can characterize the system runtime behavior?
- How do we store log files? How can we query log files?
- Can we implement distributed tracing? What is the trace context?
- What is the recovery process? How do we recover from a disaster?
- What is the deployment strategy? How do we deploy our code?
- How to set up a testing environment?
- How to document the system behavior?
- How will the system evolve?
- Do we anticipate any migration needs in the near future?
Common Techniques
- Make functional components stateless whenever possible. Stateless components are easy to scale.
- Make processes idempotent.
- Make the data flowing in the system immutable.
- Always validate the input data.
- Taking snapshots of the system can be helpful.
- Most of the time, having a dedupe logic at the entry point is helpful.
- Use replica to improve read performance.
- Use sharding or partitioning to improve write performance.
- Use message queue to decouple components.
- Use cache to improve performance. Avoid doing the same computation over and over again.
- Evaluate if an approximate solution is acceptable.
- Always have a translation layer to hide technical complexity.
- Logs should be easy to parse.
- All operations should have a timeout handling.
- Always assume network delay is arbitrary.
- Always assume we cannot tell the difference between a long delay and a disconnection event.
----- END -----
©2019 - 2022 all rights reserved