Chapter 8

Truth knowledge and Lies, after this, you should know which design does not work fundamentally

Single computer system, result is deterministic, hardware problem - > blue screen, this is a good choice(deliberate), we prefer to a computer to crash completely rather than returning a wrong result

no absolute reliable fault detector
widely acceptable indicator is timeout, could not tell if it's node failure or network failure, retry few times when possible
a request could be called several times and results in disastrous outcomes. Idempotent. https://www.i3geek.com/archives/841 call 多次，返回相同结果
如何让一个system 变得idempotent？设计API的时候
network partitions: 写到A，读B
without testing, the network failure could break the fundamental assumptions
network partition naturally happens, we can only resolve it when it happens, we can not prevent it from happening. CAP theory -> CP or AP (consistency, availability and partition)

traditionally telephones lines are almost guaranteed no delays, focus on latency
network use patterns are burst mode, system design is focusing on the maximum throughput / resource utilization
similar thing happens with retail time OS

has this request time out yet?
what's the 99th percentile response time of this service? https://engineering.linkedin.com/performance/who-moved-my-99th-percentile-latency
how many queries per second did this service handle on average in the last five minutes?
how long did the user spend on our site?
when was this article published?
at what date and time should the reminder email be sent?
when does this cache entry expire?
what is the timestamp on this error message in the log file?

Most commonly used mechanism is the Network Time Protocol (NTP) Individual NTP is not reliable.

time-of-day. It returns the current date and time according to some calendar. Java System.currentTimeMillis() and Linux: clock_gettime(CLOCK_REALTIME)
1. highly unreliable, could jump forward, backward.
2. in google spanner, time drift assumption is 30 secs per 24 hrs
. monotonic clock is suitable for measuring a duration (time interval).
1. clock_gettime(CLOCK_MONOTONIC) on linux and System.nanoTime() in Java
2. duration could be measured relatively accurate

most of the components in distributed system provides accurate deterministic result of crash. Not clock

后来的request的time clock比其他的慢，request会被覆盖
distributed id generator (UUID) -> very difficult, mac id + machine local time, different machine has different time clock

Java virtual machine has a garbage collector which occasionally needs to stop all running threads. "Stop-the-world" GC pauses have sometimes been know to last for several minutes.
When the OS context-switches to another thread, or when the hypervisor switches to a different virtual machine, the current running thread can be paused at any arbitrary point in the code.
if the application performs synchronous disk access, a thread may be paused waiting for a slot disk I/O operation to complete, Java class loader lazy load first time.
Thrashing: swapping to disk (paging)
real time OS: every library call has a worst time guaranteed.