Design Twitter Search
Last updated
Last updated
Now let us built the most complicated Index Server. Since our tweets consist of words, we will build our index across words. So indexes in most layman terms are nothing just distributed hash table where the key is the word, and the set of tweet id that contains the words are the values.
We can use either of two technique to shard our data :
Shard based on words: We first calculate the hash based on word and then find the server based on the hash.
Problems with this approach :
There might be too much load on the single server when a search term becomes hot.
No uniform distribution of load
2. Shard based on tweet id: We calculate the hash based on tweet id and find the server based on this hash. All the servers maintain the index for all the words . while fetching results, the result is first fetched from each server then compiled together on a central server and returned to the user. -> avoid hot partitions, slow search
two level shard, first shard by word