Sunday, September 29, 2013

System Design for Big Data [tinyurl]

What is tinyurl?

tinyurl is a URL service that users enter a long URL and then the service return a shorter and unique url such as "http://tiny.me/5ie0V2". The highlight part can be any string with 6 letters containing [0-9, a-z, A-Z]. That is, 62^6 ~= 56.8 billions unique strings.

How it works?

On Single Machine
Suppose we have a database which contains three columns: id (auto increment), actual url, and shorten url.

Intuitively, we can design a hash function that maps the actual url to shorten url. But string to string mapping is not easy to compute.

Notice that in the database, each record has a unique id associated with it. What if we convert the id to a shorten url?
Basically, we need a Bijective function f(x) = y such that
  • Each x must be associated with one and only one y;
  • Each y must be associated with one and only one x.
In our case, the set of x's are integers while the set of y's are 6-letter-long strings. Actually, each 6-letter-long string can be considered as a number too, a 62-base numeric, if we map each distinct character to a number,
e.g. 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z.
Then, the problem becomes Base Conversion problem which is bijection (if not overflowed :).
 public String shorturl(int id, int base, HashMap map) {
  StringBuilder res = new StringBuilder();
  while (id > 0) {
    int digit = id % base;
    res.append(map.get(digit));
    id /= base;
  }
  while (res.length() < 6)  res.append('0');
  return res.reverse().toString();
}
For each input long url, the corresponding id is auto generated (in O(1) time). The base conversion algorithm runs in O(k) time where k is the number of digits (i.e. k=6).

On Multiple Machine
Suppose the service gets more and more traffic and thus we need to distributed data onto multiple servers.

We can use Distributed Database. But maintenance for such a db would be much more complicated (replicate data across servers, sync among servers to get a unique id, etc.).

Alternatively, we can use Distributed Key-Value Datastore.
Some distributed datastore (e.g. Amazon's Dynamo) uses Consistent Hashing to hash servers and inputs into integers and locate the corresponding server using the hash value of the input. We can apply base conversion algorithm on the hash value of the input.

The basic process can be:
Insert
  1. Hash an input long url into a single integer;
  2. Locate a server on the ring and store the key--longUrl on the server;
  3. Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.
Retrieve
  1. Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
  2. Locate the server containing that key and return the longUrl.
---------

Further Readings

25 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thanks! Your blog is really exhaustive!

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. On single machine, every time you want to insert a new longURL and get a id, how do you de-duplicate longURL? My understanding we can rely on database to do it. Something like SQL Unique key word. If so, this is not good.

    ReplyDelete
    Replies
    1. First of all, since you have lots of ids to give, it's not necessary to dedup. As a matter of fact, try http://tiny.cc, with different urls, it will return the same.
      Second, it's a good way to save though. So just keep another in memory hashmap to dedup. That could be a distributed hashmap as well, if it grows large, just build a LRU type of thing to serialize it on disk

      Delete
  5. This is such a good blog. Thank you!

    ReplyDelete
  6. Will the auto-generated id be duplicated on multiple machine case?

    ReplyDelete
    Replies
    1. It might depends on different implementation of DBs. Maybe you could just use a GUID/UUID to generate on your own

      Delete
    2. What has function would you use in distributed case? If it's MD5, then the integer could be so big that the mapped url is still too long?

      Delete
  7. if you use hash on long-url, you will get the id directly without a db. In multi machine environment, it is compatible with consistent hashing.

    ReplyDelete
  8. Thanks for the post. in practice, I would use some lib to generate UUID and use this UUID as key and store value into redis/memcached.

    ReplyDelete
    Replies
    1. good thinking! however, UUID is 128 bit, 32 digit in HEX, your tiny url might not be that tiny then, right?
      How can you solve that?

      Delete
  9. For this part
    -------------
    Insert
    Hash an input long url into a single integer;
    ---------------------
    How to deal with the hash collision of long urls?
    I mean, two long urls have the same hashcode.

    ReplyDelete
    Replies
    1. That shouldn't affect it. Here it only uses this hashcode to find its partition in the consistent hash ring. If it collides, so be it.

      Delete
    2. Yes, this solution may not affect the ability to locate the right server on the ring. But how would you store the two key-value pairs with the same key mapping to different long urls in the database? When you retrieve based on the key, which long url would be returned?

      Delete
    3. We can use the separate chaining approach in hashing to avoid collisions.

      Delete
  10. Thanks for your post, but I dont quite understand the step 3 on multiple machine.
    -->Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user
    I have two questions:
    1. Which hash function is used to get this integer?
    2. Does this hash function really maps(narrows) the spatial long-url space to short-url space?

    In my view, the key of tinyurl lies in choosing the hash function carefully so as to maps spatial long-url space to short-url space. On single machine, id is a great mapping, on multiple machines, similar idea are in need as well.

    ReplyDelete
  11. For distributed key-value datastore, there's no such unique id for each record. Then how to handle it?

    ReplyDelete
  12. Post is very informative,It helped me with great information so I really believe you will do much better in the future.
    free short url

    ReplyDelete
  13. I am really very agree with your qualities it is very helpful for look like home. Thanks so much for info and keep it up.
    url shortner

    ReplyDelete
  14. I dont understand why CDN would not be talked about for static content like urls

    ReplyDelete