N00tc0d3r: System Design for Big Data [tinyurl]

Sunday, September 29, 2013

System Design for Big Data [tinyurl]

What is tinyurl?

tinyurl is a URL service that users enter a long URL and then the service return a shorter and unique url such as "http://tiny.me/5ie0V2". The highlight part can be any string with 6 letters containing [0-9, a-z, A-Z]. That is, 62^6 ~= 56.8 billions unique strings.

How it works?

On Single Machine
Suppose we have a database which contains three columns: id (auto increment), actual url, and shorten url.

Intuitively, we can design a hash function that maps the actual url to shorten url. But string to string mapping is not easy to compute.

Notice that in the database, each record has a unique id associated with it. What if we convert the id to a shorten url?
Basically, we need a Bijective function f(x) = y such that

Each x must be associated with one and only one y;
Each y must be associated with one and only one x.

In our case, the set of x's are integers while the set of y's are 6-letter-long strings. Actually, each 6-letter-long string can be considered as a number too, a 62-base numeric, if we map each distinct character to a number,

e.g. 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z.

Then, the problem becomes Base Conversion problem which is bijection (if not overflowed :).

 public String shorturl(int id, int base, HashMap map) {
  StringBuilder res = new StringBuilder();
  while (id > 0) {
    int digit = id % base;
    res.append(map.get(digit));
    id /= base;
  }
  while (res.length() < 6)  res.append('0');
  return res.reverse().toString();
}

For each input long url, the corresponding id is auto generated (in O(1) time). The base conversion algorithm runs in O(k) time where k is the number of digits (i.e. k=6).

On Multiple Machine
Suppose the service gets more and more traffic and thus we need to distributed data onto multiple servers.

We can use Distributed Database. But maintenance for such a db would be much more complicated (replicate data across servers, sync among servers to get a unique id, etc.).

Alternatively, we can use Distributed Key-Value Datastore.
Some distributed datastore (e.g. Amazon's Dynamo) uses Consistent Hashing to hash servers and inputs into integers and locate the corresponding server using the hash value of the input. We can apply base conversion algorithm on the hash value of the input.

The basic process can be:
Insert

Hash an input long url into a single integer;
Locate a server on the ring and store the key--longUrl on the server;
Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.

Retrieve

Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
Locate the server containing that key and return the longUrl.

---------

25 comments:

UnknownNovember 3, 2013 at 9:35 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 22, 2014 at 1:05 PM
very good post. thanks a lot.
ReplyDelete
Replies
WhizkidFebruary 4, 2014 at 9:50 AM
Thanks! Your blog is really exhaustive!
ReplyDelete
Replies
UnknownFebruary 10, 2014 at 8:44 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousMarch 12, 2014 at 11:06 AM
On single machine, every time you want to insert a new longURL and get a id, how do you de-duplicate longURL? My understanding we can rely on database to do it. Something like SQL Unique key word. If so, this is not good.
ReplyDelete
Replies
AnonymousMay 11, 2014 at 3:22 PM
This is such a good blog. Thank you!
ReplyDelete
Replies
UnknownMay 27, 2014 at 1:56 AM
Will the auto-generated id be duplicated on multiple machine case?
ReplyDelete
Replies
AnonymousAugust 8, 2014 at 11:44 AM
if you use hash on long-url, you will get the id directly without a db. In multi machine environment, it is compatible with consistent hashing.
ReplyDelete
Replies
Bo DuanDecember 9, 2014 at 4:08 PM
谢谢这篇文章, 很好.
ReplyDelete
Replies
UnknownFebruary 1, 2015 at 10:32 PM
Thanks for the post. in practice, I would use some lib to generate UUID and use this UUID as key and store value into redis/memcached.
ReplyDelete
Replies
UnknownMarch 19, 2015 at 11:55 PM
For this part
-------------
Insert
Hash an input long url into a single integer;
---------------------
How to deal with the hash collision of long urls?
I mean, two long urls have the same hashcode.
ReplyDelete
Replies
Nitin GuptaApril 19, 2015 at 10:57 AM
very nice blog
ReplyDelete
Replies
KamelOctober 20, 2015 at 6:31 AM
Thanks for your post, but I dont quite understand the step 3 on multiple machine.
-->Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user
I have two questions:
1. Which hash function is used to get this integer?
2. Does this hash function really maps(narrows) the spatial long-url space to short-url space?

In my view, the key of tinyurl lies in choosing the hash function carefully so as to maps spatial long-url space to short-url space. On single machine, id is a great mapping, on multiple machines, similar idea are in need as well.

ReplyDelete
Replies
AnonymousNovember 1, 2015 at 10:20 PM
For distributed key-value datastore, there's no such unique id for each record. Then how to handle it?
ReplyDelete
Replies
Shivam KumarNovember 29, 2015 at 2:03 AM
Nice article... visit more java examples
ReplyDelete
Replies
Jhon MarshalDecember 8, 2015 at 11:39 PM
Post is very informative,It helped me with great information so I really believe you will do much better in the future.
free short url
ReplyDelete
Replies
AnonymousJanuary 28, 2016 at 1:25 AM
I am really very agree with your qualities it is very helpful for look like home. Thanks so much for info and keep it up.
url shortner
ReplyDelete
Replies
AnonymousAugust 25, 2021 at 3:12 PM
I dont understand why CDN would not be talked about for static content like urls
ReplyDelete
Replies

Add comment

Sunday, September 29, 2013

System Design for Big Data [tinyurl]

What is tinyurl?

How it works?

Further Readings

25 comments: