Information Theory part 9: What is a bit?

Information Theory part 9: What is a bit?


Consider the following: Alice and Bob have figured out how to transmit messages between their treehouses. At first, they used flames at night, and shutters during the day. Then they used a wire, which they plucked in different ways. Eventually, they electrified this wire to send electrical pulses, and were now at work on an experimental wireless method. The problem is in order to pay for their equipment, they needed money. So they decided to offer their service — for a fee to others. And on the first day, Alice had three new customers who wanted to transmit messages to their friends over at Bob’s treehouse. The first customer wanted to send a list of 10 coin flips. The second customer wanted to send a 6-letter word. And the third customer wanted to send a poker hand. The question now is, “How much should she charge?” Well, the price of a message should depend on how long it takes Alice to transmit it. But how could she measure the length of different types of messages using a common unit? To find out, let’s play a game. Imagine you are Bob now. And you know Alice wants to send you these messages. But all you can do is get the answer to yes-or-no questions you’ve arranged. Alice will answer by sending a sequence of zeros or ones using some method of variation. Recall that all of their methods of transmission involve the exchange of differences. So, a 1 could be represented by an open flame, or an open shutter, or an electrical pulse. No matter how they are manifested, we can simply call them ‘binary digits’ — because a binary digit can have only one of two values — 0 or 1. So, let’s say 0 represents a ‘no,’ and 1 represents a ‘yes.’ Your challenge, now, is to always ask the minimum number of questions in order to determine the exact message. First, let’s consider the coin flips. For each symbol, the sender, Alice, can be thought of as selecting one of two different symbols — ‘heads’ or ‘tails.’ Now how many questions do you need to ask to determine which she selected? One questions — such as, “Is it heads?” — will suffice. For 10 flips, what is the minimum number of questions? Well, 10 flips times one question per flip equals 10 questions — or 10 binary digits to transmit this message. Next, let’s consider the letters. For each symbol, the sender, Alice, can be thought of as selecting 1 of 26 different symbols. Let’s start with the simplest message — which is 1 letter. How many questions are needed? Is it A? Is it B? Is it C? Is it D? And so on. But that is not the minimum number of questions. The best you could do is ask questions which eliminate half of the possibilities. For example the middle of the alphabet is between M and N. So we could first ask, “Is it less than N?” If we receive a 1 — yes — we cut out half of the possibilities — [so we have] 13 left. And since we can’t split a letter in half, we divide the possible symbols into sets of 6 and 7, and ask is it less than G? We receive a 1, which is yes. And now we are left with 6 possible letters. And we can split them in half, and ask, “Is it less than D?” We receive a 0, which is no — leaving us with three possible letters. And now we can pick a side and ask, “Is it D?” We receive a 0, which is no. And finally, we are left with two possibilities. We ask, “Is it E?” We receive a no. And after five questions, we have correctly identified the symbol: F. Realize that we will never need to ask more than five questions. So the number of questions will be at least 4, and at most 5. And in general, 2 to the power of number of questions equals the number of possible messages — which we previously defined as the ‘message space.’ So how can we calculate the exact average — or expected — number of questions, given a message space of 26? We ask the reverse question. 2 to the power of something equals 26. And to answer these type of questions, we naturally use a logarithmic function, base 2 — because log, base 2, of 26 is the exponent 2 needs to be raised to to give us 26. Which is approximately 4.7. So on average, approximately 4.7 questions will be needed per letter, at minimum. And since she wants to transmit a word with 6 letters, Bob can expect to ask, at minimum, 28.2 questions — Which means Alice will need to send at most 29 binary digits. Finally, let’s apply this formula to a new message — the poker hand. Well, for each symbol, the sender, Alice, can be thought of as selecting 1 of 52 different symbols. And in this case, the number of questions is the same as the number of times we need to split the deck and ask Alice which pile it is in — until we are left with one card — which we will find is usually 6 splits — or questions — and sometimes 5. But we can save time and just use our equation. Log, base 2, of 52 is approximately 5.7, since 2 to the power of 5.7 is approximately 52. So the minimum number of questions, on average, is 5.7 per card. A poker hand contains five cards. So to transmit a poker hand requires 28.5 questions, on average. We are done. We now have our unit. It’s based on the minimum number of questions to define the message — or the ‘height’ of the decision tree. And since Alice transmits this information as binary digits, we can shorten this [term], and call our unit the ‘bit’ — instead of ‘binary digit.’ So we have 10 coin flip require 10 bits. The 6-letter word requires 28.2 bits. and the poker hand requires 28.5 bits. Alice, then, decides to charge one penny per bit, and begins collecting her fees. Now this idea emerged in the 1920s. It was one of the more abstract problems that communication engineers were thinking about. Ralph Hartley was a prolific electronics researcher, who built on the ideas of Harry Nyquist — both of whom worked at Bell Labs after World War I. And in 1928, Hartley published an important paper titled “The Transmission of Information.” And in it, he defines the word ‘information’ using the symbol h — as: h=n × logarithm of s, where h is our information, n is the number of symbols — whether they’re notes, letters, numbers, etc. — and s is the number of different symbols available at each selection. And this can also be written as h=logarithm of s^n. And Hartley writes, “What we have done, then, is to take, as our practical measure of information, the logarithm of the number of possible symbol sequences. So, information is the logarithm of the message space. However, realize that throughout this lesson, we have been assuming that the symbol selection is random — a convenient simplification. However, we know that in reality, most communication — such as speech — isn’t always random. It’s a subtle mix of predictability and surprise. We do not roll dice when we write letters. And it is this predictability which can result in significant savings in the length of transmission. Because when we can predict things in advance, we shouldn’t need to ask as many yes-or-no questions to define it. But how could we formally model this subtle difference? This question brings us to the key insight in our story. Can you think of what it might be?

Author:

37 thoughts on “Information Theory part 9: What is a bit?”

  • This message would cost you more than 1$ to send. 104.60752504759633 pennies to be precise.

    I would wait until the next episode. Things will get cheaper when they introduce the concepts like symbol probability and entropy.

    Wait a second… I just spent 15 bucks on this comment…

  • In the poker hand example you did not account for the fact that when you pick a card from a deck, it is no longer there, thus reducing the message space and consequently the average number of questions.

  • Art of the Problem says:

    Very, very good catch. I should have mentioned that the cards are chosen with replacement…which means this isn't a "true" poker hand – just 5 selections from a set of 52.

  • Nifty Fingers says:

    Everything in information boils down to the bit though. You can have nats and hartleys as well. Nats are when you use euler's number instead of 2, and hartleys are when you use 10 instead of 2. If you give me a quantity of information, say 5 nats, that's (1 over ln2) bits, or about 1.44 bits.

    Words and phrases are multiple bits, so there's really no difference. How do you send a phrase over a wire by the way? 1V = "to", 1.1V = "the", 1.2V = "an"… engineers would say what a pain that would be.

  • REDRUM JOODEN says:

    At the end you posed a question. I'm guessing the answer was to compile all the words in the english language and make an algorhythm that would know the entire set of possible subsequent letters based upon the beginning pattern. Is there another way to limit the letters to less than 4-5? Also, I was thinking the number of signals is important, u can use a pause to end the letter, but isnt the speed also important? & if so, the wouldn't you need a number to declare the end of the letter/unit?

  • Is 4.7 the minimum number of bits per letter, or the average number of bits per letter? You seem to change your mind at least twice every sentence, on average.

  • If the message is known to be an English word, and you receive the letter "T" then it's more likely that the next letter will be "H", which might spell the common word "THE". At each node in the decision tree, the letters should be arranged between the two branches so that you have a 50% chance of ending up on either branch. Having an equal quantity of letters on each branch doesn't really make sense, it should be based on how likely they are to follow the previous letter or even previous words.

  • I would assume, when sending words, that for the larger ones you would get to a point at which the combination of bits could only represent that word and that specific word alone. You could then save on info by 'shortening' the amount of bits that the word would otherwise consists of, so you could save quite some money and information if your message is long enough.

  • At 9:32 the 'M' tile and the 'W' tile are switched. That 'Z' tile also looks suspiciously like another 'N' tile. However, at 1:21, the tiles all look normal. What happened?

    Also, I love this series and am looking forward to the next installment.

  • I love how the W and M are switched in the tiles. Also, great video; these go a long way to explaining these complex topics.

  • Kristupas Stumbrys says:

    There should be a base fee included, because you need to ask a few questions first to know how long will be the message that comes (if you don't have an agreement of how long the pauses have to be) and also what type of questions you need to ask. At first you would have needed to ask 2 questions do know if the information sent will represent H/T, letters or cards. These points are a bit advanced, if most people would learn and understand at least the concept presented here, it would be very cool

  • Never mind, I just noticed the date of the upload, I will be eagerly anticipating the last 2 episodes :). I have a feeling they have something to do with entropy…

  • wait… isn't the code with which you decipher the message considered "information"? shouldn't she be charging for that?

  • Henrique de Oliveira says:

    Also, the calculation you presented is the required one to transmit both the cards in the hand AND the order that the cards appear in the hand. If you don't care about the order, as you normally wouldn't in a true poker hand, the number of possibilities is smaller.

  • 5:56 "at most 29 binary digits" is incorrect. If you are extremely unlucky you will have to send 30 digits (message "fffff") if you are lucky, you will have to send only 24 digits ("ddddd"). 29 is average number of digits.

  • David Müller says:

    Btw. For anyone wondering, why it is H:
    I believe it's not "H" but originally a capital greek "Eta" η for "Entropy",… which is written as H but is still a "capital Eta". 😀

  • I'm pretty sure you only need 22 bits to send a poker hand?

    You cannot get duplicate cards which reduces the amount of data you need to send!

  • I'm sorry, three things in this video don't make any sense to me.
    In the example with the six-letter word, Brit shows the most efficient way to determine the letter (eliminating about half of the letters with every question). With this method you will never need to ask more than five questions, and sometimes four will be enough. There is no better way of doing it. Based on that he then goes on to calculate the average amount of questions you'd need to ask for one letter – and here's my first point: If you use the described method and you go threw every possbile letter, you will find that for exactly 6 of them you'll only need to ask four questions ( for example a, d, g, n, q, t, if you always choose the smaller one of a pair of three). The 20 remaining letters require five questions. Now if you calculate the average you get (6*4 + 20*5)/26 = 4.769… questions per letter. But Brit does the following and gets a different answer: He uses the formula to calculate the message space based on the differences per symbol and the symbol rate. Message space = (differences per symbol) to the power of (symbol rate). Basically what he calculates is "How many times do I have to sent a bit (two differences per symbol) to be able to construct 26 different messages (i.e. 26 different combinations of those bits) ?" And his answer is log base 2 of 26 = 7.004… . But then he says: "So, on average, approximately 4.7 questions will be needed per letter at minimum […]" But besides the first two digits his number is different than what I just calculated. Now, did I make a mistake in calculating the average number of questions per letter, or did he simply calculate a different thing?
    Also, does saying "on average at minimum" make any sense? It's either the average or the minimum, isn't it? And he then goes on to calculate 6*4.7 = 28.2 and states that "Bob can expect to ask, at minimum, 28.2 questions, which means Alice will need to send, at most, 29 binary digits." But clearly, the minimum number of questions would be 24 if you happen to only need four questions for every letter. And Alice will need to sent, at most, 30 bits, if all the letters require five questions. So this is the second thing I don't understand.
    And finally, does using the formula for the message space in this case make any sense at all? Because you can't sent a bit 4.7 times. Either 4 times or 5 times. And if it's not the average amount of bits, then what is the point of this calculation?
    I really hope some of you can understand my confusion and maybe help me out ':)
    (Btw it's also possible that I just didnn't understand what he says correctly because english is not my native language, but I don't think that's the case)

  • The average calculation isn't correct though. Let's say we have three letters
    'A' 'B' 'C'
    And each message is only one letter long. So let's say we represent them like this:
    A:0
    B:00
    C:01
    We need 2 bits for B, 2 bits for C, and only 1 bit for A. Therefore the average # of bits is (1 + 2 + 2) / 3choices, or 1.666666
    bits per letter on average.
    Not log2(3), which gives 1.585….
    …?

Leave a Reply

Your email address will not be published. Required fields are marked *