Imagine 2 machines. They both output messages from

an alphabet of A, B, C, or D. Machine 1 generates

each symbol randomly. They all occur 25% of the time. Machine 2 generates

symbols according to the following probabilities. Now, which machine is

producing more information? Claude Shannon cleverly

rephrased the question. If you had to predict the

next symbol from each machine, what is the minimum number

of yes or no questions you would expect to ask? Let’s look at Machine 1. The most efficient way is

to pose a question which divides the

possibilities in half. For example, our

first question, we could ask if it is any

2 symbols– such as, is it A or B? Since there is a

50% chance of A or B and a 50% chance of C or D.

After getting the answer, we can eliminate half

of the possibilities. And we will be left with 2

symbols, both equally likely. So we simply pick

one– such as, is it A? And after this

second question, we will have correctly

identified the symbol. So we can say the

uncertainty of Machine 1 is 2 questions per symbol. Now, what about Machine 2? As with Machine 1, we

could ask two questions to determine the next symbol. However, this time the

probability of each symbol is different. So we can ask our

questions differently. Here A has a 50%

chance of occurring, and all other

letters add to 50%. So we could start

by asking– Is it A? If it is A, we are done. Only one question in this case. Otherwise, we are left with 2

equal outcomes– D or B and C. So we could ask– is it D? If yes, we are done

with 2 questions. Otherwise, we have to

ask a third question to identify which of the

last 2 symbols it is. On average, how many

questions do you expect to ask to determine

a symbol from Machine 2? And this can be explained

nicely with an analogy. Let’s assume instead we want to

build Machine 1 and Machine 2. And we can generate symbols

by bouncing a disk off a peg into 1 of 2 equally

likely directions. Based on which way it falls,

we can generate a symbol. So with Machine

1, we need to add a second level or

a second bounce so that we have 2

bounces, which lead to 4 equally likely outcomes. And based on where the disk

lands, we output A, B, C, or D. Now, Machine 2. In this case, the

first bounce leads to either an A– which occurs

50% of the time– or else we lead to a second bounce,

which then can either output at D– which occurs 25%

of the time– or else it leads to a third bounce,

which then leads to either B or C– 12.5% of the time. So now we just take

a weighted average as follows– the expected

number of bounces is the probability

of symbol A times 1 bounce plus the probability

of B times 3 bounces plus the probability

of C times 3 bounces plus the probability

of D times 2 bounces. And this works out

to 1.75 bounces. Now, notice the

connection between yes or no questions

and fair bounces. The expected number

of questions is equal to the expected

number of bounces. So Machine 1 requires 2

bounces to generate a symbol while guessing an unknown

symbol requires 2 questions. Machine 2 requires 1.75 bounces. We need to ask 1.75

questions on average. Meaning, if we need to guess

100 symbols from both machines, we can expect to ask 200

questions for Machine 1 and 175 questions for Machine 2. So this means that Machine 2

is producing less information because there is

less uncertainty or surprise about its output. And that’s it. Claude Shannon calls this

measure of average uncertainty “entropy,” and he uses a

letter H to represent it. And the unit of

entropy Shannon chooses is based on the uncertainty

of a fair coin flip, and he calls this

“the bit,” which is equivalent to a fair bounce. And we can arrive at the

same result using our bounce analogy. Entropy, or H, is the

summation for each symbol of the probability

of that symbol times the number of bounces. Now, the difference

is– how do we express number of bounces

in a more general way? And as we’ve seen,

number of bounces depends how far down

the tree we are. And we can simplify

this by saying that the number of bounces

equals the logarithm base 2 of the number of

outcomes at that level. And the number of

outcomes at a level is also based on the

probability, where the number of

outcomes at a level equals 1 divided by the

probability of that outcome. Number of bounces actually

equals the logarithm base 2 of 1 over the probability

of that symbol, which gives us our final equation. Entropy, or H, is the

summation for each symbol of the probability of that

symbol times the logarithm base 2 of 1 over the

probability of that symbol. And Shannon writes this

slightly different, which just inverts

the expression inside the logarithm, which

causes us to add a negative, though both formulas

give the same results. So let’s summarize. Entropy is maximum when all

outcomes are equally likely. Any time you move away from

equally likely outcomes or introduce predictability,

the entropy must go down. Now, the fundamental idea

is that if the entropy of an information

source drops, that means we can ask fewer

questions to guess the outcome. And thanks to Shannon, the bit–

which is the unit of entropy– is adopted as our

quantitative measure of information or

measure of surprise.

## RimstarOrg says:

Very clear explanation. Thanks.

## Javier Perez de Rojas says:

Yay! New video 🙂

## Janko Kandić says:

Good to have you back 🙂

## Diogo Canina says:

Niiiiice!! Thank you, i was waiting for the longest time!

## hl2mukkel says:

HE IS ALIVE! =D WOOHOO

## Paul Miller says:

There is an arithmetical error at 3:47 (should be p_D*2)

## taylor8294 says:

Great, great video

## midevil656 says:

Could never understand the information entropy unit in class, but you nailed it here. Thanks!

## 6san6sei6 says:

i simply love your videos.

## KenCubed says:

I am greatly enjoying these!

## Keifker says:

Great video! Glad to see you are back

## 5ystemError says:

Thanks for uploading a new video!!! I just started watching these a few days ago and was bummed out when I saw you hadn't uploaded one in a little while. These are awesome.

## John A says:

To whoever runs ArtOfTheProblem, I want to personally thank you for all of the time you have spent making these videos. I came across one of them on a math forum a few months ago and I proceeded to consume every video you have made. Your videos led me to absolutely fall in love with information theory and persuaded me to open up a major in computer science. Claude Shannon has replaced Richard Feynman as my favorite scientist. I can't thank you enough, these videos literally changed the path I'm taking through life.

## Diego Mesa says:

Thank you for continuing! These are fantastic!

## Kazimieras Celiešius says:

It's great to see a new video after a while 🙂 thank you!

## viralshield says:

superb

## Steven Hollingsworth says:

Love your videos. Thank you.

## completelymindfucked says:

first video since i subscribed. it was awesome

## Sashank Dara says:

Just awesome !

## Steven Hess says:

Beautiful work.

## Immac (RockLeet) says:

SO entropy is good for encryption but bad for communication right?

## Richard Schwinn says:

1:26 "is it C?"

## HSAdestroy says:

Why dont we ask those two questions from machine1 for machine2?

## Aaron McDonald says:

This is a great explanation. This is the first time I've actually grasped the basic concept.

## Stephanie Rideout says:

I don't know about you, but I read #bounces as "hashtag bounces" and not "number of bounces". Oh look, youtube even made it a hashtag. Thanks, internet.

## pezaventura says:

These videos are amazing!

## Elizabeth Johnson says:

very nice.

## Chaithanya Ks says:

I am in <3 with information theory because of your videos. thank you so much for making this. I wish i could explain to people like the way you do. and btw can somebody explain me in detail why #bounces are represented using log?

## 77Fortran says:

Incredible video, thank you! 🙂

## Julian Goulette says:

I've heard somewhere that randomness is the most compact way to store information because it lacks pattern

## XenoContact says:

How the fuck do you ask 1.75 questions?

## ethan ztz says:

b it

## quang nguyen says:

These videos are extremely informative and very well done! Thank you for all your hard work! 🙂

## Fernando Gonzalez says:

Why is the number of outcomes state by 1/p??

If the probability was .1 then the number of outcomes would be 10 and if p was .9 the number of putcomes would be 10/9?? I didn't get that =/

But these videos are amazing, thank tou for making such a good video!

## resistancefighter888 says:

I don't get it, if we need to ask 1.75 questions to guess the next symbol of machine 2, doesn't it mean it produces MORE information?

## Anthony Vance says:

Great video. Where can I find the cool music at the end?

## Teymur Azayev says:

The beginning of the video is misleading and implies that the second machine is not random. It is in fact random and the difference between them is that the first generates a uniform distribution whereas the second doesn't.

## Kal says:

Definitely one of my favorite explanations of all time

## CityOfSky CityOfSky says:

This is the most intuitive video I've ever seen about Shannon entropy, thank you!

## dangerlibya2010 says:

I'm studying computer science and I wish all my professors explained things just like you !! that would've saved a lot of time !

## aref smiley says:

it was great explanation. Thanks

## cara says:

Very nice intuitive explanation. Thank you!

## Qoooba95 says:

It is very well explained indeed but I still have certain doubt. I would

like to think about entropy in terms of random variables (like

wikipedia definiton with discrete random variable X where possible

events are values of X), but I fail to understand the following: how do

you determine the entropy of an English word for example? What is the

random variable in that case? I see that for the set of English

characters we could determine the entropy according to the formula

presented in the video as the X would represent the character that

appears as the message and they all have certain probability of

appearing so we have the probability distribution. But I've seen people

determining entropy of a word by calculating the probabilities of the

characters as appearance rates (ABA: 1/3 for B 2/3 for A) and

considering that the probability distribution… If anyone could address

that and shed some light I would be extremely grateful! Thanks!

EDIT:

I just realised that given a sequence of characters that appear one after another independently the entropy of such message considered as a value of a random vector would be a sum of entropies of random variables which make up for that vector (is that right?) so it kind of makes more sense to me and seems natural (every additional character provides the same amount of information per average) but I'm still puzzled with entropy of given English word… Hope someone could respond to it. Also, incredibly interesting topic! And this channel is just great!

## jinnycello says:

Very easy to understand. THank you!

## Raghav Devgon says:

Thanks for such videos 🙂

## ramana raju says:

Very good

## James Simpson says:

Very good video, thanks, but I see (from my UoM physics course notes) that the natural logarithm is used instead of the base-2 logarithm: is this choice arbitrary or does it vary to suit a given situation?

## Bart Ziengs says:

Very nice video and clear explanation!

## Madras Matinee says:

HIGHLY INTUITIVE !

## Jeffrey Bos says:

Thanks a lot, this was very usefull!

## niharika rastogi says:

I like the creativity u put in explaining anything.

Simply wow.. 🙂

## Sang Phan says:

Great video, thank you. This make me realize how stupid the school is. They make beautiful things seem to be so dull and complex.

## Dongzhi Yang says:

Fantastic! Thanks you

## Afrid Gowun says:

This explains null and nothing. It's only clear for those that already confusing of it for tons years. But not for me, that never thinking about it, just started to grab its meaning. But, sorry, its explanation so fast, unable to attract me, and this is not an art product. Caused it makes me get confusion.

## Chen Phillis says:

very clear explanation! thank you

## Kevin McInerney says:

At @3:50 it should be a …. + 0.25 x 2 and not ….+ 0.25 X 4

## Darshan says:

Great video!

## TableRaw says:

wow what a video 😮 thank you

## Kamal Gurnani says:

Very nice explanation! thanks for the help 🙂

## Spandan Madan says:

To anyone who did think of this, why would you ask questions such "AB v/s CD" in first case, and "A v/s BCD" in second case – The reason is that this ensures that both answers can occur with 50% probability. If at every step, you ask 2 questions of equal probability, the AVERAGE number of times you'll have to ask the questions (in other words, follow that tree structure of question/answers), will be minimum. Why? You'll need a little bit of basic computer science experience to think this one through, maybe a good reason to take a course on data structures!

More interestingly, this is also the reason why binary search is more efficient that splitting a sorted list into 3 parts. It's a simple proof, I encourage you to think it through 🙂

## Marcin Kovalevskij says:

Thank you for all your hard work making these videos. Every video is made putting attention to the smallest detail. And that intro sound…

## Sirus Das says:

This the best explanation on the INTERNET! Amazing, Awesome and Thank You!

## Heidy Hazem says:

great (y)

## profie24 says:

素晴らしい

## MauricioMartinez0707 says:

You are a god thank you

## devin cope says:

wait… what about a and d or a and c… so on?

## Sai Aung K. Oo says:

This Shannon guy is way ahead of his time. The father of data-compression in times when there were no Internet, computers or even binary-files.

## Thomas Bingel says:

Recommendable

## sovan sam says:

Thank you alots man….This clear explaination get me to the point.

## ENXJ says:

Thank you so much for making this! If I ever become a lecturer, I will show your video to my students. This was so clear…

## N. M. says:

Please could anybody explain the equivalence of the two formulas at 6.21? THANKS

## Dzinuljak1 says:

great!

## František Sabovčik says:

This is indeed art, but the tearing of the page in the end was painful. : )

## General of your mom says:

Not clear how Markov chains is related to this. Why Markov chains were introduced?

## tmstani23 says:

Great video and super interesting topic. I think this definition of information is super counter-intuitive of the everyday common sense use of the word "information". People often think of information as a source of knowledge or having some inherent use. It seems like the comment-sense definition of information is more akin to education and implies some utility. Whereas this definition of information seems to remain indifferent to its utility and is closer to complexity or an implied amount of substance. This makes sense since entropy and information are increasing in the wild. It is fascinating that information can be proven to be entropic. I wonder if there is a limit to information or if entropy increasing means the universe is infinite in possibilities? Maybe we can only observe an information state but information itself is infinite.

## Soumyadeep Roy says:

extremely intuitive explanation,so well explained I was wondering why the heck I couldn't understand the way you explained !

## Mudar Saied says:

Hei, we did not count the probability of CD in the first machine? The first question was is it AB? then you continued only with Yes as an answer? What if it was N0?