Lecture 6: Noisy Channel Coding (I): Inference and Information Measures for Noisy Channels

Lecture 6: Noisy Channel Coding (I): Inference and Information Measures for Noisy Channels

welcome again today I’ll recap what we’ve done in data compression and then we’ll get back to the central topic if you like of the information theory course which is noisy channel coding which we looked at in lecture 1 let me start with a little question I set for you last time the question is about this ensemble ABCD with probability to half a quart at eight and eight and with the symbol code shown there and the question is if we are encoding from this source and then we reach into the encoded stream and pluck out one bit what’s the probability that it’s a one please talk to your neighbor about your answer to this question okay well let’s have a quick vote there are three options there you’ve worked out p 1 I hope you’ve talked to your neighbor about it and I’m going to ask you is your people on less than half half or bigger than a half and you can also vote for Zed which is I don’t know votes for Zed I don’t know okay votes for a p1 is less than a half okay a few votes there for a vote sir be p1 is equal to half 11 vote for be votes per seat p 1 is big enough everyone else except the people who okay we’ve got one there and no one voted for Zed so everyone else must be asleep already okay well we’ve got a range of answers there people from community a who’ve got p 1 less than a half could a volunteer take us through the argument what is your p1 and what was your method for calculating p1 since you’re in the majority okay who voted for a all right what was what was your p1 what was the value anyone unknown okay well can I help out let’s talk about the fraction of the code word that is made of ones and see what happens if we look at that so coded a the code word for a which happens half a time has got 0 ones in it code with B’s half ones Code Red Sea is to third ones and D is a hundred percent one so I defined a thing called fi it’s the fraction of the code word which is ones so a lot of people are voting for a but you’re being very shy about saying why and I’m going to now give a possible argument that might give an answer that’s in the a camp so we pick a code word first of these probabilities then we pick a random bit to see if it’s a one or a zero and these are the probabilities of it being a one or a zero so here’s a calculation probability of getting a warm I’ll put a question mark on it cuz i want to check if I’ve got your reasoning right probability of getting a one is a sum over all the letters in the alphabet probability of picking that letter times the probability of it turning out to be a one when you pick a random bit from the code word yeah is that the sort of reasoning that the a community went thought your letter nod if talking is different difficult ok we’re getting some nods alright so we add that up and we’ve got a half times zero plus a quarter times r + + 8 times two-thirds less than eight times 1 alright which gives us 18 + + 8 + + 8 times two-thirds which is 18 times two and two-thirds two and two-thirds is 8 over 3 which has given us one third okay does anyone who voted for a think that this is roughly why they voted for a hands up if that’s so okay we were getting somewhere so there’s a candidate answer there of a third but not everyone agreed we’ve got some B’s and C’s does anyone want to challenge this and say no it’s wrong here’s an argument why that’s wrong anyone or has everyone gone over to the a side yeah Martin okay so Martin just said if you reach into the stream and some of the code words are long and some are short and you reach in and pick around a bit you’re more likely to hit one of the long code words all right so this isn’t the correct way to pick a random bit from the stream we need to take into account the lengths so that’s a good argument okay so this is probably wrong at least it’s the wrong method it may not be the wrong answer so length matter length of code words matter any other argument for an alternative answer okay let me give you one other argument for this and then we’ll do the calculation again what’s the information content of this probability here answer one bit to bits three bits and three bits what’s the ideal encoded lengths for probabilities the ideal encoded lengths are the shannon information contents which are all from now on i will just call them the information content what are the length of the code words length I is 1 2 3 and 3 so the length match the information contents so this is a case where we have a symbol code with perfectly matched code lengths to the probabilities that means you can’t compress any better than this because we have our source coding theorem that says you can compress down to the entropy and if the expect if the length of your code words match the information contents it’s unbeatable good and this is a valid code it’s a prefix code so this this will work now if we use this code to encode stuff from the source and if it were the case that the bits coming out after encoding were not 5050 zeros and ones completely random independent unrelated to each other if they were not 5050 then we could compress it some more because shannon says if the probabilities are 5050 we can make codes you know with dust bin bags full of lottery tickets and all that sort of stuff we could do further compression but you can’t compress anymore because we’ve already got it down to the entropy therefore without doing any calculation at all it must be the case that p1 is a half so b is the right answer because it’s perfectly compressed and if it’s literally compressed it’ll look random ok so the right answer is B so what went wrong with this calculation well let’s just redo it and do the calculation in a way that takes into account lengths legs think about it this way we’ve got a great big bucket we’re going to put bits into the bucket each time you pick a codeword some more bits go in the bucket and some of those are ones and we’re interested in the ratio in this bucket what fraction of them are 1s so let’s give ourselves an extra column in here which is not the fraction that are 1s but the actual number that are 1s so the number of 1s in this code word is zero and this one is one and this one it’s two and in this one it’s three so on average when I pick another code word how many new bits go into the code word we’ll put that down stairs and the answer is sum over i p_i l_i and on average how many 1s go into the bucket well that’s sum over i p_i n_i okay how does that differ from what we did over here well f_i is the ratio of n_i over l_i so you can see we’re using a different thing upstairs and to compensate for that we’re using something different downstairs okay so this is the average number of 1s that go in the bucket whenever you pick a new code word this is the average number of bits that go in the ratio of those is going to be the fraction of bits that I want if we do that we have the upstairs thing is a half times zero plus a quarter x 1 + + 8 x 2 + + 8 times 3 and downstairs it’s our old friend the expected length which is the entropy of this probability distribution which is is it not 7 over 4 i’m doing this from memory and what’s up says that’s the quarter and that’s 38 and that’s a quarter so in eighth we’ve got two plus two plus three which is seven over 80 verse 7 over 4 which is half so that’s the long way to get to the answer that a sneaky information theory based answer is just to take it’s perfectly compressed case closed is fifty-fifty and it’s not only fifty fifty it’s got no correlations in it at all that there can be no correlations in the encode eating stream because otherwise we could use the information theory argument and say all those correlations could be exploited those dependencies any questions so we love information theory it provides you with very powerful very brief arguments for things that mean that you don’t need to do calculations anymore less time let me just recap what we did we introduced symbol codes or rather the time before last three interview symbol codes and symbol codes are widely used we have a Huffman algorithm that gives you optimal symbol codes and their expected length are within one bit per character of the entropy but we argued in the previous lecture we could go that doesn’t actually wrap up compression we don’t like symbol codes because being within one bit per character is not very good if the actual entropy per character is something small like 0.1 bits and we argued that often that will be the case if say we’re compressing English we play the game that gave us the idea that you could use identical twins at the sender and the receiver in order to make a compressor and then we looked at arithmetic coding and arithmetic coding is this way of thinking about a binary file as defining a single real number between 0 and 1 to an extremely large precision and any source file can be represented in the same way as in corresponding to an interval on the real line between 0 and 1 and I argued that when you do after Matic coding you will get compression down to within two bits of the information content of the entire file so this is in terms of its overhead this plus two is potentially about a factor of n smaller than the overhead you yet if you use symbol codes so arithmetic coding is in almost all cases far far better than symbol codes and I just wanted to flesh out how you’d use arithmetic coding in practice I I talked about the idea of an Oracle that being a piece of software which uses some procedure to predict the next character given all the characters x1 through xt minus 1 that have been seen and I wanted to just give you an example of how this works in real state-of-the-art compressors so here is a fragment of jane austen’s emma this is the last 50 or so characters and the context is alarmed at the BR 0 and the question is what comes next so how do real compression algorithms work well a fairly near state-of-the-art compressor that’s very widely used is a compressor called ppm and here’s how ppm works it’s called prediction by partial match and it’s slightly ad-hoc but there’s a bunch of theoretical justifications and improvements we can make to it but let me just describe the ad hoc method here’s the idea we look at the context of the last five characters of the document to define the context for the next character then we look at the entire file of everything that we’ve seen so far and we say have we seen that context before in the entire file so this is how we’re going to then predict what happens next we go and find all of those occurrences and here they are shown in blue so II space PR 0 has happened one two three four five times already in Jane Austen’s Emma and this shows what happened next there was a VIII IV and orp and an S so a first stab would be to say let’s use these six gram statistics the raw 6pm statistics to say all right there’s a 2 60 to 1 how many were the one two three four five is 75 times so there’s a 256 chance of getting a V next there’s a one with ants of 0 a 1 5th Chancellor p and a wampa the chance of an S but it also haven’t seen that much data so we better allow some probability for other things happening how do we allow for other things happening well the PPM approach is to say let’s back off to shorter length contexts as well and say what happened in context space PR 0 PR 0 r 0 0 and in any context at all and fold in those other one two three four five possibilities in addition to the six Graham statistics and wait them together in a smartish way and that’s how ppm works and it gives you extremely good compression state-of-the-art on most files it doesn’t do as well as humans do human audiences like yourselves can predict English a lot better and clearly it doesn’t understand grammar and dictionaries and and so forth but it does a lot of learning of raw statistics and that that gets you a long way towards good compression so that’s ppm okay so the question was does it learn the six grams statistics of the document of the entire document and then compress it or does it do it on the fly and the important answer is it’s just like you guys it the identical twin audiences it does things on the fly so when we’re asked to predict in this context we then say how often has a space PR 0 happened already before now we don’t look ahead into the future so the predictions are based only on what we have already seen okay any other questions there’s a couple more things I want to mention about arithmetic coding one is as I showed you last time we can make other uses of arithmetic coding we can make an extremely efficient writing system and the one I showed you was called dascha that’s based on the idea that writing involves making gestures of some sort maybe wiggling fingers or scribbling with a stick and we want to turn those gestures into text as efficiently as possible and compression is all about taking text and turning it into a bit string which in arithmetic coding is views viewed as a real number and so we can make a very good writing system by saying okay let’s have the gesture just define a real number by steering our spaceship in a two-dimensional world and that will make us if we turn it on its head a very efficient writing system so you make your gesture and outcomes text so that’s the dasher concept and I showed you that and it’s free software and help is always welcome to make dash a better and there’s another application for our automatic coding I wanted to mention and this is the idea of efficiently generating random samples let me give you an example let’s say I asked you to simulate a bent coin and I said okay let’s fix pa2 I don’t know 0.01 and PB is the rest please simulate 10 million tosses of this bent coin and I will provide you with random bits a random number generator the question is how many bits will you need to use to generate 10 million tosses of this coin let’s say n from now on instead of 10 million a standard way of solving this problem is to say oh well i’ll use the random bits you can provide me with to make real numbers between 0 & 1 so i’ll generate things that i’ll think of as real numbers uniformly distributed between 0 & 1 that will ghazal up 32 bits per real number if it’s a standard random number generator so we read in 32 bits interpret them as defining a number between 0 & 1 then we look to see if u is less than PA and if it is then we split out an A and otherwise we spit out to be this method will cost you 32 bits per character alright can we do better and do we care the answer is yes and maybe so yes we can do much better because we could take an arithmetic coder and think about it this way let’s wipe the board imagine sticking a pin into the world of an arithmetic coder what happens if you stick a pin uniformly at random into this line that goes from 0 2 to 1 or into or into one of those lines what comes out what’s the probability of what comes out if you stick in a pin and use it to select a string of text have a chat to your neighbor okay what I’ve drawn on the board here is a couple of different arithmetic coding worlds that run from 0 to 1 this is 1 for some source within alphabet with characters ABCD with different probabilities and what we noticed last time was the arithmetic coding method gives you intervals between zero and one whose size is the probability of the string equalling that particular string probability that x is a followed by c is the size of this little interval here here I’ve drawn the one for the bent coin this is ninety-nine percent and this is one percent and then we recurse and subdivide this in the same way and we get this is the little bit for a be in here and then here’s a bit for a a B and and so forth alright if we come along and stick in a pin the probability that the pin will land inside this interval here is this probability so the probability that if you then read out the string we’ve got the probability it will begin AC is the probability that a string begins with AC so sticking in a pin at random is a way of picking a string from the probability distribution of your language model okay so what we can do with bent coins is make the arithmetic decoder encoder and decoder for the bent coin then bring along random bits random notes and ones so as to define where a pin goes in the the pin you’re sticking a pin between 0 and 1 you do that by having an infinite source of zeros and ones to define where we are on the line and so you start guzzling up those zeros and ones find out where the pin is to the first bit to bits three roots come over and start reading out bits with your decoder reading out A’s and B’s with your decoder how many bits is that going to need well on average it’s just like compression whatever the binary entropy of naught point naught one is point one or something like that that’s on average how many bits you will need so in contrast to needing 32 bits per coin toss you’ll need this many bits so the arithmetic coding method for generating random samples for the bent coin is going to need binary entropy of point 0 1 bits per coin toss which is smaller than one quite a lot smaller than 1 so the it’s going to be more than a fact of 32 times more efficient with your random bits and the random bits are a valuable thing you may well want to do this it may strike you as elegant and it may actually save you money and I want to close this discussion by just giving you a little bit of evidence that we do actually care about saving random bits because random bits are not free if you want random bits you have to get them from somewhere and if what you’re doing has any sort of principal purpose you need to verify a good random bits and so here’s a CD that I received in the post a few years ago and the CD came with a message saying that this CD contains random bits and they’ve been generated thanks to the financial support of a couple of US government grants and they’re distributed under that grant from Florida State University to academics all around the world who want random bits so random bits are valuable and this is proof that serious researchers and funders food random bits as being important enough to actually spend US government money on so that’s an argument for not wanting to waste random bits any questions okay we have now feel yes so the question is are there other ways of thinking why this is a bad idea to do with the precision of the numbers you need so to make this decision about is it an A or a B often you only need the first bit to already tell you okay we don’t need to look at the next 31 bits the first bit has already made the decision so we’re wasting bits there and conversely actually if you look in detail at the probability that this will spit out a 1 it’s not going to be exactly one percent is it it’ll be some ratio of sort of n let’s call it m / 22 32 where m is some integer and so it’s not actually going to perfectly nail exact reckon it’ll be within 11 part into to the 32 which is pretty good but it won’t give you the right exactly the right answer so yes it is wasteful because it’s using loads of bits to make a decision about a single bit and arithmetic coding is better any other questions okay so we’re going to now move on to the topic of noisy channels and noisy channels have inputs and outputs and one of the things we want to do with noisy channels is infer what the input might have been given the output of the channel so a theme of noisy channels is going to be we need to understand inference and once we’re clear on how to do inference we’ll also be talking about how to measure information content in the context of noisy channels so we know how to measure information content for a single random variable like tossing a bent coin or drawing a random letter from English it’s all about log 1 over P and that’s it but for noisy channels it’s all a little bit more interesting because we’ve got inputs we’ve got outputs we’re interested in questions like how much information does seeing this output convey about the input so we want to learn how to do that and start off with I just want to give you a little puzzle to think about and this it doesn’t involve a noisy channel it just involves three cards and here are the three cards and I’ll introduce you to them one at a time here’s a card which is white and white on the front and the reverse okay here’s a card that’s brown on the front and the reverse H and here’s a card that’s brown on one side and white on the other so we’ve got three cards and now what I’m going to do is I’ll shuffle them and turn them over randomly and hide them and draw one card at random and plunk it down in front of you so you can see it and then I’m going to ask you about the other side of the card all right so I shuffle and randomized are not doing any magic trick so anything has to everything you see is what’s happening so we shuffle them up and then I randomly grab a cart and plunk it down like that all right and I show you the front and one side of the card and it’s white and the question is what’s on the other side the question is given that you’re now seeing a white face what’s the probability that the other face this car is white please turn to your neighbor and I’ll have a chat about that question okay folks for Z which means I don’t know okay we’ve got a few their nose votes for the probability is less than a half anyone no votes it’s bigger than a half a few huh seven ish and votes for see it’s half its 5050 one two three four five six grand ok so we have some clear uncertainty about this grant I like it when that happens so let’s have an argument from the second community why is it 5050 anyone yeah good so the answer I just heard was by logic it’s either this card or that card that’s sitting down there on the desk and we don’t have any additional information so it must be 5050 okay and we’ve got a load of people who disagree with you so would one of them like to give an argument why that compelling argument of logic was wrong yes okay so what I just heard was the three possible ways of getting a white thing facing up you’re either looking at that side or that side or that side and all three of those possibilities are equally likely so either this hypothesis or that one or that one and one we turn it out turn it over we’ll find out whether the other side is this or the other side is this or is that so that means it’s two-thirds yeah two-thirds chance here’s what that argument steered towards good does anyone want to add anything change your mind give another argument look ray yeah okay that’s a nice argument so the argument said think about the other side of the card which is either going to end up being white or brown and right at the beginning before i bought out the cut I could have asked you please place a bet is the the face that you can’t see going to be white or brown and we would say definitely 5050 okay and now the question is we see the front will that make us change our bet at all surely it must make a difference and that’s a very credible argument that there is a dependence between the front on the back so surely you shouldn’t be betting 5050 anymore it doesn’t this argument doesn’t tell you what the right answer is but it does argue definitely not 5050 because we’ve just learned something we know something that we didn’t know and there is a dependence and that’s a good argument to all right so let me give you a suggestion always write down the probability or everything and this is a copyright Steve doll who is a professor over in the building next door write down the probability of Everett of everything all the variables in the problem then you can condition on what has happened and you will be able to deduce by the simple laws of probability what the answer to your question is so in this case everything is the front face and the reverse space and before I did the shuffling and drew the card out the probability of the front face being white or black and the reverse face being white or black the probability of this pair of variables being white and white with one plus the probability of it being Brandon brown was one-third and the probability of the other two things is a third because you get that like the third car the black white card this chap coming along and it’s 5050 which way he’ll go also the sixth chance there mystics chance that that is the joint probability of everything and now with this joint probability we can condition on data and anything else you want to throw in and data is what we’re given and what we’re given is we’re in this world so today the white face came up so we condition on front being white and we can then work out the probability that the reverse is white for example by reno izing this probability distribution and we get to this okay this may not convince people who still really think it’s 5050 dammit it’s 5050 how can you say it’s two-thirds that’s rubbish and so I’ve got one final argument that I hope will help and then once this argument has been accepted maybe it will compel you to agree that it’s a good idea to write down the probability of everything so here’s the final argument and it’s not that the reverse was indeed white that happened to be true but it didn’t have to be the argument goes like this imagine that we play this game lots of times and instead of asking what’s the probability the other face is white I always ask you the question what’s the probability that the other face is the same as the one you’re seeing now okay so mmm-hmm this one over st. this one they’re the same this one they’re not the same so two-thirds of the time the face will be the same on the back as it is on the front so when you see a white there’s a two-thirds chance at the back is white and when you see a brown there’s a two-thirds chance to the back is brown okay so that’s not based on probability theory and I think it’s a fairly compelling argument that the correct answer is two thirds and hopefully that’s convinced you that we should use probability theory because if you can’t actually reliably solve this exercise involving just one random variable which is which card and another end of variable which is way up it went so it’s a two random barrel problem if a smart audience of graduates from Cambridge can’t reliably solve this problem just say oh yeah I know it’s fine I’ve got the answer that really shows you yeah inference is a little bit tricky inference isn’t totally straightforward humans are often very good at inference but if you want to be sure you’re getting it right use probability theory because probability theory will always get you the right answer okay all right so let’s now talk some more about noisy channels and actually let me just remind myself what comes up let’s talk through through this what where we’re heading is we’re going to talk a bit bit more about inference and we’re also going to talk about information measures for noisy channels our classic noisy channel is the binary symmetric Channel up there top right which is obscured by a wireless icon okay go away poof okay the binary symmetric channel with input 0 and 1 and it flips a fraction f of the bits so we’ll talk a lot about that channel in due course but we want to understand coding theory for any noisy channel and whenever we’re dealing with noisy channels we need to know how to do inference and I used to introduce people to inference with a different puzzle instead of the three cards I would show people the three Doors problem where the game show host says here’s the rules of the game I’m going to hide a very desirable prize here a national rail pass behind one of these three doors the game show host always explains the rules first before the game happens and the rules out I will hide the prize behind one of these doors then I will ask you the player to choose a door or you have a free choice and you choose it by naming it and we don’t open it then I the game show host guarantee that I will open another of the doors not the one you chose and I guarantee when I do that promise me I promise you trust me that the prize will not be revealed at that stage so then the prize is clearly either behind do one or door to in this example shown here door one being the one you chosen door to being the other door that you didn’t open and then the player gets the chance to stick or switch if you want and then you’ll get what’s behind the final door you end up at after either sticking or switching and the options for this puzzle then worked like this that was the rules being explained now now you go ahead and play the game the player chooses door one the host says in accordance with the rules i will now open either door two or three and will not reveal the price and as promised he opened the door is still three it doesn’t reveal the price and the question is should you stick should you switch or does it make no difference and i used to use this as my introductory example on probability theory because people would argue very hotly about in say well it’s either behind door one door one or two it’s 5050 yeah this md door doesn’t make any difference it’s 5050 and other people would say oh no no no you should switch and some people would say you should stick and it used to be contentious but unfortunately most people have heard of this puzzle and so it doesn’t really work anymore but let’s have a quick vote votes for a votes for be you should switch votes will see it makes no difference so you see it’s a useless educational device it’s been ruined because everyone’s gone and talked about it and the really annoying thing is they’ve talked about it in a way such that no one has actually learnt anything and you are my proof of that because a moment ago you voted for a lot of you you’re not all the proof but this lot yep this controversy here between B and C proves that add educational disaster has happened the three Doors puzzle which is a fantastic puzzle it’s really educational has been ruined because everyone now knows the answer but they don’t understand and it ABC here these are exactly equivalent to each other the three cards and the three doors they are the same problem as each other and yet I showed you the same problem and you didn’t get it right even though you’d allegedly learnt the answer so learning isn’t about memorizing things it’s about understanding so that’s my little rant on the three Doors so the message people should be getting from the three Doors problem is don’t memorize the answer to a stupid puzzle because then you’ll be useless at solving future inference problems instead learn the message that you should use probability theory and then you will be equipped to solve any puzzle of this type in the future ok rant over so what we do now let’s talk about noisy channels what we’re going to do with noisy channels is they’re always going to have an input and an output and a channel defines a set of conditional distributions if you conditioned on an input the channel defines what the probability of the output is going to be the channel doesn’t define a Joint Distribution it just defines a set of conditional distributions and we can only actually do inference if we’ve got the probability of everything so we need a joint distribution so I’m going to run through a very simple example where I assume I’ve got a joint distribution okay where are we yeah I’ll write down a joint distribution and I’ll define some information measures for that joint distribution then when we move on to channels we’ll have to get a joint distribution by adding to the conditional distributions defined by the channel a distribution on the inputs so that’s how its we’re going to join up I’ll start with a joint distribution or a joint ensemble and here’s an example of one and i’ll define all the different information measures we can make for this on joint ensemble so i’m going to tell you the joint probability of two random variables x and y which are dependent random variables a bit like the front and the back of the three cards so the first variable x can take on values 1 2 3 or 4 y can take on values one two three or four and the joint probabilities are 18 1 16 1 30 to 1 30 tooth 1 16 18 1 30 to 1 30 tooth 16th all across and then a quater & 0 and 0 and 0 from a joint distribution you can always define marginals and marginal probabilities are written in the margins and they are obtained by adding up and here we have a quarter a quarter a quarter and a quarter as it happens for the probability of Y and then in the bottom margin we have a half one quarter 18 18 now we already defined for a single random variable how to measure its entropy and so when we’ve got a joint ensemble there’s a bunch of straightforward things we can do with that entropy definition we can write down what the marginal entropy of X is and we can write down what the marginal entropy of y is for example so I’m defining for youth by example what marginal entropy means muzzle entry of X is the entropy of this distribution here and that is 7 over 4 bits the marginal entropy of why is the entropy of this distribution and that’s two bits and we can define the entropy of the pair X Y so you could think of the pair X Y as being a single random variable that takes on these 16 possible values with the 16 probabilities written half and you can say what’s the entropy of this distribution here so we just do normal sum over all pairs X Y Z of XY log base 2 1 over P of X and Y ok and that comes out to 27 / 8 bits for this joint distribution and those numbers I’ve just written down these three numbers have the property that if you show them graphically and draw H of X that way and h of x and y this way and then you try and make H of Y nudge up against this right hand side here you find they don’t quite match up so H of X and H and of Y added together are a bit bigger than h of x and y for this example and this is actually always true that these guys will always overlap or they might just up a butt up against each other and exactly add up to the joint entropy and they only ever do that if x and y are independent random variables so the size of this overlap here is a measure of how much dependence there is between the random variables let me define a few more things we can do with the Joint Distribution we can define conditional probabilities for example p of x given Y is defined to be the joint probability of x and y divided by probability of Y and we can write it out for all 16 values of X&Y so here is p of x given y 4 y going from 1 2 3 4 and x going 1 2 3 4 so if we conditioned on why being one then we can read out that top row probability and normalize it and we get a half a quarter 1818 the next row goes a quarter a half 18 18 when we normalize it when we normalize the third row they’re all equal so we get a quarter a quarter a quarter on quarter when we normalize the bottom row we get 1000 so those for probability vectors are the four possible conditional distributions and for every one of those probability distributions we can write down its entropy so we can write down the entropy of x given that Y is equal to 1 we can write an epi of x given that y equals 2 and so forth and the entropy of this guy here this distribution it’s fairly familiar it’s seven over for this 17 / 4 as well this one and we have the uniform distribution is two bits and the entropy of one and three zeros is zero bits cuz it’s a certain just ution we know condition on why being for that X is definitely one so what do we think about that well we notice that it’s possible for the conditional entropy of x given a particular value of y to be the same as what we started with 7 over 4 or the same or bigger or smaller so you when you get data about why it’s possible for your uncertainty about X to either go up stay the same or get less so that’s an interesting observation and something we can do with all of these who is we could say on average when we learn why how much information content will X then have so what’s the average entropy of x given Y so I’m now defining for you a thing called the conditional entropy all of these were conditional entropy zaz well there were four particular conditional entropy s for particular values of Y and then this is the really important one the conditional entropy which is the average averaging over all possible values of Y waiting together the entropy of x given that value of y notice I’m carefully consistently using little white amine a particular value and capital y is the name of the random variable itself okay and that adds up to 11 / 8 bits similarly we can work out what the entropy of Y given X is similarly go through the calculation and you get 13 / 8 bits now in this case I pointed out that the entropy of X when you learn why could stay the same could go up or could go down where we worked out the conditional entropy we got 11 / 8 bits and that is smaller than 7 over 4 and let me now tell you a theorem which is that it is always the case that the conditional entropy with capital letters is less than or equal to the marginal entropy and finally it can all be glued together like this in a single diagram so this is not at all obvious but it’s true that H of Y given X fits here and h of x given Y it’s here and thus we have rather nice word stories that we can tell which is the total information that you get when you learn x and y is the sum of the information you get when you learn y plus the amount of information you then get when having already learned why you then go on to learn X or the information content of x and y on average is the information content of learning x by itself then already knowing X learning why ok so the sum of these two is this thing here and this overlap here as I said is a measure of dependence between these variables the relationship I just said about this Plus this equaling that some people call it the chain rule for entropy the final thing I want to do is define this creature half this measure of dependence between x and y is going to be a lot of fun when we play with channels it’s going to be the most important thing we want to know and it’s called the mutual information between the random variables x and y we call it I of X semicolon why and it’s what it shows in the picture so you can pick any way of obtaining this from the picture for example it’s the difference between H of X and H of x given Y or it’s the difference between H of Y and H of Y given X and that’s the mutual information what we’re going to do in the next lecture is glue this together we’re the channel a channel is a set of conditional distributions so for the binary symmetric channel it doesn’t specify how often you should use a zero and the one it just says if you send a zero there’s a ninety percent chance you’ll get a zero out if you send a one of the ninety percent chance you’ll get a one app so the channel defines conditional distributions and then we can turn those conditional distributions if we want to into a joint distribution on input and output by inventing a probability distribution for the input then you can compute all of these information measures for the Joint Distribution you’ve defined and when we’ve done that we can start talking about the theory of how to communicate reliably over noisy channels are there any questions okay thanks very much for coming and see you next week


5 thoughts on “Lecture 6: Noisy Channel Coding (I): Inference and Information Measures for Noisy Channels”

Leave a Reply

Your email address will not be published. Required fields are marked *