Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
News

David Huffman is Dead 93

etphone writes "One of the Gods of information theory, David Huffman, has passed on: here's the official the press release. Damn, he was a good professor too..."
This discussion has been archived. No new comments can be posted.

David Huffman is Dead

Comments Filter:
  • I was lucky enough to have a class from Huffman and hear the story from him personally.

    A few more details...

    Nobody in the class but Huffman chose the term paper because they new it was a trap. The professors assigned Huffman a problem that many people believed was unproveable, though it could not be proved unproveable. It was a problem that they had worked on very hard and not been able to solve.

    The assignment was to find a way to generate minimum spanning trees, given an arbitrary set of bytes. Now, humans could do this with brute force and then checking their results to test if the spanning tree was actually the minimum spanning tree, but it wasn't a very scientific process.

    As the due date for the term paper came, Huffman still couldn't figure it out. He gave up, decided to bite the bullet and try his luck at the test he hadn't studied for. He told his wife to throw his papers away. She did, and as he was leaving he looked in the trash and saw his notes and drawings of minimum spanning trees **upside down**... and the rest is the history.

    I heard somewhere that he got his Masters and PhD at the same time because of this, but I'm not sure about that one.
  • I had the priveledge of taking cmp80D, "Introduction to Cybernetics" (no, not Dianetics, Cybernetics) directly from Huffman at UCSC.

    First of all, every class he ever tought was at 8:00am in the morning. For a Santa Cruzian, that's the equivalent of 4:00am since we normally don't wake up till 12 ;o)

    For the first two weeks of class, he was an absolute teddy bear. That was good, since the class was open to all majors (Modern Feminism and History of Conciousness included). Two weeks into the class, someone fell asleep in the front row. He wasn't a teddy bear anymore. From that point on, the entire class was in fear for their life whenever he turned around quickly.

    The class was very hard. Homework was due every single class and there was a quiz almost every single class. Every class introduced brand new material. Miss a single class and you're totally screwed. Of course, after that class, most other CS theory courses were review ;O)

    The class wasn't easy, but man could he teach.
  • Huffman coding is optimal only if the probabilities of your alphabet all happen to be (1/2^N) for integer Ns. The problem with Huffman coding is that it can't allocate a fraction of a bit.

    Arithmetic coding achieves the equivalent of allocating fractions of bits and truely achieves the optimal compression in cases where the distribution of each symbol is not correlated to previous symbols.

    In most cases, though, the difference between Huffman and Arithmetic coding are negligible.

  • And those of us who did know him personally don't need to explain that it is a sad thing that he died... Of course it is! But David lived a long life (74) and did much with it; I don't think he left feeling like he wasted his time.

    Believe me (to the original poster), wishing I had been able to talk to him one last time may sound selfish, but learning that he had passed hit me like a ton of bricks. Many others who knew him no doubt feel the same way.
  • I think that's the first thing i've _ever_ seen on slashdot that made me cry.

    I took two classes from him. Barely passed one, failed the other --- his classes were incredibly difficult, but I learned more from them than from just about any other computer-related classes I took in school. For all that his classes were
    hard, he was a great guy, too: i remember
    getting a ride to class from him (until he
    had a flat), and the class buying pizza and
    delivering it to a study session, to his surprise,
    the night before the final.

    I'll miss him. More than I would have expected.
  • What you describe is probably feasible (although the increase in complexity may not be worth it for an RLE code).

    The LZH algorithm (Lempil-Ziv-Huffman) basically does this, though. The LZ77 part compresses by creating "tokens" which point back into a window of recently seen symbols, which is kind of like RLE except the "run" can be a mixture of symbols. These "tokens" then have a non-uniform distribution, which can be further compressed by Huffman codes (ie, give the more common tokens a shorter codelength). So the basic idea (tacking a Huffman (or arithmetic) coding step onto a front end pattern matching compressor), is a sound one.

    Actually, I think this is the technique that gzip and PKZip use as well (Maybe LZH is slightly different) Anyone?
  • Strolled over to my bookshelf and randomly chose a half dozen algorithm oriented texts. Davids work appear's as major chapters in all of them.

    It is sad that those that built the swings and see-saws in our playground are leaving us.

    David you will be missed, but never forgotten.

    mitd
    "Just an old-fart coder"

  • That's why I ALWAYS sat in the back row!

  • It is optimal for so-called instantaneous coding (where you know a symbol has been output as soon as you see its last bit).

    Interesting thing is that previous attempts at solving this problem relied on trying at each stage to seperate the characters into two sets of equal probability, then seperating these sets in two and so-on. Not so easy. It takes a long time and still doesn't produce very good results. Huffman's insight (obvious in hindsight) was to do it bottom-up. Start with the least frequent characters, unify them into a new character and repeat.

    Ah, the world needs more "lazy" folks like this.

  • Remember, a proof in computer science is like a proof in math.

    I know. Nevertheless, I thought I had a rather trivial counter-example. Fortunately, Gargle show my mistaken assumption.


    ---
  • Compression using the Huffman algorithm is commonly referred to as huffing [tuxedo.org]. Decompression was naturally dubbed puffing [tuxedo.org].

  • I've studied physics + maths at university for 3 years. Then I did medicine. I now a practicing medical doctor. Part time I'm studing comp-sci. I've been exposed to the theories of many great people in many fields...

    But this guy is one of the brightest in all the fields I've studied. This is a day to grieve for a great mind.

    Elivs

    modirators feel free to lower this comment.
  • Some names, that should/will be remebered (in somewhat chronological order - please correct me):
    • 1969 Thompson & Ritchie (Unix operating system)
    • 1971 Kernighan & Ritchie (C programming language)
    • 1975 Bill Gates (PC Operating Systems and Applications)
    • 1991 Tim Berners Lee & Dan Connolly (Cern, HTTP and HTML specification)
    • 1991 Linus Torvalds (Linux operating system)
    • 199? Marc Andreesen, Rob McCool, Eric Bina, Jamie Zawinsky et al. (NCSA Mosaic/Netscape browser and webserver)
    • 19?? (please add whoever you think deserves)
    :-)
    ms
  • As I read throught the bulk of the users' comments, it strikes me that only one made a mention of feeling sorry or bad that he died (and that was only for selfish reasons).

    One of the greatest men in the CS field died and all you guys can talk about is optimality of Huffman encoding.

    Ask yourself these questions... A) What did I ever make in grad-school that is still being used today B) Why should I care that he died?

    His death represents a signifigant turning point in the CS field; in some ways, it's starting the second generation..

    Any CS person from now on won't be able to say that they were alive at the same time Dr. Huffman was.

    And that makes me sad.

  • As an outgrowth of his work on the mathematical properties of "zero curvature" surfaces, Huffman developed his own techniques for folding paper into unusual sculptured shapes.

    You could you tell a bit about this one? I can only imagine flat surfaces here, but that wouldn't be worth mentioning, or?

  • Yup...
    I remember being introduced to Huffman compression in CS classes in Edinburgh...

    Can't claim to have known the guy, but may he RIP, and condolances to those who did know him.

    ~Tim
  • Most of us didn't know him personally, so we have no reason to feel sad at his death.

  • Just use FreeZip - no registration needed.

  • One of the reasons why I got into computer science and not just computers was to learn about the underlying theory behind the most common applications of computers. I always enjoy hearing stories told by the people who helped to create the science that I study.

    Last night in a crash course for an exam I came accross Huffman encoding. Needless to say it was much more interesting then what I was supposed to be studying :p. I specifically remember thinking just how big an impact such a technique had on CS. I also remember thinking that in all likelihood Huffman, like all the greats seem to have done, had already passed on. Damn I wish my insight wasn't correct.

    Seems like another legend has passed on and I won't be able to hear some great stories...

    rip Huffman, and thanks. This is why I read /..

  • I think that in retrospect the technical acheivements of Torvalds will be ignored. After all, at a technical level, he did nothing for computer science; he only copied a lot of people who came before him.

    True, he hasn't done much as far as technicall innovation (at least as far as I know), but his work was certainly a catalyst for bringing powerful, unix-style computing to the desktop, which has spawned numerous advances in computer science and education.

    And of course, let's not forget RMS. A lot of people don't agree with his views, but you can't deny that the FSF has greatly bettered the world of computing... remember, you likely wouldn't have Linux without GNU.

    ------
  • Good god, man! Ever hear of Huffman Coding? A Fucking brilliant and obvious ( as all great things are ) means of tokenising sequences of strings for later enlargement. It's as obvious as E=MC2. BUT nobody looked at the obvious so directly as he (and Turing, and Einstein, and possibly Feynman ) and his peers.
    I'm explaining badly here, so I'll bow out.
    Affiliations, politics, etc can darken a man/woman's vision. Huffman's vision at one early point in his life was extremely pure.
    That should be celebrated.
    BTW I'm 45 and my hair raises at the very mention of his name.
    Fare Well, old friend...
    Pete.
  • Hmmm... Huffman coding is rather obvious, but I fail to see how e=mc^2 was ever obvious. Remarkably simple and elegant, yes.
  • Didn't Shannon invent data compression in the Forties? IIRC, Shannon-Fano codes were a slightly clumsier version of Huffman coding which preceed Huffman coding by a few years.

    Please correct me if I'm wrong.
    --
  • I hade a class whith him last year. Where he wasn't the best teachear for every one I still thin he was a great guy and his preseance will be greatly missed around UCSC. He was the on CS profeser I knew who never rilly used his computer which was only a old mac se anny way.
  • Who is David Huffman..? Good god man, read the article.

    From it:

    Huffman is probably best known for the development of the Huffman Coding Procedure, the result of a term paper he wrote while a graduate student at the Massachusetts Institute of Technology (MIT). "Huffman Codes" are used in nearly every application that involves the compression and transmission of digital data, such as fax machines, modems, computer networks, and high-definition television.

    ------
  • by Anonymous Coward
    Perhaps if you bothered to read the article before shooting for first post, you'd recognize that Huffman, by force of genius, is one of those people who made modern computing possible. Man, any geek worth his salt takes the time to learn who's shoulders he/she is standing on. May he rest in discrete mathematical bliss.
  • by Sloppy ( 14984 ) on Friday October 08, 1999 @07:28PM (#1627798) Homepage Journal

    You're thinking of Phil Katz, although he himself stood on the shoulders of giants.

    Huffman code goes back to the 50s or 60s. Put very simply it's a way of representing information where the length of the code is short for more frequently occuring elements, and long for infrequently occuring elements.

    Nobody uses Huffman compression alone these days, but it is still very popular as a back end after a dictionary compressor. Such an approach is used by Zip's deflation, LZH (where the "H" is you-know-who), etc.


    ---
  • by MalcolmT ( 1868 ) on Friday October 08, 1999 @07:35PM (#1627799)
    I remember reading about Huffman encoding back in high school and it is probably largely responsible for me moving on from "computers as a game thingie" to "computers as something to program". The following is my recollection of how this term paper came about - although the details may be a little wrong, since the articles where I've read it are in boxes in the garage and I can't be bothered going down to hunt them out at the moment. :-)

    Anyway, just thought I'd share my memories about them man.

    Huffman and a few others were taking a graduate course from a professor who gave them a choice about how to be assessed. Either they could sit an end-of-term exam (there may have been some assignments involved throughout the semester as well), or they could write a term paper. Each student could make their own choice.

    Most of the class members chose the end-of-term exam, but Huffman (he admitted later) was a slightly lazy student and so decided on the term paper, thinking he could knock it off in a couple of weeks and get an easy credit. Unfortunately (for him, luckily for computer science) he kept putting off the work and suddenly realised he was running out of time to write the paper and couldn't even think of a topic. To make matters worse, he had missed (or not concentrated in) so many of the lectures that the option of renegging and doing the exam was no longer really open to him.

    In desperation, he asked the professor for a suggested topic. The professor (I wish I knew his name - he deserves a place in history as well!) posed a problem about compressing data. Huffman struggled with the topic for quite a while but eventually (quite close to the deadline, IIRC) came across a very elegant solution that worked beautifully. He was even able to prove that his solution was optimal, in the sense that no better byte-by-byte compression method was possible (this doesn't include things like LZW, etc, which use run-length compression techniques).

    The rest, as they say, is history. As an aside, Huffman passed the course with top marks, since the professor had neglected to inform Huffman that he (the professor) and a colleague had already put extensive work into the problem and failed to solve it satisfactorily.

    Just goes to show, sometimes the best work *is* done under pressure. :-)

    (OK, let the error-pointer-outers go to work. Where did I mess up?)
  • He will be missed, I liked PKUNZIP, it didn't have a stupid shareware message like Winzip.
  • by layne ( 15501 ) on Friday October 08, 1999 @07:35PM (#1627801)
    of Dr. Huffman doing what he loved can be found here [sgi.com].

    Dr. Huffman was another modest genius whose work doesn't fit in the hoary pigeonholes of Nobel candidates but is as ubiquitous as the DSP. It's no surprise he seems anonymous; these are the sort of stories that make me so often grateful for /.

    Thanks David.
  • Who on earth puts out a PRESS RELEASE when someone dies? A news article maybe, but a press release would make it seem like a happy thing, which it isn't. Goodbye PKUNZIP :(
  • Man, any geek worth his salt takes the time to learn who's shoulders he/she is standing on.

    ...Not to mention putting such obvious flamebate out without logging out and becoming an anonymous coward :-) Hey wait a minute, you're an ac with a more useful post. Yours is at least not flamebait. Well anyway, flame the "first post" guy away! lol

    Patrick Barrett
    Yebyen@adelphia.net

  • Californians
  • It's sad that more people don't know how much David Huffman shaped today's world. Here's a prime example of someone that will pass on, and no one will notice... they're too busy using the machines that he helped create.
  • It seems the computer industry has reached an age where pioreers and educators are beginning to pass away. Postel, Stevens and now Huffman. These are truly sad times. My condolanses to his family.

    It does make you wounder, who will be the "heros" of tomorrow?

    B. Johannessen

  • gnu's "tar zxvf" all the way baby! GO GNU! I wonder if Huffman was an Open Source advocate... anyone have a clue? I wanna know.

    Patrick Barrett
    Yebyen@adelphia.net

  • He was even able to prove that his solution was optimal, in the sense that no better byte-by-byte compression method was possible (this doesn't include things like LZW, etc, which use run-length compression techniques). ...[snip]... (OK, let the error-pointer-outers go to work. Where did I mess up?)

    I'm skeptical about the "proof" part of the story, since Huffman coding is really just a quick-n-dirty (but much easier to implement) approximation of arithmetic coding. Arithmetic coding will match or beat Huffman every single time, if compression ratio is all you care about. I'm not saying Huffman coding isn't extremely cool, but it's certainly not optimal. It sure is elegant, though.


    ---
  • Given an input alphabet of fixed length, Huffman coding can be *proven* to produce optimally short output sequences. Translation: you're wrong.
  • Press releases are normal for people who were significant in their fields. Almost all politicians, for example, have press releases issued when they die, regardless of whether you've heard of them before or not. Likewise with actors, musicians, etc. It is not uncommon for a press release when, say, someone who won a Nobel dies.

    If, say, your father had died, would you let some hack reporter write his obituary if you could do it yourself?

  • by Anonymous Coward
    I've never heard this story in a CS context. Something like this did happend in the field of operations research: George Dantzig, while a young graduate student at Berekely several decades ago, walked in late to a statistics class. He copied down the problems on the board, assuming they were for homework. He solved the general linear programming problem, inventing the simplex algorithm to do so. After that, my memory for the details gets fuzzy. (1) He either remarked to the professor (Jerzey Neymann?) that one of the problems was much harder than he expected, and he couldn't solve the other at all. The professor responded that the problems weren't for homework, but the two most important open problems in the field. (2) Or he silently turned in his homework and, weeks later, the professor stumbled across the assignment, recognizing Dantzig's accomplishment. It's a great anecdote. You can find it recounted in the collection "More Mathematical People," a fabulous book. Be sure to check out the chapter on Bill Gosper too.
  • In the geek/male world I think one of the highest complements to a person is to have their work talked about, criticized, etc.

    For me it's the same as erecting a monument.

    At least these comments weren't about PKzip 8(.
  • He was even able to prove that his solution was optimal

    Not quite. The Huffman algorithm has been proven to produce optimal prefix code, which are codes such that no codeword is also a prefix of another codeword. According to my algorithms book, Huffman compression saves 20% to 90%.
  • The above description is how I remember him telling us how he came across the solution. I was lucky that the teacher for our data compression class was out the first week of school. Prof. Huffman subbed and taught us about Huffman codes (imagine that). It is amazing how simple of a concept it is. He was a very intelligent guy and a very demanding teacher. I also had him for the undergraduate signal analysis class (about a 90% failure rate as I recall). I only lasted 4 weeks. That was the hardest class I have ever taken. You had to know calculus very good to have a chance to keep up with the rest of the material (and I hadn't taken calculus in over a year). People will probably start passing that class now but I dought they will learn as much. I knew people who took it knowing they would fail. But they knew they would learn more than in most other classes anyway. He will be missed.
  • Compression is made up of modelling and coding. The coding step transforms a sequence of probability values between 0 and 1 to a bit stream which must obviously have the property that you can recreate the original probabilities from it. Huffman coding is optimal iff each probability can be represented as 2^(-k), k being >= 1. This is a very rare case, all prob. values must be 1/2, 1/4, 1/8 etc. So, other coders like arithmetic coding beat Huffman most of the time. Unfortunately, there are nasty patents on ar. coding, that's why JPEG files using arithmetic coding (they're possible) are not in general use although they give about 5 to 10 percent improvement in compression over the widely used Huffman JPEG's.
  • Well, you'd get slightly more optimal compression for the runlengths to have a separate tree, and you'd also be able to have runlengths >256. Saves a few bytes. :) And the position of the RLE code will certainly depend on what sort of data it is. For an image with lots of solid colors, it would be nearer the top, but for your average textfile it'd be closer to the bottom.

    And yeah, DH has the advantage where the tree starts out with all leaves being at the lowest level (basically being binary code, heh)... I guess a priority heap would be a good implementation to use, yeah... I just figured one would rebuild the tree after each character. It doesn't take *that* long. :) (Of course, for a large file that'd be real painful.) By removing the initial tree, however, you do lose the implicit encryption in the general case, but there's no reason you couldn't have a different starting tree, and then you could also use trees tuned for different applications and get a bit more compression anyway. Even with non-optimized initial trees (i.e. all characters at the bottom), you have about 8.578E506 initial trees to choose from, though admittedly that would start out with just your basic one-to-one replacement thing, and if anyone wanted to crack the code it'd be easy to just assume the trivial, ordered tree, decompress to trivially-compressed plaintext, and then work out the substitutions later. Ohwell.
    ---
    "'Is not a quine' is not a quine" is a quine.

  • Bill Joy (BSD, csh, vi, curses, Sun Microsystems) Kirk McKusick (BSD) Donald Knuth (TeX, The Art of Computer Programming) Stephen Wolfram (Wolfram Research) Steve Wozniak and Steve Jobs (Breakout, my favorite arcade game, Apple) Seymore Cray (Cray Research) Vint Cerf, Jon Postel (Arpanet) And I am going to go out on a limb, but I think that in retrospect the technical acheivements of Torvalds will be ignored. After all, at a technical level, he did nothing for computer science; he only copied a lot of people who came before him.
  • It isn't very well known that Huffman created some very ingenious math to describe the folding phenomenon of the paper. Unfortunately, he fell ill before he was able to publish any of it. Colleagues at Santa Cruz are looking into publishing his results on his behalf.

    These aren't simple, obvious equations or paper folding for a hobby, these are the elegant works of beautiful mathematics that only a genius like Huffman could come up with. Huffman will be missed at Santa Cruz, and by the rest of the pepole in the world that will now never get to know him.
  • For those interested in exactly how Huffman encoding works, click here [gate.net] for a brief and wonderfully lucid account. (This is my private archive of a long-dead web page by John Morris).

    Joe

  • I was also one of Huffman's students. He was certainly a brilliant individual, there's no doubt about that. However, he was dismal as a teacher and students generally hated his classes. It was clear that he also hated having to teach us. You could learn a lot from him, but it was like fighting a war.

    Nevertheless, he'll be missed.
  • Once you understand michelson-moreley (sp?) most of special and general relativity are obvious. Not *easy* mind you, but *obvious*. I don't easily remember the emc2 connection at this moment but it's nothing next to the initial insights.
    If you'd attempt a true (even spam-camouflaged) e-mail, we'd discuss it, but since you won't, I can't illuminate you further. You lose.
  • Huffman did a lot of good work, but to me he was an arrogant bastard. I had him for three classes at UCSC. You could open notes from several years earlier and get the same information. But that's not what caused me to dislike him.
    First midterm of my first class of college, I was asked to pick up my midterm from his office. Strange, but I didn't know any better. I was asked to think about what I did during the test very carefully. Even stranger. I told him to the best of my ability. "Came in, took test, handed test to you, left." I was then accused of lying and that it would go easier on me if I named accomplices. After several iterations of this, I was finally accused of cheating. He displayed the evidence. Two tests with my name on it. Similar printing styles. I was able to identify which one was mine. Neither one was a stellar performance of scoring. High 70's and low 80's. More accusations and since I had no clue on how this happened, he figured I was protecting someone. He sent me out, saying he was going to get me thrown out of college. It's a good thing I ran into a TA who mentioned seeing duplicate homeworks. She suggested checking the registration office. The next morning I find out I have a name in common with someone else at UCSC. He happens to be auditing CIS 10. I go to Huffman and explain my find. I was grudgingly allowed back into class while he checked this out.
    So that's my memory of Huffman. He nearly turned me complety off computer science. Hopefully, my story wasn't common to most students.
  • and an article [plover.com] with a good exposition. Bricolage . . . fine word for this eulogy.
  • Makes many exercises in Origami look clumsy by comparison. Some of those folds he did are an amazing piece of art unto themselves- they're unbelievably beautiful. If you haven't followed up that link, http://www.sgi.com/grafica/huffman/ [sgi.com] do so now- if you like geometric artforms, you're in for a treat.
  • Not at all. For every decodable code, there's an equivalent prefix free code. So it's sufficient to consider only prefix free codes. So the Huffman code is optimal among all decodable codes. Note that you can actually encode more than one symbol at a time
    i.e. if you've symbols A & B, you can code it as AA,AB,BA,BB. By coding more and more symbols at a time, you can eventually reach the limit determined by the entropy of your signal source.

  • Hmm.. I haven't seen the proof, but I can show you a very simple counter-example to right here. Got a minute?

    Suppose we have an alphabet of only 3 characters, and each character occurs with equal frequency. This is a nasty example, since neither arithmetic or Huffman are going to compress it well.

    Look what happens when you assign the Huffman codes: you're going to have one 1-bit code, and two 2-bit codes. On average, characters are going to be represented by 5/3 bits (about 1.666 bits).

    With arithmetic compression, characters are going to be represented by log2(3) bits (about 1.584 bits). log2(3) is less than 5/3. QED.


    ---
  • Ah, Gargle's note below shows me my mistake. I ass/u/me'd a situation where you only encode one character per symbol. By encoding larger chunks, Huffman can approach the log2(3) case. Ok, you got me. I was wrong. :-)


    ---
  • Man, how do you do that? Those curved folds rock. I'd put those on my wall. Anybody know any books on how to do that stuff? My brother rocks at origami but I've never seen him do curved stuff like that before.
  • It does make you wounder, who will be the "heros" of tomorrow?

    Why... Rob Malda, of course. ;)

    On another note, Bill Gates has become a sort of "nerd hero" to those who don't know any better. Anyone who isn't a real techie would consider Bill Gates to be the quintessence of "geek" and "nerd".

    And, of course, Linus, RMS, etc.

    Those will be the best known, of course. The rest, like in every other major area, will rarely be remembered.

    --

  • http://www.sgi.com/grafica/huffman/
  • that post is
    1. Tastless
    2. Incorrect, space "savings" must be less than 100%, otherwise you have nothing left.

    -----
  • I know a bit about Huffman encoding.
    Is it appropriate (or has this been done) to combine Huffman and run length encoding?
    Like reserve a huffman symbol for specifying that "the next symbol" will repeat for a specified amount of time. You'd prolly reuse all the symbols that you've already generated for all 256 bytes to specify the length of the "run".
    So if you encounter the RLE symbol (say "11") you'd know the next symbol after that specifies run length, and the one after that is the symbol that is running that length. It would be pretty efficient especially if the document didn't have too many characters that repeated over 256 times. Maybe a header to specify how many symbols represent a length for files that repeat too much (might be overkill)..
    Am I stating the obvious? Cause I really think it would be pretty cool :)

    Huffman encoding is darn elegant.
  • ok, this first post stuff it getting a tad bit old.

    -----
  • What do you mean you don't believe the proof that it's optimal? You think there's an error in the proof? If you insist, I can dig out the proof and post it here, perhaps you can show us the error?

    Remember, a proof in computer science is like a proof in math.

    If you are just complaining that Huffman coding is not always the best compression method, well, nobody ever claimed it was.

    What it is, is a greedy algorithm which constructs a provably optimal "prefix code". Furthermore, it is possible to show that the optimal compression achievable by a character code can always be achieved with a prefix code.
    A character code is one in which each byte (or other symbol) of the input is translated to a single symbol of output. The compression achieved in Huffman coding is optimal for character codes.

    LZW, arithmetic coding, and other compression algorithms are not character codes.

    Torrey Hoffman (Azog)
  • You can feel sad if you want.

    You should not tell us that we should feel sad.

    That is not your place or right to command.

    The greatest eulogy is that we are discussing his life and work. A remarkable man has passed and now we discuss his accomplishments on an Internet bulletin board something that did not exist or was even considered a possibility when Huffman was born.

    Huffman's passing does not represent anything in the CS field; it is his LIFE which does!

    I am not sad he is dead.

    I am happy that he once lived!
  • by Azog ( 20907 ) on Saturday October 09, 1999 @08:01AM (#1627844) Homepage
    Since several people have posted on "optimality", without knowing exactly how Huffman coding works and how it is optimal, I thought I'd post an explanation. This is based on the chapter in "Introduction to Algorithms" by Corman, Leiserson, and Rivest. (An excellent undergrad textbook.)

    First, definitions:

    A character code is a code in which each symbol of input is translated to one symbol of output. (Compression is achieved in character codes by using variable length codes. Short code symbols are used for frequently occuring inputs.)

    A prefix code is a code in which no symbol is a prefix of another symbol. For example, if you have "10" as a symbol, then you can't have "101" as a symbol in a prefix code. Prefix codes are nice because they greatly simplify encoding and decoding.

    It is possible to prove that the optimal data compression achievable by a character code can always be achieved with a prefix code. Furthermore, it is possible to represent codes as a binary tree, in which a symbol "1" represents the left branch, and "0" the right branch. So "101" would be left, right, left.

    It is easy to prove that an optimal code for any particular file is represented by a full binary tree, i.e. one in which each node has two leaves. (This is the first exercise of the chapter).

    Now, Huffman coding is a greedy algorithm that constructs an optimal prefix code. The algorithm builds the tree T corresonding to the optimal code in a bottom-up manner.

    Pseudocode:
    Assume C is a set of n characters, and each c in C is an object with a frequency f(c). A priority queue Q, keyed on f, is used to identify the two least-frequent objects to merge together. The result of the merger of two objects is a new object whose frequency is the sum of the frequencies of the two objects that were merged:

    HUFFMAN(C)
    n = size of C
    Q = C
    for i = 1 to (n - 1)
    {
    z = new node()

    x = z.left = Q.ExtractMin()
    y = z.right = Q.ExtractMin()
    f(z) = f(x) + f(y)
    Q.Insert(z)
    }
    return Q.ExtractMin()

    And that's it!

    To prove this is correct, prove that the problem of determining an optimal prefix code: (a) exhibits the greedy-choice property, and (b) exhibits the optimal substructure property

    From these two lemmas, the theorem
    "Huffman coding produces es an optimal prefix code" is trivial to prove. The corollary is that Huffman encoding produces an optimal character code.

    Torrey Hoffman (Azog)
  • As a math major also taking CS classes from Huffman, I had real issues with his concept of "intuitive proofs". Personally, I had doubts whether he consistently knew the math well enough to do formal proofs. Actually, much of the early info theory work was relatively simple mathematics where the famous names of CS were the first to pose the particular problems and were able to solve them. Sometimes great work is the result of solving a very difficult problem (ex. Fermat's Last Theorem by Wiles) and sometimes by finding new topics (Huffman). David Huffman was a nice man that will be remembered for a very long time for having discovered something fundamental to CS.
  • Well, I think I'd take exception to the statement that "it was clear he hated having to teach us". After all, the guy was teaching well into his 70's, and I don't think he had to. I asked him once whether he was employed as a researcher, and had to do teaching as well, and he said, "No, I'm employed to teach".

    I do believe he was generally a bit annoyed by the many students who were unprepared to take his classes (in his mind). He was demanding, to be sure (I generally spent ~ 15 hours each week on his homework alone; but I was woefully unprepared :), but I think he honestly did enjoy it when he saw his students begin to "get it." I know many didn't like his methods, but I enjoyed how he did NOT teach directly from a book, and instead, prepared class notes and assignments for each lecture; clearly he wasn't just phoning it in.

    BTW. I'm class of '93; when were you there?
  • Yes, so you code more than one symbol at a time.
    An example:

    Symbols A, B. P(A)=0.8, P(B)=0.2
    Code A=0, Code B=1, Expected length=1

    Code 2 symbols at a time:
    P(AA)=0.64, P(AB)=0.16, P(BA)=0.16, P(BB)=0.04
    Code AA=0, AB=10, BA=110, BB=111
    Expected length=0.78

    Code more symbols at a time if you want (but returns are diminishing)

    The shortest possible expected length is given by the entropy which is equal to -P(A)logP(A)-P(B)logP(B) =0.72

    I'm not sure how arithmetic coding works, but I highly doubt that it gives you the theoretically optimal coding just like that.
  • That's actually a pretty neat idea... you'd probably want a separate Huffman tree for runlengths though, rather than using a fixed-size runlength size. I don't think it'd be really that useful in real-world data, but it wouldn't hurt any on top of normal Huffman (except by having an additional entry in the tree and a second, albeit small, tree). However, I think that other schemes such as dynamic Huffman (where the tree is modified as it goes based on the changing frequencies) would likely work out better on average. Perhaps dynamic runlength Huffman would be good... and of course, it's all lossless, for that extra-specialness. :)
    ---
    "'Is not a quine' is not a quine" is a quine.
  • by ChadN ( 21033 ) on Friday October 08, 1999 @09:06PM (#1627850)
    Well, I was fortunate enough to learn David Huffman's compression coding technique from the man himself. While at UC Santa Cruz, I took every course that he taught, and I think I learned more in those 4 classes than nearly all my other technical courses combined. Professor Huffman really was a genius, and taught me some amazing things about insight into a problem. I had no idea he was ill, and had been planning to talk to him about a few things (such as publishing his class notes and lectures; his problem sets were EXCELLENT teaching tools). He also wrote a recommendation for me to get into graduate school, and I never properly thanked him for it, which I deeply regret.

    In any case, the story above seems mostly accurate; Huffman's teacher was in fact Richard Fano, who along with Claude Shannon, was one of the early architects of "information theory". Every aspect of our modern life, from our use of CD players, to wireless phones (to name but a few) came as a result of these ideas. In fact, Shannon-Fano compression coding is a related form that was developed before Hufmann's technique. However, since Shannon and Fano knew it wasn't "optimal" (a much abused word), and spent much time trying to come up with an optimal algorithm, they thought it was impossible. So, by giving the assignment to Huffman as a project, he hoped to show him how such as easy sounding question was in fact quite difficult (or impossible).

    Well, Professor Huffman was brilliant at boiling problems down to their most basic nature, but after about a week of thinking about it on and off, he realized that he probably wouldn't be able to figure it out, and that it was a much harder problem than it at first seemed. So he crumbled up all his notes and began to prepare for the final, but as he tossed his notes into the garbage can, he says he had a moment of clarity, and suddenly realized the process was actually quite simple.

    So when the time came to present his "project", Professor Fano called on David to discuss his method of producing minimum redundancy codes, assuming of course that David would have come up with a non-optimal method, or have to admit that it seemed very hard or impossible. Instead, Huffman went to the chalkboard and gave a quick explanation of how to produce what we now call "Huffman codes", then sat down. Fano apparently slapped his forehead in amazement and said (in French) something like, "It CAN'T be that easy!". But it was, and Huffman, never having been told that this problem was "impossible", dutifully solved it.

    Huffman's achievements go well beyond his coding technique, and are well worth looking up (among other things, he produced some novelty papers about optical illusions that have become very useful in machine vision circles). I can tell you, as a former student, that he was both loved and loathed, but as someone who was willing to put in the work, his classes were incredibly enlightening.
  • Just a quick note about arithmetic coding: It can (essentially) produce codes of fractional length, unlike Huffman codes, so in the simple model above where only the frequency of single characters is modelled, it would almost certainly beat a similarly modelled Huffman code. The trick is that it holds more state information about the code, and can "delay" outputting bits until the fractional codelength fills up the next bit completely. Thus, it can produce shorter length codes than Huffman (using simpler statistical modelling), but you have to wait an indefinite amount of time for the next bit to be output.

    It can be shown to be "theoretically optimal" in this sense as well. I believe Glen Langdon has shown that both techniques are eqipotent, in that, given sufficiently powerful modelling techniques, neither can outdo the other as a coding method (ie. they both can approach the entropy limits). However, as a practical method of coding, they each have their different strengths and weaknesses, and their different applications.
  • With the "H" standing for Huffman.

    I had some demonstration C-source code for various compression methods once, but I never could really understand it at the time (probably because I was 12 then).

    Although as I understand it other people's work was used as well, the work done by these compression pioneers was cruical to the rapid flow of information we have today.
  • I was thinking that you'd reuse the codes you've already made for the bytes for the runlengths - it would prove easier and i don't think it be much less efficient.
    Is that what you were thinking - or does the second tree do something else?
    Like you say, this run length thing would put an additional entry on the tree ...most probably somewhere near the top(?)...but i guess the savings from RLE would negate the loss :)

    Dynamic Huffman would actually be good for large files since you wouldn't have to parse the whole file first :) I guess it's implemented like a priority heap. Dynamic Runlength Huffman does actually sound pretty cool :) - I might actually try to write something in Java to play around with the idea - I've never written a compression routine before - will be fun ;).
  • I really feel as though the nerd comunity (I prefer to think of us as geeks) is losing our role models quickly these days.

    I just want to say that we all need to spend time studying the likes of Huffman, Greenspun, and Minsky, Mandelbrot, so that good ideas can stay alive.

    Thanks.

    David

"An idealist is one who, on noticing that a rose smells better than a cabbage, concludes that it will also make better soup." - H.L. Mencken

Working...