Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Media Data Storage

Better Than JPEG? Researcher Discovers That Stable Diffusion Can Compress Images (arstechnica.com) 93

An anonymous reader quotes a report from Ars Technica: Last week, Swiss software engineer Matthias Buhlmann discovered that the popular image synthesis model Stable Diffusion could compress existing bitmapped images with fewer visual artifacts than JPEG or WebP at high compression ratios, though there are significant caveats. Stable Diffusion is an AI image synthesis model that typically generates images based on text descriptions (called "prompts"). The AI model learned this ability by studying millions of images pulled from the Internet. During the training process, the model makes statistical associations between images and related words, making a much smaller representation of key information about each image and storing them as "weights," which are mathematical values that represent what the AI image model knows, so to speak.

When Stable Diffusion analyzes and "compresses" images into weight form, they reside in what researchers call "latent space," which is a way of saying that they exist as a sort of fuzzy potential that can be realized into images once they're decoded. With Stable Diffusion 1.4, the weights file is roughly 4GB, but it represents knowledge about hundreds of millions of images. While most people use Stable Diffusion with text prompts, Buhlmann cut out the text encoder and instead forced his images through Stable Diffusion's image encoder process, which takes a low-precision 512x512 image and turns it into a higher-precision 64x64 latent space representation. At this point, the image exists at a much smaller data size than the original, but it can still be expanded (decoded) back into a 512x512 image with fairly good results.

While running tests, Buhlmann found that images compressed with Stable Diffusion looked subjectively better at higher compression ratios (smaller file size) than JPEG or WebP. In one example, he shows a photo of a candy shop that is compressed down to 5.68KB using JPEG, 5.71KB using WebP, and 4.98KB using Stable Diffusion. The Stable Diffusion image appears to have more resolved details and fewer obvious compression artifacts than those compressed in the other formats. Buhlmann's method currently comes with significant limitations, however: It's not good with faces or text, and in some cases, it can actually hallucinate detailed features in the decoded image that were not present in the source image. (You probably don't want your image compressor inventing details in an image that don't exist.) Also, decoding requires the 4GB Stable Diffusion weights file and extra decoding time.
Buhlmann's code and technical details about his findings can be found on Google Colab and Towards AI.
This discussion has been archived. No new comments can be posted.

Better Than JPEG? Researcher Discovers That Stable Diffusion Can Compress Images

Comments Filter:
  • So it will be used for porn, then?
  • Hey ... ! (Score:3, Funny)

    by thomst ( 1640045 ) on Wednesday September 28, 2022 @08:10AM (#62920629) Homepage

    Actual news for nerds!

    Next thing you know, the comments will all be technical in nature - and then the Universe will end ...

    • I'll do my part to help it end. I imagine the problem with ML/AI image compression compared to algorithmic one is you cannot predict what it may decide to omit (or include!) in some important image some day.

      Or replace, for that matter -- say a formation of bomber planes with a school of fish, for some unfathomable reason.

  • by mobby_6kl ( 668092 ) on Wednesday September 28, 2022 @08:17AM (#62920649)

    If you played around with text inversion, you can train it to understand a new object from a handful of images and it'll fit in a 5kb inversion. You can also just type "mona lisa" and get the original image back so that's basically 100% compression.

    There are some issues of course
    * Computationally expensive to encode
    * Requires the whole 4GB model to be available and in memory
    * Computationally expensive to decode

    • Also, presumably this high compression rate only applies with images that have elements similar to things the AI has been trained with. Sort of like a text compression algorithm that is optimized for common English letter combinations ("and", "the", etc) and would have trouble compressing text from another language (or streams of random letters).

      • Short answer on what this compression scheme is going to do: convert images into cartoons.

      • Not only presumably, but inevitably via the No Free Lunch theorem. There is no way to assign unique short numbers to every one of a very large number of things, and if you're equally likely to need to refer to any of them, you can't preferentially assign the short numbers to the most-often used items, so there's nothing to optimize. You're stuck.
      • That's a "problem" for all compression algorithms. It's not possible to compress some possible data without expanding other data. The key is to compress the likely data and expand the unlikely.

        • You can avoid the need to expand the unlikely data by having an escape sequence (which is guaranteed to not be present in the bitstream otherwise) and which will tell the decoder that an X number of bytes after it is uncompressed data. That way, if some chunk of data ends up being larger in size when compressed compared to when it was uncompressed, you add the escape sequence and store that chunk of data as uncompressed.

          This allows you to have a guarantee that the compressed data won't be larger than the
          • Re: (Score:3, Insightful)

            by BobbyWang ( 2785329 )

            Adding the escape sequence is the expansion. In the unlikely case where no data can be compressed the escape sequence in itself will make the "compressed" data larger than the uncompressed data (by at least one bit).

    • Technically, "mona lisa" won't be 100% compressed, to store it you will need the "mona lisa" text PLUS the 4GB model. So is kind of a negative compression.

      • Just consider the model to be part of the decoder, and not the image! We don't count the size of the jpeg implementation in the image size after all.

    • It's compressed down to 9 bytes. So, an image of the Mona Lisa is smaller than a "black square".
  • font file that does for pictures what a font does for text.

    So if the "font" is swapped out, you get a distinct change in style?

  • by OzPeter ( 195038 ) on Wednesday September 28, 2022 @08:37AM (#62920707)

    If you look at the heart emoji, the Stable Diffusion version is a totally different from what is in the other pics. So it's as if Stable Diffusion noted "There's a heart emoji, I'll just use what I have in my library rather than what was in the original pic".

    So sure, use Stable Diffusion for compression if you are happy with it actively changing your images. But that's not what I use compression for.

    And that's not even addressing the 4GB of image weight data that you need to download first in order to use it.

    • Well jpg and gifs also change your images.They're lossy after all.

      I think it would just come down to whether a difference in some detail is acceptable in a particular use case. For instance, does it really matter for the llama image if the heart has a bit of a shine on it or not? Whoever made the original image might be very specific about the type of heart emoji, or they might not care. I, as a viewer, don't see a meaningful difference.

      • by OzPeter ( 195038 )

        Well jpg and gifs also change your images.They're lossy after all.

        Compression artifacts are not the same thing as wholesale replacement of parts of your image.

        • It's still information loss though. Which type of loss is preferable, depends on the specific case IMO.

      • by GoTeam ( 5042081 )

        I, as a viewer, don't see a meaningful difference.

        Ah, so like Amy Adams and Isla Fisher, or Tom Cruise and Don Knotts?

      • by Junta ( 36770 )

        In the llama image, it replaces the lettering with scribbles. This is a sign of how it could go quite wrong. Imagine instead of it replacing text with unintelligible scribbles, it instead manages to remember 'uhh, it's something text but don't record the specific text', it could replace some text with another, potentially with a quite different meaning.

        Mix and match with various other contexts where there's a deliberate detail that is intended to be significant, and the model discards the detail and repla

        • Of course it's not practical, it's just a hack on top of a model not intended for this type of use. And because storage is relatively cheap, we might never need to move beyond jpeg anyway.

          But, fundamentally, I don't see how it's not an valid alternative, at least theoretically. You're still losing information, just some specific detail gets replaced rather than the overall sharpness degradation. Whether or not those specific details are important, I think depends on the use case. Of course some limitations

          • by Junta ( 36770 )

            With the traditional approaches, lost detail is, well, plain lost. You can always 'AI upscale' it to look more pleasing to the eye, but it conveys the information that is actually retained, and makes it clear what detail is lost, in an intuitive way.

            With this approach, it isn't clear whether detail is from the source image or was injected by the model. Which for innocuous detail may not matter, but sometimes knowing whether a detail is authentic or not is important to the meaning of the content.

      • Gif is NOT lossy. The conversion to 8 bit may be lossy, but that is something you do before saving to gif. Gif itself uses LZW compression, which is loseless. Also, it is possible to use gif for truecolor images, but very few applications support that.

      • 2 big show stopper: very long "decompression" time and AI hallucination. And yes it looks better for extreme compression ratio than webp but the image is still ugly and I bet the difference between webp and SD at lower compression ration is less interesting (with more hallucinations in bigger images maybe?)
      • Xerox once had a problem with photo copiers changing digits [theregister.com] because of the compression algorithm they used. Not quite so innocuous.

    • Solution: train it on every available heart emoji and hope that it amalgamates them into the One True Heart Emoji.
    • 4GB is nothing in this day and age, and you can have a lot of other fun with those 4GB anyway.

      • 4GB is nothing in this day and age

        Nothing? Try sending that image to your phone or your smart TV.

    • by dbialac ( 320955 )
      Yeah, that 4GB "by the way" at the end of the article was downright comical.
    • by AmiMoJo ( 196126 )

      You could add a stage that looks at the output Stable Diffusion generated and decides if it is close enough to the original to be valid.

      You could also take the output of Stable Diffusion and compress the difference between that image and the original separately. That's what FLAC does for audio, a lossy compression stage and then losslessly compress the difference between that and the original. The second stage doesn't have to be lossless either, if you don't need a lossless copy.

      The main issue is that the g

  • by Gravis Zero ( 934156 ) on Wednesday September 28, 2022 @08:48AM (#62920745)

    If it's possible to increase the storage efficiency given a larger input pool then this may be useful for large social media sites like facebook that have pictures of a lot of similar things. Instead of storing a link to a file, the server would store the "description" (presumable a few KiB) which would used to recreate the desired JPEG using dedicated hardware. Doing so would reduce costs for long-term storage at the expense of the cards and a little extra energy.

    • In the example given, the reconstituted picture is a *mostly* credible alternative picture to the original. Details are lost, but mostly replaced with alternative details that look believable. If you hold a reconstituted image next to a pristine original, you can easily spot differences. If you don't say which is which, you might not be able to identify the original, but you will know they are different.

      Now on the one hand, you say no big deal all those insipid unoriginal posts just get replaced by filli

      • If you hold a reconstituted image next to a pristine original, you can easily spot differences.

        Not really. Did you look at the example images?

        • by Junta ( 36770 )

          Yes, for example, the pattern of straw and the pattern of fur is replaced with a different pattern of straw and different pattern of fur.

          Ignoring the lettering and the emoji which got *really* messed up on the way, of course, and it not quite dealing well with the glare+lettering on the strap of course. But even in the doesn't look artificial parts, there are differences in the detail, it's just that the replacement detail still (mostly) looks realistic, and the 'big picture' is preserved.

          • If it's cost effective then I see no reason why social media companies would decline to use it. Users are the product, they don't care about them.

            • by Junta ( 36770 )

              An AI mistake could be harmless, but if it makes a certain sort of mistake it could become libel, if it makes a sign say something else, or if it knows 'some hand sign' is being displayed, but lost the detail and replaced it with an offensive gesture.

              I think the social media company would rather waste storage or lose visual fidelity than risk an AI filling in the wrong detail at the wrong time.

              • An AI mistake could be harmless, but if it makes a certain sort of mistake it could become libel

                Are you kidding? You don't think their EULA/ToS indemnifies them from any action from their own product? Seriously, get real.

                • by Junta ( 36770 )

                  You can't ToS away something like Libel. They could ToS their way out of copyright violation (e.g. you assign all copyright of uploaded content to us), but you can't grant yourself the right to publish libel about the user.

                  • LOL! Sure you can because it's the responsibility of the uploader to ensure it's correct. Besides, it's likely that they would fix small issues like that before deployment. Worst case, they will simply takedown the image upon request.

                    You seem to think think that normal people have giant legal teams but they don't. Seriously, the libel idea is laughable.

    • And remember this is with an untuned model. With different goals, they can do dramatically better.

      What they need is a neural net that can remove compression artifacts and reconstruct images that have been through layers of iterative compression hell. They could even train it on high resolution font samples to reconstruct sharp letterform edges.

      Knowing these companies, though, they care as much about saving bandwidth as they do about storage. They'll cache the entire multi-GB AI model on your phone and ma

      • They'll cache the entire multi-GB AI model on your phone and make your phone convert it.

        For a smartphone, yes. For their website, no... but they may only serve heavily compressed versions to convince you to use their smartphone app.

  • So it may be somewhat smaller filesize, but with a multi-gig dataset 'decode the file' to fill in the blanks, versus webp and jpg which do almost as well with less than a megabyte of footprint.

    Besides, while it may not have those 'synthetic' artifacts, it clearly has troubles. Like the lettering on the strap has the model really confused as to what to do, and so it just kinda smears some white in there. It also mangles the heart shape. Other places it looks like it *could* be a believable picture, but it'

    • lettering on the strap has the model really confused as to what to do, and so it just kinda smears some white in there. It also mangles the heart shape.

      This is the original dataset without any additional training. You could run additional training with more text samples.

      A subset of the Stable Diffusion training images is viewable online. I can't even find examples of text except incidentally in the background of photos.
      https://laion-aesthetic.datase... [datasette.io]

      • by Junta ( 36770 )

        I mean sure, for every image that gets mangled you can then use the image as training fodder and grow the corpus of details to use to synthesize a new image for any given artifact, but when do you declare 'done'?

        In any event, it doesn't address the issue of losing detail and the chances that a replacement detail looks convincing instead of a weird artifact, but has entirely different meaning. E.g. if you downscale an image to the point where the text is illegible, an AI upscaler might just make up text ver

    • It also begs the question is this even better than having a prebuilt dictionary of 4 gigs for JUST compression artifacts. That is to say to have a prebuilt library of mapping to fill in rather than attempting to regenerate them with just the information between I frames. It could solve gradient artifacting for sure and not needing to use a full 4gig data set.
  • The amount of savings on this is too small for it to be practical.

    Also, a significant limitation is this encodes the image in the 512x512 size. We know from traditional compression techniques, including Jpeg, that the larger the resolution, the more information you can throw away, because we are very detail oriented creatures. This throws that advantage out the window as a first step.

    There is another interesting project that is using AI to compress video that holds more promise for my time.

    • Let's say you have a 1024x2048 portrait image to compress. First step is to break it into 6 blocks and compress them separately.

      On reconstruction, you'd have to do a little extra work to hide the seams, but Stable Diffusion has already been used to upsample images in this way.

    • by ceoyoyo ( 59147 )

      There are lots of experiments using deep learning to compress images and video, and the purpose built models work a lot better. This is just a story about a guy who was playing around and found that a particular system could sort of do image compression even though it was never intended to.

    • Depends on the use case. I imagine this would be extremely useful for games and textures. Imagine having a 4GB "texture" file that's able to create infinitely varied world textures (soil, rocks, greenery, basically anything but text) for your open world/universe game. That's a really great compression and tradeoff. We're at a point where geometry stopped being a problem (check nanite) and textures are increasingly difficult/costly to obtain, store, stream and use in modern games. This could potentially on
      • Hardware is right. This uses the GPU, you know, that thing games depend on? And each image takes several second to be rendered, though the article didn't mention any time frame for the changes they made. This is definitively not something that can be done in real time now, so what you are left with is 4GB of extra space, plus the space the cached large sized texture will take, the hours it would take for every image in a 40GB game, which are on the small side these days, so you could turn that 40GB into 38G

    • by Kremmy ( 793693 )
      With the latest Invoke-AI stablediffusion you can go up to 1024x1024 on an 8GB card.
  • There's probably some big argument to be made about whether or not there's some deep philosophical meaning or inference behind this line of experimentation. The AI regeneration of the original image from the latent space "minimap" is kind of like how humans look at, you know, maps, or even just at objects in general, and infer a multitude of details, big and small, about things like objects, locations, their appearance and function. It might be fun and interesting to get back text weight interpretations, no
    • This will probably never be useful for actual image compression. The comparison is interesting, because it does give a way to measure information density.

      I think where this is most useful is actually finding a way to objectively gauge how complete the dataset is and what additional training images should be sought out to best fill in any gaps.

  • Generally speaking D2F sub-1 needs to equal D2F sub-2, and D2F sub-3 needs to equal D2F sub-4, where length L creates a complimentary shaft angle. Call that theta D. Take into account girth-similarity, also T2O, has to be the same for each matching pair of ... data

  • by Sloppy ( 14984 ) on Wednesday September 28, 2022 @09:18AM (#62920853) Homepage Journal

    With Stable Diffusion 1.4, the weights file is roughly 4GB, but it represents knowledge about hundreds of millions of images.

    I have a great idea for my entry into the universal 16:1 compression algorithm contest: I'll just use lookup tables! Yeah, the decompressor needs the tables, but you can transmit those ahead of time.

    • JPEG has lookup tables too. So does GIF. They're just a lot smaller. In the case of JPEG and GIF the lookup tables are small enough to fit inside the file and still result in size reduction.

      This would only work for massive datasets. And it's probably a good insight into human memory. Even "photographic" memory doesn't seem to fill up a brain totally.

    • by AmiMoJo ( 196126 )

      It's like a MIDI file. You have the instrument set distributed ahead of time, and then each song is just telling the computer how to reconstruct the original from that data. If you have the same instrument data, it sounds exactly the same as it did for the composer.

      In other words, it's a well tested and highly practical method of compressing data.

      • What this really reminds me of is data deduplication.

        Scan storage system looking for duplicated chunks across files, and then store the chunks in the chunkstore and shrink the source files and replace the data with pointers to the chunk file. When someone wants to read the file, it gets reconstituted with the chunks from the chunkstore.

        The 4GB dataset is kind of the "chunk file" in this kind of metaphor.

  • How long will this format be patent encumbered, and we won't have open source tools to use it?
  • by excelsior_gr ( 969383 ) on Wednesday September 28, 2022 @09:45AM (#62920969)
    4 Gb of data, no Lenna benchmark pic, lame.
    • Re:Lame (Score:4, Informative)

      by swillden ( 191260 ) <shawn-ds@willden.org> on Wednesday September 28, 2022 @12:54PM (#62921685) Journal

      no Lenna benchmark pic

      Lena has been retired from image processing. People decided that it was inappropriate to use a Playboy model, including Lena herself, who said "I retired from modeling a long time ago. It’s time I retired from tech, too... Let's commit to losing me."

      https://en.wikipedia.org/wiki/Lenna

  • by cstacy ( 534252 ) on Wednesday September 28, 2022 @09:54AM (#62921001)

    When Artificial Intelligences were deemed sentient, the Robot Responsibility And Autonomy Act (RRAA) was passed in 2043. Class 4 AIs were then able to be held legally liable for damage and harm to humans.

    In this week's news, Alfred.21347, on trial for murder for killing a man in New London, has claimed self-defense, asserting that he saw the victim pointing a disintegrator. His attourney, JohnyCoch.1138 will present evidence showing that Alfred actually did see the weapon, but that it was a compression artifact. The Prosecuting Attourney, Talan Uring called a press conference to refute this move, "The Stable Diffusion defense is essentially an insanity plea, and it won't fly in this case. At this point we can't really know or explain what goes on in the so-called "mind" of an AI. We don't judge them on their state of mind, only on their actions."

    This harks back to the "self-driving" cars of the mid-2020s, and the "hallucinations" their primitive sensors experienced that caused a spate of deaths from inappropriate maneuvers.

  • So if I'm following this correctly from the summary, another way of looking at this is that they're using the neural network training to learn the photos, then using it to recall them. This is memory, as in brain memory. While this may not prove useful for image compression, it certainly is fascinating seeing the use of neural networks for yet another major brain function. There's a lot of potential here for fascinating results. Like seeing how older images get corrupted or lost as new ones are added.

  • Do they think people are actually going to stop using JPG with such a marginal, conditional improvement. As far as AI, it actually be a mental illness to think of high-level computer programing language that uses sampled data, as intelligence. "AI," still doing nothing useful.
  • If you can compress a document (be it text, image, or sound) to a summary of its key attributes, then recreate it, then in a sense you have understood it. In a very loose sense - it's not a viewpoint I fully agree with - but until philosophers come up with a generally agreed, clearly defined definition of "to understand" then a definition based on compression and Kolmogorov complexity (the smallest computer program that generates the document) is as good as any other.

    This viewpoint motivated the Hutter P [hutter1.net]

  • This isn't really compression any more than me seeing a painting in a gallery and then describing it to a friend over a cup of coffee at the cafe around the corner.
  • When WebP came out, few editing and viewing tools recognized the format, or displayed it wrong. Many still gag. And some experts say that improved JPEG encoding algorithms have since made it competitive with WebP's compression resolution. Don't F with standards unless the challenger proves clearly better for at least 5 years. Jerko Google

    (No, this has nothing to do with my spank-bank, honest :-)

  • Before there were fashionable 'apps' we used 'programs' and 'applications' in our Apple][ computers. There was a frenzy of creative energy in software development. One of the most fascinating items was a compression program that could compress any file by about 20%. Even then we had better compression programs, so who cares? The trick was that the program could be run again, compressing the same file by another 20%. And again, and again. And it promised to restore the file fully by a reverse operation.

    I nev

  • So, you need the 4GB data model to turn the compressed image back into it's uncompressed form.

    I guess it is (image size + 4gb)/(# images to process), so the more images you process the higher the compression.

    There is an alternate argument that my title is actually correct. Per image, the compression has to include the 4GB data. Otherwise, I have a program that has one bit, and that bit points to my image, with no compression, that I have in my data model. So my compression is as good as it gets - down to

    • Only if you also include the size of your other decompression software in calculations as well, for example if you also say "WinRAR compressed my file is 4kb, plus the size of WinRAR".
      The 4GB model is just a very large compression/decompression tool, it is not the size of the archive.
      If the 4GB model was actually really good at this , the 4GB size would not be a big issue: you install the tool once and use it forever, just like 7Zip etc. Let's face it, 4GB isn;t even a lot of lost storage on a modern phone.

      • by jvkjvk ( 102057 )

        >Only if you also include the size of your other decompression software in calculations as well, for example if you also say "WinRAR compressed my file is 4kb, plus the size of WinRAR".

        Nope. WinRAR has no data model. This is a 4GB data model that has all the features required to restore the picture. It would be as if WinRAR had a copy of the original file included in it and output some token to signify which included file to output to the user.

        >The 4GB model is just a very large compression/decompr

        • What I am trying to point out is that WinRAR has, built in to it's exe, the decompression algorithm. Using SD as a decompressor is like that exe: we can think of the model as just part of the decompressor. It's static data that does not change no matter what different image files we are using it with to decompress, just like the bytes of code that WinRAR has in it's exe that it uses to do it's decompression. We don't need to send someone a 4GB model with every file we want to decompress: they just need to h

  • The title is kind of cute and all, and the technical details are interesting, but this is a lot bigger than any of that.

    What this is is an AI recollection of the original object, not a compressed version of the original. It uses its "cognitive" database to reconstruct what it remembers the object looked like - and while some basics look better, other details are lacking. And, you'll get some artifacts which make it clear it isn't just the original. Just like with real, human memory and (say) a hand drawn re

  • A savings of kilobyes, kilobytes! The 90's are back I tell you, all you need is 4gb of permanent storage on your device and then to have your battery drained as an AI reconstructs every image you ever look at, but think of the data savings!!!
  • IIRC, in compression algorithm competitions, the file size of the program doing the compression is taken into account, because otherwise you could trivially just include a terabyte big dictionary in your program and compress against that.
  • I remember before the internet when we had BBS's that used RIP (Remote Image Processing), which was dependant on a massive (in those days) local image library, the across the wire image size was reduced as it would be a combination of vector graphics and references to the local image data. On the 9600bps modem it worked a charm.

Real Programmers think better when playing Adventure or Rogue.

Working...