Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Almighty Buck United States Hardware

19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding 150

An anonymous reader writes: 19-year-old Thomas Sohmers, who launched his own supercomputer chip startup back in March, has won a DARPA contract and funding for his company. Rex Computing, is currently finishing up the architecture of its final verified RTL, which is expected to be completed by the end of the year. The new Neo chips will be sampled next year, before moving into full production in mid-2017.The Platform reports: "In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under 'Programming New Computers,' and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the 'difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.'"
This discussion has been archived. No new comments can be posted.

19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding

Comments Filter:
  • good for him (Score:5, Insightful)

    by turkeydance ( 1266624 ) on Wednesday July 22, 2015 @07:24PM (#50164895)
    mean it.
  • by jonwil ( 467024 ) on Wednesday July 22, 2015 @07:33PM (#50164957)

    Not sure whats more impressive, the fact that a 19 year old is able to get DARPA funding or the fact that a 19 year old (and his team presumably) is about to go into mass production with a fairly fancy looking custom microprocessor on a 28nm fab process.

    • by alvinrod ( 889928 ) on Wednesday July 22, 2015 @07:55PM (#50165039)
      I was a little curious about that as well and one of the linked articles from TFA [theplatform.net] says that this kid was at MIT at 13. I'll go ahead guess that he's really into and good at microprocessor design. The article I've linked also talks about some of the design decisions for the chip he's making, on which I'd be interested in hearing from someone with a background in the field.
      • by Anonymous Coward

        I don't have a background in microprocessor design: I've only designed a very simple one as an assignment, but I've been following the industry pretty closely.

        From what I can tell, his design looks like it might be flops per watt comparable to GPUs, but with different memory abstractions that result in similar limitations. I suspect that if you write custom code for it, and have the right kind of problem, it will do significantly better than available options, but in the general case and/or non specialized

        • by trsohmers ( 2580479 ) on Wednesday July 22, 2015 @09:25PM (#50165343)
          I'm hugely biased as I am the founder of the referenced startup, but I figured I would point out a few key things: 1. When it comes to FLOPs per watt, we are actually aiming at a 10 to 25x increase over existing systems... The best GPUs (before you account for the power usage of the CPU required to operate it) get almost 6 double precision GFLOPs per watt, while our chip is aiming for 64. 2. When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing). It just so happened that when the GPGPU craze started ~6/7 years ago, they had enough of an advantage over CPUs that they made sense for some other applications, but in actuality, GPUs do so much worse on level 1 and level 2 BLAS apps compared to the latest CPUs that GPUs are really starting to lose their advantage (and I think will be dying out when it comes to anything other than what they were originally designed for plus some limited heavy matrix workloads... but then again, I'm biased). 3.Programming is the biggest difficulty, and will make or break our company and processor. The DARPA grant is specifically for continued research and work on our development tools, which are intended to automate the unique features of our memory system. We have some papers in the works and will be talking pubicly about our *very* cool software in the next couple of months. 4. Your mention of the Mill and running existing code well, I had a pretty good laugh. Let me preface this by saying that I find stack machines academically interesting and are fun to think about, and I don't discredit the Mill team entirely, and think it is good thing they exist. With that being said they have had barely functioning compilers for years (which they refuse to release pubicly), and stack machines are notorious for having HORRIBLE support for languages like C. The fact that Out Of The Box Computing (the creators of the Mill) have been around for over 10 years and have given nothing but talks with powerpoints (though they clearly are very intelligent and have an interesting architectures) says a lot about their future viability. I hate to be a downer like that, especially since I have found Ivan's talks interesting and that he is a nice and down to earth guy, but I highly doubt they will never have a chip come out. I'll restate my obvious biases for the previous statement. Feel free to ask any other questions.
          • Re: (Score:3, Funny)

            by PopeRatzo ( 965947 )

            I'm hugely biased as I am the founder of the referenced startup, but I figured I would point out a few key things:

            Thomas, you are awesome.

            Enjoy your success. I see from your bio that in your "free time" you like to play guitar. I hope you've bought yourself a good one (or six).

          • Re: (Score:3, Interesting)

            by Anonymous Coward

            Thanks for the response!

            I should have noticed your numbers for for double precision flops, so my numbers were way off. Thanks for the correction. I bet you are IEEE compliment too (Darn GPUs...).

            Your design is intended specifically for parallel work loads with localized or clustered data access, correct? (I realize this is includes most supercomputer work jobs) It sounds like similar constraints you have with GPUs, but if met properly, the performance should be much better/more efficient and more scale-able

            • by trsohmers ( 2580479 ) on Thursday July 23, 2015 @12:06AM (#50165825)
              1.We are IEEE compliant, but I'm not a fan of it TBH, as it has a ridiculous number of flaws... Check out Unum and the new book "The End Of Error" by John Gustafson (and also search Gustafson's Law, the counterargument to the more famous Amdahl's law), which goes over all of them and proposes a superior floating point format in *every* measure.
              2.First thing we get around primarily by having ridiculous bandwidth (288 to 384GB/s aggregate chip-to-chip bandwidth)... we'll have more info out on that in the coming months. When it comes to memory movement, that's the big difficulty and what a big portion of our DARPA work is focused on, but a number of unique features of our network on chip (statically routed, non blocking, single cycle latency between routers, etc) help a lot with allowing the compiler to *know* that it can push things around in given time, and having to put a minimal number of NOPs. There is a lot of work, and it will not be perfect with our first iteration, but the initial customers we are working with do not require perfect auto-optimization to begin with.
              3. If you think of it as each core as being a quad issue in order RISC core (think on the performance order of a ARM Cortex A9 or potentially A15, but using a lot less power and being 64 bit), you can have one fully independent and isolated application on each core. That's one of the very nice things about a true MIMD/MPMD architecture. So we do fantastic with things that parallelize well, but you can also use our cores to run a lot of independent programs decently well.
          • The important question is - how does it perform for the Cycles (Blenders render engine) benchmarks :)

            http://blenderartists.org/foru... [blenderartists.org]

            https://www.blender.org/downlo... [blender.org]

          • by captnjohnny1618 ( 3954863 ) on Wednesday July 22, 2015 @11:00PM (#50165681)
            I'm burning some mod points to post this under my username, but it's totally worth it. THIS is the kind of article that should be on Slashdot!

            Can you elaborate on the programming structure/API you guys are envisioning for this? (it's cool if you can't, I'd understand :-D). Also, what particular types of problems are you guys targeting your chips to solve or to what areas do you envision your chips being especially well suited? Also, who do you think has done the best nitty-gritty write up about the project so far? I'd love to hear what you think is the best technical description publicly available. Can't wait to learn more as the project grows.

            Although I'm not a programmer or CS person by training, I do GPGPU programming (although not BLAS-based stuff) almost exclusively for my research and enjoy it because once you understand the differences between the GPU and CPU it just become a question of how to best parallelize your algorithm. It'd be AMAZING to see the memory bandwidth and power usage specs you guys are working towards under a similar programming structure we currently see with something like CUDA or OpenCL. Any plans for something like that or am I betraying my hobbyist computing status?

            Finally, if you ever need any applications testing, specifically in the medical imaging field, feel free get in touch. ;-)
            • by trsohmers ( 2580479 ) on Thursday July 23, 2015 @12:12AM (#50165835)
              1. My personal favorite programming models for our sort of architecture would be PGAS/SPMD style, with the latter being the basis for OpenMP. PGAS gives a lot more power in describing and efficiently having shared memory in an application with multiple memory regions. Since every one of our cores have 128KB of our scratchpad memory, and all of those memories are part of a global flat address space, every core can access any other cores memory as if it is part of one giant continuous memory region. That does cause some issues with memory protection, but that is a sacrifice you make for this sort of efficiency and power (but we have some plans on how to address that with software... more news on that will be in the future). The other nice programming model we see is the Actor model... so think Erlang, but potentially also some CSP like stuff with Go in the future (And yes, I do realize they are competing models).
              If you want to get the latest info as it comes out, sign up for our mailing list on our website!
              • Interesting you mentioned CSP. When I read up on your architecture, its close relative, Functional Reactive Programming (well... inspired by FRP...) came to mind. Leads to easy programming and relatively straightforward, direct mapping of the FRP nodes to cores, and event streams to communication among cores. Very good isolation.

          • by godrik ( 1287354 ) on Wednesday July 22, 2015 @11:29PM (#50165743)

            I like the idea of "reinventing the computer for performance". Trying to get rid of overhead caused by virtual memory has attracted quite a bit of attention recently, so the idea is definitly sound.
            A few questions:
            -Is there any more details I can read on anywhere? I could not really see any details passed the "slightly technical PR" on http://www.rexcomputing.com/in... [rexcomputing.com]
            -Do you plan on plan on presenting your work at SuperComputing?
            -You mention BLAS3 kernels, so I assume you mean dense BLAS3 kernels. In what I see, people are no longer really interested in dense linear algebra. Most of the applications I see nowadays are sparse. Can your architecture deal with that?
            -The chip and architecture seem to essentially be based on a 2D mesh network, can it be extended to more dimensions? I was under the impression that it would cause high latency in physical simulation, because you can not easily project a 3D space in a 2D space without introducing large distance discrepancies. (Which is why BG/Q use 5D torus network.)
            Keep us apraised!
            Cheers

            • by trsohmers ( 2580479 ) on Wednesday July 22, 2015 @11:53PM (#50165797)
              This is a bit old and has some inaccuracies, so I hesitate to share it, but since you can find it if you dig deep enough... here it is: http://rexcomputing.com/REX_OC... [rexcomputing.com]
              Couple quick things: Our instruction encoding is a bit different than what it has on the slide, we've brought it down to 128 bit VLIW (32 bits per functional unit operation), and there are some pipeline particulars we are not talking about publicly yet. We have also moved all of our compiler and toolchain development to be based on LLVM (and thus the really dense slides in there talking about GCC are mostly irrelevant).
              As mentioned in the presentation, we have some ideas of expanding the 2D mesh on the chip, including having it become a 2D torus... our chip-to-chip interconnect allows a lot more interesting geometries, and are working on one with a university research lab that features a special 50-node configuration with max 2 hops between nodes. Our 96GB/s chip-to-chip bandwidth per side is also a big thing differing us from other chips (with the big sacrifice being the very short distance we need to have between chips and having a lot of constraints in packaging and the motherboard). We'll have more news on this in the future.
              When it comes to sparse and dense computations, we are mostly focusing on the dense ones to start (FFT's are beautiful on our architecture), but we are capable of doing well with sparse workloads, and while those developments are in the pipeline, it will take a lot more compiler development effort.
              We actually had a booth in the emerging technologies exhibition at Supercomputing Conference 2014, and hope to have a presence again this year
          • by Brannon ( 221550 ) on Wednesday July 22, 2015 @11:34PM (#50165763)

            Please explain to me simply how you get 10x in compute efficiency over GPUs--these chips are already fairly optimal at general purpose flops per watt because they run at low voltage and fill up the die with arithmetic.

            GPUs have excellent memory bandwidth to their video RAM (GDDR*), they have poor IO latency & bandwidth (PCIe limited) which is the main reason they don't scale well.

            We've heard the VLIW "we just need better compilers" line several times before.

            Thus far this sounds like a truly excellent high school science fair project, or a slightly above average college engineering project. It is miles away from passing an industrial smell test.

            • by Anonymous Coward

              It's not hard to get 10x if you make something that is essentially unprogrammable and incompatible with all existing software. (Think DSPs without the vendor libraries, stream computing, FPGAs, etc.) Since so much of the cost of these systems is in software development incompatibility is pretty much a dead-end for anything that isn't military-specific.

          • As somebody in the VLSI field, I am happy that somebody broke out of the monopoly/duopoly of the established players. WE are moving towards "single/double" vendor for everything from mobiles to laptop processors to desktop processors. Having little choice also harms progress.

            The other thing which excites me is that you are going towards a completely new architecture. This is what innovation is about!

            Hopefully, your success will inspire others also to take the plunge.

          • by Arkh89 ( 2870391 )

            When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing).

            This is only true if your problem does not fit in the VRAM (which is getting over 10GB nowadays). If it does, you'll be 8x to 12x faster than any brand new CPU for any element-wise operation. Also, it is much more common to find an easy way to cut the problem nicely than not.
            That being said, do you know with how much embedded RAM will you be proposing your architecture (even a rough projection)?

          • Re: (Score:3, Interesting)

            by __rze__ ( 2550872 )
            Hi Thomas,

            I found this extremely intriguing, as I am currently writing up my dissertation on high-GFLOPS/W 3-D layered reconfigurable architectures. I am also of the opinion that memory handling is the key, as it is the only way to resolve the von Neumann bottle-neck problem. Many processing elements with no means to feed them are useless. In my design I am using reconfigurability and flexibility to gain energy efficiency (my architectural range allows 111GFLOPs/W in some configurations).

            I am also conce

          • Re: (Score:2, Funny)

            and stack machines are notorious for having HORRIBLE support for languages like C

            Which is what makes them so awesome. It's like a door that filters out undesirable drunken retards before they even enter your house.

          • When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU)

            That's a somewhat odd claim. One of the reasons that computations on GPUs are fast is that they have high memory bandwidth. Being hampered by using the same DRAM as the CPU is one of the reasons that integrated GPUs perform worse. If you're writing GPU code that's doing anything other than initial setup over PCIe, then you're doing it badly wrong.

            That said, GPU memory controllers tend to be highly specialised. The nVidia ones have around 20 different streaming modes for different access patterns (I t

            • That's less of an issue if your throughput comes from thread-level parallelism. There are some experimental architectures floating around that get very good i-cache usage and solid performance from a stack-based ISA and a massive number of hardware threads.

              The other day, I had the mildly insane idea that perhaps our abilities to explore the architectural space are limited by all existing architectures having been painstakingly handcrafted. Thus, if it were possible somehow to parametrically generate an architecture, and then synthesize a code generator backend for a compiler and a suitable hardware implementation, we might be able to discover some hidden gems in the largely unexplored universe of machine architectures. But it sounds like a pipe dream to me...

              • Tensilica (now owned by Cadence) does a shitty version of this already, which generates (restricted in scope) RTL and a compiler backend from a description of what you want. Synopsys also has a smaller version that does that in a larger scope than Tensilica, using some high level synthesis (which I think is basically pseudoscience when it comes to hardware design) along with SystemVerilog stuff. We actually prototyped the hardware generation part of what you are saying, and it works pretty decently without
          • you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU)

            Uhm, no. GPUs have massive bandwidth to THEIR memory. You're talking about lower speeds to memory of A DIFFERENT PROCESSOR, so essentially you're trying to compare using the PCI bus as a network and your direct memory access. These are two different things. GPUs can have far more memory than the systems they are attached to and nVidia has certainly used this as a selling point for their GPGPU stuff. If your GPU is using system memory over the PCI bus, you fucked up your hardware purchase. When you thi

          • Thomas; your web site mentions LLVM as your development language environment. Can I assume C/C++ is the main language? Is your tool chain highly customized? Would seem to be necessary to support a highly parallel machine with a new architecture. Are there any details on your tool chain you might be willing to share, or do I need to apply, get hired, and then find out?
            • It is a lot easier to talk in person or phone/skype/email over the specifics, but what I can say right now is that you are correct in your base assumption, but I would not say it is highly customized... The base of it borrows the number of LLVM improvements targeted for VLIW systems over the past couple of years (which work even better for us as we are a much more simplified/relaxed VLIW), and extends funcitonality, but most of our custom work is meant to extend beyond the base backend. Technically, if you
          • Since your chips are so small, have you considered moving to BJT TTL circuits to ramp up frequency to the 50-100 GHz range while using lower power to improve your GFLOP/W rating?
        • From what I can tell, his design looks like it might be flops per watt comparable to GPUs, but with different memory abstractions that result in similar limitations. I suspect that if you write custom code for it, and have the right kind of problem, it will do significantly better than available options, but in the general case and/or non specialized code it won't do anything much better than a GPU, but it it may be competitive.

          In other words, very much like Chuck Moore's Forth cores. ;-) Which is fine, there's quite a range of applications for hardware like that. Especially in the military.

      • by metlin ( 258108 )

        He's a Thiel Fellow, and clearly, that model is working for kids like him who are super gifted for whom the current college education model would be absurd.

        This 17-Year-Old Dropped Out Of High School For Peter Thiel And Built A Game-Changing New Kind Of Computer [businessinsider.com]

        Pretty awesome, if you ask me!

      • I'll go ahead guess that he's really into and good at microprocessor design

        What, a 19 year old who's designed his own supercomputer chip and received $100K DARPA funding? You're really going out on a limb there.

    • Both are pretty damn impressive, although for chip manufacturing, $100,000 isn't exactly a lot of money. Consider what a single engineer earns per year. Apparently, they have six people, and hiring a seventh. I guess that means they must have some other funding coming in from somewhere, as he talked about how "supercomputing" is a poisoned word among Silicon Valley VC firms.

      I hope to hear interesting things about this young man and his company in the future.

      • by trsohmers ( 2580479 ) on Wednesday July 22, 2015 @09:08PM (#50165277)
        This is the founder of the startup in the article. We have actually just raised $1.25 in venture funding, which is mentioned in the article. Thanks, and I hope we will be bringing more news soon.
        • by avgjoe62 ( 558860 ) on Wednesday July 22, 2015 @09:12PM (#50165287)
          $1.25? Contact me and I'll double, hell why not triple your funding! :)
        • Ah, I missed it right in the first paragraph. Good work getting the initial funding and your company off the ground. Mine isn't nearly so ambitious, but I can sympathize with the headaches of getting a new business started.

          I'm a software guy, and know only the theoretical basics about the hardware I program for, but the notion of putting more of the complexity into the compiler instead of the chip is interesting. I wonder if this technology requires new approaches to languages and compilers, or whether i

          • by robi5 ( 1261542 )

            Maybe HotSpot / V8 type of optimizations would work well, as in running code, the actual patterns emerge. This is a great talk on the cost of virtual memory, the future of JS and more :-) https://www.destroyallsoftware... [destroyallsoftware.com]

          • In either case, it sounds like a hell of a challenge, as (if I understand correctly) you'd presumably need to pre-evaluate logic flow and track how resources are accessed in order to embed the proper data cache hints. However, those sort of access patterns can change depending on the state of program data, even within the same code. Or, I suppose you could "tune" the program for optimal execution through multiple pass evaluation runs, embedding data access hints in a second pass after seeing the results of a "profiling" run.

            Sounds like LuaJIT on steroids.

        • by Anonymous Coward

          "We have actually just raised $1.25 in venture funding"

          Looks like you shouldn't use your CPU design to run your finances there, Sparky.

        • I'm hoping that there's a million missing there. Are you just planning on selling IP cores? When I talked to a former Intel Chief Architect a few years ago (hmm, about 10 years ago now), he was looking at creating a startup and figured that $60m was about the absolute minimum to bring something to market. From talking to colleagues on the lowRISC project and at ARM, $1-2m is just enough to produce a prototype on a modern process, but won't get you close to mass production. Do you plan on raising more mo
        • by KGIII ( 973947 )

          You may be too busy but are you accepting private VC funding and, if so, whom would one contact to do so? I might have a soft-spot for MIT alumni.

    • "Virtual Memory translation and paging are two of the worst decisions in computing history"

      "Introduction of hardware managed caching is what I consider 'The beginning of the end'"

      ---

      These comments belie a fairly child-like understanding of computer architecture.

      • "Virtual Memory translation and paging are two of the worst decisions in computing history"

        "Introduction of hardware managed caching is what I consider 'The beginning of the end'"

        ---

        These comments belie a fairly child-like understanding of computer architecture.

        He's young, and he displays much more talent than people twice his age. What's your problem anyway?

        • by Anonymous Coward

          Talent is not the same thing as experience. Being able to do something does not mean it is a good idea to do it. So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.

          • Talent is not the same thing as experience.

            I'm in agreement - experience counts for a lot when doing something new.

            Being able to do something does not mean it is a good idea to do it.

            I'm in agreement with this as well.

            So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.

            It is highly unlikely that this will go anywhere (so, yeah - agreement again)... BUT... he is displaying a great deal of talent for his age. The lessons he learns from this failure[1] will be more valuable than the lessons learned in succeeding at a less difficult task.

            As I understand it, he proposes removing the hardware cache and instead using the compiler to prefetch values from memory. He says the

            • Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen

              That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood. The problem is the eviction. Having a good policy for when data won't be referenced in the future is hard. A simple round-robin policy on cache lines works okay, but part of the reason that modern caches are complex is that they try to have more clever eviction strategies. Even then, most of the die usage by caches is the SRAM cells - the controller logic is tiny in comparison.

              • Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen

                That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood.

                I honestly thought that was the difficult part; it's halting-problem hard, if I understand correctly. If you cannot predict whether a program will ever reach the end-state, then you cannot predict if it will ever reach *any* particular state. To know whether to prefetch something requires you to have knowledge about the program's future state.

                To my knowledge prediction of program state only works if your predicting a *very* short time in the future (say, no more than a hundred instructions). If you're li

                • Prefetching in the general case is non-computable, but a lot of accesses are predictable. If the stack is in the scratchpad, then you're really only looking at heap accesses and globals for prefetching. Globals are easy to statically hint and heap variables are accessed by pointers that are reachable. It's fairly easy for each function that you might call to emit a prefetch version that doesn't do any calculation and just loads the data, then insert a call to that earlier. You don't have to get it right

                  • One of the things that doesn't seem to be getting through in most of the media articles is how our memory system is actually set up. I'll try to describe it briefly here, starting from the single core.

                    At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That sc
                    • At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad

                      Note that a single-cycle latency for L1 is not that uncommon in in-order pipelines - the Cortex A7, for example, has single-cycle access to L1.

                      That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations,

                      The usual trick for this is to arrange your cache lines such that your L1 is virtually indexed and physically tagged, which means that you only need the TLB lookup (which can come from a micro-TLB) on the response. If you look at the cache design on the Cortex A72, it does a few more tricks that let you get roughly the same power as a direct-mapped L1 (which has ver

            • The primary benefit of caches for HPC applications is *bandwidth filtering*. You can have much higher bandwidth to your cache (TB/s, pretty easily) than you can ever get to off-chip--and it is substantially lower power. It requires blocking your application to have a working set that fits in cache.

              He's pulling out quotes from Cray (I used to work there) about how caches just get in the way--and they did, 30 years ago when there were very few HPC applications whose working set could fit in cache. It's a very

              • Except we do have "caches".... just not hardware managed ones, and so we call them scratchpads. Our "L1" equivalent is 128KB per core (2 to 4x more than Intel), 1 cycle latency (compared to 4 cycle latency for Intel), and lower power (As little as 2/5ths the power usage). We do have that TB/s per second for every core to its local scratchpad, and our total aggregate bandwidth on the network on chip (Cores to cores) is 8TB/s.

                I'll throw a Seymour Cray quote here for fun... "You can't fake memory bandwidth
      • by gnupun ( 752725 )

        "Virtual Memory translation and paging are two of the worst decisions in computing history"

        In the old days and even with current CPUs, one CPU can run multiple processes. But if CPUs were small enough and cheap enough, one program would run on multiple CPUs. Why would you need memory protection (virtual memory translation) if only a small portion of one program is running on one CPU? Answer: you don't.

        So TL;DR, he could be right, but only for systems with huge number of weak/limited CPUs.

      • "Virtual Memory translation and paging are two of the worst decisions in computing history"

        He's not completely wrong there. Paging is nice for operating systems isolating processes and for enabling swapping, but it's horrible to implement in hardware and it's not very useful for userland software. Conflating translation with protection means that the OS has to be on the fast path for any userland changes and means that the protection granule and translation granule have to be the same size. The TLB needs to be an associative structure that can return results in a single cycle, which makes it h

      • by tomhath ( 637240 )
        His comments indicate vision. Decades ago it was necessary to have caching and virtual memory, but with modern chip design he sees that it's no longer needed; instead of trying to fix yesterday's problem with yesterdays solution let's move on to solving the problem as if there was never a need for caching and virtual memory in the first place.
      • by Anonymous Coward

        If you never reinvent the wheel, you'll never invent the tire. I say we let him down the rabbit hole and see if he comes back with anything new.

  • Only $100k? (Score:5, Informative)

    by afidel ( 530433 ) on Wednesday July 22, 2015 @08:18PM (#50165121)

    That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.

    • by Anonymous Coward

      If you read the article, it says they raised $1.25 million in addition to the DARPA SBIR, which is just for software, and goes into the costs involved. For getting prototypes, it says they only need a couple hundred thousand dollars. I bet they are going to raise their next round after they have prototypes.

      • A couple of hundred thousand dollars will get you a prototype, not prototypes - experienced chip designers sometimes get something that works first time (and are deservedly incredibly smug about it). More commonly, you go through at least 2-3 iterations.
    • You can reduce the cost dramatically for a prototype - it's called a "shuttle run", where you share the mask costs with a group of other companies who put their chips on the same wafers. You can't go into mass production with this of course.

      Plus, they have a lot more than $100k in total.
    • That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.

      That's just DARPA's award. He mentioned another $1.2M or so in VC funding in a different comment.

  • The final project of this VLSI elective course I took required each team to build three logical modules that would work together. I was responsible for the control and integration portion bringing together all the logical modules. I spent an entire sleepless night sorting out the issues. Our team was the only one that had a functioning chip (simulated) in the end. The lecturer wasn't surprised - most chips of any reasonable complexity require A LOT of painstaking (e.g. efficient routing, interference) work

    • Re: (Score:2, Funny)

      by Anonymous Coward

      An ENTIRE sleepless night? Wow. Sounds TOUGH. —said no MIT grad ever.

  • 19 (Score:5, Funny)

    by PopeRatzo ( 965947 ) on Wednesday July 22, 2015 @09:51PM (#50165425) Journal

    When I was 19, my main achievement was building a bong out of a milk jug.

    • by Anonymous Coward

      Which is less economically wasteful than flushing $1.25 million down the drain on yet another VLIW chip that is claimed to change the world, etc.

    • But I bet it was a nice bong, that brought joy to many.
      • But I bet it was a nice bong, that brought joy to many.

        Oh, it was a sweet bong. I got a contract from DARPA to make more, but I encountered some problems in the manufacturing stage because I was too busy watching Cartoon Network.

    • obviously, with mindblowing results
  • Parallella...
  • by Anonymous Coward

    "Why is there yogurt in this hat?" "I can explain that. It used to be milk, and well, time makes fools of us all."

    I truly hope this approach pans out and advances chip design, but if it doesn't, it will be another publicly available learning tool for the next small team to learn from. It's easy to say that it won't work and that it is going down the same path as previous attempts, but thet might have something that does work and is worth a lot of money. If you don't like it, don't invest. If you think it

  • by Melkhior ( 169823 ) on Thursday July 23, 2015 @02:01AM (#50166051)
    Cue this old joke...
    - How many hardware engineers does it take to change a light bulb?
    - None, we'll fix it in software.

    Doing stuff in software to make hardware easier has been tried before (and before this kid was born, perhaps why he thinks this is new). It failed. Transputer, i960, i432, Itanium, MTA, Cell, a slew of others I don't remember...

    As for the grid, nice, but not exactly new. Tilera, Adapteva, KalRay, ...
    • Yeah, yawn. Just last week I was bored and looked up transputer on ebay You can still buy them I think their programming language was OCCAM, no? I wish I had all the tools available to kids these days, but a 16k computer took two years of savings.
  • Most 19 year olds' idea of achievement is not puking up on the front doorstep after a particularly brutal night out boozing. For all you doubters: can we see how this chip performs in the wild before making judgement, please? To Thomas: will the chip ever see a retail shelf in say a personal supercomputer like the NVidia Tesla?

  • Darned overloaded abbreviations. RTL has priority, means Resistor-Transistor Logic.

I'd rather just believe that it's done by little elves running around.

Working...