19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding 150
An anonymous reader writes: 19-year-old Thomas Sohmers, who launched his own supercomputer chip startup back in March, has won a DARPA contract and funding for his company. Rex Computing, is currently finishing up the architecture of its final verified RTL, which is expected to be completed by the end of the year. The new Neo chips will be sampled next year, before moving into full production in mid-2017.The Platform reports: "In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under 'Programming New Computers,' and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the 'difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.'"
good for him (Score:5, Insightful)
Not sure whats more impressive... (Score:5, Insightful)
Not sure whats more impressive, the fact that a 19 year old is able to get DARPA funding or the fact that a 19 year old (and his team presumably) is about to go into mass production with a fairly fancy looking custom microprocessor on a 28nm fab process.
Re:Not sure whats more impressive... (Score:5, Informative)
Re: (Score:1)
I don't have a background in microprocessor design: I've only designed a very simple one as an assignment, but I've been following the industry pretty closely.
From what I can tell, his design looks like it might be flops per watt comparable to GPUs, but with different memory abstractions that result in similar limitations. I suspect that if you write custom code for it, and have the right kind of problem, it will do significantly better than available options, but in the general case and/or non specialized
Re:Not sure whats more impressive... (Score:5, Informative)
Re: (Score:3, Funny)
Thomas, you are awesome.
Enjoy your success. I see from your bio that in your "free time" you like to play guitar. I hope you've bought yourself a good one (or six).
Re: (Score:3, Interesting)
Thanks for the response!
I should have noticed your numbers for for double precision flops, so my numbers were way off. Thanks for the correction. I bet you are IEEE compliment too (Darn GPUs...).
Your design is intended specifically for parallel work loads with localized or clustered data access, correct? (I realize this is includes most supercomputer work jobs) It sounds like similar constraints you have with GPUs, but if met properly, the performance should be much better/more efficient and more scale-able
Re:Not sure whats more impressive... (Score:5, Informative)
2.First thing we get around primarily by having ridiculous bandwidth (288 to 384GB/s aggregate chip-to-chip bandwidth)... we'll have more info out on that in the coming months. When it comes to memory movement, that's the big difficulty and what a big portion of our DARPA work is focused on, but a number of unique features of our network on chip (statically routed, non blocking, single cycle latency between routers, etc) help a lot with allowing the compiler to *know* that it can push things around in given time, and having to put a minimal number of NOPs. There is a lot of work, and it will not be perfect with our first iteration, but the initial customers we are working with do not require perfect auto-optimization to begin with.
3. If you think of it as each core as being a quad issue in order RISC core (think on the performance order of a ARM Cortex A9 or potentially A15, but using a lot less power and being 64 bit), you can have one fully independent and isolated application on each core. That's one of the very nice things about a true MIMD/MPMD architecture. So we do fantastic with things that parallelize well, but you can also use our cores to run a lot of independent programs decently well.
Re: (Score:3)
The important question is - how does it perform for the Cycles (Blenders render engine) benchmarks :)
http://blenderartists.org/foru... [blenderartists.org]
https://www.blender.org/downlo... [blender.org]
Re: (Score:2)
Re:Not sure whats more impressive... (Score:5, Interesting)
Can you elaborate on the programming structure/API you guys are envisioning for this? (it's cool if you can't, I'd understand
Although I'm not a programmer or CS person by training, I do GPGPU programming (although not BLAS-based stuff) almost exclusively for my research and enjoy it because once you understand the differences between the GPU and CPU it just become a question of how to best parallelize your algorithm. It'd be AMAZING to see the memory bandwidth and power usage specs you guys are working towards under a similar programming structure we currently see with something like CUDA or OpenCL. Any plans for something like that or am I betraying my hobbyist computing status?
Finally, if you ever need any applications testing, specifically in the medical imaging field, feel free get in touch.
Re:Not sure whats more impressive... (Score:5, Interesting)
If you want to get the latest info as it comes out, sign up for our mailing list on our website!
Re: Not sure whats more impressive... (Score:2)
Interesting you mentioned CSP. When I read up on your architecture, its close relative, Functional Reactive Programming (well... inspired by FRP...) came to mind. Leads to easy programming and relatively straightforward, direct mapping of the FRP nodes to cores, and event streams to communication among cores. Very good isolation.
Re:Not sure whats more impressive... (Score:4, Interesting)
I like the idea of "reinventing the computer for performance". Trying to get rid of overhead caused by virtual memory has attracted quite a bit of attention recently, so the idea is definitly sound.
A few questions:
-Is there any more details I can read on anywhere? I could not really see any details passed the "slightly technical PR" on http://www.rexcomputing.com/in... [rexcomputing.com]
-Do you plan on plan on presenting your work at SuperComputing?
-You mention BLAS3 kernels, so I assume you mean dense BLAS3 kernels. In what I see, people are no longer really interested in dense linear algebra. Most of the applications I see nowadays are sparse. Can your architecture deal with that?
-The chip and architecture seem to essentially be based on a 2D mesh network, can it be extended to more dimensions? I was under the impression that it would cause high latency in physical simulation, because you can not easily project a 3D space in a 2D space without introducing large distance discrepancies. (Which is why BG/Q use 5D torus network.)
Keep us apraised!
Cheers
Re:Not sure whats more impressive... (Score:4, Informative)
Couple quick things: Our instruction encoding is a bit different than what it has on the slide, we've brought it down to 128 bit VLIW (32 bits per functional unit operation), and there are some pipeline particulars we are not talking about publicly yet. We have also moved all of our compiler and toolchain development to be based on LLVM (and thus the really dense slides in there talking about GCC are mostly irrelevant).
As mentioned in the presentation, we have some ideas of expanding the 2D mesh on the chip, including having it become a 2D torus... our chip-to-chip interconnect allows a lot more interesting geometries, and are working on one with a university research lab that features a special 50-node configuration with max 2 hops between nodes. Our 96GB/s chip-to-chip bandwidth per side is also a big thing differing us from other chips (with the big sacrifice being the very short distance we need to have between chips and having a lot of constraints in packaging and the motherboard). We'll have more news on this in the future.
When it comes to sparse and dense computations, we are mostly focusing on the dense ones to start (FFT's are beautiful on our architecture), but we are capable of doing well with sparse workloads, and while those developments are in the pipeline, it will take a lot more compiler development effort.
We actually had a booth in the emerging technologies exhibition at Supercomputing Conference 2014, and hope to have a presence again this year
I'm a pro in the field. This doesn't scan. (Score:5, Interesting)
Please explain to me simply how you get 10x in compute efficiency over GPUs--these chips are already fairly optimal at general purpose flops per watt because they run at low voltage and fill up the die with arithmetic.
GPUs have excellent memory bandwidth to their video RAM (GDDR*), they have poor IO latency & bandwidth (PCIe limited) which is the main reason they don't scale well.
We've heard the VLIW "we just need better compilers" line several times before.
Thus far this sounds like a truly excellent high school science fair project, or a slightly above average college engineering project. It is miles away from passing an industrial smell test.
Re: (Score:1)
It's not hard to get 10x if you make something that is essentially unprogrammable and incompatible with all existing software. (Think DSPs without the vendor libraries, stream computing, FPGAs, etc.) Since so much of the cost of these systems is in software development incompatibility is pretty much a dead-end for anything that isn't military-specific.
All the best (Score:2)
As somebody in the VLSI field, I am happy that somebody broke out of the monopoly/duopoly of the established players. WE are moving towards "single/double" vendor for everything from mobiles to laptop processors to desktop processors. Having little choice also harms progress.
The other thing which excites me is that you are going towards a completely new architecture. This is what innovation is about!
Hopefully, your success will inspire others also to take the plunge.
Re: (Score:2)
When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing).
This is only true if your problem does not fit in the VRAM (which is getting over 10GB nowadays). If it does, you'll be 8x to 12x faster than any brand new CPU for any element-wise operation. Also, it is much more common to find an easy way to cut the problem nicely than not.
That being said, do you know with how much embedded RAM will you be proposing your architecture (even a rough projection)?
Re: (Score:3, Interesting)
I found this extremely intriguing, as I am currently writing up my dissertation on high-GFLOPS/W 3-D layered reconfigurable architectures. I am also of the opinion that memory handling is the key, as it is the only way to resolve the von Neumann bottle-neck problem. Many processing elements with no means to feed them are useless. In my design I am using reconfigurability and flexibility to gain energy efficiency (my architectural range allows 111GFLOPs/W in some configurations).
I am also conce
Re: (Score:2, Funny)
and stack machines are notorious for having HORRIBLE support for languages like C
Which is what makes them so awesome. It's like a door that filters out undesirable drunken retards before they even enter your house.
Re: (Score:3)
When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU)
That's a somewhat odd claim. One of the reasons that computations on GPUs are fast is that they have high memory bandwidth. Being hampered by using the same DRAM as the CPU is one of the reasons that integrated GPUs perform worse. If you're writing GPU code that's doing anything other than initial setup over PCIe, then you're doing it badly wrong.
That said, GPU memory controllers tend to be highly specialised. The nVidia ones have around 20 different streaming modes for different access patterns (I t
Re: (Score:2)
That's less of an issue if your throughput comes from thread-level parallelism. There are some experimental architectures floating around that get very good i-cache usage and solid performance from a stack-based ISA and a massive number of hardware threads.
The other day, I had the mildly insane idea that perhaps our abilities to explore the architectural space are limited by all existing architectures having been painstakingly handcrafted. Thus, if it were possible somehow to parametrically generate an architecture, and then synthesize a code generator backend for a compiler and a suitable hardware implementation, we might be able to discover some hidden gems in the largely unexplored universe of machine architectures. But it sounds like a pipe dream to me...
Re: (Score:2)
Re: (Score:1)
you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU)
Uhm, no. GPUs have massive bandwidth to THEIR memory. You're talking about lower speeds to memory of A DIFFERENT PROCESSOR, so essentially you're trying to compare using the PCI bus as a network and your direct memory access. These are two different things. GPUs can have far more memory than the systems they are attached to and nVidia has certainly used this as a selling point for their GPGPU stuff. If your GPU is using system memory over the PCI bus, you fucked up your hardware purchase. When you thi
Re: (Score:2)
Re: (Score:2)
Re: (Score:1)
Re:Not sure whats more impressive... (Score:5, Funny)
Re:Not sure whats more impressive... (Score:5, Funny)
Re: (Score:2)
This is probably the best reply I've ever read by the subject of an article posted to Slashdot.
Re: (Score:2)
Whatever. You ever kiss a girl? You've tossed your youth away to build next year's landfill. Good for you.
I'd imagine that a 19 year old as impressive as he is pretty much has the sexual choice of anyone (male or female) that he wants.
Re: (Score:2)
From what I can tell, his design looks like it might be flops per watt comparable to GPUs, but with different memory abstractions that result in similar limitations. I suspect that if you write custom code for it, and have the right kind of problem, it will do significantly better than available options, but in the general case and/or non specialized code it won't do anything much better than a GPU, but it it may be competitive.
In other words, very much like Chuck Moore's Forth cores. ;-) Which is fine, there's quite a range of applications for hardware like that. Especially in the military.
Re: (Score:3)
He's a Thiel Fellow, and clearly, that model is working for kids like him who are super gifted for whom the current college education model would be absurd.
This 17-Year-Old Dropped Out Of High School For Peter Thiel And Built A Game-Changing New Kind Of Computer [businessinsider.com]
Pretty awesome, if you ask me!
Re: (Score:2)
I'll go ahead guess that he's really into and good at microprocessor design
What, a 19 year old who's designed his own supercomputer chip and received $100K DARPA funding? You're really going out on a limb there.
Re: (Score:2)
Both are pretty damn impressive, although for chip manufacturing, $100,000 isn't exactly a lot of money. Consider what a single engineer earns per year. Apparently, they have six people, and hiring a seventh. I guess that means they must have some other funding coming in from somewhere, as he talked about how "supercomputing" is a poisoned word among Silicon Valley VC firms.
I hope to hear interesting things about this young man and his company in the future.
Re:Not sure whats more impressive... (Score:5, Informative)
Re:Not sure whats more impressive... (Score:5, Funny)
Re: (Score:2)
Hell, at that price I'd be able to fund the whole project in Dogecoins!
Re: (Score:2)
Ah, I missed it right in the first paragraph. Good work getting the initial funding and your company off the ground. Mine isn't nearly so ambitious, but I can sympathize with the headaches of getting a new business started.
I'm a software guy, and know only the theoretical basics about the hardware I program for, but the notion of putting more of the complexity into the compiler instead of the chip is interesting. I wonder if this technology requires new approaches to languages and compilers, or whether i
Re: (Score:2)
Maybe HotSpot / V8 type of optimizations would work well, as in running code, the actual patterns emerge. This is a great talk on the cost of virtual memory, the future of JS and more :-) https://www.destroyallsoftware... [destroyallsoftware.com]
Re: (Score:2)
In either case, it sounds like a hell of a challenge, as (if I understand correctly) you'd presumably need to pre-evaluate logic flow and track how resources are accessed in order to embed the proper data cache hints. However, those sort of access patterns can change depending on the state of program data, even within the same code. Or, I suppose you could "tune" the program for optimal execution through multiple pass evaluation runs, embedding data access hints in a second pass after seeing the results of a "profiling" run.
Sounds like LuaJIT on steroids.
Re: (Score:1)
"We have actually just raised $1.25 in venture funding"
Looks like you shouldn't use your CPU design to run your finances there, Sparky.
Re: (Score:2)
Re: (Score:1)
You may be too busy but are you accepting private VC funding and, if so, whom would one contact to do so? I might have a soft-spot for MIT alumni.
The 19 year old is a lunatic (Score:1)
"Virtual Memory translation and paging are two of the worst decisions in computing history"
"Introduction of hardware managed caching is what I consider 'The beginning of the end'"
---
These comments belie a fairly child-like understanding of computer architecture.
Re: (Score:2)
"Virtual Memory translation and paging are two of the worst decisions in computing history"
"Introduction of hardware managed caching is what I consider 'The beginning of the end'"
---
These comments belie a fairly child-like understanding of computer architecture.
He's young, and he displays much more talent than people twice his age. What's your problem anyway?
Re: (Score:1)
Talent is not the same thing as experience. Being able to do something does not mean it is a good idea to do it. So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.
Re: (Score:2)
Talent is not the same thing as experience.
I'm in agreement - experience counts for a lot when doing something new.
Being able to do something does not mean it is a good idea to do it.
I'm in agreement with this as well.
So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.
It is highly unlikely that this will go anywhere (so, yeah - agreement again)... BUT... he is displaying a great deal of talent for his age. The lessons he learns from this failure[1] will be more valuable than the lessons learned in succeeding at a less difficult task.
As I understand it, he proposes removing the hardware cache and instead using the compiler to prefetch values from memory. He says the
Re: (Score:2)
Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen
That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood. The problem is the eviction. Having a good policy for when data won't be referenced in the future is hard. A simple round-robin policy on cache lines works okay, but part of the reason that modern caches are complex is that they try to have more clever eviction strategies. Even then, most of the die usage by caches is the SRAM cells - the controller logic is tiny in comparison.
Re: (Score:2)
Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen
That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood.
I honestly thought that was the difficult part; it's halting-problem hard, if I understand correctly. If you cannot predict whether a program will ever reach the end-state, then you cannot predict if it will ever reach *any* particular state. To know whether to prefetch something requires you to have knowledge about the program's future state.
To my knowledge prediction of program state only works if your predicting a *very* short time in the future (say, no more than a hundred instructions). If you're li
Re: (Score:3)
Prefetching in the general case is non-computable, but a lot of accesses are predictable. If the stack is in the scratchpad, then you're really only looking at heap accesses and globals for prefetching. Globals are easy to statically hint and heap variables are accessed by pointers that are reachable. It's fairly easy for each function that you might call to emit a prefetch version that doesn't do any calculation and just loads the data, then insert a call to that earlier. You don't have to get it right
Re: (Score:3)
At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That sc
Re: (Score:2)
At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad
Note that a single-cycle latency for L1 is not that uncommon in in-order pipelines - the Cortex A7, for example, has single-cycle access to L1.
That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations,
The usual trick for this is to arrange your cache lines such that your L1 is virtually indexed and physically tagged, which means that you only need the TLB lookup (which can come from a micro-TLB) on the response. If you look at the cache design on the Cortex A72, it does a few more tricks that let you get roughly the same power as a direct-mapped L1 (which has ver
His entire premise is wrong. (Score:2)
The primary benefit of caches for HPC applications is *bandwidth filtering*. You can have much higher bandwidth to your cache (TB/s, pretty easily) than you can ever get to off-chip--and it is substantially lower power. It requires blocking your application to have a working set that fits in cache.
He's pulling out quotes from Cray (I used to work there) about how caches just get in the way--and they did, 30 years ago when there were very few HPC applications whose working set could fit in cache. It's a very
Re: (Score:2)
I'll throw a Seymour Cray quote here for fun... "You can't fake memory bandwidth
Pretty much everything you said is gibberish. (Score:2)
Congratulations for tricking someone into giving you money. Good luck with your impending disaster.
Re: (Score:2)
"Virtual Memory translation and paging are two of the worst decisions in computing history"
In the old days and even with current CPUs, one CPU can run multiple processes. But if CPUs were small enough and cheap enough, one program would run on multiple CPUs. Why would you need memory protection (virtual memory translation) if only a small portion of one program is running on one CPU? Answer: you don't.
So TL;DR, he could be right, but only for systems with huge number of weak/limited CPUs.
Re: (Score:2)
"Virtual Memory translation and paging are two of the worst decisions in computing history"
He's not completely wrong there. Paging is nice for operating systems isolating processes and for enabling swapping, but it's horrible to implement in hardware and it's not very useful for userland software. Conflating translation with protection means that the OS has to be on the fast path for any userland changes and means that the protection granule and translation granule have to be the same size. The TLB needs to be an associative structure that can return results in a single cycle, which makes it h
Re: (Score:2)
Re: (Score:1)
If you never reinvent the wheel, you'll never invent the tire. I say we let him down the rabbit hole and see if he comes back with anything new.
Re: (Score:2)
Only $100k? (Score:5, Informative)
That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.
Re: (Score:1)
If you read the article, it says they raised $1.25 million in addition to the DARPA SBIR, which is just for software, and goes into the costs involved. For getting prototypes, it says they only need a couple hundred thousand dollars. I bet they are going to raise their next round after they have prototypes.
Re: (Score:2)
Re: (Score:1)
Plus, they have a lot more than $100k in total.
Re: (Score:2)
That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.
That's just DARPA's award. He mentioned another $1.2M or so in VC funding in a different comment.
Re: (Score:2)
VLSI is hard (Score:2)
The final project of this VLSI elective course I took required each team to build three logical modules that would work together. I was responsible for the control and integration portion bringing together all the logical modules. I spent an entire sleepless night sorting out the issues. Our team was the only one that had a functioning chip (simulated) in the end. The lecturer wasn't surprised - most chips of any reasonable complexity require A LOT of painstaking (e.g. efficient routing, interference) work
Re: (Score:2, Funny)
An ENTIRE sleepless night? Wow. Sounds TOUGH. —said no MIT grad ever.
19 (Score:5, Funny)
When I was 19, my main achievement was building a bong out of a milk jug.
Re: (Score:1)
Which is less economically wasteful than flushing $1.25 million down the drain on yet another VLIW chip that is claimed to change the world, etc.
Re: (Score:2)
Re: (Score:2)
Oh, it was a sweet bong. I got a contract from DARPA to make more, but I encountered some problems in the manufacturing stage because I was too busy watching Cartoon Network.
Re: (Score:2)
Re: (Score:2)
True. I can get plenty of chips for not much money at the grocery store, and I'd guess they're tastier than the ones trsohmers is working on.
Parallella... (Score:1)
Futurama quote of the day. (Score:1)
"Why is there yogurt in this hat?" "I can explain that. It used to be milk, and well, time makes fools of us all."
I truly hope this approach pans out and advances chip design, but if it doesn't, it will be another publicly available learning tool for the next small team to learn from. It's easy to say that it won't work and that it is going down the same path as previous attempts, but thet might have something that does work and is worth a lot of money. If you don't like it, don't invest. If you think it
(old fart)been tried before(/old fart) (Score:4, Insightful)
- How many hardware engineers does it take to change a light bulb?
- None, we'll fix it in software.
Doing stuff in software to make hardware easier has been tried before (and before this kid was born, perhaps why he thinks this is new). It failed. Transputer, i960, i432, Itanium, MTA, Cell, a slew of others I don't remember...
As for the grid, nice, but not exactly new. Tilera, Adapteva, KalRay,
Re: (Score:2)
Colour me impressed (Score:2)
Most 19 year olds' idea of achievement is not puking up on the front doorstep after a particularly brutal night out boozing. For all you doubters: can we see how this chip performs in the wild before making judgement, please? To Thomas: will the chip ever see a retail shelf in say a personal supercomputer like the NVidia Tesla?
Old-timer here (Score:2)
Re: (Score:1)
I'd say that's a fair sign of success. Despite the sense of jealousy, nobody can think of anything bad to say
Re:Half an hour, two comments (Score:5, Informative)
Re:By Neruos (Score:4, Interesting)
Re: By Neruos (Score:1)
From what I've been able to read it doesn't look that different from other projects like Tilera or Kalray MPPA.
Re: By Neruos (Score:5, Informative)
Re: (Score:2)
The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.
You might have answered this already but I'm not very good at reading walls-o-text, so apologies if this is a repeat: The hardware managed cache design for chips is popular for a reason - it provides a speed boost. If you remove this what kind of process do you propose to replace it with? (Unless you have a design that makes a hardware managed cache redundant. What do you do then? Have software manage the cache?)
Re: (Score:2)
Re: (Score:3)
2. Already have standard cells and memory compilers. We are not amateurs.
3. We actually have solid state physics and fabrication experience, and understand the physical constraints of wire and gate delays, leakage, etc. All of those play