Stories
Slash Boxes
Comments
typodupeerror delete not in

Comments: 109 +-   The Story Behind a Failed HPC Startup on Tuesday November 03, @06:25PM

Posted by kdawson on Tuesday November 03, @06:25PM
from the build-it-and-they-will-come-if-you-don't-run-out-of-money-first dept.
supercomputing
business
earth
power
news
jbrodkin writes "SiCortex had an idea that it thought would take the supercomputing world by storm — build the most energy-efficient HPC clusters on the planet. But the recession, and the difficulties of penetrating a market dominated by Intel-based machines, proved to be too much for the company to handle. SiCortex ended up folding earlier this year, and its story may be a cautionary tale for startups trying to bring innovation to the supercomputing industry."
story

Related Stories

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • 1 down (Score:3, Informative)

    by Locke2005 (849178) on Tuesday November 03, @06:31PM (#29970688)
    Lightfleet [lightfleet.com] soon to follow. How is the company that was using Transmeta chips doing?
    • Re: (Score:3, Insightful)

      Orion? Long gone.

      http://www.theregister.co.uk/2006/02/14/orion_shuts_down/ [theregister.co.uk]

      The weird thing here is that the Register quotes Bill Gates as calling Orion's deskside supercomputers as part of a "key trend". Now, I've always though Bill's understanding of the marketplace was overrated. But you'd think that somebody whose immense fortune comes almost entirely from the triumph of commodity processors would know that this kind of effort is doomed.

      Some people are just in love with these fancy RISC architectures and

      • Yes, Orion, thanks, I couldn't remember the name. Insisting on basing an HPC on a chip that achieves low power by throttling back performance under high demand probably wasn't a smart choice either. Although at least with an Orion, you could develop your app on your desktop PC.
        • You're thinking of Transmeta's LongRun technology, which reduces clock speed (and thus power consumption) when the system isn't working hard. Similar features are actually quite standard in the current generation of CPUs. There's no impact on performance, because when the system's busy, it's always running at maximum speed. There is some hassle with system software that gets confused when the CPU its running on goes into idle mode.

          Most of the supposed power savings for a Transmeta cpu comes from the fact th

    • I dislike intensely saying bad about companies I've worked for, but it's not bad to simply say outright that Lightfleet is (for all practical purposes) clinically dead. It is possible that it could be revived, I suppose. Some of the early design work was ingenious and has a lot of merit. At this time, though, Count Dracula has better vital signs.

  • Fool's errand (Score:4, Insightful)

    by Locke2005 (849178) on Tuesday November 03, @06:40PM (#29970864)
    In a blog post after SiCortex shut down, Reilly says he believes there is still room for non-x86 machines in the HPC market. He is wrong. Much more money is being spent every year on improving x86 chips than all the competitors combined. Basing a supercomputer on MIPs was short-sighted; even if it offers a a price/performance or power/performance advantage now, in a couple years it won't, because x86 is being improved at a much faster rate. Where is Sequent now? The only way to build a successful desktop HPC company is to be able to do system design turns as fast as new x86 generations come out and ship soon after the new CPUs become widely available, e.g. a complete new product every 6 months. That requires partnership with either Intel or AMD, not use of a MIPs chip that no one is spending R&D resources on anymore.
    • unfortunately thats not sufficient. you also need the us govt to throw you tens of millions in
      'research contracts' that dont amount to anything, and have them agree to buy your overpriced
      machines even though they dont really do anything useful

      even then its a pretty difficult market

    • Basing a supercomputer on MIPs was short-sighted; even if it offers a a price/performance or power/performance advantage now, in a couple years it won't, because x86 is being improved at a much faster rate. Where is Sequent now?

      Uh, Sequent never used MIPs chips. In fact, the vast majority of the system that they sold were Intel based.

      Maybe you mean SGI? Their problems seemed to coincided with their moves to Intel chips (SGI PCs that flopped and then later the wholesale move to ia64). Not to say that the problems were caused by those moves - maybe they were, maybe they weren't, but it certainly didn't make the problems go away.

      • Sorry about the confusing jump there. I was referring to SiCortex's use of MIPs, not Sequent. Sequent was always x86 based, their weakness was lagging the release of the latest x86 chips by about a year in their products. Lightfleet also tried than discarded the Broadcom/SiByte processor, because it's floating point unit quite simply did not perform as advertised.
        • I think IBM's entry into the multi-processor Unix machines with the RS/6000 was the death knell in Sequent and Pyramid. The both had their day in the Sun. Now Sun is gone too.

    • He is wrong.

      Have you looked at the top100 recently? x86 is certainly not the only game in town.

      • Yeah, it's only 95%...
      • GPU's? (Score:2, Interesting)

        My next cluster is going to be based around Tesla's. GPU's are the future. It takes 100,000 x86 cores to get a petaflop, You can get there with 25,000 if you use cells(5K x86, 20k cells) You can do the same thing with 10k if you use GPU's (5k x86, 5k Tesla's) Guess what the cheapest option is? They might not be the most energy efficient, but haven't we learned the problem with custom chips in HPC market, That's why we went to clusters in the first place
        • Re:GPU's? (Score:4, Informative)

          by peawee03 (714493) <mcericks@@@uiuc...edu> on Wednesday November 04, @12:26AM (#29974010)
          Currently, Teslas are the single-precision future. All my work is in double precision (64-bit), which is where most GPUs are much much slower. IIRC, the next generation GPUs are going to have respectable double precision performance, but they're way down the road- hopefully I'll have moved on to a job where it doesn't matter by then. Hell, I consider it a victory when I've gotten a code translated from FORTRAN 77 to Fortran 95. GPUs? I'll wait until next decade. More normal cores are low-hanging fruit I can use with any MPI code *now*.
    • Re: (Score:3, Insightful)

      Right now x86 has only two viable competitors.

      -ARM
      -Whatever IBM can design. (but IBM's stuff is expensive)

      ARM CPUs tend to be cheap, power efficient, and pack a ton of performance for the price - and the company has enough cash to keep developing for years and years. Other companies fab, so that lets them keep focused on what they're good at. It's a relationship that mirrors GPU makers - ATI/nVidia/TSMC. However, ARM has a very low performance cap compared to x86, so that limits usage scenarios. Good for lo

      • Cell architecture is good for specific problems sets, ones that look a lot like video streaming. It is by no means a general purpose HPC architecture.

        Everybody knows x86 is poor foundation for high speed computing, but the software, tools, and R&D budget to keep improving it are there. Specialty processors can offer temporary advantages, but within a few years Intel engineers will do their best to integrate any new ideas into their latest designs. Since recovering from the Whitehall fiasco and getting
    • Basing a supercomputer on MIPs was short-sighted; even if it offers a a price/performance or power/performance advantage now, in a couple years it won't, because x86 is being improved at a much faster rate.

      It wasn't even MIPS. From TFA:

      But SiCortex went against conventional wisdom by building its own processors and this decision limited the company's market to early adopters, Conway says. In building its chips, SiCortex obtained intellectual property from several vendors, including MIPS Technologies, and tweaked the design to meet its own needs.

      An HPC start-up going into the microprocessor design business now? That really is a fool's errand. Mind you, that's sort of how the ARM processor came to be, but that was a loooonnng time ago.

      • It *was* MIPS. They licensed core designs and used the MIPS ISA. But they had a custom 6-to-a-die design with some specialist MPI and fabric circuitry of their own design.
    • > Much more money is being spent every year on improving x86 chips than all the competitors combined.

      By your logic, General Motors should be crushing Ferrari in the supercar market. After all, GM spends much more on their car development than Ferrari does.
    • Of course there are all those GPU's offering orders of magnitude more GFlops than any x86 and being improved at a much faster rate than x86.
  • by Wrath0fb0b (302444) on Tuesday November 03, @06:51PM (#29971020)

    ... is almost always wrong. As one of the principals on a large-ish (not large by world standards, 1000 cores, mainly Nehalem so approximately 100 GFLOPS) cluster, I've been very pleased that we've done things as simply as possible. Sun Grid Engine and ROCKS running on commodity 1Us delivers an economical and effective solution (no, I don't work for Sun).

    Most importantly, the environment does not unduly restrict what kind of compute jobs can be run. If it can be compiled on *nix, we can probably run it. We lose to specialized hardware (GPU-based, Cell-based, ... ) in raw throughput but we make up for it in both initial price and ease of deployment. We don't even have a dedicated admin for the cluster -- we had one to set it up and he did such a good job we haven't needed to hire a replacement!

    Ultimately, I feel like it's not worth paying extra in hardware and software-dev costs to save few dollars on cooling and power. Sure, you get credibility of running a "green" cluster (nevermind that you have to pay to feed and house those extra developers, which should legitimately come out of your carbon budget) but you end with with a far less useful product.

    Long Live X86(_64)!

    • by Gorobei (127755) on Tuesday November 03, @09:41PM (#29972734)

      Exactly right. I've got >10K cores and >10M LOC. "Hardware fault" typically means a datacenter caught fire, or was flooded, or an undersea cable got cut.

      If someone pitches a cheaper solution (e.g. power savings,) I'm happy to listen for 10 minutes. Then I just want to know how fast I can see results: a dev costs $50K/month here, so I'll give it a week or two: if you don't have a test farm ready to go with full compilers, a data security plan, etc, I'm going to just reject. If you can get traction with universities, great, come back and pitch again in a year.

      • That's what they designed: it's basically a bog-ordinary Linux-with-MPI cluster in a box. They had a custom internal fabric that was far more efficient than ordinary switches and even had on-die MPI accelerators. They also shipped with compilers for C, C++ and Fortran.

        It was meant to be a drop-in replacement for room-sized clusters for a fraction of the space and heat. Basically what killed them was cashflow.

        • Re: (Score:3, Interesting)

          Yep, cashflow is a bitch: if I need to spend $25K to even look at the product, and they need $20M to run a demo datacenter, they need something like $100M in capital to avoid dying on the vine :(

      • Just out of curiosity how much of that $50K/month is salary? And how does the rest break down?
    • By your logic, General Motors should be crushing Ferrari. After all, GM spends much more on their car development than Ferrari does.

  • x86 is certainly entrenched in the desktop, but in supercomputing? In the top 10, it's maybe half x86. There's a strong showing from Power (BlueGene) and of course the #1 spot held by an x86/Cell hybrid (which gest most fo the FLOPS from the SPU, not PPC or x86)

    Hardly entrenched.

    Looking down further, there is mainly x86, but still a strong showing from Power (IBM) but also SPARC, NEC's vector processor (kind of PPC), Itanium and a few randoms.

    So, the to 100 is dominated by AMD, Intel and IBM in roughly equa

    • Re: (Score:3, Informative)

      First, the HPC world has a lot of commodity computers, but it also has a lot of very special-purpose computers.

      Second, the odds of someone buying an HPC machine and then running pre-compiled generically-optimized code on it is virtually zero.

      Third, HPC computers (as compared to loosely-clustered "pile-of-PCs" systems) are expensive and almost invariably use components that aren't "run-of-the-mill" (such as Infiniband or SCI for the interconnect).

      In consequence, not only is the ix86 not "entrenched", it can'

  • Commodity (Score:2, Interesting)

    FTFA:

    "It is possible for a small company to compete in the computer systems business," Reilly wrote. "There are some who will say that nobody can compete against 'commodity manufacturers.' Ignore them. ... There are only two true commodities in the computer business: DRAMs and wafer area. Everybody pretty much pays the same price for DRAMs. Wafer area is what you make of it. If you insist on building giant 100W chips, life will be tough. But if you use the silicon wafer area for something new, different an

  • by timmarhy (659436) on Tuesday November 03, @07:49PM (#29971760)
    These guys failed in a very typical geeky fashion. they understood the technology but not the business, and at the end of the day your customers need a business case to use your services. it's the tail attempting to wag the dog.
  •     Points (made above) about non-x86 processors are doomed aside, the Si-cortex had an interesting interconnect design. Their kautz graph based interconnect was fairly (at least to me) innovative.

        Personally I'm sorry to see them go, we never had a chance to benchmark our software on their system but I was suspicious it might have behaved very well per $. Even if the underlying system disappears their interconnect ideas may survive.

  • by MarkvW (1037596) on Tuesday November 03, @08:45PM (#29972274)

    Somebody is going to crack the market--and it won't be one of the people who sit at home and cry in their beer about how Intel rules the world and that nobody has any hope of success!!

    Thank goodness for the entrepreneurs who spit on lassitude and take their shot! Those wozniaks are the people who end up delivering really cool stuff for the rest of humanity, and leave the conventional wisdom people in the dust.

  • by labradore (26729) on Tuesday November 03, @10:44PM (#29973158)
    They were ahead of schedule to profitability. They lost funding for the next gen. equipment development because one of their VCs was overextended (read: losing too much money on other risky ventures) and decided to pull out. The risk with a company like that may be high but once you get enough profitability, you can fund further product development internally. They had sold about twenty $1.5M machines in about a year's time on the market. They said they were about 1.5 years to profitability, so I'm guessing that they were expecting to sell another 75 or 100 top-end machines to get to break-even. At that rate, they were probably spending less than $20M a year on development. I'm guessing that they burned up $100M to get were they got. In the overall scheme of things, that's not a big bet. If they managed to develop 20 to 50- thousand node machines and increase the output per core within 3 years, that is something that would have been able to do more than fill a niche. They probably would have developed some game-changing technology in the bargain. Stuff that the Intel and Google might just be interested in.

    To be clear: this was not a failure due to the economics of competing against Intel/x86. This was a failure due to not being lucky. It takes sustained funding to make your way from start-up to profit in most technical businesses. HPC is more technical and thus more expensive than most.

    • Yep, purely a business failure due to the crappy timing of the GFC, rather than market trends per se.

      What really pisses me off is that Sun bought MySQL for a billion dollars when they probably could have gotten SiCortex for a fraction of that.
  • by Iphtashu Fitz (263795) on Tuesday November 03, @11:19PM (#29973392)

    I work as a sysadmin at a Boston-based university, and one of my jobs is managing an HPC cluster. We actually had SiCortex come give us a demo of one of their systems a little over a year ago and were rather impressed from a basic technology standpoint. However the biggest drawback we saw, which was a significant one, was that their cluster wasn't x86 based. We run a number of well known commercial apps on our cluster like Matlab, Mathematica, Fluent, Abaqus, and many others. Without those vendors all actively supporting MIPS, SciCortex was simply a non-starter for us when we were researching our next generation cluster. And by actively I mean rolling out MIPS versions of their products on a schedule comparable to their x86 product releases. Having to wait 6 months or more for MIPS versions simply isn't acceptable. If they could get firm commitments from those commercial vendors then we might have pursued SciCortex, but that simply wasn't the case. Even the inability to run a standard commercial linux distro was a huge drawback, since many commercial software vendors specifically require a commercial distro like Red Hat or SUSE if you're trying to get support from them.

  • That Sun pissed away one billion dollars on MySQL instead of buying out SiCortex. Smart move, Sun!

  • SiCortex's failure (Score:3, Interesting)

    by RzUpAnmsCwrds (262647) on Wednesday November 04, @05:44AM (#29975982)

    Having actually used a SiCortex machine, I can tell you that the problem wasn't the VC, or the compilers, or even really the hardware.

    The problem was the market.

    There are two types of x86-based small clusters (the market that SiCortex was aiming for): clusters with Gigabit Ethernet and clusters with expensive interconnects (Mirinet, InfiniBand, or 10G Ethernet).

    Gigabit Ethernet clusters do a good job with problems that are embarrassingly parallel (or at least have minimal communication demands). $150k gets you 300 Nehalem cores and a lot of memory. SiCortex fails here because their competition (the SC1458) is much more expensive and much slower. The fact that the SC1458 uses less power (around 5kW instead of 10kW) is impressive, but unless you're very power or cooling constrained, it's simply more cost effective to deal with the extra power and cooling cost.

    SiCortex hardware was more cost effective against clusters with expensive interconnects. The problem is, the people who buy clusters with expensive interconnects do so because their problem is interconnect heavy. Unfortunately, despite all of the cool CS behind SiCortex's interconnect, the fact is that it just didn't do that well against InfiniBand. That's partly because the SiCortex system has more nodes, which means that more messages have to use the interconnect. It's partly because for very small clusters, it's possible to use a single IB switch that connects every node to every other node. And it's partly because SiCortex didn't have the kind of mature hardware/software stack that someone like Mellanox has.

    So, there you have it. For the problems that ran well on SiCortex hardware, you could get the same performance at dramatically lower cost using Gigabit Ethernet. For the problems that require an expensive interconnect, the SiCortex approach of "more, smaller nodes" results in dramatically more overhead compared with the "fewer, faster nodes" strategy.

    • Re: (Score:3, Interesting)

      Don't be unlucky. At least, that's what the story is about.

      More seriously, it looks like they were trying for high end supercomputing. There's probably a lot more money in smaller supercomputing clusters, but then they'd get hit hard by the proprietary structure of their hardware.
      • Re:Lesson learned (Score:4, Insightful)

        by jd (1658) <imipak@yah[ ]com ['oo.' in gap]> on Tuesday November 03, @09:19PM (#29972552) Homepage Journal

        Having worked in one HPC startup (Lightfleet), I can say that one of the biggest dangers any startup faces is its own management. Good ideas don't make themselves into good products or turn themselves into good profits by selling. Good ideas don't even make it easier - you only have to look at how many products that are both defective by design AND sell astronomically well to see that.

        I can't speak for SiCortex' case, but it looks to me like they had a great idea but lacked the support system needed to get very far in the market. It's not a unique story - Inmos didn't fail on technological grounds. Transmeta probably didn't, either.

        Really, it would be great if there could be some effort into examining the inventions of the past to see what ideas are worth trying to recreate. For example, would there be any value in Content Addressable Memory? Cray got an MPI stack into RAM, but could some sort of hardware message-passing be useful in general? Although SCI and Infiniband are not extinct, they're not prospering too well either - could they be redone in a way that didn't hurt performance but did bring them into the mass market?

        Then, there's all sorts of ideas that have died (or are dying - Netcraft confirms it) that probably should be dead. Bulk Synchronous Processing is fading, distributed shared memory is now only available in spiritualist workshops, CORBA was mortally wounded by its own specification committee and parallel languages like PARLOG and UPC are not running rampant even though there are huge problems with getting programs to run well on SMP and/or multicore systems.

        • VCs often fail when economies go bad. Same thing happened in 2000-2001. To be honest, I'd have expected the VCs to fail sooner, say in winter 2008 or Spring 2009. They actually got some good mileage out of the VCs.
    • Re: (Score:3, Insightful)

      The thing is, industries like these are already really, really dominated by single players and everyone uses them. It's the same with Windows too - it's own marketshare will keeps it having that marketshare. In airplane industry all the European companies had to merge so that they could compete with Boeing.

      When something becomes like a standard, it's really hard to break in.

      • Single player? Have you looked at the top 100? It's equal parts Intel (x86), AMD (x86) and IBM (Power related), with a smattering of others: Cell - mostly SPU, Itanium, SPARC, NEC and others.

        There's certaqinly no dominanint player and not even much of a dominant instruction set. The thing is, supercomputers are so expensice and unique that porting to a different instruction set is usually the least of the work, except for Roadrunner which is fast but rather hard to use.

      • Re:Lesson learned (Score:5, Insightful)

        by Jacques Chester (151652) on Tuesday November 03, @11:55PM (#29973702)

        They didn't die because their customers abandoned them for something cheaper. They died because they had a cashflow crisis due to investors pulling out of a planned round of fundraising. They had millions of dollars of sales in the pipeline.

        The lesson isn't "Don't compete with Intel", it's "When you run out of money, you're out of business". Or perhaps, "The financial crisis killed lots of otherwise sound businesses". Luck, as the OP pointed out, played a large part.

    • Re: (Score:2, Insightful)

      Lesson learned: there is no market for proprietary CPUs on MPP supercomputers. It's gone. If Cray and SGI couldn't do it, how are a couple guys from DEC and Novell going to pull it off?
      It's always sad when someone's dream fails, but come'on guys. You're pursuing a 15-years-ago market, just like DEC and Novell did when they died (okay, Novell exists, but it is irrelevant).

      Supercomputers are commodity processors increasingly in commodity boxes running commodity open-source software. A supercomputer running sl

    • by Wrath0fb0b (302444) on Tuesday November 03, @07:03PM (#29971196)

      Why not use something based of the Atom chip but massively parallel.

      You are probably one of those guys that thinks that if you can get 36 women working together on making a baby, it will be ready in 1 week.

      Not all problems can scale out to many cpus (or wombs, for that matter). Threading overhead, network latency/bandwidth, mutual exclusion (or the overhead on atomic data types) all conspire to defeat attempts to scale. This is, of course, if your problem is one that is even amenable to straightforward parallelization in the first place -- many problems (for instance, lattice simulations of Monte Carlo) are excruciating to scale to even 2 cpus.

      In my own (informal) tests on our HPC (x64, Linux, see my post above for details), I concluded that you need to be able to discretize your work into independent (and NONBLOCKING) chunks of ~5ms in order to make spawning a pthread worth it. Of course, "worth it" is a relative term -- some people would be glad to double the cpu-time required for a 25% reduction in wall-clock time while others might not, so I'll concede that my measurement is biased. IIRC, I required a net-efficiency (versus the single-core version) of no worse than 85% -- e.g. spend less than 15% of your cpu-time dealing with thread overhead or waiting for a mutex. This was for 8 cores on the same motherboard by the way, if you are spawning MPI jobs over a network socket, expect much much worse.

      • You are probably one of those guys that thinks that if you can get 36 women working together on making a baby, it will be ready in 1 week.

        It'd certainly be fun trying, though.;)

        Not all problems can scale out to many cpus (or wombs, for that matter). Threading overhead, network latency/bandwidth, mutual exclusion (or the overhead on atomic data types) all conspire to defeat attempts to scale.

        It's not my skill set, but I remember years ago seeing a fascinating show on how blindly adding more resources can make something SLOWER. To translate the case study (involving editing individual segments for a news show on limited editing equipment) into geek speak, they demonstrated that unless you do things right, you might wind up with cores 2-8 zipping through their parts just to wait for core 1, which has unceremoniously had all the long tasks scheduled to run on

      • Low-power CPUs could be useful if you've the bandwidth. Ultimately, though, you're limited to how parallelizable the problem is - a fundamentally sequential problem will remain a fundamentally sequential problem, for example.

        If the problem can be parallelized, you're then limited by the nature of the CPU vs. the nature of the problem. A problem that is essentially SIMD is going to do great on a cluster of identical processors. A problem that is essentially MIMD is not. MIMD problems will always do better on

      • >This was for 8 cores on the same motherboard by the way, if you are spawning MPI jobs over a network socket, expect much much worse.

        They thought of that in the SiCortex design. They use a custom internal fabric to reduce the maximum hop distance to 3 between any pair of chips; and furthermore they included on-die MPI-handling logic to take that overhead out of the CPUs.
    • (FLOPS/W)/latency
      For non embarrassingly parallel jobs it won't matter how efficiently you can compute if you can't communicate the results between nodes.
Don't mind him; politicians always sound like that.