Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Almighty Buck Cloud Hardware

UC Berkeley Launches SkyPilot To Help Navigate Soaring Cloud Costs 22

Researchers at U.C. Berkeley's Sky Computing Lab have launched SkyPilot, an open source framework for running ML and Data Science batch jobs on any cloud, or multiple clouds, with a single cloud-agnostic interface. Datanami reports: SkyPilot uses an algorithm to determine which cloud zone or service provider is the most cost-effective for a given project. The program considers a workload's resource requirements (whether it needs CPUs, GPUs, or TPUs) and then automatically determines which locations (zone/region/cloud) have available compute resources to complete the job before sending it to the least expensive option to execute. The solution automates some of the more challenging aspects of running workloads on the cloud. SkyPilot's makers say the program can reliably provision a cluster with automatic failover to other locations if capacity or quota errors occur, it can sync user code and files from local or cloud buckets to the cluster, and it can manage job queueing and execution. The researchers claim this comes with substantially reduced costs, sometimes by more than 3x.

SkyPilot developer and postdoctoral researcher Zongheng Yang said in a blog post that the growing trend of multi-cloud and multi-region strategies led the team to build SkyPilot, calling it an "intercloud broker." He notes that organizations are strategically choosing a multi-cloud approach for higher reliability, avoiding cloud vendor lock-in, and stronger negotiation leverage, to name a few reasons. To save costs, SkyPilot leverages the large price differences between cloud providers for similar hardware resources. Yang gives the example of Nvidia A100 GPUs, and how Azure currently offers the cheapest A100 instances, but Google Cloud and AWS charge a premium of 8% and 20% for the same computing power. For CPUs, some price differences can be over 50%. [...]

The project has been under active development for over a year in Berkeley's Sky Computing Lab, according to Yang, and is being used by more than 10 organizations for use cases including GPU/TPU model training, distributed hyperparameter turning, and batch jobs on CPU spot instances. Yang says users are reporting benefits including reliable provisioning of GPU instances, queueing multiple jobs on a cluster, and concurrently running hundreds of hyperparameter trials.
This discussion has been archived. No new comments can be posted.

UC Berkeley Launches SkyPilot To Help Navigate Soaring Cloud Costs

Comments Filter:
  • by darkain ( 749283 ) on Tuesday December 13, 2022 @07:19PM (#63128652) Homepage

    Standards :) https://xkcd.com/927/ [xkcd.com]

  • by timeOday ( 582209 ) on Tuesday December 13, 2022 @07:19PM (#63128658)
    If the providers are committed to supporting open protocols then this could work sustainability. But if this is the equivalent of a shopping bot that uses web scraping to find the best deals among online shops that haven't signed up to be part of it, it will require continual re-work and quickly fall into disuse. The basic idea has been floating around and tried many times, but are confounded by either shallow means (like CAPTCHA) or deeper means like retailer-specific SKU's with products that differ trivially, or package deals / add-ons that complicate the value proposition. Retailers generally really want to 'differentiate their offerings,' i.e. prevent a friction-free, apples-to-apples race to the bottom that eliminates profit. Not sure why clouds would be different... nobody wants to be commoditized.
    • Re:Provider buy-in? (Score:4, Interesting)

      by fuzzyfuzzyfungus ( 1223518 ) on Tuesday December 13, 2022 @08:10PM (#63128758) Journal
      Vendors will definitely try to sell you on their proprietary abstraction layers or play various pricing games that encourage you to enter but not to leave; but their desire to obfuscate pricing is somewhat blunted by the fact that their more serious customers expect to be able to get price data programmatically(and it's practically mandatory for something like AWS spot instances). You will need to parse the various instance types into their component hardware in order to make comparisons, and there might be some factors that don't show up on the spec sheet that need to be benchmarked; but customers who are big enough to matter would be distinctly upset if they got rid of the price list query API and replaced it with a CAPTCHA.
      • AWS and google bill for each part on there own that is not part of the basic VM price.
        Like data in and data out that is an added change
        basic monitoring / logs viewer added fee
        static ip added fee
        storage added fee
        image templates need to pay for the storage.

        • All that is definitely true; but, for the purposes of an expert system being put together collaboratively for the use of people doing reasonably hardcore HPC stuff, having some complexity in the price calculation is a relatively minor issue unless there are elements that either do not have prices that can be scraped or pulled programmatically, or components that are particularly unpredictable.

          In this case it looks like SkyPilot has a storage component that does some abstraction of data handling(whether r
  • Sounds like a great idea. I wonder if it supports free services for tiny jobs

  • by Drew M. ( 5831 ) on Tuesday December 13, 2022 @07:43PM (#63128708) Homepage

    Only because I do this for my own company, just wait until you see how much cheaper it is to host your own machines in a datacenter.

    Here's a quick summary of CPU costs with Passmark scores factored in from the spreadsheet I created to model it all. We use Dell R7525 Epyc 7282 128GB at a physical datacenter with about a cost of $100/month per server including power and 250Mbps bandwidth. These were about the most cost effective servers I could find per Passmark unit. All costs are factored over a 5 year timespan including hosting, power and bandwidth usage for our specific application.

    Here's the comparisons to the cloud in order of cheapest to most expensive:
    Google n2d-highcpu-32-preempt us-central-1 is 2.8x more expensive than our hardware solution (we use these preempt instances to burst into the cloud to handle peaks) For some weird reason AMD in us-central-1 is crazy cheap compared to other zones
    Amazon c5.2xlarge-spot (Variable pricing) generally comes in this area, but because the pricing is variable, it varies
    Oracle Cloud E4 32core 128G is 10.5x more expensive than our hardware solution
    Google n2d-highcpu-32 us-central-1 is 17x more expensive than our hardware solution
    Azure F32s V2 est Xeon 8168 is 17.3x more expensive than our hardware solution
    Amazon c5.2xlarge 17.3x more expensive than our hardware solution
    Digital Ocean 8VCPUs 20.7x more expensive than our hardware solution

    Also of note, bandwidth for most of the cloud providers is on the order of 50-100x more expensive than what a datacenter will charge you.

    • by StormReaver ( 59959 ) on Tuesday December 13, 2022 @08:03PM (#63128750)

      ...just wait until you see how much cheaper it is to host your own machines in a datacenter.

      That's what many of us here on Slashdot said was inevitable once the whole "host your data on someone else's computer" craze kicked into high gear. The only question is whether those who were seduced by false promises have become too inextricably dependent on the "someone else's computer" API's to unshackle themselves from them.

    • I'm having trouble following your math. I went to Dell's web site and priced out a machine that seems close to your specs, https://www.dell.com/en-us/sho... [dell.com] Total cost was $6,500 or so, just for the hardware. That would already be over $100 a month, not counting hosting fees. What did I miss?

      And I'm curious what costs you found on the various cloud providers.

      • by Drew M. ( 5831 )

        The $100 a month is the cost of hosting at a datacenter. The machine costs $5185 last I checked. Take that over 5 years with the total passmark score of 31510 and compare it with cloud costs. I've got a whole spreadsheet of it here, but probably can't share it.

        We can run the math on one. I can tell you that a Google n2d-highcpu-32 costs $8700 a year to run and is close to half the CPU power of the above machine, total two of those up over 5 years to match and you get $87000 which is obviously a lot more tha

        • by tomz16 ( 992375 )

          Well yeah? Renting is always more expensive than owning.

          That being said, the value-proposition of "cloud" hosting has always been near-infinite on-demand scaling (e.g. christmas is coming and our new product just went viral, let's ramp up capacity 1,000X for the next month alone).... or... the grant check just cleared, fire up the ML instances for the next 6-months until the mid-term report for more funding is due, etc. It's a "pay for only what you eat" model.

          The problem has (IMHO) become dumb-dumbs rep

        • That does help with the math, thanks.

          There are some differences that offset some of the extra cloud cost.

          - Typical configurations in the cloud, offer triple redundancy, where your data is stored in three data centers, one of which is in a different geographic location. That redundancy typically comes with the package.
          - That cloud server (unless you specifically block it) is maintained for you, including OS patches. On-site, you've got to pay somebody to do that.
          - If your on-site server is generally used at

          • by zlives ( 2009072 )

            cloud ROI depends a lot on not having the personnel to mange the on prem because "some one else" is doing it and thats built in to the price. Redundancy you do have to pay for but it is readily available.
            The problem is that cloud is not simple and you still need experts to maintain, and those experts are a higher cost.

          • by Drew M. ( 5831 )

            Hosted services are even more insanely priced, I was just referring to VMs. Some of the services I've calculated are closer to 100x more expensive in the cloud than to do it yourself even when including redundancy. You have to ask yourself if that database is worth $1000 a year yourself or $100000 a year in the cloud. That extra $99k is half an engineer salary and we're just talking about 1 database. Generally Kubernetes is about double the cost of non Kubernetes in GCP, so we're approaching 45x more expens

            • If the work being done on a single on-prem server, costs $100K in the cloud, you're doing it wrong. You have to think differently when moving to SAAS to optimize costs. I know this first-hand because, in the last year, we set up multiple Azure databases, VMs, and ADF jobs to sync data across various systems. Initially, our bill was $20K per month. But by taking a closer look at where the money was going, we were able to reduce that to $2K per month for the same amount of work, without sacrificing throughput

    • by CEC-P ( 10248912 )
      But dreeeewwwww, their clouds can't be damaged by storms and theft! Well, except for all the AWS and Google outages that keep happening lol.
  • Like with every cross-platform development system, you'll be stuck with LCD of what the various providers offer. And if something goes wrong, debugging is that much more difficult because there is a layer of abstraction to deal with.

  • How high can you fly?

  • I forgot where I heard this law in the 90s, it basically said application programs will bloat and consume as much CPU as there is in the machine. It is just human nature, developers won't bother to optimize if the program runs "fast enough", with more CPU then more bloat is tolerated.

    So guess what when you have an essentially unlimited CPU in the cloud? The cost rises with every program update and eventually cost more than you can afford.

    Companies using IBM Mainframes learned this long ago. Controlling p

The biggest difference between time and space is that you can't reuse time. -- Merrick Furst

Working...