UC Berkeley Launches SkyPilot To Help Navigate Soaring Cloud Costs 22
Researchers at U.C. Berkeley's Sky Computing Lab have launched SkyPilot, an open source framework for running ML and Data Science batch jobs on any cloud, or multiple clouds, with a single cloud-agnostic interface. Datanami reports: SkyPilot uses an algorithm to determine which cloud zone or service provider is the most cost-effective for a given project. The program considers a workload's resource requirements (whether it needs CPUs, GPUs, or TPUs) and then automatically determines which locations (zone/region/cloud) have available compute resources to complete the job before sending it to the least expensive option to execute. The solution automates some of the more challenging aspects of running workloads on the cloud. SkyPilot's makers say the program can reliably provision a cluster with automatic failover to other locations if capacity or quota errors occur, it can sync user code and files from local or cloud buckets to the cluster, and it can manage job queueing and execution. The researchers claim this comes with substantially reduced costs, sometimes by more than 3x.
SkyPilot developer and postdoctoral researcher Zongheng Yang said in a blog post that the growing trend of multi-cloud and multi-region strategies led the team to build SkyPilot, calling it an "intercloud broker." He notes that organizations are strategically choosing a multi-cloud approach for higher reliability, avoiding cloud vendor lock-in, and stronger negotiation leverage, to name a few reasons. To save costs, SkyPilot leverages the large price differences between cloud providers for similar hardware resources. Yang gives the example of Nvidia A100 GPUs, and how Azure currently offers the cheapest A100 instances, but Google Cloud and AWS charge a premium of 8% and 20% for the same computing power. For CPUs, some price differences can be over 50%. [...]
The project has been under active development for over a year in Berkeley's Sky Computing Lab, according to Yang, and is being used by more than 10 organizations for use cases including GPU/TPU model training, distributed hyperparameter turning, and batch jobs on CPU spot instances. Yang says users are reporting benefits including reliable provisioning of GPU instances, queueing multiple jobs on a cluster, and concurrently running hundreds of hyperparameter trials.
SkyPilot developer and postdoctoral researcher Zongheng Yang said in a blog post that the growing trend of multi-cloud and multi-region strategies led the team to build SkyPilot, calling it an "intercloud broker." He notes that organizations are strategically choosing a multi-cloud approach for higher reliability, avoiding cloud vendor lock-in, and stronger negotiation leverage, to name a few reasons. To save costs, SkyPilot leverages the large price differences between cloud providers for similar hardware resources. Yang gives the example of Nvidia A100 GPUs, and how Azure currently offers the cheapest A100 instances, but Google Cloud and AWS charge a premium of 8% and 20% for the same computing power. For CPUs, some price differences can be over 50%. [...]
The project has been under active development for over a year in Berkeley's Sky Computing Lab, according to Yang, and is being used by more than 10 organizations for use cases including GPU/TPU model training, distributed hyperparameter turning, and batch jobs on CPU spot instances. Yang says users are reporting benefits including reliable provisioning of GPU instances, queueing multiple jobs on a cluster, and concurrently running hundreds of hyperparameter trials.
Standards (Score:3)
Standards :) https://xkcd.com/927/ [xkcd.com]
Provider buy-in? (Score:3)
Re:Provider buy-in? (Score:4, Interesting)
AWS and google bill for each part on there own (Score:2)
AWS and google bill for each part on there own that is not part of the basic VM price.
Like data in and data out that is an added change
basic monitoring / logs viewer added fee
static ip added fee
storage added fee
image templates need to pay for the storage.
Re: (Score:2)
In this case it looks like SkyPilot has a storage component that does some abstraction of data handling(whether r
It's DQSUUCP (Score:2)
Sounds like a great idea. I wonder if it supports free services for tiny jobs
All cloud costs are insane (Score:5, Insightful)
Only because I do this for my own company, just wait until you see how much cheaper it is to host your own machines in a datacenter.
Here's a quick summary of CPU costs with Passmark scores factored in from the spreadsheet I created to model it all. We use Dell R7525 Epyc 7282 128GB at a physical datacenter with about a cost of $100/month per server including power and 250Mbps bandwidth. These were about the most cost effective servers I could find per Passmark unit. All costs are factored over a 5 year timespan including hosting, power and bandwidth usage for our specific application.
Here's the comparisons to the cloud in order of cheapest to most expensive:
Google n2d-highcpu-32-preempt us-central-1 is 2.8x more expensive than our hardware solution (we use these preempt instances to burst into the cloud to handle peaks) For some weird reason AMD in us-central-1 is crazy cheap compared to other zones
Amazon c5.2xlarge-spot (Variable pricing) generally comes in this area, but because the pricing is variable, it varies
Oracle Cloud E4 32core 128G is 10.5x more expensive than our hardware solution
Google n2d-highcpu-32 us-central-1 is 17x more expensive than our hardware solution
Azure F32s V2 est Xeon 8168 is 17.3x more expensive than our hardware solution
Amazon c5.2xlarge 17.3x more expensive than our hardware solution
Digital Ocean 8VCPUs 20.7x more expensive than our hardware solution
Also of note, bandwidth for most of the cloud providers is on the order of 50-100x more expensive than what a datacenter will charge you.
Re:All cloud costs are insane (Score:4, Interesting)
...just wait until you see how much cheaper it is to host your own machines in a datacenter.
That's what many of us here on Slashdot said was inevitable once the whole "host your data on someone else's computer" craze kicked into high gear. The only question is whether those who were seduced by false promises have become too inextricably dependent on the "someone else's computer" API's to unshackle themselves from them.
Re: (Score:2)
I'm having trouble following your math. I went to Dell's web site and priced out a machine that seems close to your specs, https://www.dell.com/en-us/sho... [dell.com] Total cost was $6,500 or so, just for the hardware. That would already be over $100 a month, not counting hosting fees. What did I miss?
And I'm curious what costs you found on the various cloud providers.
Re: (Score:3)
The $100 a month is the cost of hosting at a datacenter. The machine costs $5185 last I checked. Take that over 5 years with the total passmark score of 31510 and compare it with cloud costs. I've got a whole spreadsheet of it here, but probably can't share it.
We can run the math on one. I can tell you that a Google n2d-highcpu-32 costs $8700 a year to run and is close to half the CPU power of the above machine, total two of those up over 5 years to match and you get $87000 which is obviously a lot more tha
Re: (Score:2)
Well yeah? Renting is always more expensive than owning.
That being said, the value-proposition of "cloud" hosting has always been near-infinite on-demand scaling (e.g. christmas is coming and our new product just went viral, let's ramp up capacity 1,000X for the next month alone).... or... the grant check just cleared, fire up the ML instances for the next 6-months until the mid-term report for more funding is due, etc. It's a "pay for only what you eat" model.
The problem has (IMHO) become dumb-dumbs rep
Re: (Score:2)
That does help with the math, thanks.
There are some differences that offset some of the extra cloud cost.
- Typical configurations in the cloud, offer triple redundancy, where your data is stored in three data centers, one of which is in a different geographic location. That redundancy typically comes with the package.
- That cloud server (unless you specifically block it) is maintained for you, including OS patches. On-site, you've got to pay somebody to do that.
- If your on-site server is generally used at
Re: (Score:2)
cloud ROI depends a lot on not having the personnel to mange the on prem because "some one else" is doing it and thats built in to the price. Redundancy you do have to pay for but it is readily available.
The problem is that cloud is not simple and you still need experts to maintain, and those experts are a higher cost.
Re: (Score:2)
Hosted services are even more insanely priced, I was just referring to VMs. Some of the services I've calculated are closer to 100x more expensive in the cloud than to do it yourself even when including redundancy. You have to ask yourself if that database is worth $1000 a year yourself or $100000 a year in the cloud. That extra $99k is half an engineer salary and we're just talking about 1 database. Generally Kubernetes is about double the cost of non Kubernetes in GCP, so we're approaching 45x more expens
Re: (Score:2)
If the work being done on a single on-prem server, costs $100K in the cloud, you're doing it wrong. You have to think differently when moving to SAAS to optimize costs. I know this first-hand because, in the last year, we set up multiple Azure databases, VMs, and ADF jobs to sync data across various systems. Initially, our bill was $20K per month. But by taking a closer look at where the money was going, we were able to reduce that to $2K per month for the same amount of work, without sacrificing throughput
Re: (Score:1)
Least common denominators (Score:2)
Like with every cross-platform development system, you'll be stuck with LCD of what the various providers offer. And if something goes wrong, debugging is that much more difficult because there is a layer of abstraction to deal with.
Sky Pilot (Score:2)
How high can you fly?
Re: (Score:2)
The young process so ill, looks at the Sky Pilot, remembers the words: Thou shalt not kill.
Programs consume as much as you give it (Score:2)
I forgot where I heard this law in the 90s, it basically said application programs will bloat and consume as much CPU as there is in the machine. It is just human nature, developers won't bother to optimize if the program runs "fast enough", with more CPU then more bloat is tolerated.
So guess what when you have an essentially unlimited CPU in the cloud? The cost rises with every program update and eventually cost more than you can afford.
Companies using IBM Mainframes learned this long ago. Controlling p