They link to various other projects that do cloud-y-HPC:
* AWS ParallelCluster [AWS]
* Cluster in the cloud [AWS, GCP, Oracle]
* Elasticluster [AWS, GCP, OpenStack]
* Google Cluster Toolkit [GCP]
* illume-v2 [OpenStack]
* NVIDIA DeepOps [Ansible playbooks only]
* StackHPC Ansible Role OpenHPC [Ansible Role for OpenStack]
Nvidia also offers free licenses for their Base Command Manager (BCM, formerly Bright Cluster Manager); pay for enterprise support, or hit up the forums:
Thanks for this! I went looking for something similar a while back and found nothing much. I'm guessing that the alternative to this tidy modern repository is a gigantic broken pile of ansible/chef/puppet that hasn't been touched in 10 years.
Even surprisingly popular distributed-systems stuff is always really bad about "follow this 10 step copy/paste to deploy to EKS" but that's also obnoxious. In the first place, people want to see something basically working on small scale first to check if it's abandonware. But even after that.. local prototyping without first setting up multiple repositories, then shipping multiple modified container images, and already having CI/CD for all of the above is really nice to have.
> I'm guessing that the alternative to this tidy modern repository is a gigantic broken pile of ansible/chef/puppet that hasn't been touched in 10 years.
Not quite sure how well you looked, but there are a bunch of deployment systems for HPC, Ansible or otherwise:
I have worked 100% in 3 comparable systems over the past 10 years. Can you access with ssh?
I find it super fluid to work on the HPC directly to develop methods for huge datasets by using vim to code and tmux for sessions. I focus on printing detailed log files constantly with lots of debugs and an automated monitoring script to print those logs in realtime; a mixture of .out .err and log.txt.
Interesting, I've been dealing with replacing a few on-prem HPC clusters lately. One of the things we've been looking at is OpenOnDemand. How does this compare to that? Is this primarily targeted at cluster development or can I really just make an arbitrarily large production HPC cluster with it?
Don’t you still need the HPC cluster with OpenOnDemand? I thought it was a web interface to use HPC resources.
But this still runs on a single computer, so you wouldn’t use this to deploy a production cluster. This would be for testing in a virtual multi-node-ish setup.
* https://github.com/ComputeCanada/magic_castle
They link to various other projects that do cloud-y-HPC:
* AWS ParallelCluster [AWS]
* Cluster in the cloud [AWS, GCP, Oracle]
* Elasticluster [AWS, GCP, OpenStack]
* Google Cluster Toolkit [GCP]
* illume-v2 [OpenStack]
* NVIDIA DeepOps [Ansible playbooks only]
* StackHPC Ansible Role OpenHPC [Ansible Role for OpenStack]
Nvidia also offers free licenses for their Base Command Manager (BCM, formerly Bright Cluster Manager); pay for enterprise support, or hit up the forums:
* https://www.nvidia.com/en-us/data-center/base-command-manage...
* http://support.brightcomputing.com/manuals/10/
* http://support.brightcomputing.com/manuals/11/
Even surprisingly popular distributed-systems stuff is always really bad about "follow this 10 step copy/paste to deploy to EKS" but that's also obnoxious. In the first place, people want to see something basically working on small scale first to check if it's abandonware. But even after that.. local prototyping without first setting up multiple repositories, then shipping multiple modified container images, and already having CI/CD for all of the above is really nice to have.
Not quite sure how well you looked, but there are a bunch of deployment systems for HPC, Ansible or otherwise:
* https://old.reddit.com/r/HPC/comments/1p4a3fq/what_imaging_s...
* My comment listing a bunch: https://news.ycombinator.com/item?id=46037792
I have worked 100% in 3 comparable systems over the past 10 years. Can you access with ssh?
I find it super fluid to work on the HPC directly to develop methods for huge datasets by using vim to code and tmux for sessions. I focus on printing detailed log files constantly with lots of debugs and an automated monitoring script to print those logs in realtime; a mixture of .out .err and log.txt.
But this still runs on a single computer, so you wouldn’t use this to deploy a production cluster. This would be for testing in a virtual multi-node-ish setup.