Contents:
This document provides a minimal set of instructions on how to get NDT running on an Amazon Web Services HPC cluster. To keep these instructions as simple as possible, many common practices especially with respect to security have been left out. Use at your own risk. These instructions were last verified to work on 2019-06-18.
Different commands need to be run on different machines and possibly by different users. Which machine (and possibly user) a command should be run as, is indicated by the sample command prompt:
local$
: Regular user on your local machine.ec2$
: The ec2-user
on the master node of a cluster.Wherever an IP address would appear in an example, the address has been
replaced with a.b.c.d
.
For input values, a note is provided to indicate which IP address a.b.c.d
should be replaced with.
Since pip installs executables to an odd location, update the PATH
to
include this location.
local$ export PATH=~/.local/bin:"$PATH"
To make this change persistent, runtime configuration files will need to be updated as well (e.g., in bash, add export PATH=~/.local/bin:"$PATH"
to the end of ~/.bashrc
).
Cluster setup will be managed with the pcluster
command line interface (CLI) which can be installed via pip
.
local$ pip install --user --upgrade awscli
local$ pip install --user --upgrade aws-parallelcluster
Note: The utility formerly known as CfnCluster was renamed to pcluster
.
Before a cluster can be set up, a user must be created with permissions to create the cluster.
test_user
).If you failed to record or otherwise misplaced the secret key, a new key pair can be created for a user.
In order to login to a cluster via ssh
, a pair of ssh keys is needed.
To create such a pair:
.txt
extension).mv ~/Downloads/TestClusterSshKey.pem.txt ~/.ssh/TestClusterSshKey.pem
)chmod 400 ~/.ssh/TestClusterSshKey.pem
).Next the CLI needs to be configured.
local$ aws configure
AWS Access Key ID [None]: <Access key ID from user creation>
AWS Secret Access Key [None]: <Secret access key from user creation>
Default region name [None]: us-east-2
Default output format [None]: text
If all went well, this should have created ~/.aws/config
and ~/.aws/credentials
.
Configure a cluster template by running pcluster configure
.
The process should look something like this:
local$ pcluster configure
Cluster Template [default]: default
Acceptable Values for AWS Region ID:
eu-north-1
ap-south-1
eu-west-3
eu-west-2
eu-west-1
ap-northeast-2
ap-northeast-1
sa-east-1
ca-central-1
ap-southeast-1
ap-southeast-2
eu-central-1
us-east-1
us-east-2
us-west-1
us-west-2
AWS Region ID []: us-east-2
VPC Name [public]: public
Acceptable Values for Key Name:
TestClusterSshKey
Key Name []: TestClusterSshKey
Acceptable Values for VPC ID:
vpc-dfd9c6b7
VPC ID []: vpc-dfd9c6b7
Acceptable Values for Master Subnet ID:
subnet-5accb120
subnet-baf1f6d2
subnet-68ad0924
Master Subnet ID []: subnet-5accb120
Configure compute instance types:
~/.parallelcluster/config
in a text editor.[cluster default]
.compute_instance_type
value (e.g., add a line compute_instance_type = t2.micro
).initial_queue_size
and/or max_queue_size
. (e.g., add a line max_queue_size = 10
) (optional)Note: If you need more than the default number of nodes (10), go to the EC2 Service Limits page to request an increase to the limits for the instance type being used.
Actually create a cluster:
local$ pcluster create my-test-cluster
Note: This will take a surprisingly long time (approximately 5-10 minutes).
Verify that the cluster exists, by listing the current clusters (optional):
local$ pcluster list
Login to cluster:
local$ pcluster ssh my-test-cluster -i ~/.ssh/TestClusterSshKey.pem
The authenticity of host 'a.b.c.d (a.b.c.d)' can't be established.
ECDSA key fingerprint is SHA256:VGhpcyBpcyBub3QgYSByZWFsIGZpbmdlcnByaW50LiAK.
Are you sure you want to continue connecting (yes/no)? yes
If prompted about the authenticity of the host, answer yes
, as show above.
Clone NDT:
ec2$ cd /shared
ec2$ git clone https://github.com/doing-science-to-stuff/ndt.git
Compile NDT:
ec2$ cd ndt
ec2$ cmake . && make
If any additional files are needed (e.g. updated scene files or YAML files),
those can be copied with scp
.
Using the public IP address, files can be transferred using scp
.
ec2$ scp -i ~/.ssh/~/.ssh/TestClusterSshKey.pem \
./path_to_local_file \
ec2-user@a.b.c.d:/shared/ndt/path_to_remote_file
Replacing a.b.c.d
with the public IP address from the instances table.
To simplify setting up clusters in the future, it is possible to take a
snapshot of the current state of the /shared
volume on the cluster.
The snapshot can then be used as the starting point for future cluster
instances.
To create a snapshot of the cluster:
NDT Compiled and Ready to Run
).The cluster uses a queueing system to manage jobs. To submit a job to the queuing system, a submission script that describes what is to be run must be written.
Using a text editor (e.g., nano
or vim
), create a file named example_job.sh
.
In that file add the following text:
#!/bin/sh
#$ -cwd
#$ -N ndt
#$ -pe mpi 10
#$ -j y
/usr/lib64/mpich/bin/mpirun -np $NSLOTS ./ndt -b r -f 3 -d 4 -s scenes/hypercube.so
In this file, the 10
in the line #$ -pe mpi 10
specified how many slots
(cores) the job will use.
Submit the job:
ec2$ qsub example_job.sh
Note: It may complain that ec2-user's job is not allowed to run in any
queue
. This can be ignored.
The status of the submitted job can be tracked with the qstat
command.
ec2$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
------------------------------------------------------------------------------------------------
1 0.55500 ndt ec2-user qw 06/18/2019 05:28:02 10
The state qw
means the job is waiting for enough resources to be available
to run the job.
If qstat
doesn’t produce any output, there are no jobs in the queue.
Which means either qsub
failed, or the the job started and has already
terminated.
Progress on adding execution hosts to the cluster can be tracked with the
qconf
command.
ec2$ qconf -sh
When the job starts, the state will change to r
.
Once the job completes, it will no longer show up in qstat
output.
Standard out is redirected to a file, progress of the running job can be
tracked by tailing this file.
Since the job-ID show by qstat
is 1
, the filename will be ndt.o1
.
ec2$ tail -f ndt.o1
To stop monitoring the progress, press ctrl+c. This will not affect the running job, it will only stop displaying of further output.
If you need to stop a running job, you can do so with the qdel
command.
ec2$ qdel 1
Where 1
is the job-ID for the running job, as reported by qstat
.
Jobs that terminate normally do not need to be deleted.
Once a rendering job is complete, the resulting images will need to be retrieved from the cluster before the cluster is deleted.
Warning: Any output not captured in a snapshot or transferred off of the cluster will be destroyed when the cluster is deleted.
Using the public IP address, files can be transferred using scp
.
local$ mkdir results
local$ scp -r -i ~/.ssh/~/.ssh/TestClusterSshKey.pem \
ec2-user@a.b.c.d:/shared/ndt/images \
./results
Replacing a.b.c.d
with the public IP address from the instances table.
Once you are finished running jobs on the cluster and have collected all of
your output files, log out of the master node with the logout
command.
ec2$ logout
Connection to a.b.c.d closed.
When you are done with the cluster, it can be torn down with the pcluster
command.
local$ pcluster delete my-test-cluster
Verify that cluster was deleted, by listing the current clusters (optional):
local$ pcluster list
If you created a snapshot before deleting the cluster, you can use that snapshot as the starting point for future cluster instances.
ebs
section to ~/.parallelcluster/config
:
[ebs snapshot_name]
ebs_snapshot_id = snap-XXXXXXXXXXXXXXXX
Where snap-XXXXXXXXXXXXXXXX is the actual ID of the snapshot.
[cluster default]
section:
ebs_settings = snap-XXXXXXXXXXXXXXXX
At this point rerunning pcluster create cluster-name
will create a cluster
such that /shared
is populated with the contents of the snapshot,
eliminating the need to refetch the source code and compile it.
As with any complex process, things may go wrong along the way. This section provides a list of potential problems that may occur and suggestions on how to fix them.
python get-pip.py
produces the error message:
ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/pip-19.1.1.dist-info'
Consider using the `--user` option or check the permissions.
--user
option, (i.e., python get-pip.py --user
).Programmatic access
was not checked.Programmatic access
box
in the Access type section below the User name field. This user has no permissions
You haven't given this user any permissions. This means that the user has no access to any AWS service or resource. Consider returning to the previous step and adding some type of permissions.
Previous
button (twice?) to get to the Set permissions
screen.Attach existing policies directly
.AdministratorAccess
.Next: Tags
button.aws
produces the error message:
aws: command not found
awscli
is not installed.awscli
with pip install --user --upgrade awscli
aws
is not in your PATH
export PATH=~/.local/bin:"$PATH"
and update runtime configuration files.pcluster
produces the error message:
pcluster: command not found
aws-parallelcluster
is not installed.aws-parallelcluster
with pip install --user --upgrade aws-parallelcluster
pcluster
is not in your PATH
export PATH="~/.local/bin:$PATH"
and update runtime configuration files.pcluster configure
produces the error message:
Failed with error: You must specify a region.
Hint: please check your AWS credentials.
aws configure
.Default region
was not specified when configuring AWS.aws configure
and be sure to give an answer to Default region name
(e.g., us-east-2
).pcluster configure
produces the error message:
Failed with error: An error occurred (AuthFailure) when calling the DescribeRegions operation: AWS was not able to validate the provided access credentials
Hint: please check your AWS credentials.
pcluster configure
produces the error message:
Failed with error: Unable to locate credentials
Hint: please check your AWS credentials.
aws configure
and be sure to provide values for AWS Access Key ID
and AWS Secret Access Key
.pcluster configure
produces the error message:
Failed with error: An error occurred (UnauthorizedOperation) when calling the DescribeRegions operation: You are not authorized to perform this operation.
Hint: please check your AWS credentials.
AdministratorAccess
permissions.Permissions
tab.Add permissions
button.Attach existing policies directly
.AdministratorAccess
.Next: Review
button at the bottom of the page.Add permissions
button at the bottom of the page.pcluster configure
produces the error message:
ERROR: The value (TestClusterShKey) is not valid
Please select one of the Acceptable Values listed above.
pcluster
command produces the error message:
ERROR: Default config ~/.parallelcluster/config not found.
You can copy a template from here: ~/.local/lib/python2.7/site-packages/pcluster/examples/config
aws-parallelcluster
has not been configured yet.pcluster configure
.pcluster ssh my-test-cluster
produces the error message:
ec2-user@a.b.c.d: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
-i
flag, (e.g., pcluster ssh my-test-cluster -i ~/.ssh/TestClusterSshKey.pem
).pcluster ssh my-test-cluster -i ~/.ssh/TestClusterSshKey.pem
produces the error message:
ec2-user@a.b.c.d: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
pcluster ssh my-test-cluster -i ~/.ssh/TestClusterSshKey.pem
produces the error message:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '~/.ssh/TestClusterSshKey.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "~/.ssh/TestClusterSshKey.pem": bad permissions
ec2-user@a.b.c.d: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
chown
400 ~/.ssh/TestClusterSshKey.pem
).pcluster ssh my-test-cluster -i ~/.ssh/TestClusterSshKey.pem
produces the error message:
Load key "~/.ssh/TestClusterSshKey.pem": Permission denied
ec2-user@a.b.c.d: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
chown
400 ~/.ssh/TestClusterSshKey.pem
).