High Performance Computing
The following frequently asked question are for SPMD: Domain Decompositon Method (DDM).
How many MPI processes (-np
) and threads (-nt
) per
MPI process should I use for DDM runs?
This depends on whether DDM is run in a cluster with separate machines or in a shared memory machine with multiple processors/sockets.
To run parallel MPI processes, distributed memory (with parallel access) is essential. If a
single node contains multiple sockets (each with a single Processor), then theoretically, an
equivalent number of MPI processes (equal to the number of sockets) can be run on the node,
provided sufficient RAM is available to handle all MPI processes simultaneously in parallel.
However, if sufficient distributed memory is not available in the RAM, then it is typically
more efficient to use Shared Memory Parallelization (SMP) instead of DDM and use multiple
cores within the node in parallel via the –nt
run option.
When each node has sufficient RAM to execute only a single serial OptiStruct run, activate SMP on each node by splitting up the run into
multiple threads (using more than four threads, -nt
=4 is usually not
effective for such nodes).
Example:
- Insufficient RAM:
optistruct <inputfile> -ddm -np 4 –nt 4
- Sufficient RAM:
optistruct <inputfile> -ddm -np 8 –nt 4
-core
in and limit the number of MPI processes
-np
to make sure OptiStruct runs in the
in-core mode. The number of MPI processes -np
is usually dictated by the
memory demand. When -np
is set, you can determine the number of threads
per MPI process based on the total number of cores available.A generally cautious method is to specify the
number of threads per MPI process -nt
equal to the number of cores per
socket, and to specify the number of MPI processes per machine equal to the number of
sockets in each machine. You can extrapolate from this to the cluster environment. For
example, if one machine in a cluster is equipped with two typical Intel Xeon Ivy bridge
CPUs, you can set two MPI processes per machine, and 12 cores per MPI process
(-nt
=12) since a typical Ivy bridge CPU consists of 12
cores.
Starting from version 2018.0, OptiStruct also
allows setting -cores
run option wherein, you are only required to
identify the total number of cores available for your run (regardless of whether this is a
cluster run or single node run). OptiStruct will automatically
assign -np
and -nt
based on your number of total cores
specification via -cores
.
Will DDM use less memory for each MPI process than in the serial run?
Yes, memory per MPI process for a DDM solution is significantly reduced compared to serial runs. DDM is designed for extremely large models on machine clusters. The scaling of out-of-core mode on multiple MPI processes is very good because the total I/O amount is distributed and the smaller I/O is better cached by system memory.
Will DDM use less disk space for each MPI process than in the serial run?
Yes. Disk space usage is also distributed.
Can DDM be used in Normal Mode Analysis and Dynamic Analysis?
Yes, refer to the Supported Solution Sequences for DDM Level 2 Parallelization (Geometric Partitioning) section. Both DDM levels (1 and 2) are supported for Direct Frequency response analysis, whereas, geometric partitioning (level 2) is generally supported for most solutions.
Can DDM be used in Nonlinear Analysis?
Yes, see Supported Solution Sequences for DDM Level 2 Parallelization (Geometric Partitioning). DDM level 1 (task-based parallelization) is not supported for Nonlinear Analysis. However, geometric partioning via DDM level 2 is generally supported for most solutions.
Can DDM be used in Optimization runs?
Yes, DDM can be used in Analysis and Optimization. For details, refer to Supported Solution Sequences for DDM Level 2 Parallelization (Geometric Partitioning).
Can DDM be used if I have only one subcase?
Yes, the solver utilizes multiple processors/sockets/machines to perform matrix factorizations and analysis.
If I have multiple subcases, should I use DDM?
Yes, DDM is applicable to multiple subcases as well. Run may be even more efficient if the multiple subcases are supported in DDM level 1 (task-based parallelization). Again, this depends on the available memory and disk-space resources.
How to run OptiStruct DDM over LAN?
It is possible to run OptiStruct DDM over LAN. Follow the corresponding MPI manual to setup different working directories of each node the OptiStruct SPMD is launched.
Is it better to run on cluster of separate machines or on shared memory machine(s) with multiple CPUs?
There is no single answer to this question. If the computer has sufficient memory to run all tasks in-core, expect faster solution times as MPI communication is not slowed down by the network speed. But if the tasks have to run out-of-core, then computations are slowed down by disk read/write delay. Multiple tasks on the same machine may compete for disk access, and (in extreme situations) even result in wall clock time slower than that for serial (non-MPI) runs.