Understanding GPU Usage & Influencing Job Scheduling with Altair PBS Professional

This post was co-authored by Scott Suchyta, Hiroji Kubo, and Kumar Vaibhav Pradipkumar Saxena at Altair.  

Understanding the operational status of the GPU and incorporating the GPU’s health in the decisions of the job scheduler is useful for ensuring users optimal job placement. It also helps administrators understand the usage of GPU resources for planning resource allocation.

Background

The oldest academic supercomputing center in Japan, the University of Tokyo, is making plans for a future exascale system that can manage HPC and deep learning applications. As you would expect, current supercomputing systems are executing more traditional engineering, earth sciences, energy sciences, materials, and physics applications. The site has seen a growing demand for executing biology, biomechanics, biochemistry, and deep learning applications. The new applications require computational accelerators, and the site has invested in NVIDIA® Tesla® P100 GPUs to increase utilization and productivity for engineers and scientists. This is the first time that the Information Technology Center (ITC) at the University of Tokyo has adopted a computational accelerator in a supercomputer.

Referring to Figure 1, Reedbush is a supercomputer with three subsystems: Reedbush-U, which comprises only CPU nodes; Reedbush-H, which comprises nodes with two GPUs mounted as computational accelerators; and Reedbush-L, which comprises nodes with four GPUs mounted. These subsystems can be operated as independent systems.

Figure 1 https://www.cc.u-tokyo.ac.jp/en/supercomputer/reedbush/system.php

With all the computational power in the three systems, the University of Tokyo’s ITC required a solution that would provide its users with a robust, resilient, and power-efficient environment to ensure that their designs and scientific discoveries run to completion. Hewlett-Packard Enterprise (HPE) and Altair collaborated on the project, which entailed integrating GPU monitoring using the NVIDIA Data Center GPU Manager (DCGM) with Altair PBS Professional™ on the Reedbush supercomputer.

The How

HPE collaborated with Altair to develop GPU monitoring and workload power management capabilities within PBS Professional. The solution includes the usage of NVIDIA DCGM, which is a low-overhead tool suite that performs a variety of functions on each host system, including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration, and accounting.

PBS Professional and NVIDIA DCGM Integration

The PBS Professional and NVIDIA DCGM integration includes the following benefits and functionalities:

  • Increase system resiliency
  • Automatically monitor node health
  • Automatically run diagnostics on GPUs
  • Reduce the risk of jobs failing due to GPU errors
  • Prevent jobs from running on nodes with GPU environment errors
  • Optimize job scheduling through GPU load and health monitoring
  • Provide node health information to help administrators and users understand how jobs are being affected
  •  Record GPU usage for future planning

The integration relies on a few of the PBS Professional plugin events (a.k.a. hooks), as seen in Figure 3 and Figure 4 below. I’m not going to talk about all of these plugins/hook events in this article, but I recommend that you review the PBS Professional Administrator Guide.

Figure 3 Admission Control and Management Plugins

Figure 4 Job Execution Plugins

A little background on hooks. Hooks are custom executables that can be run at specific points in the life cycle of a job. Each type of event has a corresponding type of hook and can accept, reject, or modify the behavior of the user’s workflow. This integration utilizes the execjob_begin, execjob_epilogue, execjob_end, and exechost_periodic hook events.

System/GPU Validation & Health Checks

The integration will call the NVIDIA DCGM health checks and analyze the results, which could be pass, warning, or failure. Ideally we would want everything to pass, but as we know, the system can have troubles. When the integration detects warnings and failures, the following events will be triggered:

1.      Offline node; no longer eligible to accept new jobs

2.      Record failure in daemon log

3.      Set time stamped comment on node

4.      Record failure in user’s ER file

The difference between a warning and a failure is that if there is a warning, the integration will allow the job to continue executing. Otherwise, if there is a failure, the integration will requeue the job and allow the scheduler to identify healthy nodes.

In addition to health checks, the integration can also perform diagnostics as supported by the NVIDIA DCGM including software deployment tests, stress tests, and hardware issues. When integration detects any diagnostic check failing, the following events will be triggered:

1.      Offline node; no longer eligible to accept new jobs

2.      Record failure in daemon log

3.      Set time stamped comment on node

4.      Record failure in user’s ER file

Per-Job GPU Usage

When the system and its GPUs pass the initial tests, then the integration will begin tracking the job’s GPU usage so that it can be recorded in the PBS Professional accounting logs when the job terminates.

The Results

As they say, “The proof of the pudding is in the eating.” To verify that the integration was working, the site submitted hundreds of High-Performance LINPACK (HPL) with GPU jobs to PBS Professional to exercise the health checks and GPU accounting.

In this example, the NVIDIA GPUs are used very efficiently. In addition, HPE SGI Management Suite sets power resources for the jobs, which also increases the efficiency from the viewpoint of power usage, both for the whole node and the CPU usage rate. This example shows only 2 nodes with 2 GPUs each, but it has been confirmed that this monitoring and management works for a job scaling to 120 nodes (the whole system).

Below is a snippet of qstat -xf that illustrates one of the finished HPL jobs’ GPU usage.

As you know, PBS Professional records a lot of information. Some may argue too much. Well, ITC developed a custom command, called rbstat, to extract specific job details and simplify the output for users and administrators:

Closing

The PBS Professional and NVIDIA DCGM integration has proven to be critical to ITC’s infrastructure to ensure that their users’ designs and scientific discoveries run to completion. In addition, the integration is providing administrators and users more insight into the utilization of the system and GPUs, which is knowledge that can be used for future procurements.

Although this example focused on an aspect of integrating with NVIDIA DCGM, the same capability is hardware- and OS-agnostic and can be used with any system. I hope this example has given you some ideas about what can be done to customize your site and fulfill your evolving requirements and demands from users to report on site-specific metrics. For more information, a copy of the hooks, or follow-up questions, please feel free to leave a comment below.

Acknowledgments

The authors would like to thank the Information Technology Center, the University of Tokyo, Hewlett Packard Enterprise, Altair, and NVIDIA Corporation, who supported the deployment of this integration.

Reference

Altair® PBS Professional® 13.1 Administrator’s Guide

Altair® PBS Professional® 13.0.500 Power Awareness Release Notes

GPU-Accelerated Computing Made Better with NVIDIA DCGM and PBS Professional® (https://www.altair.com/NewsDetail.aspx?news_id=11273)

NVIDIA Data Center GPU Manager (DGMC) Overview & Download (https://developer.nvidia.com/data-center-gpu-manager-dcgm)

NVIDIA Data Center GPU Manager Simplifies Cluster Administration, By Milind Kukanur, August 8, 2016 (https://devblogs.nvidia.com/nvidia-data-center-gpu-manager-cluster-administration/)

Failing to Optimize the Human-in-the-Loop at the Earliest Stages of Design is More Expensive than You Think – Part 3

This guest contribution on the Altair Blog is written by Steve Beck, President, CEO at SantosHuman. SantosHuman is a member of the Altair Partner Alliance.

For those who have just arrived, welcome. Post #1 can be found here and post #2 can be found here. For everyone else, thank you for following this series and for coming back for the third and final post.

The opening statement in this series was that most design processes, whether intentional or not, effectively prioritize product capability over usability. The considerable cost of failing to prioritize usability was then shown through client engagement examples from four different industries.

All examples presented in the first two posts would have benefited significantly if design teams had the ability to evaluate the human-in-the-loop in ways that could inform and support product development decisions. That has been the promise of digital human modeling for decades. Yet, examples just like those presented continue and they are not only common but pervasive throughout most industries.

This final post focuses on why and, of course, provides a solution.

The Problem
Most commercially available digital human models are really just virtual mannequins. Like the mannequins found in department stores, virtual mannequin joints must be individually rotated into place until some recognizable human activity is achieved. Manipulating mannequin joints within a computer environment is tedious, non-intuitive, time-consuming, and subjective. It can also be quite frustrating, not only because non-human-looking results are a frequent option, but also because every design change requires the entire process to be repeated.

Companies that provide virtual mannequins have worked hard to mitigate this frustration by including the ability to leverage pre-recorded snapshots of human activity, primarily in the form of motion capture data. Use of motion capture data to drive virtual mannequin postures does circumvent the need to interactively manipulate their joints but that data is also expensive to acquire and time-consuming to process. In fact, recent estimates from one of our automotive clients indicated an internal motion capture budget of over $30,000 per subject, per motion capture study.

But the real problem with using pre-recorded data of any kind in design is that it’s inflexible. It cannot respond to change. It can only be used as acquired. Any design change that potentially affects human interactivity requires the acquisition of more data. This is great news for companies in motion capture-related businesses, but it’s a nightmare for design teams and their budgets and deadlines. Unfortunately, this contributes to an even bigger problem.

Because virtual mannequin joints must either be manually manipulated or driven by pre-recorded data, they can really only react to an existing design.  This means significant resources must first be expended to bring a design to a relatively high level of maturity before a virtual mannequin can be deployed. In other words, the use of a virtual mannequin requires a rather long list of traditional engineering efforts to be completed first. Consequently, at the point when human-centric evaluations can finally occur, any indicated need for change will be in direct conflict with all the resources already expended.

This is almost the same situation design teams were in before virtual mannequins existed; when product evaluations could only be accomplished through trial and error, physical prototypes, and focus groups. While the need for physical prototypes may be reduced, human-centric evaluations still occur too late to be effective.  What is most ironic is that the usability of your products by your customers—those who ultimately determine your product’s success in the market—is effectively being treated as if it is among the least important of your product’s design criteria.

Why are outcomes like those presented in this series so common?  Because traditional design processes do not allow human-in-the-loop evaluations to occur until late in a product’s development cycle when change is no longer a realistic option.

The Solution
To be clear, Santos® technologies offer significant advantages in these traditional workflows which appear to be pervasive throughout most industries. Santos® predictive models are fast, flexible, objective and of course, predictive. Because they’re predictive, they provide a fair amount of autonomy which makes them easier to use and easier to use correctly.

However, the real value of Santos® virtual human-in-the-loop solutions lies in the unique ability to predict human physical behavior and performance while taking into consideration the human-centric challenges we must all deal with every day in the physical world. These challenges include:

  • Simultaneously achieving multiple and competing task goals
  • Mitigating limitations in strength, flexibility, and fatigue
  • Optimizing grasp strategies
  • Ensuring we can see what we’re doing
  • Remaining in balance and avoiding collisions in spite of external forces that may be acting upon us
  • Trying not to get hurt

A truly predictive model makes trade-off analysis (the evaluation of what-if scenarios) possible. Trade-off analysis is why predictive models are created and why they are so valuable. A truly predictive human model can provide the task-focused trade-off analyses your teams need to optimize the human-in-the-loop at the earliest stages of design—where change is not only most effective but still an option.

Watch this video for one example of how this is done.

Conclusions
Like many of the companies we work with, your company has probably been in business for a very long time. Your teams probably have 100’s if not 1000’s of employee-years’ worth of experience using your existing design processes. And your revenues are likely in the millions if not billions of dollars per year.  By all objective measures, your company is exceptionally good at what it does.

However, consider that your design teams Avoid the Cost and Uncertainty of Trial & Error in meeting:

  • Structural Performance Requirements through the use ofFinite Element Analysis
  • Aerodynamic and Thermal Performance Requirements through the use ofComputational Fluid Mechanics
  • Mechanical System Performance Requirements through the use ofMulti-Body Dynamics

So, why continue to incur the cost and uncertainty in meeting human-in-the-loop requirements through trial and error? Those humans-in-the-loop are your customers. Their positive feedback is that next level of competitive advantage.

Watch this video for one example of how this is done:

Learn more about SantosHuman

5 Reasons to Modernize Your Product Development Processes NOW!

It’s exciting to read about the latest advances in product design and manufacturing technology. From generative design offering automated lightweighting possibilities, to 3D printing creating previously impossible shapes, and the Internet of Things (IoT) connecting millions of devices across the … Read More