Microway joins the OpenPOWER Foundation

We’re excited to announce that Microway has joined the OpenPOWER Foundation as a Silver member. We are integrating the OpenPOWER technologies into our server systems and HPC clusters. We’re also offering our HPC software tools on OpenPOWER.

The collaboration between OpenPOWER members is going to bring exciting new possibilities to High Performance Computing, and to the IT industry in general. The OpenPOWER Foundation’s list of members is quite impressive, but also represents a very broad range of interests and industries. Our efforts will focus on molding these technologies into performant and easy-to-use HPC systems. Our experts ensure that Microway systems “just work”, so expect nothing less from our OpenPOWER offerings.

Continue reading

Posted in Hardware | Tagged , | Leave a comment

NVIDIA Tesla M40 24GB GPU Accelerator (Maxwell GM200) Up Close

NVIDIA has announced a new version of their popular Tesla M40 GPU – one with 24GB of high-speed GDDR5 memory. The name hasn’t really changed – the new GPU is named NVIDIA Tesla M40 24GB. If you are curious about the original version with less memory, we have a detailed examination of the original M40 GPU.

As support for GPUs grows – particularly in the exploding fields of Machine Learning and Deep Learning – there has been increasing need for large quantities of GPU memory. The Tesla M40 24GB provides the most memory available to date in a single-GPU Tesla card. The remaining specifications of the new M40 match that of the original: 7 TFLOPS of single-precision floating point performance.

The Tesla M40 continues to be the only high-performance Tesla compute GPU based upon the “Maxwell” architecture. “Maxwell” provides excellent performance per watt, as evidenced by the fact that this GPU provides 7 TFLOPS within a 250W power envelope.

Maximum single-GPU memory and performance: Tesla M40 24GB GPU

Available in Microway NumberSmasher GPU Servers and GPU Clusters

Photo of the NVIDIA Tesla M40 24GB GPU Accelerator bottom edge

Specifications*

  • 3072 CUDA GPU cores (GM200)
  • 7.0 TFLOPS single; 0.21 TFLOPS double-precision
  • 24GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • Dynamic GPU Boost for optimal clock speeds
  • Passive heatsink design for installation in qualified GPU servers

Continue reading

Posted in Benchmarking, Hardware | Tagged , , | Leave a comment

Intel Xeon E5-2600 v4 “Broadwell” Processor Review

Today we begin shipping Intel’s new Xeon E5-2600 v4 processors. They provide more CPU cores, more cache, faster memory access and more efficient operation. These are based upon the Intel microarchitecture code-named “Broadwell” – we expect them to be the HPC processors of choice.

Important changes in Xeon E5-2600 v4 include:

  • Up to 22 processor cores per CPU
  • Support for DDR4 memory speeds up to 2400MHz
  • Faster Floating Point Instruction performance
  • Improved parallelism in scheduling micro-operations
  • Improved performance for large data sets

Continue reading

Posted in Hardware | Tagged , | Leave a comment

DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel

This week I had the opportunity to run the STREAM memory benchmark on a Microway 2U NumberSmasher server which supports up to 3 DIMMs per channel.  In practice, this system is typically configured with 768GB or 1.5TB of DDR4 memory. A key goal of this benchmarking was to examine how RAM quantity and clock frequency affect bandwidth performance.  When fully loading all three DIMMs per channel, the memory frequency defaults to 1600MHz.  At two DIMMs per channel, the default memory frequency increases to 1866MHz.  With one DIMM per channel, the frequency maxes out at 2133MHz.

Continue reading

Posted in Benchmarking, Hardware | Tagged , | Leave a comment

Accelerating Code with OpenACC and the NVIDIA Visual Profiler

Comprised of a set of compiler directives, OpenACC was created to accelerate code using the many streaming multiprocessors (SM) present on a GPU. Similar to how OpenMP is used for accelerating code on multicore CPUs, OpenACC can accelerate code on GPUs. But OpenACC offers more, as it is compatible with multiple architectures and devices, including multicore x86 CPUs and NVIDIA GPUs.

Here we will examine some fundamentals of OpenACC by accelerating a small program consisting of iterations of simple matrix multiplication. Along the way, we will see how to use the NVIDIA Visual Profiler to identify parts of the code which call OpenACC compiler directives. Graphical timelines displayed by the NVIDIA Visual Profiler visually indicate where greater speedups can be achieved. For example, applications which perform excessive host to device data transfer (and vice versa), can be significantly improved by eliminating excess data transfer.

Continue reading

Posted in Development, Software, Test Drive | Tagged , , , , , , , , | Leave a comment

NVIDIA Tesla M40 12GB GPU Accelerator (Maxwell GM200) Up Close

With the release of Tesla M40, NVIDIA continues to diversify its professional compute GPU lineup. Designed specifically for Deep Learning applications, the M40 provides 7 TFLOPS of single-precision floating point performance and 12GB of high-speed GDDR5 memory. It works extremely well with the popular Deep Learning software frameworks and may also find its way into other industries that need single-precision accuracy.

The Tesla M40 is also notable for being the first Tesla GPU to be based upon NVIDIA’s “Maxwell” GPU architecture. “Maxwell” provides excellent performance per watt, as evidenced by the fact that this GPU provides 7 TFLOPS within a 250W power envelope.

Maximum single-GPU performance: Tesla M40 12GB GPU

Available in Microway NumberSmasher GPU Servers and GPU Clusters

Photo of the NVIDIA Tesla M40 12GB GPU Accelerator

Specifications

  • 3072 CUDA GPU cores (GM200)
  • 7.0 TFLOPS single; 0.21 TFLOPS double-precision
  • 12GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • Dynamic GPU Boost for optimal clock speeds
  • Passive heatsink design for installation in qualified GPU servers

As with all other modern Tesla GPUs, you should expect it to be able to max out the PCI-E 3.0 bus to achieve ~12GB/sec of data transfers between the system and each GPU:

Continue reading

Posted in Benchmarking, Hardware | Tagged , , | Leave a comment

Deep Learning Frameworks: A Survey of TensorFlow, Torch, Theano, Caffe, Neon, and the IBM Machine Learning Stack

The art and science of training neural networks from large data sets in order to make predictions or classifications has experienced a major transition over the past several years. Through popular and growing interest from scientists and engineers, this field of data analysis has come to be called deep learning. Put succinctly, deep learning is the ability of machine learning algorithms to acquire feature hierarchies from data and then persist those features within multiple non-linear layers which comprise the machine’s learning center, or neural network.

Two years ago, questions were mainly about what deep learning is, and how it might be applied to problems in science, engineering, and finance. Over the past year, however, the climate of interest has changed from a curiosity about what deep learning is, and into a focus on acquiring hardware and software in order to apply deep learning frameworks to specific problems across a wide range of disciplines.

The current major deep learning frameworks will be examined here and compared, across various features, such as native language of framework, multi-GPU support, and aspects of usability.

Continue reading

Posted in Software | Tagged | 1 Comment

Keras and Theano Deep Learning Frameworks

Theano_Keras_Blog_masthead_flattened_masthead

Here we will explore how to use the Theano and Keras Python frameworks for designing neural networks in order to accomplish specific classification tasks. In the process, we will see how Keras offers a great amount of leverage and flexibility in designing neural nets. In particular, we will examine two active areas of research: classification of textual and image data.

Continue reading

Posted in Hardware, Software, Test Drive | Leave a comment

Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs

NVIDIA DIGITS Deep Learning Tutorial

In this Caffe deep learning tutorial, we will show how to use DIGITS in order to train a classifier on a small image set.  Along the way, we’ll see how to adjust certain run-time parameters, such as the learning rate, number of training epochs, and others, in order to tweak and optimize the network’s performance.  Other DIGITS features will be introduced, such as starting a training run using the network weights derived from a previous training run, and using a completed classifier from the command line.

Continue reading

Posted in Benchmarking, Software, Test Drive | Tagged , , , , , | Leave a comment

DDR4 RDIMM and LRDIMM Performance Comparison

Recently, while carrying out memory testing in our integration lab, Lead Systems Integrator, Rick Warner,  was able to clearly identify when it is appropriate to choose load-reduced DIMMs (LRDIMM) and when it is appropriate to choose registered DIMMs (RDIMM) for servers running large amounts of DDR4 RAM (i.e., 256 Gigabytes and greater). The critical factors to consider are latency, speed, and capacity, along with what your computing objectives are with respect to them.

Misconceptions on Load Reduced DIMM Performance

Load-reduced DIMMs were built so that high-speed memory controllers in CPUs could drive larger quantities of memory. Thus, it’s often assumed that LRDIMMs will offer the best performance for memory-dense servers. This impression is strengthened by the fact that Intel’s guide for DDR4 memory population shows LRDIMMs running at a higher frequency than RDIMMs (e.g., 2133MHz vs 1866MHz). However, as we’ll show below, there are greater factors at play.

Continue reading

Posted in Benchmarking, Hardware | Tagged , | Leave a comment