Oser Communications Group

Page 3 of 11

S u p e r C o m p u te r S h o w D a i l y Tu e s d a y, N o ve m b e r 1 8 , 2 0 1 4 4 System Monitoring A fully IPMI 2.0-compliant internal system monitor provides system oper- ation data, control and alarming func- tions, including fan operation, volt- ages, and temperature at several points in the enclosure, including NVIDIA GPU telemetry. A remote interface is provided through redundant Ethernet ports on the rear of the enclosure which includes a Command Line, web-based or external interface option supporting popular SNMP and RCMP protocols. One Stop Systems' CA16000 is available immediately. Call its applica- tion engineers for lead time and pricing and to best determine which accelerators meet your specific requirements. Visit One Stop Systems at booth #754. For more information, go to www.onestop systems.com, call 760-745-9883 or email sales@onestopsystems.com. ACCELERATING CLUSTERING ON MACHINE LEARNING HARDWARE By Chris McCormick, Sr. R&D Engineer, Cognimem Technologies Clustering takes unlabeled data and finds natural groupings within it. One of the most commonly used clustering tech- niques called K-Means clustering maps nicely into scalable parallel machine learning hardware. A variant of this tech- nique called K-Medians takes advantage of the computational simplicity of the L1 distance metric. Accelerating K-Medians with Cognimem Hardware The primary computational load for per- forming K-Medians on a large dataset is in the cluster assignment step, which requires calculating the distance between each cluster center and every data point. The Cognimem architecture is designed to accelerate specifically that task – cal- culating a large number of distances in parallel, and sorting the distances to find nearest neighbors. First, we load each of the 'K' cluster centers into the CM1K neurons (one per neuron), assigning each a unique catego- ry value to identify it. For each example vector in the dataset, we broadcast the vector to the network. The CM1K then calculates the distance to each of the cluster centers in parallel. The nearest neuron is the cluster that we will assign this data point to. After every data point has been assigned to a cluster, we calculate the new cluster locations by taking the medi- an of a cluster's members. These two steps (cluster assignment and cluster movement) are repeated iteratively until the cluster assignments stop changing from one iteration to the next. MNIST Dataset Example The CM1K provides the most accelera- tion for problems where the value of 'K' is large. Training a Radial Basis Function Network (RBFN) is one such example – the prototypes for the RBF neurons are selected through K- Medians clustering. The MNIST hand-written digit dataset is useful for benchmarking dif- ferent machine learning algorithms for classification performance. The dataset consists of centered and normalized images of hand-written digits (0 through 9). The dataset includes a train- ing set of 60,000 examples and a test set of 10,000 examples. Using a Radial Basis Function Network (RBFN) trained with roughly 8,000 neurons in the hidden layer, we achieved 97.46 percent accuracy on the test set. Performance Measurements To compare the performance of cluster- ing on the CM1K versus a PC, we meas- ure the time taken to cluster the 60,000 MNIST training vectors into various 'K' clusters on a PC. For our PC implemen- tation, we are using K-Means clustering implemented efficiently in MATLAB using matrix multiplication operations (MATLAB performs matrix multiplica- tion very efficiently using SIMD instruc- tions and multiple threads), running on an Intel Core i7-4770 at 3.4GHz. 8,192 clusters requires 68.24µs. For comparison, the CM1K can find the nearest cluster center for an input vector in 11µs for one chip, or 22µs when chaining multiple chips together (e.g., handling 8,192 clusters requires 8 CM1Ks operating in parallel). Visit Cognimem Technologies at booth #3744. For more information, visit www.cognimem.com, call 916-358-9485 or email info@cognimem.com. OSS FEATURES NVIDIA TESLA GPUS IN FIRST PCIE GEN3 HIGH-DENSITY COMPUTE ACCELERATOR FOR HPC APPLICATIONS One Stop Systems Inc. (OSS), a leader in PCI Express ® (PCIe ® ) expansion technology, is shipping the first PCIe 3.0 expansion appliance that supports up to 16 high-end accelerator boards from a single or multiple servers. The 3U High Density Compute Accelerator (CA16000) provides up to 73.3 TFLOPS of computational power using NVIDIA ® ™ Tesla ® ™ K10 GPU accel- erators. The CA16000 is a complete appliance, solving integration issues and making installation easy. The user simply connects the cable or cables to the host server(s) and has hundreds or thousands of additional compute cores readily available. "One Stop Systems has consistently produced ground-breaking products in the PCIe expansion market and the PCIe Gen3 High Density Compute Accelerator sets the bar even higher," said Steve Cooper, CEO of One Stop Systems. "OSS Accelerators have had a natural evolution from the Gen2 products first introduced in 2009. Since then we've continued to lead the industry, producing even faster computational products. The Gen3 CA16000 utilizes the latest silicon available with expert design capabilities experienced in designing PCI Express expansion products. The result is the fastest computational device on the mar- ket today and the easiest to install and use. Because we have already configured the system with cutting-edge NVIDIA Tesla GPU accelerators, the time to get systems up and running is greatly mini- mized. This product is the dream of any data center manager." "NVIDIA Tesla GPUs are the high- est-performance, most energy-efficient accelerators ever developed for HPC and supercomputing customers," said Sumit Gupta, General Manager of the Tesla Accelerated Computing Business at NVIDIA. "One Stop Systems is provid- ing customers with a differentiated, high- density system with world-class HPC performance to solve their most difficult computational challenges." Easy Installation Installation is made easier by the modu- larity of the appliance. The CA16000 consists of the rackmount chassis, three modular power supplies, four pre-loaded GPU canisters and the front bezel. Each canister is preloaded with four NVIDIA Tesla GPU accelerators. Ample cooling is provided by a fan located on the front of each canister plus four exhaust fans mounted on the rear of the enclosure. The canisters and power supplies slide into the front of the chassis and the front bezel snaps into place. One to four servers are then cabled to the rear of the enclosure. Each connection operates at aggregated Gen3 speeds of 128Gb/s. OFFERED FOR YOUR PREPRANDIAL EDIFICATION David Abramson of the University of Queensland will discuss some of the vex- ing problems of debugging software in software running on supercomputers that exploit massive parallelism in a talk titled "It Was Working Until I Changed...," scheduled for 11:15 a.m. to noon on Tuesday, November 18 in the New Orleans Theater. According to his abstract for the talk, "This process becomes even more difficult in supercomputers that exploit massive parallelism because the state is distributed across processors, and additional failure modes (such as race and timing errors) can occur." Abramson will discuss a debugging strategy called "relative debugging," which allows a user to compare the run time state between executing programs, one being a working, "reference" code, and the other being a test version. His discussion will review the basic ideas in relative debugging and will give exam- ples of how it can be applied to debug- ging supercomputing applications. If that's not your cup of tea, you could also head down to room 393-395 at 11:30 to hear a finalist for the conference's best student paper, "A Volume Integral Equation Stokes Solver for Problems with Variable Coefficients," presented by authors Dhairya Malhotra, Amir Gholami, George Biros from the University of Texas at Austin. According to their abstract, the paper presents a novel numerical scheme for solving the Stokes equation with vari- able coefficients in the unit box that's based on a volume integral equation for- mulation. The session will be chaired by Justin Luitjens of NVIDIA. At the same time next door in room 391-392, Suzanne Rivoire from Sonoma State University will be chairing a ses- sion in which authors Niladrish Chatterjee, Mike O'Connor, Gabriel H. Loh, Nuwan Jayasena and Rajeev Balasubramonian offer their paper titled "Managing DRAM Latency Divergence in Irregular GPGPU Applications." The authors will propose solutions for this problem that yield a 10.1 percent per- formance improvement for irregular GPGPU workloads relative to a through- put-optimized GPU memory controller. AN INDEPENDENT PUBLICATION NOT AFFILIATED WITH SC Lee M. Oser CEO and Editor-in-Chief Kim Forrester Paul Harris Associate Publishers Lorrie Baumann Editorial Director Jeanie Catron JoEllen Lowry Associate Editors Yasmine Brown Vicky Glover Graphic Designers Mary Procida Caitlyn Roach Customer Service Managers Enrico Cecchi European Sales Super Computer Show Daily is published by Oser Communications Group ©2014 All rights reserved. Executive and editorial offices at: 1877 N. Kolb Road • Tucson, AZ 85715 520.721.1300 • Fax: 520.721.6300 www.oser.com European offices located at Lungarno Benvenuto Cellini, 11, 50125 Florence, Italy.

Articles in this issue

Links on this page

view archives of Oser Communications Group - Super Computer Show Daily Nov 18, 2014

Super Computer Show Daily Nov 18, 2014

Contents of this Issue

Navigation

Page 3 of 11

Articles in this issue

Links on this page