#

Distributed System

A distributed system is a group of computers working together as to appear as a single computer to the end-user. These machines have a shared state, operate concurrently and can fail independently without affecting the whole system’s uptime.

#

High Performance Computing

High-performance computing (HPC) is the use of super computers and parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.

#

LUSTRE Filesystem

Lustre is a scale-out architecture distributed parallel filesystem. Metadata services and storage are segregated from data services and storage.

Cluster Computing

Technical Specifications

  • A Master Node

     
     

  • Infiniband

    TrueScale Network (Qlogic) QDR at 40 Gb/s

  • 4 Phi Nodes

    1 coprocessor 7120P with 61 cores at 1.238 GHz and 16 GB RAM. 2 processors E5-2670 v2 with 10 cores each at 2.5 GHz and 128 GB RAM DDR3 at 1600 MHz

  • 24 intel nodes

    2 processors E5-2670 v2 with 10 cores each at 2.5 GHz and 128 GB RAM DDR3 at 1600 Mhz

  • LUSTRE Filesystem

    A lustre storage system with 34 TB for home and 27 TB for data

  • 5 AMD nodes with

    2 processors AMD 6378 with 16 cores each at 2.4 GHz and 64 GB RAM DDR3 at 1600 Mhz

Tools

Development

On large-scale computers, many users must share available resources. Because of this, you cannot just log on to one of these systems, upload your programs, and start running them. Essentially, your programs (called batch jobs) have to "get in line" and wait their turn. And, there is more than one of these lines (called queues) from which to choose. Some queues have a higher priority than others (like the express checkout at the grocery store). The queues available to you are determined by the projects that you are involved with.
The jobs in the queues are managed and controlled by a batch queuing system, without which, users could overload systems, resulting in tremendous performance degradation. The queuing system will run your job as soon as it can while still honoring the following:

  • Meeting your resource requests
  • Not overloading systems
  • Running higher priority jobs first
  • Maximizing overall throughput

We use the PBS Professional queuing system. The PBS module should be loaded automatically for you at login, allowing you access to the PBS commands.

Stats

Ganglia

Uso

Tutorials

PDF

Manuals