GPU tesla k40m¶

El cluster lamb cuenta con 8 nodos (n1,n2,n3,n4,n35,n36,n37,n38) que tienen una tarjeta NVIDIA Tesla K40m. Esta tarjeta tiene 2880 CUDA cores y memoria RAM GDDR5 de 12 GB. Tiene un nivel de computo de 3.5 (capacidad de CUDA).

De igual manera se cuenta con la instalación de tensorflow versión 2.3.4 y de pytorch 2.7.0 en el cluster para ser usado dentro de 8 nodos con tarjeta gpu.

Para obtener el listado de software de desarrollo instalado en el cluster, usar module avail

module avail

faststructure        julia-1.11.0     nvidia-cuda11.8        stacks-ipyrad-plink-structure
gcc-9.4.0            matlab-R2017b    mpi/2021.7.1           tensorflow2.3.4-python3.8
gmt-6.2.0            matlab-R2023b    pytorch2.7-python3.9   R-4.3.3
:
:

Para obtener información de la tarjeta GPU y ver que procesos estan corriendo en la GPU:

ssh -X n1 nvidia-smi
ssh n2 nvidia-smi
ssh n3 nvidia-smi
:

Tensorflow¶

TensorFlow es una biblioteca de código abierto que permite crear modelos de aprendizaje automático (ML) con Python.

Usando tensorflow+PBS¶

Crear el archivo de python:

test-tensorflow.py

from __future__ import print_function
'''
Basic Multi GPU computation example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''

'''
This tutorial requires your machine to have 1 GPU
"/cpu:0": The CPU of your machine.
"/gpu:0": The first GPU of your machine
'''
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='3'
import numpy as np
import datetime
import tensorflow.compat.v1 as tf

tf.disable_v2_behavior()

# Processing Units logs
log_device_placement = True

# Num of multiplications to perform
n = 10

'''
Example: compute A^n + B^n on 2 GPUs
Results on 8 cores with 2 GTX-980:
* Single GPU computation time: 0:00:11.277449
* Multi GPU computation time: 0:00:07.131701
'''
# Create random large matrix
A = np.random.rand(10000, 10000).astype('float32')
B = np.random.rand(10000, 10000).astype('float32')

# Create a graph to store results
c1 = []
c2 = []

def matpow(M, n):
    if n < 1: #Abstract cases where n &lt; 1
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))

'''
 Single GPU computing
'''
with tf.device('/gpu:0'):
    a = tf.placeholder(tf.float32, [10000, 10000])
    b = tf.placeholder(tf.float32, [10000, 10000])
    # Compute A^n and B^n and store results in c1
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

with tf.device('/cpu:0'):
   sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n

t1_1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
   # Run the op.
   sess.run(sum, {a:A, b:B})
t2_1 = datetime.datetime.now()

print("Single GPU computation time: " + str(t2_1-t1_1))

Se crea el script de PBS y haciendo referencia a la cola gpu que es la que contiene los 4 nodos que tienen tarjeta Tesla K40m

tensorflow.pbs

#!/bin/bash
#PBS -N tensorflow
#PBS -q gpu
#PBS -l nodes=1
#PBS -o tensorflow.out
#PBS -e tensorflow.err

module load tensorflow2.3.4-python3.8

cd $PBS_O_WORKDIR
python test-tensorflow.py

Se envía el script:

$ qsub tensorflow.pbs

Y se revisa la salida del archivo tensorflow.out:

$ cat tensorflow.out

Single GPU computation time: 0:00:21.743607
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_1: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
:
:

Pytorch¶

PyTorch es un marco de aprendizaje profundo de código abierto basado en software que se emplea para crear redes neuronales, combinando la biblioteca de aprendizaje automático(ML) de Torch con una API de alto nivel basada en Python

Usando pytorch+PBS¶

La versión a usar es pytorch 2.7.0

$ conda list |grep torch
$ python -c 'import torch; print(torch.cuda.is_available());'

 True     <-- debe de aparecer

Prueba de ejecución con un solo gpu.

Se genera el programa en python que usa pytorch y cuda (polynomial.py) en un directorio que no se llame pytorch, por ejemplo test.

import torch
import math

class LegendrePolynomial3(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return 0.5 * (5 * input ** 3 - 3 * input)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        return grad_output * 1.5 * (5 * input ** 2 - 1)


dtype = torch.float
#device = torch.device("cpu")
device = torch.device("cuda:0")  # Uncomment this to run on GPU

# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For this example, we need
# 4 weights: y = a + b * P3(c + d * x), these weights need to be initialized
# not too far from the correct result to ensure convergence.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.0, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)

learning_rate = 5e-6
for t in range(2000):
    # To apply our Function, we use Function.apply method. We alias this as 'P3'.
    P3 = LegendrePolynomial3.apply

    # Forward pass: compute predicted y using operations; we compute
    # P3 using our custom autograd operation.
    y_pred = a + b * P3(c + d * x)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # Manually zero the gradients after updating weights
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')

Se crea el script de PBS (pytorch.pbs)

#!/bin/bash
#PBS -N pytorch
#PBS -q gpu
#PBS -l nodes=1
#PBS -o pytorch.out
#PBS -e pytorch.err

module load pytorch2.7-python3.9

cd $PBS_O_WORKDIR
python polynomial.py

Se envia el script

$ qsub pytorch.pbs

Ejecutando el mando nvidia-smi en el nodo asignado se puede verificar que esta corriendo en un GPU.

$ ssh n1 nvidia-smi

La salida se puede ver en el archivo pytorch.out:

$ cat pytorch.out

209.95834350585938
144.66018676757812
100.70249938964844
71.03520202636719
50.978511810302734
37.40313720703125
28.20686912536621
21.973186492919922
17.745729446411133
14.877889633178711
12.931766510009766
11.610918998718262
10.714248657226562
10.105475425720215
9.692106246948242
9.411375045776367
9.220745086669922
9.091285705566406
9.003361701965332
8.94364070892334
 Result: y = 1.2777713782885503e-11 + -2.208526849746704 * P3(-2.5764071431844116e-10 + 0.2554861009120941 x)