GPU tesla k40m¶
El cluster lamb cuenta con 8 nodos (n1,n2,n3,n4,n35,n36,n37,n38) que tienen una tarjeta NVIDIA Tesla K40m. Esta tarjeta tiene 2880 CUDA cores y memoria RAM GDDR5 de 12 GB. Tiene un nivel de computo de 3.5 (capacidad de CUDA).
De igual manera se cuenta con la instalación de tensorflow versión 2.3.4 y de pytorch 2.7.0 en el cluster para ser usado dentro de 8 nodos con tarjeta gpu.
Para obtener el listado de software de desarrollo instalado en el cluster, usar module avail
module avail
faststructure julia-1.11.0 nvidia-cuda11.8 stacks-ipyrad-plink-structure
gcc-9.4.0 matlab-R2017b mpi/2021.7.1 tensorflow2.3.4-python3.8
gmt-6.2.0 matlab-R2023b pytorch2.7-python3.9 R-4.3.3
:
:
Para obtener información de la tarjeta GPU y ver que procesos estan corriendo en la GPU:
ssh -X n1 nvidia-smi
ssh n2 nvidia-smi
ssh n3 nvidia-smi
:
Tensorflow¶
TensorFlow es una biblioteca de código abierto que permite crear modelos de aprendizaje automático (ML) con Python.
Usando tensorflow+PBS¶
Crear el archivo de python:
test-tensorflow.py
from __future__ import print_function
'''
Basic Multi GPU computation example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''
'''
This tutorial requires your machine to have 1 GPU
"/cpu:0": The CPU of your machine.
"/gpu:0": The first GPU of your machine
'''
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='3'
import numpy as np
import datetime
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
# Processing Units logs
log_device_placement = True
# Num of multiplications to perform
n = 10
'''
Example: compute A^n + B^n on 2 GPUs
Results on 8 cores with 2 GTX-980:
* Single GPU computation time: 0:00:11.277449
* Multi GPU computation time: 0:00:07.131701
'''
# Create random large matrix
A = np.random.rand(10000, 10000).astype('float32')
B = np.random.rand(10000, 10000).astype('float32')
# Create a graph to store results
c1 = []
c2 = []
def matpow(M, n):
if n < 1: #Abstract cases where n < 1
return M
else:
return tf.matmul(M, matpow(M, n-1))
'''
Single GPU computing
'''
with tf.device('/gpu:0'):
a = tf.placeholder(tf.float32, [10000, 10000])
b = tf.placeholder(tf.float32, [10000, 10000])
# Compute A^n and B^n and store results in c1
c1.append(matpow(a, n))
c1.append(matpow(b, n))
with tf.device('/cpu:0'):
sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n
t1_1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
# Run the op.
sess.run(sum, {a:A, b:B})
t2_1 = datetime.datetime.now()
print("Single GPU computation time: " + str(t2_1-t1_1))
Se crea el script de PBS y haciendo referencia a la cola gpu
que es la que contiene los 4 nodos que tienen tarjeta Tesla K40m
tensorflow.pbs
#!/bin/bash
#PBS -N tensorflow
#PBS -q gpu
#PBS -l nodes=1
#PBS -o tensorflow.out
#PBS -e tensorflow.err
module load tensorflow2.3.4-python3.8
cd $PBS_O_WORKDIR
python test-tensorflow.py
Se envía el script:
$ qsub tensorflow.pbs
Y se revisa la salida del archivo tensorflow.out
:
$ cat tensorflow.out
Single GPU computation time: 0:00:21.743607
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_1: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
:
:
Pytorch¶
PyTorch es un marco de aprendizaje profundo de código abierto basado en software que se emplea para crear redes neuronales, combinando la biblioteca de aprendizaje automático(ML) de Torch con una API de alto nivel basada en Python
Usando pytorch+PBS¶
La versión a usar es pytorch 2.7.0
$ conda list |grep torch
$ python -c 'import torch; print(torch.cuda.is_available());'
True <-- debe de aparecer
Prueba de ejecución con un solo gpu.
- Se genera el programa en python que usa pytorch y cuda (
polynomial.py
) en un directorio que no se llame pytorch, por ejemplo test.
import torch
import math
class LegendrePolynomial3(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""
@staticmethod
def forward(ctx, input):
"""
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
"""
ctx.save_for_backward(input)
return 0.5 * (5 * input ** 3 - 3 * input)
@staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.
"""
input, = ctx.saved_tensors
return grad_output * 1.5 * (5 * input ** 2 - 1)
dtype = torch.float
#device = torch.device("cpu")
device = torch.device("cuda:0") # Uncomment this to run on GPU
# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)
# Create random Tensors for weights. For this example, we need
# 4 weights: y = a + b * P3(c + d * x), these weights need to be initialized
# not too far from the correct result to ensure convergence.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.0, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)
learning_rate = 5e-6
for t in range(2000):
# To apply our Function, we use Function.apply method. We alias this as 'P3'.
P3 = LegendrePolynomial3.apply
# Forward pass: compute predicted y using operations; we compute
# P3 using our custom autograd operation.
y_pred = a + b * P3(c + d * x)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item())
# Use autograd to compute the backward pass.
loss.backward()
# Update weights using gradient descent
with torch.no_grad():
a -= learning_rate * a.grad
b -= learning_rate * b.grad
c -= learning_rate * c.grad
d -= learning_rate * d.grad
# Manually zero the gradients after updating weights
a.grad = None
b.grad = None
c.grad = None
d.grad = None
print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')
- Se crea el script de PBS (
pytorch.pbs
)
#!/bin/bash
#PBS -N pytorch
#PBS -q gpu
#PBS -l nodes=1
#PBS -o pytorch.out
#PBS -e pytorch.err
module load pytorch2.7-python3.9
cd $PBS_O_WORKDIR
python polynomial.py
- Se envia el script
$ qsub pytorch.pbs
- Ejecutando el mando
nvidia-smi
en el nodo asignado se puede verificar que esta corriendo en un GPU.
$ ssh n1 nvidia-smi
- La salida se puede ver en el archivo
pytorch.out
:
$ cat pytorch.out
99 209.95834350585938
199 144.66018676757812
299 100.70249938964844
399 71.03520202636719
499 50.978511810302734
599 37.40313720703125
699 28.20686912536621
799 21.973186492919922
899 17.745729446411133
999 14.877889633178711
1099 12.931766510009766
1199 11.610918998718262
1299 10.714248657226562
1399 10.105475425720215
1499 9.692106246948242
1599 9.411375045776367
1699 9.220745086669922
1799 9.091285705566406
1899 9.003361701965332
1999 8.94364070892334
Result: y = 1.2777713782885503e-11 + -2.208526849746704 * P3(-2.5764071431844116e-10 + 0.2554861009120941 x)