Sunday, July 5, 2026

High performance computing - A self-scheduling algorithm

HPC for a data science and cloud project was set up - Self-Scheduling Algorithm with Parallel programming

Abstract

It has been a while since I last posted in this forum. Today, I prepared a schedule related to High-Performance Computing (HPC) and reviewed a recent project I completed for an HPC and Data Science course. This project focused on implementing HPC tools to address scheduling problems in cloud environments. I discovered that similar scheduling issues exist in battery pack management, as my co-project team member discussed at the time.

Currently, I have a comprehensive understanding of how to manage scheduling in cloud projects, especially for time series-dependent workloads. The study revolves around analyzing such a spiky workload using mpirun. We introduced a transformation step to the training dataset to normalize this spiky load. For the self-scheduling algorithm to effectively handle such a load, we monitored the mpithreads to ensure smooth resource usage. 

Introduction

Applying a self-scheduling algorithm to solve real-world applications as an objective, the inputs considered include a square matrix and constraints, along with the system configuration for hybrid setups. The output was an optimized vector to demonstrate self-scheduling options

3

Literature review


The self-scheduling algorithm can be applied for real-time analysis in data science, specifically for:
    1. Scheduling charge and discharge cycles of battery cells within a battery pack.
    2. Managing resources such as memory, processors, and storage for cloud elasticity.
    3. Introducing a novel programming technique determined during the project, utilizing the MPI (Message Passing Interface) package.
IBM_torc_py Architecture

All remote operations are conducted asynchronously through a dedicated server thread. This thread is responsible for the following tasks:

  • Inserting incoming tasks into the local queue of the process
  • Receiving completed tasks along with their results
  • Handling task-stealing requests

The internal architecture of torcpy is illustrated in the accompanying figure. As a result, tasks (also known as features) can be created and finalized using the `submit` and `wait` calls. For more details, refer to the link https://github.com/IBM/torc_py

Pseudo-code

Package steps based on torc

  1. Init

# torcpy execution starts, and MPI initialization happens

  1. Submit

# Switching to master-worker

>>Assign the calculations to workers via Submit

work inp=8.000, out=64.000 ...on node 0 worker 0 thread 139725221726080

  1. Receive, create, execute

    1. Received: 8.0^2=64.000

    2. Elapsed time for col1 =47.07 s

    3. Elapsed time for col2 =48.07 s

    4. TORCPY: node[0]: created=94, executed=94


Self-Scheduler wrapper

  • Init

  • Submit 

    • Work x^2

    • T x T Transpose

  • Receive, create, & Execute

    • node_id(): return the rank of the calling MPI process

    • worker_id(): return the global worker thread ID 

Methodology

Maximize performance with Matrix-Vector Multiplication that enhances vector scheduling by leveraging MPI and OpenMP. Start with initial MPI optimization, then unlock even more potential through coarse optimization using OpenMP. #MatrixMultiplication #MPI #OpenMP #Optimization #TechTips




HTTP request dataset

 

No of nodes

No of workers

Elapsed time in seconds

No of Records

1

2

15/0.0

30/7

2

2

8/0.0

30/7

3

2

5/0.0

30/7

2

1

15/1

30/7

6

1

5/1

30/7

1

1

30/1

30/7


Sample Output 

TORCPY: main starts
work inp=11.000, out=121.000 ...on node 0 worker 0 thread 140230837819264
work inp=16.000, out=256.000 ...on node 0 worker 1 thread 140229995116288
work inp=3.000, out=9.000 ...on node 0 worker 3 thread 140229978330880
ork inp=3.000, out=9.000 ...on node 1 worker 6 thread 140336443447040
work inp=1.000, out=1.000 ...on node 1 worker 5 thread 140337286150016
work inp=4.000, out=16.000 ...on node 1 worker 7 thread 140336435054336
work inp=5.000, out=25.000 ...on node 1 worker 8 thread 140336426661632
work inp=9.000, out=81.000 ...on node 1 worker 9 thread 140336418268928


Elapsed time=3.05 s

TORCPY: node[0]: created=30, executed=15

TORCPY: node[1]: created=0, executed=15


 

No of nodes

No of workers

Elapsed time in seconds

Rows/no of data records

1

1

47

94

1

2

15

30


Result discussion
The empirical evaluation of our self-scheduling algorithm demonstrates significant performance gains when leveraging hybrid parallel programming. By systematically scaling the architecture from 1 to 6 nodes and adjusting worker thread configurations, we observed a substantial reduction in overhead and processing times. For instance, in our HTTP request dataset benchmark, increasing the node count effectively minimized the elapsed runtime from 30 seconds down to just 5 seconds for a standard batch of 30 records. Furthermore, integrating the torc_py framework allowed for seamless, asynchronous master-worker task distribution. As highlighted by our node execution logs, tasks were efficiently balanced across available hardware threads (e.g., node[0] and node[1] evenly splitting 30 executed tasks), mitigating the spiky resource usage typically associated with heavy time-series workloads. These optimization vectors prove that the self-scheduling wrapper successfully balances computation and communication overhead. Ultimately, this framework provides a highly viable, real-time analysis solution for data science applications—ranging from maintaining cloud elasticity during unpredictable demand spikes to managing critical charge/discharge cycles within battery packs.
Future work
Sample Output for the Battery pack dataset (94/30 records) - with required training data shall be implemented for differing thread tests.

Conclusion
In conclusion, utilizing a self-scheduling algorithm can effectively address real-world applications by optimizing the processing of a square matrix while adhering to specified constraints, including system configurations for hybrid systems. The result of this approach will be a refined and optimized vector, ultimately enhancing the efficiency and performance of the system.

Labels: ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home