Finetuning Machine Learning for speed & JS, Python, C, WebASM for speed : Creating a parallel multiplication Grid NPU Simulation for 4x4Grid
// Creating a parallel multiplication Grid NPU Simulation for 4x4 Grid 11:13 03/10/2024 (c)RS
// Reference NPU Document in https://is.gd/DictionarySortJS
include <mpi.h>
include <stdio.h>
define NUM_ROWS 4
define NUM_COLS 4
int main(int argc, char* argv[]) {
int rank, size, i, j;
int data[NUMROWS][NUMCOLS];
int factor;
int localresults[NUMROWS];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Initialize data and factor
// ...
// Broadcast factor to all processors
MPI_Bcast(&factor, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Perform local multiplication
for (i = 0; i < NUM_ROWS; i++) {
local_results[i] = data[rank][i] * factor;
}
// Gather results from all processors
// ...
// Print results
// ...
MPI_Finalize();
return 0;
}
// # End code
Creating a parallel multiplication with load & run in the most efficient manner, based on a 4x4 grid:
[ ]1a, [ ]2a, [ ]3a, [ ]4a
[ ]1b, [ ]2b, [ ]3b, [ ]4b
[ ]1c, [ ]2c, [ ]3c, [ ]4c
[ ]1d, [ ]2d, [ ]3d, [ ]4d
Understanding the Grid:
Each row represents a different set of data to be multiplied.
Each column represents a different factor to be multiplied by.
Parallel Multiplication Approach:
Load Data:
Assign each row to a separate processor or thread.
Load the data for each row into the corresponding processor's local memory.
Broadcast Factor:
Broadcast the factor to be multiplied by to all processors or threads.
Parallel Multiplication:
Each processor or thread performs the multiplication operation on its assigned row using the broadcasted factor.
Store Results:
Store the results of each multiplication in the corresponding processor's local memory.
Efficient Implementation:
Data Distribution:
Distribute the data across processors or threads in a way that minimizes communication overhead.
For example, if the data is already distributed across multiple memory modules,
Assign each row to a processor that has access to its corresponding data.
Factor Broadcasting:
Use efficient broadcasting mechanisms to minimize communication latency.
For example, use tree-based broadcasting or collective operations provided by MPI or other parallel programming frameworks.
Load Balancing:
Ensure that each processor or thread has a roughly equal amount of work to do.
This can be achieved by dividing the data into equal-sized chunks or by using dynamic load balancing techniques.
Memory Access Patterns:
Optimize memory access patterns to avoid cache misses and improve performance.
For example, use techniques like loop unrolling or data prefetching.
Synchronization:
Minimize synchronization overhead between processors or threads.
Use asynchronous communication or non-blocking operations whenever possible.
(c)Rupert S
Genuinely good JS + Python & configuration work, Windows, Linux, ARM
ML tensor + ONNX Learner libraries & files
Model examples in models folder
https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA
https://is.gd/OpenStreamingCodecs
https://is.gd/AMDPro2024PolarisCombined
The perfect Proposal RS
Best mini models by far
AMD Llama 135m https://huggingface.co/amd/AMD-Llama-135m
MS PHI 4K Model https://huggingface.co/microsoft/Phi-3-mini-4k-instruct