Home   People   Publications  
 

Refereed International Conference Poster

PIM-Weaver: Compiler-Driven Adaptive Tensor Management with PIM-aware Data Layout [abstract]
Heelim Choi and Hanjun Kim
Student Research Competition (SRC), the 23rd ACM International Symposium on Code Generation and Optimization (CGO) (CGO), March 2025.

Tensors are widely used to manage extensive data across applications, from artificial intelligence to high-performance computing. Tensor operations often handle large volume of data, inherently struggle with data movement overhead. Processing-in-memory (PIM) architectures are designed to mitigate the performance bottleneck caused by data movement between memory and processors in modern computing systems. Despite PIM’s potential, its effectiveness is limited by unique challenges in tensor management, such as mismatched dimensions between tensors and memory layouts and the high overhead associated with layout-transform in PIM implementations. This work highlights several critical challenges in managing tensors for PIM systems. First, efficiently partitioning tensors across PIM memory banks remains significantly difficult, as it requires considering a wide range of constraints like data locality, kernel correctness, workload balancing, and inter-bank communication cost. Second, aligning tensor layouts with kernel-specific access patterns, such as tiled layout for convolutional layers or row-major layouts for matrix-multiplication, is challenging due to the transformation overhead that must be minimized. Third, managing inter-layer alignment is particularly complex, as smooth transitions between layers, such as pipelines, require careful coordination to avoid accumulating high layout rearrangement costs. In addition, managing complex metadata changing by diverse layouts efficiently is also challenging. Existing approaches to partitioning and layout optimization for PIM typically rely on layer-specific, heuristic methods that require in-depth hardware knowledge. These methods often fail to scale effectively to complex workloads, resulting in inefficiencies such as excessive inter-bank communication, high rearrangement overheads, and metadata changes induced by kernel operations. The challenges are further exacerbated in application pipelines with diverse layers, necessitating frequent transformations and layer-specific layouts, ultimately degrading performance. To address these challenges, this paper introduces: - A unified tensor metadata framework: Encapsulates partitioning rules, overlap configurations, and layout mappings, enabling dynamic adaptability. - Compiler-driven rearrangement mechanisms: Optimizes tensor layouts for kernel-specific requirements while minimizing communication cost. - A communication-aware layout strategy: Aligns tensor partitions with processing unit constraints to reduce inter-bank dependencies. Evaluated on workloads layer-wise kernels like convolution, matrix-multiplication and multi-layer models like VGGNet and BERT. Benchmarks demonstrate that metadata-driven adaptive tensor management reduces inter-bank communication, improves processing unit utilization, and minimizes rearrangement overhead. As a result, PIM-Weaver shows up to 15% end-to-end latency gain for multi-layer applications over state-of-the-art PIM-optimized library based implementations. In conclusion, this framework establishes a scalable approach for efficient tensor processing in memory-intensive applications using PIM. It emphasizes the importance of aligning tensor layouts with PIM constraints and kernel-specific requirements. We aim to set a foundation for future research in efficient tensor processing across PIM systems, by providing a structured framework for adaptive tensor management.