Search Results for author: Youngeun Kwon

Found 7 papers, 0 papers with code

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

no code implementations10 May 2022 Youngeun Kwon, Minsoo Rhu

Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design.

Recommendation Systems

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training

no code implementations25 Oct 2020 Youngeun Kwon, Yunjae Lee, Minsoo Rhu

Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters.

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

no code implementations12 May 2020 Ranggi Hwang, Taehun Kim, Youngeun Kwon, Minsoo Rhu

Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e. g., ads, e-commerce, etc) serviced from cloud datacenters.

NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units

no code implementations15 Nov 2019 Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, Minsoo Rhu

To satisfy the compute and memory demands of deep neural networks, neural processing units (NPUs) are widely being utilized for accelerating deep learning algorithms.

Management Translation

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

no code implementations8 Aug 2019 Youngeun Kwon, Yunjae Lee, Minsoo Rhu

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters.

Recommendation Systems

Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning

no code implementations18 Feb 2019 Youngeun Kwon, Minsoo Rhu

As the models and the datasets to train deep learning (DL) models scale, system architects are faced with new challenges, one of which is the memory capacity bottleneck, where the limited physical memory inside the accelerator device constrains the algorithm that can be studied.

Cannot find the paper you are looking for? You can Submit a new open access paper.