Wen-mei Hwu

Q: What Schools Are Affiliated With Wen-mei Hwu

Wen-mei Hwu is affiliated with the following schools: University of California, Berkeley, Stanford University, University of Illinois Urbana-Champaign, National Taiwan University

Wen-mei Hwu's AcademicInfluence.com Rankings

Wen-mei Hwu

Computer Science

#826

World Rank

#857

Historical Rank

#449

USA Rank

Parallel Computing

World Rank

Historical Rank

USA Rank

Computer Architecture

World Rank

Historical Rank

USA Rank

Database

#1964

World Rank

#2062

Historical Rank

#473

USA Rank

computer-science Degrees

Wen-mei Hwu

Engineering

#664

World Rank

#1073

Historical Rank

#290

USA Rank

Electrical Engineering

#758

World Rank

#824

Historical Rank

#195

USA Rank

engineering Degrees

Download Badge

Computer Science
Engineering

Wen-mei Hwu's Degrees

PhD Electrical Engineering University of California, Berkeley
Masters Electrical Engineering Stanford University
Bachelors Electrical Engineering National Taiwan University

Why Is Wen-mei Hwu Influential?

(Suggest an Edit or Addition)

According to Wikipedia, Wen-mei Hwu is the Walter J. Sanders III-AMD Endowed Chair professor in Electrical and Computer Engineering in the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign. His research is on compiler design, computer architecture, computer microarchitecture, and parallel processing. He is a principal investigator for the petascale Blue Waters supercomputer, is co-director of the Universal Parallel Computing Research Center , and is principal investigator for the first NVIDIA CUDA Center of Excellence at UIUC. At the Illinois Coordinated Science Lab, Hwu leads the IMPACT Research Group and is director of the OpenIMPACT project – which has delivered new compiler and computer architecture technologies to the computer industry since 1987. From 1997 to 1999, Hwu served as the chairman of the Computer Engineering Program at Illinois. Since 2009, Hwu has served as chief technology officer at MulticoreWare Inc., leading the development of compiler tools for heterogeneous platforms. The OpenCL compilers developed by his team at MulticoreWare are based on the LLVM framework and have been deployed by leading semiconductor companies. In 2020, Hwu retired after serving 33 years in University of Illinois at Urbana-Champaign. Currently, Hwu is a Senior Distinguished Research Scientist at Nvidia Research and Emeritus Professor at University of Illinois at Urbana-Champaign.

(See a Problem?)

Wen-mei Hwu's Published Works

Number of citations in a given year to any of this author's works

Total number of citations to an author for the works they published in a given year. This highlights publication of the most important work(s) by the author

Published Works

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA (2008) (997)
A power controlled multiple access protocol for wireless packet networks (2001) (722)
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing (2012) (710)
POSITION STATEMENT. (1995) (602)
The superblock: An effective technique for VLIW and superscalar compilation (1993) (371)
Accelerating advanced mri reconstructions on gpus (2008) (331)
An adaptive performance modeling tool for GPU architectures (2010) (309)
Program optimization space pruning for a multithreaded gpu (2008) (304)
GPU Computing Gems Emerald Edition (2011) (260)
CUDA-Lite: Reducing GPU Programming Complexity (2008) (260)
IMPACT: an architectural framework for multiple-instruction-issue processors (1991) (246)
GPU clusters for high-performance computing (2009) (244)
Using profile information to assist classic code optimizations (1991) (242)
Achieving High Instruction Cache Performance With An Optimizing Compiler (1989) (241)
An effective GPU implementation of breadth-first search (2010) (234)
PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference (2019) (230)
DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs (2018) (225)
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs (2008) (221)
Programming Massively Parallel Processors (2013) (209)
An asymmetric distributed shared memory model for heterogeneous parallel systems (2010) (202)
A comparison of full and partial predicated execution support for ILP processors (1995) (200)
IMPACT: an architectural framework for multiple-instruction-issue processors (1991) (188)
Run-Time Adaptive Cache Hierarchy Management via Reference Analysis (1997) (180)
FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs (2009) (179)
Profile‐guided automatic inline expansion for C programs (1992) (178)
Trimaran: An Infrastructure for Research in Instruction-Level Parallelism (2004) (178)
Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation (2000) (176)
Checkpoint repair for out-of-order execution machines (1987) (162)
Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation (2020) (151)
HPS, a new microarchitecture: rationale and introduction (1985) (149)
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization (1999) (148)
Adaptive Cache Management for Energy-Efficient GPU Computing (2014) (139)
Program optimization carving for GPU computing (2008) (137)
Integrated predicated and speculative execution in the IMPACT EPIC architecture (1998) (137)
BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads (2014) (132)
Inline function expansion for compiling C programs (1989) (131)
Dynamic memory disambiguation using the memory conflict buffer (1994) (131)
FPGA/DNN Co-Design: An Efficient Design Methodology for 1oT Intelligence on the Edge (2019) (131)
GPU acceleration of cutoff pair potentials for molecular modeling applications (2008) (125)
Run-time spatial locality detection and optimization (1997) (125)
Checkpoint Repair for High-Performance Out-of-Order Execution Machines (1987) (122)
Sentinel scheduling: a model for compiler-controlled speculative execution (1992) (121)
Characterizing the impact of predicated execution on branch prediction (1994) (120)
GPU Computing Gems Jade Edition (2011) (120)
Region-based compilation: an introduction and motivation (1995) (115)
Run-Time Cache Bypassing (1999) (112)
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching (1991) (110)
Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications (2010) (103)
Java bytecode to native code translation: the Caffeine prototype and preliminary results (1996) (100)
Trace Selection For Compiling Large C Application Programs To Microcode (1988) (98)
A framework for balancing control flow and predication (1997) (97)
Reverse If-Conversion (1993) (95)
QP: A Heterogeneous Multi-Accelerator Cluster (2011) (93)
Superblock formation using static program analysis (1993) (93)
Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis (2004) (91)
SPGNet: Semantic Prediction Guidance for Scene Parsing (2019) (90)
Comparing Software And Hardware Schemes For Reducing The Cost Of Branches (1989) (87)
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs (2010) (87)
Sentinel scheduling: a model for compiler-controlled speculative execution (1993) (85)
The Concurrency Challenge (2008) (83)
Compiler-directed dynamic computation reuse: rationale and initial results (1999) (82)
The Effect of Code Expanding Optimizations on Instruction Cache Design (1993) (80)
Thousand-Core Chips [Roundtable] (2008) (76)
Unrolling-based optimizations for modulo scheduling (1995) (75)
Compiler technology for future microprocessors (1995) (73)
Real-time in vivo computed optical interferometric tomography (2013) (73)
DL: A data layout transformation system for heterogeneous computing (2012) (72)
Implicitly Parallel Programming Models for Thousand-Core Microprocessors (2007) (71)
Chai: Collaborative heterogeneous applications for integrated-architectures (2017) (70)
Reinforcement Learning Based Text Style Transfer without Parallel Training Corpus (2019) (70)
GPU computing gems (2011) (69)
SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems (2019) (69)
Compute Unified Device Architecture Application Suitability (2009) (69)
High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs (2009) (69)
A study of the energy saving and capacity improvement potential of power control in multi-hop wireless networks (2001) (68)
Run-time Adaptive Cache Hierarchy Via Reference Analysis (1997) (68)
Critical issues regarding HPS, a high performance microarchitecture (1985) (66)
Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation (1998) (66)
XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines (2010) (64)
Transmission power control for multiple access wireless packet networks (2000) (63)
A hardware mechanism for dynamic extraction and relayout of program hot spots (2000) (63)
A Scalable Tridiagonal Solver for GPUs (2011) (62)
The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors (1995) (62)
Efficient Pattern-Based Time Series Classification on GPU (2012) (61)
A scalable, numerically stable, high-performance tridiagonal solver using GPUs (2012) (61)
An Architectural Framework for Runtime Optimization (2001) (60)
SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance (2014) (60)
Beating in-order stalls with "flea-flicker" two-pass pipelining (2006) (59)
Accelerating reduction and scan using tensor core units (2018) (55)
EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions (2020) (54)
Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts (2018) (54)
Benchmark characterization (1991) (53)
Long time-scale simulations of in vivo diffusion using GPU hardware (2009) (51)
Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors (2012) (51)
Importance of heap specialization in pointer analysis (2004) (50)
Compiler code transformations for superscalar-based high-performance systems (1992) (50)
HMDES Version 2.0 Specification (1996) (50)
Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems (2012) (48)
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization (2003) (46)
Multilevel Granularity Parallelism Synthesis on FPGAs (2011) (45)
Sentinel scheduling for VLIW and superscalar processors (1992) (44)
Optimizing NET Compilers for Improved Java Performance (1997) (44)
Benchmark characterization for experimental system evaluation (1990) (44)
Platform choices and design demands for IoT platforms: cost, power, and performance tradeoffs (2016) (43)
PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-Efficient ReRAM (2019) (43)
SpaceJMP: Programming with Multiple Virtual Address Spaces (2016) (43)
The benefit of predicated execution for software pipelining (1993) (42)
FlatFlash: Exploiting the Byte-Accessibility of SSDs within a Unified Memory-Storage Hierarchy (2019) (42)
Acceleration of the Pair-HMM Algorithm for DNA Variant Calling (2016) (41)
HPSm, a high performance restricted data flow architecture having minimal functionality (1986) (40)
A software based approach to achieving optimal performance for signature control flow checking (1990) (40)
Vacuum packing: extracting hardware-detected program phases for post-link optimization (2002) (40)
Data relocation and prefetching for programs with large data sets (1994) (39)
Alleviating Semantic-level Shift: A Semi-supervised Domain Adaptation Method for Semantic Segmentation (2020) (39)
More IMPATIENT: A gridding-accelerated Toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs (2013) (38)
Register Connection: A New Approach To Adding Registers Into Instruction Set Architectures (1993) (38)
Optimization and architecture effects on GPU computing workload performance (2012) (38)
"Flea-flicker" multipass pipelining: an alternative to the high-power out-of-order offense (2005) (37)
Application-Transparent Near-Memory Processing Architecture with Memory Channel Network (2018) (37)
Speculative hedge: regulating compile-time speculation against profile variations (1996) (37)
Modulo scheduling of loops in control-intensive non-numeric programs (1996) (37)
A study of the cache and branch performance issues with running Java on current hardware platforms (1997) (36)
DeepStore: In-Storage Acceleration for Intelligent Queries (2019) (36)
CUBA: an architecture for efficient CPU/co-processor data communication (2008) (36)
Heterogeneous System Architecture: A New Compute Platform Infrastructure (2015) (35)
DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator (2020) (34)
Toward Application-Aware Security and Reliability (2007) (33)
Field-testing IMPACT EPIC research results in Itanium 2 (2004) (33)
The program decision logic approach to predicated execution (1999) (33)
Accelerator Architectures (2008) (33)
Comparing static and dynamic code scheduling for multiple-instruction-issue processors (1991) (33)
What is ahead for parallel computing (2014) (33)
Three Architecutral Models for Compiler-Controlled Speculative Execution (1995) (33)
Enhancing loop buffering of media and telecommunications applications using low-overhead predication (2001) (33)
Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (2021) (33)
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) (2012) (32)
Direct Numerical Simulation of Turbulent Flow in a Square Duct using a Graphics Processing Unit (GPU) (2010) (31)
Speculative execution exception recovery using write-back suppression (1993) (31)
KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism (2016) (31)
Programming Massively Parallel Processors, Third Edition: A Hands-on Approach (2016) (31)
MemXCT: memory-centric X-ray CT reconstruction with massive parallelization (2019) (30)
A new framework for debugging globally optimized code (1999) (30)
Accurate and efficient predicate analysis with binary decision diagrams (2000) (30)
Branch recovery with compiler-assisted multiple instruction retry (1992) (30)
The parallelization of video processing (2009) (30)
Impatient MRI: Illinois Massively Parallel Acceleration Toolkit for image reconstruction with enhanced throughput in MRI (2011) (30)
Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS (2019) (29)
Parallel implementation of Multi-dimensional Ensemble Empirical Mode Decomposition (2011) (28)
Architectural support for compiler-synthesized dynamic branch prediction strategies: Rationale and initial results (1997) (28)
Energy saving and capacity improvement potential of power control in multi-hop wireless networks (2003) (28)
Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs (2019) (28)
Scalar program performance on multiple-instruction-issue processors with a limited number of registers (1992) (28)
Region-based compilation (1996) (27)
Proceedings of the 25th annual international symposium on Microarchitecture (1992) (27)
Improving static branch prediction in a compiler (1998) (27)
Heterogeneous Computing Meets Near-Memory Acceleration and High-Level Synthesis in the Post-Moore Era (2017) (26)
Using Profile Information to Assist Advaced Compiler Optimization and Scheduling (1992) (25)
How GPUs Can Improve the Quality of Magnetic Resonance Imaging (2011) (25)
A Tiling-Scheme Viterbi Decoder in Software Defined Radio for GPUs (2011) (25)
Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures (2019) (25)
Exploiting parallel microprocessor microarchitectures with a compiler code generator (1988) (24)
Adaptive Cache Bypass and Insertion for Many-core Accelerators (2014) (24)
Efficient compilation of CUDA kernels for high-performance computing on FPGAs (2013) (23)
In-place transposition of rectangular matrices on accelerators (2014) (23)
Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures (2015) (23)
High-throughput Ant Colony Optimization on graphics processing units (2018) (23)
Run-time adaptive cache management (1998) (22)
An Architectural Framework for Run-Time Optimization (2001) (22)
PUMA (2019) (22)
BLESS 2: accurate, memory-efficient and fast error correction method (2016) (22)
Performance Analysis and Tuning for General Purpose Graphics Processing Units (2012) (22)
EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs (2020) (21)
Three Superblock Scheduling Models for Superscalar and Superpipelined Processors (1991) (21)
Collaborative (CPU + GPU) algorithms for triangle counting and truss decomposition on the Minsky architecture: Static graph challenge: Subgraph isomorphism (2017) (21)
Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators (2021) (21)
Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition (2018) (21)
Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes (2015) (20)
HPSm, a high performance restricted data flow architecture having minimal functionality (1998) (20)
XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs (2019) (20)
Efficient Instruction Sequencing with Inline Target Insertion (1992) (20)
PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space (2019) (20)
The Susceptibility of Programs to Context Switching (1994) (20)
Comparison based sorting for systems with multiple GPUs (2013) (19)
Accelerating iterative field-compensated MR image reconstruction on GPUs (2010) (19)
Hpsm: exploiting concurrency to achieve high performance in a single-chip microarchitecture (1987) (19)
Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning (2018) (19)
Tolerating data access latency with register preloading (1992) (19)
NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving (2019) (18)
Compiler-Assisted Multiple Instruction Retry (1991) (18)
Compile-time memory disambiguation for c programs (2000) (17)
High-performance CUDA kernel execution on FPGAs (2009) (17)
Run-time generation of HPS microinstructions from a VAX instruction stream (1986) (17)
Update on k-truss Decomposition on GPU (2019) (17)
Collaborative Computing for Heterogeneous Integrated Systems (2017) (17)
SkyNet: A Champion Model for DAC-SDC on Low Power Object Detection (2019) (16)
Efficient kernel synthesis for performance portable programming (2016) (16)
Automatic Discovery of Coarse-Grained Parallelism in Media Applications (2007) (16)
A Guide for Implementing Tridiagonal Solvers on GPUs (2014) (16)
FPGA accelerated DNA error correction (2015) (15)
Petascale XCT: 3D Image Reconstruction with Hierarchical Communications on Multi-GPU Nodes (2020) (15)
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function-as-a-Service (2018) (15)
PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow (2021) (15)
In-Place Data Sliding Algorithms for Many-Core Architectures (2015) (15)
Systematic compilation for predicated execution (2000) (14)
Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects (2019) (14)
DNNBuilder (2018) (14)
Triangle Counting and Truss Decomposition using FPGA (2018) (14)
Hardware support for dynamic activation of compiler-directed computation reuse (2000) (14)
CIGAR: Application Partitioning for a CPU/Coprocessor Architecture (2007) (14)
Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer (1993) (14)
In-Place Matrix Transposition on GPUs (2016) (13)
High-Performance Computing with Accelerators (2010) (13)
Beating in-order stalls with "flea-flicker" two-pass pipelining (2003) (13)
Accelerating Sparse Deep Neural Networks on FPGAs (2019) (13)
Tolerating First Level Memory Access Latency in High-Performance Systems (1992) (13)
Scalable SIMD-parallel memory allocation for many-core machines (2013) (13)
Tangram: a High-level Language for Performance Portable Code Synthesis (2015) (13)
Performance Portability in Accelerated Parallel Kernels (2013) (13)
Modulo schedule buffers (2001) (12)
An Architecture Framework for Introducing Predicated Execution into Embedded Microprocessors (1999) (12)
Rebooting the Data Access Hierarchy of Computing Systems (2017) (12)
Benchmark characterization (1991) (12)
Accelerating MR image reconstruction on GPUs (2009) (12)
Compiler-directed early load-address generation (1998) (12)
Optimization of Machine Descriptions for Efficient Use (1996) (12)
Code reordering and speculation support for dynamic optimization systems (2001) (12)
Update on Triangle Counting on GPU (2019) (12)
Pseudo-IoU: Improving Label Assignment in Anchor-Free Object Detection (2021) (12)
A Bi-Directional Co-Design Approach to Enable Deep Learning on IoT Devices (2019) (12)
Enhancing the Usability and Utilization of Accelerated Architectures via Docker (2015) (12)
The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication (1999) (11)
SAVI objects: sharing and virtuality incorporated (2017) (11)
TIGER: tiled iterative genome assembler (2012) (11)
Sparse regularization in MRI iterative reconstruction using GPUs (2010) (11)
At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation (2020) (11)
Compiler-Based Multiple Instruction Retry (1995) (11)
WebGPU: A Scalable Online Development Platform for GPU Programming Courses (2016) (11)
Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing (2014) (10)
Efficient and Scalable Workflows for Genomic Analyses (2016) (10)
The Effect of Compiler Optimizations on Available Parallelism in Scalar Programs (1991) (10)
Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications (2009) (10)
DESIGN CHOICES FOR THE HPSm MICROPROCESSOR CHIP. (1987) (10)
Design evaluation of OpenCL compiler framework for Coarse-Grained Reconfigurable Arrays (2012) (10)
Generalize or Die: Operating Systems Support for Memristor-Based Accelerators (2017) (10)
PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses (2021) (9)
C COMPILER FOR HPS I, A HIGHLY PARALLEL EXECUTION ENGINE. (1986) (9)
Throughput-oriented kernel porting onto FPGAs (2013) (9)
Control flow optimization for supercomputer scalar processing (1989) (9)
Multi-GPU Implementation for Iterative MR Image Reconstruction with Field Correction (2011) (9)
Interpretable Visual Reasoning via Induced Symbolic Space (2020) (9)
A brief survey of benchmark usage in the architecture community (1991) (9)
DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model (2016) (9)
Interactive Source-Level Debugging of Optimized Code (1999) (9)
EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing (2011) (9)
Profile-assisted instruction scheduling (1994) (9)
An Empirical Study of Function Pointers Using SPEC Benchmarks (1999) (9)
Automatic execution of single-GPU computations across multiple GPUs (2014) (9)
Forward semantic: a compiler-assisted instruction fetch method for heavily pipelined processors (1989) (9)
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments (2018) (9)
GPU-Accelerated Gridding for Rapid Reconstruction of Non-Cartesian MRI (2010) (8)
A Fast and Massively-Parallel Inverse Solver for Multiple-Scattering Tomographic Image Reconstruction (2018) (8)
Characterization of Repeating Data Access Patterns in Integer Benchmarks (2001) (8)
Introduction to Predicate Execution (1998) (7)
GPU-SM: shared memory multi-GPU programming (2015) (7)
Efficient Methods for Mapping Neural Machine Translator on FPGAs (2021) (7)
Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architectures (2010) (7)
An efficient framework for performing execution-constraint-sensitive transformations that increase instruction-level parallelism (1997) (7)
An experimental single-chip data flow CPU (1990) (7)
Code coverage and input variability: effects on architecture and compiler research (2002) (7)
Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device (2019) (7)
An Analytical Approach to Scheduling Code for Superscalar and VLIW Architectures (1994) (7)
Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications (2015) (6)
Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles (2021) (6)
Transitioning HPC software to exascale heterogeneous computing (2015) (6)
Benanza: Automatic μBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs (2019) (6)
Illinois ECE 498AL: Programming Massively Parallel Processors (2009) (6)
Aquarius (1987) (6)
Operating System Interfaces: Bridging the Gap Between CPU and FPGA Accelerators (2006) (6)
HPS IMPLEMENTATION OF VAX; INITIAL DESIGN AND ANALYSIS. (1986) (6)
Executing Nested Parallel Loops on Shared-Memory Multiprocessors (1992) (6)
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems (2018) (6)
Parallel solutions of inverse multiple scattering problems with born-type fast solvers (2016) (6)
Introduction to predicated execution (1998) (6)
Open Relation Modeling: Learning to Define Relations between Entities (2021) (6)
Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems (2022) (6)
Near-Memory and In-Storage FPGA Acceleration for Emerging Cognitive Computing Workloads (2019) (6)
Across-Stack Profiling and Characterization of Machine Learning Models on GPUs (2019) (6)
Using GPUs to Accelerate Advanced MRI Reconstruction with Field Inhomogeneity Compensation (2011) (6)
Application of Compiler-Assisted Rollback Recovery to Speculative Execution Repair (1994) (6)
Xprof: profiling the execution of X Window programs (1992) (6)
Region-based compilation: Introduction, motivation, and initial experience (1997) (6)
MLModelScope: Evaluate and Measure ML Models within AI Pipelines (2018) (6)
A study of code reuse and sharing characteristics of Java applications (1998) (6)
Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX (2007) (6)
A Practical Interprocedural Pointer Analysis Framework (1980) (6)
Large inverse-scattering solutions with DBIM on GPU-enabled supercomputers (2017) (6)
Frustrated with Replicating Claims of a Shared Model? A Solution (2018) (6)
Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices (2020) (5)
On tuning the microarchitecture of an HPS implementation of the VAX (1987) (5)
Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures (2017) (5)
Snoopy cache test-and-test-and-set without execessive bus contention (1990) (5)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 13: Reductions and their Implementation (2009) (5)
Accelerator Architectures A Ten-Year Retrospective (2018) (5)
A Software-Oriented Floating-Point Format for Enhancing Automotive Control Systems (1998) (5)
Tearing Down the Memory Wall (2020) (5)
A study of the effects of compiler-controlled speculation on instruction and data caches (1995) (5)
Static program analysis to enhance profile independence in instruction-level parallelism compilation (1998) (5)
Tolerating Cache-Miss Latency with Multipass Pipelines (2006) (5)
FCUDA-HB: Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow (2016) (5)
Hardware support for dynamic activation of compiler-directed computation reuse (2000) (4)
An execution profiler for window‐oriented applications (1993) (4)
A programming system for future proofing performance critical libraries (2016) (4)
Chapter 9 – Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms (2012) (4)
Matching on-chip data storage to telecommunication and media application properties (2004) (4)
Advanced MRI reconstruction toolbox with accelerating on GPU (2011) (4)
Node-Aware Stencil Communication for Heterogeneous Supercomputers (2020) (4)
MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale (2019) (4)
On tuning the microarchitecture of an HPS implementation of the VAX (1988) (4)
FFT blitz: the tensor cores strike back (2021) (4)
AN IMPLEMENTATION OF GURPR*: A SOFTWARE PIPELINING ALGORITHM BY JOHN WILLIAM BOCKHAUS (1992) (4)
RAI: A Scalable Project Submission System for Parallel Programming Courses (2017) (4)
Reducing Cache Misses in Numerical Applications Using Data Relocation and Prefetching. (1995) (4)
A systematic approach to delivering instruction-level parallelism in epic systems (2005) (4)
History of GPU Computing (2013) (4)
A Feature Taxonomy and Survey of Synchronization Primitive Implementations (1991) (4)
DEER: Descriptive Knowledge Graph for Explaining Entity Relationships (2022) (4)
AccDNN: An IP-Based DNN Generator for FPGAs (2018) (4)
Graph Neural Network Training and Data Tiering (2022) (3)
An Efficient GPU Implementation Technique for Higher-Order 3D Stencils (2019) (3)
Many-core parallel computing - Can compilers and tools do the heavy lifting? (2009) (3)
BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage (2022) (3)
HyKernel: A Hybrid Selection of One/Two-Phase Kernels for Triangle Counting on GPUs (2021) (3)
Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications (2011) (3)
K-Clique Counting on GPUs (2021) (3)
The Design and Implementation of a Scalable Deep Learning Benchmarking Platform (2019) (3)
Implementing a GPU programming model on a Non-GPU accelerator architecture (2010) (3)
Exploring Semantic Capacity of Terms (2020) (3)
FlatFlash (2019) (3)
Program decision logic optimization using predication and control speculation (2001) (3)
Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach (2021) (3)
Data Layout Transformation for Structured-Grid Codes on GPU (2009) (3)
Single-Pass Memory System Evaluation for Multiprogramming Workloads (1990) (3)
Context-sensitive pointer analysis based on procedural summaries (2004) (3)
Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance (2015) (3)
Experiments with HPS, a Restricted Data Flow Microarchitecture for High Performance Computers (1986) (3)
MLHarness: A Scalable Benchmarking System for MLCommons (2021) (3)
MemXCT: Design, Optimization, Scaling, and Reproducibility of X-ray Tomography Imaging (2021) (3)
MemXCT (2019) (3)
Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor (2016) (2)
A Simulation Study of Simultaneous Vector Prefetch Performance in Multiprocessor Memory Subsystems (Extended Abstract) (1989) (2)
Systematic prototyping of superscalar computer architectures (1992) (2)
High-speed Interferometric Synthetic Aperture Microscopy on a Graphics Processing Unit (2012) (2)
The Design and Implementation of a Scalable DL Benchmarking Platform (2019) (2)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 12: Structuring Parallel Algorithms (2009) (2)
SpaceJMP (2016) (2)
The GPU RevolUTion aT WoRk (2011) (2)
New Trends in Computational Electromagnetics (2019) (2)
Trusted ILLIAC - A Configurable Application-Aware High-Performance Platform for Trustworthy Computing (2007) (2)
PRO-GAGE: A High Performance Compact GAGE Hash Function Processor for Small Space Technology (2021) (2)
FReaC Cache: Folded-logic Reconfigurable Computing in the Last Level Cache (2020) (2)
Understanding Jargon: Combining Extraction and Generation for Definition Modeling (2021) (2)
Thoughts on massively-parallel heterogeneous computing for solving large problems (2017) (2)
Fast MR Image Reconstruction using Graphics Processing Units (2007) (2)
A New Data-Location Tracking Scheme for the Recovery of Expected Variable Values (1998) (2)
An efficient GPU implementation and scaling for higher-order 3D stencils (2021) (2)
A New Breakpoint Implementation Scheme for Debugging Globally Optimized Code (1998) (2)
HPSm2: A refined single-chip microengine (1988) (2)
A Compiler Framework for Optimizing Dynamic Parallelism on GPUs (2022) (2)
Optimized Data Transfers Based on the OpenCL Event Management Mechanism (2015) (2)
Scalable parallel DBIM solutions of inverse-scattering problems (2017) (2)
Performance Implications of Synchronization Support for Parallel Fortran Programs (1991) (2)
Compaction Algorithm for Precise Modular Context-Sensitive Points-to Analysis (2003) (2)
Advances in Benchmarking Techniques: New Standards and Quantitative Metrics (1995) (2)
EMOGI (2020) (2)
Chapter 5 – Performance considerations (2017) (2)
Seeing the invisible: Limited-view imaging with multiple-scattering reconstruction (2018) (2)
DNNExplorer (2020) (2)
Rapid computation of sodium bioscales using gpu‐accelerated image reconstruction (2013) (2)
Iteration Disambiguation for Parallelism Identification in Time-Sliced Applications (2007) (2)
Introduction to data parallelism and CUDA C (2013) (2)
DLSpec: A Deep Learning Task Exchange Specification (2020) (2)
Safer Illinois and RokWall: Privacy Preserving University Health Apps for COVID-19 (2021) (2)
Corezilla: Build and Tame the Multicore Beast? (2007) (2)
RECO-HCON: A High-Throughput Reconfigurable Compact ASCON Processor for Trusted IoT (2022) (2)
The application of compiler-assisted multiple-instruction retry to VLIW architectures (1994) (1)
Ensuring Critical Data Integrity via Information Flow Signatures (2007) (1)
Hardware-compiler co-design for adjustable data power savings (2009) (1)
Applying Scalable Interprocedural Pointer Analysis to Embedded Applications (2004) (1)
Hardware Support for Dynamic Management of Compiler-Directed Computation Reuse. (2000) (1)
Data-Parallel Execution Model (2013) (1)
Rethinking computer architecture for throughput computing (2013) (1)
A Simple Non-i.i.d. Sampling Approach for Efficient Training and Better Generalization (2020) (1)
Improved Superblock optimization in GCC (2006) (1)
Parallel K-clique counting on GPUs (2021) (1)
State Space Search (2011) (1)
Challenges and Pitfalls of Reproducing Machine Learning Artifacts (2019) (1)
Retrospective: IMPACT: an architectural framework for multiple-instruction issue (1998) (1)
Enabling GPU Support for the COMPSs-Mobile Framework (2017) (1)
System Synthesis and Automated Verification : Design Demands for IoT Devices (2016) (1)
TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes (2020) (1)
TIGER: tiled iterative genome assembler (2012) (1)
Exploiting horizontal and vertical concurrency via the HPSm microprocessor (1988) (1)
Parallel Programming Models for Thousand-Core Microprocessors (2007) (1)
Supercomputing for Full-Wave Tomographic Image Reconstruction in Near-Real Time (2018) (1)
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (2016) (1)
Chapter 5 – CUDA Memories (2013) (1)
Micro - GAGE: A Low-power Compact GAGE Hash Function Processor for IoT Applications (2020) (1)
Optimization of tele-immersion codes (2009) (1)
Parallel Patterns: Sparse Matrix–Vector Multiplication: An Introduction to Compaction and Regularization in Parallel Algorithms (2013) (1)
PARALLELIZATION OF VIDEO PROCESSING : FROM PROGRAMMING MODELS TO APPLICATIONS (2009) (1)
Edge Crypt-Pi: Securing Internet of Things with Light and Fast Crypto-Processor (2020) (1)
clMPI: An OpenCL Extension for Interoperation with the Message Passing Interface (2013) (1)
Challenges and Pitfalls of Machine Learning Evaluation and Benchmarking (2019) (1)
Breaking Down the Memory Wall for Scalable Microprocessor Platforms (2004) (1)
Application of compiler-assisted multiple instruction rollback recovery to speculative execution (1993) (1)
Exploiting horizontal and vertical concurrency via the HPSm microprocessor (1987) (1)
PARSEC: PARallel Subgraph Enumeration in CUDA (2022) (1)
Programming Massively Parallel Processors with CUDA (audio course) (2011) (1)
Superscalar Processors (2011) (1)
Combining Sampling with Single-Pass Techniques for Efficient Cache Simulation (1991) (1)
Innovative applications and technology pivots - a perfect storm in computing (2017) (1)
CUDA Dynamic Parallelism (2013) (1)
HPS, a new microachitecture: introduction and rationale (2000) (1)
Performance Analysis of Computer Vision with Machine Learning Algorithms on Raspberry Pi 3 (2020) (1)
Interferometric Synthetic Aperture Microscopy with Computational Adaptive Optics for High-Resolution Tomography of Scattering Tissue (2012) (1)
Incremental compiler transformations for multiple instruction retry (1994) (1)
Graph Neural Network Training with Data Tiering (2021) (1)
Can Language Models Be Specific? How? (2022) (1)
Conclusion and Future Outlook (2013) (1)
Scaling Analysis of a Hierarchical Parallelization of Large Inverse Multiple-Scattering Solutions (2017) (1)
Supplementary Materials SPGNet: Semantic Prediction Guidance for Scene Parsing (2019) (0)
Iterative Modulo Scheduling (2018) (0)
Extending HLS with High-Level Descriptive Language for Configurable Algorithm-Level Spatial Structure Design (2021) (0)
Editor's Introduction (2004) (0)
Chapter 21 – Conclusion and outlook (2017) (0)
Special-Purpose Machines (2011) (0)
Advancing Computing Infrastructure for Very Large-Scale Deep Learning at C3SR (2020) (0)
Submission-Aware Reviewer Profiling for Reviewer Recommender System (2022) (0)
Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators (2018) (0)
Sentinel Scheduling: A Model for Compiler-Controlled Execution Speculative (1993) (0)
The design and implementation of the wolfram language compiler (2020) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 4: CUDA Threads - Part 2 (2009) (0)
Objects : Sharing and Virtuality Incorporated (2017) (0)
Supernode Partitioning (2011) (0)
EPHAM 2009 Exploiting Parallelism Using GPUs and Other Hardware-Assisted Methods (2009) (0)
Panel Statement (2011) (0)
Compiler Assisted Recovery For Fault-Tolerant Highly Parallel Multiprocessor Architectures (1992) (0)
Application Acceleration with the Explicitly Parallel Operations System - the EPOS Processor (2008) (0)
PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks (2023) (0)
Parallelizing Maximal Clique Enumeration on GPUs (2022) (0)
Session details: Emerging technologies and interconnect (2010) (0)
Raising the level of many-core programming with compiler technology - meeting a grand challenge (2010) (0)
On extracting coarse-grained function parallelism from c programs (2006) (0)
IPDPS 2011 Wednesday 25th Year Panel: What's ahead? (2011) (0)
From the guest editors (2007) (0)
Relationship Between Facial Recognition, Color Spaces, and Basic Image Manipulation (2020) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 3: CUDA Threads, Tools, Simple Examples (2009) (0)
From the Guest Editors (2016) (0)
Future of computing: hardware versus software (2011) (0)
Run-time optimization architecture (2002) (0)
DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs (2019) (0)
Micro-21 from the program chair (1989) (0)
Illinois ME 498 Introduction of Nano Science and Technology, Lecture 7: Basics of Solid Mechanics in Nanostructures (2009) (0)
xER: An Explainable Model for Entity Resolution using an Efficient Solution for the Clique Partitioning Problem (2021) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 5: CUDA Memories (2009) (0)
The code size advantage of predicted execution for software pipelining (1993) (0)
Efficient Inference on GPUs for the Sparse Deep Neural Network Graph Challenge 2020 (2020) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 15: Kernel and Algorithm Patterns for CUDA (2009) (0)
for Reliable and High-Performance ComputingUniversity of IllinoisUrbana , IL 61801 (1992) (0)
A Study on Soft-Core Processor Configurations for Embedded Cryptography Applications (2020) (0)
COMPARISON OF SEVERAL EVOLVING (UNIVERSITY) SUPERCOMPUTER ARCHITECTURES. (1984) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 7: GPU as part of the PC Architecture (2009) (0)
SCOPE: C3SR Systems Characterization and Benchmarking Framework (2018) (0)
Semi-Coherent DMA: An Alternative I/O Coherency Management for Embedded Systems (2018) (0)
Keynote: Architecture and software for emerging low-power systems (2017) (0)
I OF SYNCHRONIZATION SUPPORT FOR PARALLEL FORTRAN PROGRAMS (0)
CONTE, HIRSCH, HWU: SINGLE PASS METHODS FOR EFFICIENT CACHE SIMULATION 1 Combining Trace Sampling with Single Pass Methods for E cient Cache Simulation (0)
Session details: Corezilla: build and tame the multicore beast (2007) (0)
Scalable , Precise Context-Sensitive Top-Down Process for Modular Points-to Analysis (2003) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 14: Application Case Study - Quantative MRI Reconstruction (2009) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 10: Control Flow (2009) (0)
Memory and data locality (2017) (0)
HPS papers: A retrospective (2016) (0)
Triangle Counting and Truss Decomposition using FPGA Static Graph Challenge : Subgraph Isomorphism (2018) (0)
Fast CUDA-Aware MPI Datatypes without Platform Support (2020) (0)
Parallel patterns—parallel histogram computation: An introduction to atomic operations and privatization (2017) (0)
An introduction to OpenCLTM (2012) (0)
Chapter 7 – Compiler Technology (2016) (0)
AD-A 266 930 The Application of Compiler-Assisted Multiple Instruction Retry (0)
Program committee (2018) (0)
Dynamic Tracking of Information Flow Signatures for Security Checking (2007) (0)
You may direct your questions about the class to (0)
Code Reordering and Speculation Support for Dynamic Optimization System (2001) (0)
Vertext: An End-to-end AI Powered Conversation Management System for Multi-party Chat Platforms (2020) (0)
Application case study—non-Cartesian magnetic resonance imaging: An introduction to statistical estimation methods (2017) (0)
Illinois ME 498 Introduction of Nano Science and Technology, Lecture 6: Basics of Transport in Nanostructures (2009) (0)
Parallel programming and computational thinking (2013) (0)
Mapping high-level programming languages to OpenCL 2.0 (2015) (0)
The application of multiple instruction retry to VLIW architectures using compiler generated hazard-free code (1993) (0)
Yale Patt Recognized with Four Papers Among Inaugural Micro Test of Time Award (2017) (0)
Read buffer optimizations to support compiler-assisted multiple instruction retry (1993) (0)
Server Farm (2011) (0)
CUDA application development (2008) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 8: Threading Hardware in G80 (2009) (0)
Parallel Patterns: Convolution (2013) (0)
A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators (2012) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 2: The CUDA Programming Model (2009) (0)
PhraseScope: An Effective and Unsupervised Framework for Mining High Quality Phrases (2021) (0)
Speculative Execution and Compiler-Assisted Multiple Instruction Recovery (1994) (0)
Foreword to the Special Issue (1998) (0)
Floating-point considerations (2013) (0)
Distribution Environment Universal Software Distribution Language Optimizing Java Bytecode Translator Stack to Register Mapping Java Memory Organization Java Verification Costs Java Garbage Collection Java Cache Performance Java Array Index Bounds Checking (1998) (0)
AsHES 2016 Keynote (2016) (0)
Graviton: A Reconfigurable Memory-Compute Fabric for Data Intensive Applications (2021) (0)
IIC FILE COO 1 U " U-ENG-0-IS COORDINATED SCIENCE LABORATORY Colleg of Enginfteiu ADA 222 807 EFFICIENT INSTRUCTION SEQUENCING WITH INLINE TARGET INSERTION (0)
Guest Editors' Introduction (1996) (0)
Memory Profiling for 3 G Domain DSP Development (2003) (0)
Application case study: Advanced MRI reconstruction (2012) (0)
Word Prior Detection Segmentation Input " The left guy " Image : Query : a guy left the youth Energy (2017) (0)
Chapter 6 – Numerical considerations (2017) (0)
A Retrospective Recount of Computer Architecture Research with a Data-Driven Study of Over Four Decades of ISCA Publications (2019) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 9: Memory Hardware in G80 (2009) (0)
Chapter 8 – Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches (2012) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 1: Introduction (2009) (0)
Multiple-pass pipelining: enhancing in-order microarchitectures to out-of-order performance (2005) (0)
Compiler-Assisted Signature Monitoring (1990) (0)
PRACE Autumn School 2010 - CUDA Programming, Pt. 2 (2010) (0)
Making Parallel Programming Easy: Research Contributions from Illinois (2013) (0)
RECO-DryGASCON: Re-configurable Lightweight DryGASCON Engine (2020) (0)
TEMPI: An Interposed MPI Library with Canonical Representation of MPI Datatypes. (2021) (0)
WHAT'S AHEAD? (2011) (0)
Parallel patterns: sparse matrix computation: An introduction to data compression and regularization (2017) (0)
Scalable SIMD-parallel memory allocation for many-core machines (2011) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 6: CUDA Memories - Part 2 (2009) (0)
Eecient Instruction Sequencing with Inline Target Insertion 1 (1990) (0)
Editors' Introduction (2004) (0)
Design of a power-efficient ARM processor with a timing-error detection and correction mechanism (2016) (0)
Center for Reliable and High-Performance Computing COMPILER-ASSISTED SIGNATURE MONITORING (2017) (0)
IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research (2023) (0)
Chapter 17 – Parallel programming and computational thinking (2017) (0)
DKG: A Descriptive Knowledge Graph for Explaining Relationships between Entities (2022) (0)
The Future of Computer Architecture Research: An Industrial Perspective (2005) (0)
Comparative performance evaluation of multi-GPU MLFMA implementation for 2-D VIE problems (2017) (0)
Session details: Emerging technologies and interconnect (2010) (0)
Gelato Federation : Global Partnerships Advancing Itanium (2007) (0)
Chapter 6 – Performance Considerations (2013) (0)
Message from the Program Chair (2021) (0)
MLModelScope: Evaluate and Introspect Cognitive Pipelines (2019) (0)
Parallel patterns: convolution: An introduction to stencil computation (2017) (0)
Visualization and Analysis of GPU Summer School Applicants and Participants (2008) (0)
Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 11: Floating Point Considerations (2009) (0)
GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture (2022) (0)

This paper list is powered by the following services:

Other Resources About Wen-mei Hwu

en.wikipedia.org

What Schools Are Affiliated With Wen-mei Hwu?

Wen-mei Hwu is affiliated with the following schools:

Wen-mei Hwu's Academic­Influence.com Rankings

Wen-mei Hwu's Degrees

Why Is Wen-mei Hwu Influential?

Wen-mei Hwu's Published Works

Published Works

Other Resources About Wen-mei Hwu

What Schools Are Affiliated With Wen-mei Hwu?

Wen-mei Hwu's AcademicInfluence.com Rankings