Stephen W. Keckler
American computer scientist
Stephen W. Keckler's AcademicInfluence.com Rankings
Download Badge
Computer Science Engineering
Stephen W. Keckler's Degrees
- PhD Computer Science Stanford University
- Masters Computer Science Stanford University
- Bachelors Electrical Engineering Stanford University
Similar Degrees You Can Earn
Why Is Stephen W. Keckler Influential?
(Suggest an Edit or Addition)According to Wikipedia, Stephen W. Keckler is an American computer scientist and the current Vice President of Architecture Research at NVIDIA. Keckler received a BS in electrical engineering from Stanford University in 1990 and an MS and PhD in computer science from MIT in 1992 and 1998, respectively. He then joined the faculty at the University of Texas at Austin, where he served from 1998 to 2012. He joined NVIDIA in 2009. In 2003, he received the ACM Grace Murray Hopper Award for his work in leading the TRIPS architecture research group. He became an ACM Senior Member in 2006 and an ACM Fellow in 2011.
Stephen W. Keckler's Published Works
Published Works
- Modeling the effect of technology trends on the soft error rate of combinational logic (2002) (1563)
- SCNN: An accelerator for compressed-sparse convolutional neural networks (2017) (863)
- An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches (2002) (786)
- Clock rate versus IPC: the end of the road for conventional microarchitectures (2000) (733)
- GPUs and the Future of Parallel Computing (2011) (581)
- Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture (2003) (551)
- Research Challenges for On-Chip Interconnection Networks (2007) (495)
- Regional congestion awareness for load balance in networks-on-chip (2008) (403)
- Scaling to the end of silicon with EDGE architectures (2004) (370)
- A NUCA Substrate for Flexible CMP Cache Sharing (2007) (350)
- Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications (2017) (310)
- vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design (2016) (293)
- The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays (2002) (283)
- Energy-efficient mechanisms for managing thread context in throughput processors (2011) (264)
- Measuring Experimental Error in Microprocessor Simulation (2001) (246)
- Timeloop: A Systematic Approach to DNN Accelerator Evaluation (2019) (229)
- Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems (2016) (220)
- Express Cube Topologies for on-Chip Interconnects (2009) (200)
- Exploring the design space of future CMPs (2001) (198)
- Implementation and Evaluation of On-Chip Network Architectures (2006) (192)
- A design space evaluation of grid processor architectures (2001) (188)
- Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture (2019) (186)
- Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees (2011) (167)
- Composable Lightweight Processors (2007) (166)
- Distributed Microarchitectural Protocols in the TRIPS Prototype Processor (2006) (149)
- Scalable hardware memory disambiguation for high-ILP processors (2003) (148)
- The impact of delay on the design of branch predictors (2000) (146)
- Netrace: dependency-driven trace-based network-on-chip simulation (2010) (143)
- Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks (2017) (138)
- The M-machine multicomputer (1997) (138)
- Page Placement Strategies for GPUs within Heterogeneous Memory Systems (2015) (133)
- Preemptive Virtual Clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip (2009) (130)
- Hardware support for fast capability-based addressing (1994) (128)
- Static energy reduction techniques for microprocessor caches (2001) (122)
- Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems (2017) (117)
- On-Chip Interconnection Networks of the TRIPS Chip (2007) (114)
- Scaling the Power Wall: A Path to Exascale (2014) (114)
- SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation (2017) (113)
- Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches (2003) (110)
- Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor (2012) (110)
- Towards high performance paged memory for GPUs (2016) (101)
- Thermal response to DVFS: analysis with an Intel Pentium M (2007) (92)
- The M-Machine multicomputer (1995) (91)
- Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor (1998) (88)
- TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP (2004) (88)
- Flexible software profiling of GPU architectures (2015) (88)
- Exploiting microarchitectural redundancy for defect tolerance (2003) (80)
- Anatomy of GPU Memory System for Multi-Application Execution (2015) (77)
- A compile-time managed multi-level register file hierarchy (2011) (75)
- MAGNet: A Modular Accelerator Generator for Neural Networks (2019) (71)
- NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs (2019) (71)
- Unlocking bandwidth for GPUs in CC-NUMA systems (2015) (71)
- An evaluation of the TRIPS computer system (2009) (70)
- Architecting an Energy-Efficient DRAM System for GPUs (2017) (69)
- Convergence and scalarization for data-parallel architectures (2013) (65)
- A case for toggle-aware compression for GPU systems (2016) (65)
- ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection (2019) (63)
- Implementation and Evaluation of a Dynamically Routed Processor Operand Network (2007) (60)
- Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications (2014) (58)
- Late-binding: enabling unordered load-store queues (2007) (57)
- Static placement, dynamic issue (SPDI) scheduling for EDGE architectures (2004) (55)
- Realistic Workload Characterization and Analysis for Networks-on-Chip Design (2009) (51)
- Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration (2019) (51)
- A NUCA substrate for flexible CMP cache sharing (2005) (50)
- Universal mechanisms for data-parallel architectures (2003) (49)
- Multicore Processors and Systems (2009) (49)
- On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories (2004) (46)
- Exploiting microarchitectural redundancy for defect tolerance (2003) (45)
- A wire-delay scalable microprocessor architecture for high performance systems (2003) (45)
- Concurrent Event Handling through Multithreading (1999) (44)
- A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior (2014) (42)
- Selective GPU caches to eliminate CPU-GPU HW cache coherence (2016) (40)
- Dataflow Predication (2006) (40)
- A variable warp size architecture (2015) (40)
- Optimizing Software-Directed Instruction Replication for GPU Error Detection (2018) (39)
- A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm (2019) (39)
- A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors (2012) (39)
- Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks (2001) (38)
- The Effect of Technology Scaling on Microarchitectural Structures (2000) (33)
- A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm (2020) (33)
- A characterization of speech recognition on modern computer systems (2001) (33)
- GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors (2015) (33)
- Evaluation and optimization of multicore performance bottlenecks in supercomputing applications (2011) (32)
- Multitasking workload scheduling on flexible-core chip multiprocessors (2008) (31)
- Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design (2016) (30)
- SNAP: A 1.67 — 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm CMOS (2019) (30)
- SCNN (2017) (29)
- Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures (2014) (28)
- SASSIFI : Evaluating Resilience of GPU Applications (2015) (27)
- Segment gating for static energy reduction in networks-on-chip (2009) (27)
- Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs (2019) (27)
- Microprocessor pipeline energy analysis (2003) (25)
- Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors (2019) (25)
- NVBitFI: Dynamic Fault Injection for GPUs (2021) (23)
- Routed inter-ALU networks for ILP scalability and performance (2003) (23)
- Toggle-Aware Compression for GPUs (2015) (22)
- Proceedings of the 36th annual international symposium on Computer architecture (2009) (22)
- Stitch-X : An Accelerator Architecture for Exploiting Unstructured Sparsity in Deep Neural Networks (2018) (22)
- On-Chip Networks for Multicore Systems (2009) (22)
- Making Convolutions Resilient Via Algorithm-Based Error Detection Techniques (2020) (21)
- Power, Performance, and Thermal Management for High-Performance Systems (2007) (21)
- Critical path analysis of the TRIPS architecture (2006) (21)
- 21st century digital design tools (2013) (20)
- An Efficient, Protected Message Interface (1998) (19)
- Arbitrary Modulus Indexing (2014) (19)
- A Technology-Scalable Architecture for Fast Clocks and High ILP (2001) (18)
- Designing Efficient Heterogeneous Memory Architectures (2015) (18)
- HarDNN: Feature Map Vulnerability Evaluation in CNNs (2020) (17)
- A patch memory system for image processing and computer vision (2016) (17)
- Design and Implementation of the TRIPS Primary Memory System (2006) (17)
- SimpleScalar Simulation of the PowerPC Instruction Set Architecture (2001) (16)
- SNAP: An Efficient Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference (2021) (15)
- Scalable selective re-execution for EDGE architectures (2004) (15)
- Recent extensions to the SimpleScalar tool suite (2004) (14)
- Topology-Aware quality-of-service support in highly integrated chip multiprocessors (2010) (14)
- Generating and Characterizing Scenarios for Safety Testing of Autonomous Vehicles (2021) (14)
- Scalable On-Chip Interconnect Topologies (2008) (14)
- A QoS-Enabled On-Die Interconnect Fabric for Kilo-Node Chips (2012) (13)
- Exploiting criticality to reduce bottlenecks in distributed uniprocessors (2011) (13)
- Decomposing memory performance: data structures and phases (2006) (12)
- SwapCodes: Error Codes for Hardware-Software Cooperative GPU Pipeline Error Detection (2018) (12)
- Counting Dependence Predictors (2008) (12)
- End-to-end validation of architectural power models (2009) (11)
- Combining Hyperblocks and Exit Prediction to Increase Front-End Bandwidth and Performance (2002) (11)
- Evolution of the Graphics Processing Unit (GPU) (2021) (11)
- Reconciling performance and programmability in networking systems (2007) (11)
- How to implement effective prediction and forwarding for fusable dynamic multicore architectures (2013) (11)
- High performance dense linear algebra on a spatially distributed processor (2008) (11)
- Characterizing the SPHINX Speech Recognition System (2001) (11)
- Scaling Power and Performance viaProcessor Composability (2014) (11)
- Structurally Sparsified Backward Propagation for Faster Long Short-Term Memory Training (2018) (10)
- Software Infrastructure and Tools for the TRIPS Prototype (2007) (10)
- NUCA : A Non-Uniform Cache Access Architecture for Wire-Delay Dominated On-Chip Caches (2003) (10)
- Modeling the Impact of Device and Pipeline Scaling on the Soft Error Rate of Processor Elements (2002) (10)
- TRIPS: A distributed explicit data graph execution (EDGE) microprocessor (2007) (9)
- Optimizing Selective Protection for CNN Resilience (2021) (9)
- Training Long Short-Term Memory With Sparsified Stochastic Gradient Descent (2017) (9)
- Measuring the Radiation Reliability of SRAM Structures in GPUs Designed for HPC (2014) (9)
- Software-Directed Techniques for Improved GPU Register File Utilization (2018) (8)
- Characterizing and Mitigating Soft Errors in GPU DRAM (2021) (8)
- Errata on "Measuring Experimental Error in Microprocessor Simulation" (2002) (8)
- The design and implementation of the TRIPS prototype chip (2005) (7)
- GPU computing and the road to extreme-scale parallel systems (2011) (7)
- CLARA: Circular Linked-List Auto and Self Refresh Architecture (2016) (7)
- GPU snapshot: checkpoint offloading for GPU-dense systems (2019) (7)
- A real-time energy-efficient superpixel hardware accelerator for mobile computer vision applications (2016) (7)
- A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology (2019) (7)
- Instruction scheduling for emerging communication-exposed architectures (2004) (7)
- Simba: scaling deep-learning inference with chiplet-based architecture (2021) (6)
- Assessment of MRAM Technology Characteristics and Architectures (2001) (6)
- Exploiting Temporal Data Diversity for Detecting Safety-critical Faults in AV Compute Systems (2022) (6)
- Multicore Optimization for Ranger (2009) (6)
- Speculative reconvergence for improved SIMT efficiency (2020) (6)
- GPU Domain Specialization via Composable On-Package Architecture (2021) (5)
- The effects of explicitly parallel mechanisms on the multi-ALU processor cluster pipeline (1998) (5)
- Toggle-Aware Bandwidth Compression for GPUs (2015) (5)
- Polymorphous architectures: a unified approach for extracting concurrency of different granularities (2006) (5)
- An Analytical Model for Hardened Latch Selection and Exploration (2016) (5)
- Scalable hardware memory disambiguation for high ILP processors (2003) (5)
- The Importance of Locality in Scheduling and Load Balancing for Multiprocessors (1994) (4)
- Fast thread communication and synchronization mechanisms for a scalable single chip multiprocessor (1998) (4)
- TRIPS Intermediate Language (TIL) Manual (2005) (4)
- NVBit (2019) (4)
- Making Sense of Performance Counter Measurements on Supercomputing Applications (2010) (4)
- An Evaluation of the TRIPS Computer System (Extended Technical Report) (2008) (4)
- Power and Performance Optimization : A Case Study with the Pentium M Processor (2006) (4)
- A Coupled Multi-ALU Processing Node for a Highly Parallel Computer (1992) (4)
- Architecture And Implementation Of The Trips Processor (2007) (3)
- Hybrid Operand Communication for Dataflow Processors (2009) (3)
- Massively Multithreaded Computing Systems (2012) (3)
- Analysis of the TRIPS prototype block predictor (2009) (3)
- Selective Re-Execution and its Implications for Value Speculation (3)
- TRIPS Application Binary Interface (ABI) Manual (2005) (3)
- Breaking the GOP / Watt Barrier with EDGE Architectures (2005) (3)
- Rethinking caches for throughput processors: technical perspective (2014) (3)
- Tera-Op Reliable Intelligently Adaptive Processing System (TRIPS) (2004) (3)
- Coordinated power, energy, and temperature management (2007) (3)
- Suraksha: A Framework to Analyze the Safety Implications of Perception Design Choices in AVs (2021) (3)
- Multitasking workload scheduling on flexible core chip multiprocessors (2008) (3)
- Proceeding of the 41st annual international symposium on Computer architecuture (2014) (3)
- Coordinated Power , Energy , and Temperature Management for High-Performance Microprocessors (2004) (2)
- Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs (2018) (2)
- MATERIAL PROPERTIES OF LASER POWDER BED FUSION PROCESSED 316L STAINLESS STEEL (2018) (2)
- Simba (2019) (2)
- Estimating Silent Data Corruption Rates Using a Two-Level Model (2020) (2)
- On the Trend of Resilience for GPU-Dense Systems (2019) (2)
- The Memory Behavior of Data Structures in C SPEC CPU 2000 Benchmarks (2006) (2)
- Exploiting Slack for Low Overhead Soft Error Reliability (2007) (2)
- 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI), 2008 (2008) (2)
- Partition the Banks , not the Functionality , of Large-Window Load / Store Queues (2006) (2)
- Network-on-chip implementation and performance improvement through workload characterization and congestion awareness (2008) (2)
- An Adaptive Cache Structure for Future High-Performance Systems (2002) (2)
- Tera-OP Reliable Intelligently Adaptive Processing System (TRIPS) Implementation (2008) (2)
- Architecture for High Performance Systems (2003) (1)
- Sharing Speculation : A Mechanism for Low-Latency Access to Falsely Shared Data (2003) (1)
- Buffets (2019) (1)
- 36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA (2009) (1)
- A Characterization of High Performance DSP Kernels on the TRIPS Architecture (2006) (1)
- Power and Thermal Characteristics of a Pentium M System (2007) (1)
- A Temperature-Aware Power Estimation Methodology (2008) (1)
- M-Machine Microarchitecture v1.1 (1993) (1)
- Appears in the 2006 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2006) Critical Path Analysis of the TRIPS Architecture (2006) (1)
- Saving PAM4 Bus Energy with SMOREs: Sparse Multi-level Opportunistic Restricted Encodings (2022) (1)
- Compiler-assisted Hybrid Operand Communication (2009) (1)
- Energy-Efficient Data Compression for GPU Memory Systems (2014) (1)
- Techniques to improve the hard and soft error reliability of distributed architectures (2007) (1)
- Evaluation and Optimization of Signal Processing Kernels on the TRIPS Architecture (2006) (1)
- Suraksha: A Quantitative AV Safety Evaluation Framework to Analyze Safety Implications of Perception Design Choices (2021) (1)
- ReFLEX: Block Atomic Execution on Conventional ISA Cores (2010) (1)
- Maximizing Area Efficiency for Single-Chip Server Processors (2001) (1)
- The Mmmachine Multicomputer (1995) (1)
- Fault Aware Instruction Placement for Static Architectures (2005) (1)
- GPU Subwarp Interleaving (2022) (1)
- Cooperative Profile Guided Optimizations (2021) (1)
- A Routing Network for the Grid Processor Architecture (2003) (1)
- Augmenting Legacy Networks for Flexible Inference (2022) (0)
- The future of multi-core technologies (2007) (0)
- Composable Multicore Chips (2009) (0)
- Microprocessor Pipeline Energy Analysis: Speculation and Over-Provisioning (2003) (0)
- Implementation of the Control Unit in the TRIPS Prototype Processor (2006) (0)
- Charles R. (Chuck) Moore (1961 - 2012) (2012) (0)
- Measuring experimental error in microprocessor simulation (2001) (0)
- Coordinated Management : Power , Performance , Energy , and Temperature (2005) (0)
- Method, system and medium for providing a distributed maschinenverarbeitbares predicate prediction (2010) (0)
- Increasing Interconnection Network Throughput with Virtual Channels (2015) (0)
- Author retrospective for a NUCA substrate for flexible CMP cache sharing (2014) (0)
- The TRIPS OPN: A Processor Integrated NoC for Operand Bypass (2010) (0)
- FEATURE MAP VULNERABILITY EVALUATION IN CNNS (2020) (0)
- Analysis of the TRIPS Architecture by Dipak Chand Boyed (2004) (0)
- Power Management in High Performance Systems (2003) (0)
- Researching Novel Systems: To Instantiate, Emulate, Simulate, or Analyticate? (2007) (0)
- Zhuyi (2022) (0)
- 2014 International Symposium on Computer Architecture Influential Paper Award; 2014 Maurice Wilkes Award Given to Ravi Rajwar (2014) (0)
- Microprocessor Pi pel i ne Energy Analysis (2003) (0)
- Continuous Optimization and Coordinated Power Management (2005) (0)
- Chapter 2 On-Chip Networks for Multicore Systems (2017) (0)
- GPU chip Select Register File Pending Warps ALUs Cache / Scratch Banks ( 32 x 2 KB ) (2013) (0)
- The Authors' Model of Energy, Bandwidth, and Latency for Dram Technologies Enables Exploration of Memory Hierarchies That Combine Heterogeneous Memory Technologies with Different Attributes. Analysis Shows That the Gap between On-and Off-package Dram Technologies Is Narrower than That Found between (2015) (0)
- Gpus and the Future of Parallel Computing This Article Discusses the Capabilities of State-of-the Art Gpu-based High- Throughput Computing Systems and Considers the Challenges to Scaling Single-chip Parallel-computing Systems, Highlighting High-impact Areas That the Computing Research Community Can (2011) (0)
- DEPENDENCE PREDICTION IN A MEMORY SYSTEM (2010) (0)
- Enabling and Accelerating Dynamic Vision Transformer Inference for Real-Time Applications (2022) (0)
- Zhuyi: perception processing rate estimation for safety in autonomous vehicles (2022) (0)
- Unique Chips and Systems (2007) (0)
This paper list is powered by the following services:
Other Resources About Stephen W. Keckler
What Schools Are Affiliated With Stephen W. Keckler?
Stephen W. Keckler is affiliated with the following schools: