#3099 Overall Influence

American academic

Jack J. Dongarra ForMemRS; is an American University Distinguished Professor of Computer Science in the Electrical Engineering and Computer Science Department at the University of Tennessee. He holds the position of a Distinguished Research Staff member in the Computer Science and Mathematics Division at Oak Ridge National Laboratory, Turing Fellowship in the School of Mathematics at the University of Manchester, and is an adjunct professor in the Computer Science Department at Rice University. He served as a faculty fellow at 's institute for advanced study . Dongarra is the founding director of Innovative Computing Laboratory.

Source: Wikipedia- Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods
- PVM
- LINPACK Users' Guide
- A set of level 3 basic linear algebra subprograms
- Automated empirical optimizations of software and the ATLAS project
- Matrix Eigensystem Routines — EISPACK Guide
- An extended set of FORTRAN basic linear algebra subprograms
- Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
- The International Exascale Software Project roadmap
- The LINPACK Benchmark: past, present and future
- Automatically Tuned Linear Algebra Software
- Matrix Eigensystem Routines — EISPACK Guide Extension
- A class of parallel tiled linear algebra algorithms for multicore architectures
- Exascale computing and big data
- Netsolve: a Network-Enabled Server for Solving Computational Science Problems
- Numerical Linear Algebra for High-Performance Computers
- Towards dense linear algebra for hybrid GPU accelerated manycore systems
- Distribution of mathematical software via electronic mail
- The GrADS Project: Software Support for High-Level Grid Application Development
- Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects
- Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine
- Chebyshev tau-QZ algorithm methods for calculating spectra of hydrodynamic stability problems
- A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem
- From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
- DAGuE: A generic distributed DAG engine for High Performance Computing
- S12---The HPC Challenge (HPCC) benchmark suite
- Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs
- Algorithm-based fault tolerance applied to high performance computing
- Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs
- Squeezing the most out of an algorithm in CRAY FORTRAN
- FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
- Dense linear algebra solvers for multicore with GPU accelerators
- PaRSEC: Exploiting Heterogeneity to Enhance Scalability
- Post-failure recovery of MPI communication capability
- An Improved Magma Gemm For Fermi Graphics Processing Units
- Matrix Market: a web resource for test matrix collections
- ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers
- Accelerating scientific computations with mixed precision algorithms
- Collecting Performance Data with PAPI-C
- Self adaptivity in Grid computing
- Software Libraries for Linear Algebra Computations on High Performance Computers
- Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
- Performance analysis of MPI collective operations
- Condition Numbers of Gaussian Random Matrices
- Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community
- Block reduction of matrices to condensed forms for eigenvalue computations
- Integrated Pvm Framework Supports Heterogeneous Network Computing
- Parallel tiled QR factorization for multicore architectures
- Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
- Overview of GridRPC: A Remote Procedure Call API for Grid Computing
- Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers
- A Note on Auto-tuning GEMM for GPUs
- Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
- Unrolling loops in fortran
- Improving the Accuracy of Computed Eigenvalues and Eigenvectors
- A message passing standard for MPP and workstations
- Autotuning GEMM Kernels for the Fermi GPU
- Scheduling workflow applications on processors with different capabilities
- Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems
- On some parallel banded system solvers
- SRS: A FRAMEWORK FOR DEVELOPING MALLEABLE AND MIGRATABLE PARALLEL APPLICATIONS FOR DISTRIBUTED SYSTEMS
- Performance Analysis of MPI Collective Operations
- Graphical development tools for network-based concurrent supercomputing
- Preface: Basic Linear Algebra Subprograms Technical (Blast) Forum Standard
- Experiments with Scheduling Using Simulated Annealing in a Grid Environment
- QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
- An Evaluation of User-Level Failure Mitigation Support in MPI
- A performance oriented migration framework for the grid
- Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems)
- Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
- HARNESS: a next generation distributed virtual machine
- A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures
- Algorithm-based fault tolerance for dense matrix factorizations
- Numerical Libraries and the Grid
- Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
- Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
- The Impact of Multicore on Math Software
- NetSolve
- The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community
- Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy
- A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors
- A metascheduler for the Grid
- Computer benchmarking: Paths and pitfalls: The most popular way of rating computer performance can confuse as well as inform; avoid misunderstanding by asking just what the benchmark is measuring
- Scheduling dense linear algebra operations on multicore processors
- Towards Efficient MapReduce Using MPI
- Implementation of some concurrent algorithms for matrix factorization
- Standards for graph algorithm primitives
- DAGuE: A Generic Distributed DAG Engine for High Performance Computing
- Scalability Issues Affecting the Design of a Dense Linear Algebra Library
- Performance, Design, and Autotuning of Batched GEMM for GPUs
- Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor
- Fault tolerant high performance computing by a coding approach
- Multiprocessing linear algebra algorithms on the CRAY X-MP-2: Experiences with small granularity
- Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
- Adaptive Scheduling for Task Farming with Grid Middleware
- Toward a new metric for ranking high performance computing systems.
- Implementation of mixed precision in solving systems of linear equations on the Cell processor
- Solving banded systems on a parallel processor
- HARNESS and fault tolerant MPI
- High-performance computing systems: Status and outlook
- High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems
- Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
- Reduction to condensed form for the Eigenvalue problem on distributed memory architectures
- Self-adapting software for numerical linear algebra and LAPACK for clusters
- Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing
- Unified model for assessing checkpointing protocols at extreme-scale
- The LINPACK Benchmark: An explanation
- Accelerating Numerical Dense Linear Algebra Calculations with GPUs
- Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures
- CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
- Batched matrix computations on hardware accelerators based on GPUs
- Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries
- Scheduling in the Grid Application Development Software Project
- Autotuning in High-Performance Computing Applications
- Netlib and NA-Net: Building a Scientific Computing Community
- Redesigning the message logging model for high performance
- Innovations of the NetSolve Grid Computing System
- Iterative Sparse Triangular Solves for Preconditioning
- The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
- Performance of various computers using standard linear equations software in a FORTRAN environment
- A proposal for a set of parallel basic linear algebra subprograms
- Improving the Performance of CA-GMRES on Multicores with Multiple GPUs
- Performance of various computers using standard linear equations software
- The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form
- Squeezing the most out of eigenvalue solvers on high-performance computers
- High-performance high-resolution semi-Lagrangian tracer transport on a sphere
- LU factorization for accelerator-based systems
- QR factorization of tall and skinny matrices in a grid computing environment
- Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
- Implementing Dense Linear Algebra Algorithms Using Multitasking on the CRAY X-MP-4 (or Approaching the Gigaflop)
- A proposal for an extended set of Fortran Basic Linear Algebra Subprograms
- Building and Using a Fault-Tolerant MPI Implementation
- Recent Developments in Gridsolve
- A comparison of search heuristics for empirical code optimization
- Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery
- A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU
- A proposal for a set of level 3 basic linear algebra subprograms
- Performance of various computers using standard linear equations software in a Fortran environment
- Algorithmic redistribution methods for block-cyclic decompositions
- Robust task scheduling in non-deterministic heterogeneous computing systems
- ALGORITHMIC ISSUES ON HETEROGENEOUS COMPUTING PLATFORMS
- Optimizing symmetric dense matrix-vector multiplication on GPUs
- Introduction to the HPCChallenge Benchmark Suite
- The design and implementation of the parallel out-of-core ScaLAPACK LU, QR, and Cholesky factorization routines
- State-of-the-art eigensolvers for electronic structure calculations of large scale nano-systems
- A collection of parallel linear equations routines for the Denelcor HEP
- Accelerating collaborative filtering using concepts from high performance computing
- LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU
- Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs
- Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels
- A scalable framework for heterogeneous GPU-based clusters
- Algorithm-based fault tolerance for dense matrix factorizations
- Self-Adapting Numerical Software for Next Generation Applications
- A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-Hessenberg form
- Algorithmic bombardment for the iterative solution of linear systems: A poly-iterative approach
- Recent trends in the marketplace of high performance computing
- EZTrace: A Generic Framework for Performance Analysis
- Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs
- Numerical Considerations in Computing Invariant Subspaces
- An evaluation of User-Level Failure Mitigation support in MPI
- Incomplete Sparse Approximate Inverses for Parallel Preconditioning
- A comparative study of automatic vectorizing compilers
- The quest for petascale computing
- Algorithm-based diskless checkpointing for fault tolerant matrix operations
- Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs
- Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures
- The PlayStation 3 for High-Performance Scientific Computing
- Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
- The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software
- Dynamic task discovery in PaRSEC
- Performance of various computers using standard linear equations software
- Parallel matrix transpose algorithms on distributed memory concurrent computers
- Hierarchical QR factorization algorithms for multi-core clusters
- Process Distance-Aware Adaptive MPI Collective Communications
- Performance Portability of a GPU Enabled Factorization with the DAGuE Framework
- A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
- Preface: Basic Linear Algebra Subprograms Technical (Blast) Forum Standard
- NetSolve: Past, Present, and Future – A Look at a Grid Enabled Server
- Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
- Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers
- A portable environment for developing parallel FORTRAN programs
- The marketplace of high-performance computing
- NetSolve: Grid enabling scientific computing environments
- The Netlib Mathematical Software Repository
- A Fast Batched Cholesky Factorization on a GPU
- Tile QR factorization with parallel panel processing for multicore architectures
- Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment
- Hierarchical DAG Scheduling for Hybrid Distributed Systems
- Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures
- Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
- Implementation and Usage of the PERUSE-Interface in Open MPI
- Dynamic Reconfiguration and Virtual Machine Management in the Harness Metacomputing System
- LINPACK Benchmark
- LINPACK Benchmark
- LINPACK Benchmark
- LINPACK Benchmark
- Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology
- Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency
- The eigenvalue problem for Hermitian matrices with time reversal symmetry
- Parallel loops — A test suite for parallelizing compilers: Description and example results
- Static tiling for heterogeneous computing platforms
- HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters
- A Parallel Algorithm for the Nonsymmetric Eigenvalue Problem
- Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems
- The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
- Accelerating Linear System Solutions Using Randomization Techniques
- Investigating half precision arithmetic to accelerate dense linear system solvers
- QR Factorization for the Cell Broadband Engine
- TOP500 Supercomputer sites 11/2000
- PVM and HeNCE: Tools for Heterogeneous Network Computing
- A block-asynchronous relaxation method for graphics processing units
- MPI collective algorithm selection and quadtree encoding
- Numerical linear algebra algorithms and software
- High Performance Dense Linear System Solver with Soft Error Resilience
- Exploring New Architectures in Accelerating CFD for Air Force Applications
- A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures
- The Virtual Instrument: Support for Grid-Enabled Mcell Simulations
- Request Sequencing: Optimizing Communication for the Grid
- A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations
- A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
- Porting the PLASMA Numerical Library to the OpenMP Standard
- Solving the secular equation including spin orbit coupling for systems with inversion and time reversal symmetry
- Self-healing network for scalable fault-tolerant runtime environments
- One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators*
- High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors
- Performance of various computers using standard linear equations software in a Fortran environment
- A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks
- DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges
- Message-passing performance of various computers
- A proposal for a user-level, message passing interface in a distributed memory environment
- Java access to numerical libraries
- Developing numerical libraries in Java
- High-Performance Heterogeneous Computing
- Correlated set coordination in fault tolerant message logging protocols for many-core clusters
- A survey of recent developments in parallel implementations of Gaussian elimination
- Computing the conditioning of the components of a linear least-squares solution
- Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs
- Fault Tolerance Techniques for High-Performance Computing
- Accelerating GPU Kernels for Dense Linear Algebra
- Biological sequence alignment on the computational grid using the GrADS framework
- Improvement of parallelization efficiency of batch pattern BP training algorithm using Open MPI
- DARPA's HPCS Program: History, Models, Tools, Languages
- Deploying fault tolerance and taks migration with NetSolve
- Design for a Soft Error Resilient Dynamic Task-Based Runtime
- High performance matrix inversion based on LU factorization for multicore architectures
- Recent Enhancements To Pvm
- Investigating power capping toward energy‐efficient scientific applications
- Scalable Fault Tolerant Protocol for Parallel Runtime Environments
- Review of Performance Analysis Tools for MPI Parallel Programs
- Tools to aid in the analysis of memory access patterns for FORTRAN programs
- Programming methodology and performance issues for advanced computer architectures
- Soft error resilient QR factorization for hybrid system with GPGPU
- Key concepts for parallel out-of-core LU factorization
- PVM-Parallel Virtual Machine: AUsers' Guide and Tutorial for Networked Parallel Computing
- Sunway TaihuLight supercomputer makes its appearance
- Analytical modeling and optimization for affinity based thread scheduling on multicore systems
- A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
- The TOP500 List and Progress in High-Performance Computing
- Race to Exascale
- PTG: An Abstraction for Unhindered Parallelism
- High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
- ADAPT
- Composing resilience techniques: ABFT, periodic and incremental checkpointing
- Distributed Probabilistic Model-Building Genetic Algorithm
- The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
- Efficient Pattern Search in Large Traces Through Successive Refinement
- Key concepts for parallel out-of-core LU factorization
- Power monitoring with PAPI for extreme scale architectures and dataflow-based programming models
- Overlapping Computation and Communication for Advection on Hybrid Parallel Computers
- A Parallel Tiled Solver for Dense Symmetric Indefinite Systems on Multicore Architectures
- Distribution of mathematical software via electronic mail
- Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy
- Corrigenda: “An Extended Set of FORTRAN Basic Linear Algebra Subprograms”
- Scalable Networked Information Processing Environment (SNIPE)
- Acceleration of GPU-based Krylov solvers via data transfer reduction
- HPCG Benchmark Technical Specification
- IML++ v.1.2 iterative methods library, reference guide
- Experiences in autotuning matrix multiplication for energy minimization on GPUs
- A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow-Banded Linear Systems II
- Polyhedron Model
- MPL_Connect managing heterogeneous MPI applications interoperation and process control
- Preconditioned Krylov solvers on GPUs
- Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
- A new metric for ranking high-performance computing systems
- A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing
- L2 Cache Modeling for Scientific Applications on Chip Multi-Processors
- Towards Achieving Performance Portability Using Directives for Accelerators
- Performance of various computers using standard linear equations software in a Fortran environment
- Rectangular full packed format for cholesky's algorithm
- Practical scalable consensus for pseudo-synchronous distributed systems
- PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers
- Enabling interactive and collaborative oil reservoir simulations on the Grid
- Numerically Stable Real Number Codes Based on Random Matrices
- Performance Instrumentation and Measurement for Terascale Systems
- Heterogeneous MPI application interoperation and process management under PVMPI
- GridSolve: The Evolution of A Network Enabled Solver
- Open MPI’s TEG Point-to-Point Communications Methodology: Comparison to Existing Implementations
- Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging
- Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead
- Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
- Selected results from the ParkBench Benchmark
- Updating incomplete factorization preconditioners for model order reduction
- Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning
- With Extreme Computing, the Rules Have Changed
- A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations
- Automatic blocking of QR and LU factorizations for locality
- Tools and techniques for performance---Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)
- Algorithm 710: FORTRAN subroutines for computing the eigenvalues and eigenvectors of a general matrix by reduction to general tridiagonal form
- SLATE
- The design of linear algebra libraries for high performance computers
- Message-passing performance of various computers
- visPerf: Monitoring Tool for Grid Computing
- Accurate Cache and TLB Characterization Using Hardware Counters
- On Using Incremental Profiling for the Performance Analysis of Shared Memory Parallel Applications
- Correlated Set Coordination in Fault Tolerant Message Logging Protocols
- Evaluation of the HPC Challenge Benchmarks in Virtualized Environments
- A Class of Communication-avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines
- A Parallel Solver for Incompressible Fluid Flows
- Scalable networked information processing environment (SNIPE)
- Multithreading for synchronization tolerance in matrix factorization
- Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi
- Efficient parallelization of batch pattern training algorithm on many-core and cluster architectures
- Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs
- Optimizing Krylov Subspace Solvers on Graphics Processing Units
- Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives
- Improving the Accuracy of Computed Singular Values
- Toward a High Performance Tile Divide and Conquer Algorithm for the Dense Symmetric Eigenvalue Problem
- Communication-Avoiding Symmetric-Indefinite Factorization
- Matrix product on heterogeneous master-worker platforms
- LAPACK++
- Towards batched linear solvers on accelerated hardware platforms
- Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
- Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
- Towards an Accurate Model for Collective Communications
- Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures
- The design and implementation of the parallel out-of-core ScaLAPACK LU, QR and Cholesky factorization routines
- Automatic analysis of inefficiency patterns in parallel applications
- OMPIO: A Modular Software Architecture for MPI I/O
- Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction
- Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q
- Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations
- Linear algebra on high performance computers
- An asynchronous algorithm on the NetSolve global computing system
- QCG-OMPI: MPI applications on grids
- GrADSolve—a grid-based RPC system for parallel computing with application-level scheduling
- Kernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms
- Clusters and computational grids for scientific computing
- PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution
- Anatomy of a globally recursive embedded LINPACK benchmark
- Heterogeneous Streaming
- Failure Detection and Propagation in HPC systems
- PERFORMANCE STUDY OF LU FACTORIZATION WITH LOW COMMUNICATION OVERHEAD ON MULTIPROCESSORS
- Performance of various computers using standard linear equations software in a Fortran environment
- Soft error resilient QR factorization for hybrid system with GPGPU
- GPU-Aware Non-contiguous Data Movement In Open MPI
- Numerical libraries and the grid
- JLAPACK – Compiling LAPACK FORTRAN to Java
- Recursive Approach in Sparse Matrix LU Factorization
- Parallel matrix transpose algorithms on distributed memory concurrent computers
- PB-BLAS: a set of parallel block basic linear algebra subprograms
- Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI
- A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
- A Scalable Approach to MPI Application Performance Analysis
- Self-Adapting Numerical Software and Automatic Tuning of Heuristics
- TEG: A High-Performance, Scalable, Multi-network Point-to-Point Communications Methodology
- Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors
- Power profiling of Cholesky and QR factorizations on distributed memory systems
- Fast Cholesky factorization on GPUs for batch and native modes in MAGMA
- Looking back at dense linear algebra software
- An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systems
- Middleware for the use of storage in communication
- LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi
- Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems
- Search Space Generation and Pruning System for Autotuners
- Strengthening compute and data intensive capacities of Armenia
- Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster
- Programming tools and environments
- PLASMA
- Algorithm 589: SICEDR : A FORTRAN Subroutine for Improving the Accuracy of Computed Matrix Eigenvalues
- Evaluating Dynamic Communicators and One-Sided Operations for Current MPI Libraries
- On the performance and energy efficiency of sparse linear algebra on GPUs
- Crpc Research Into Linear Algebra Software for High Performance Computers
- Preface
- Tiling on systems with communication/computation overlap
- Scalable Fault Tolerant MPI: Extending the Recovery Algorithm
- Evaluating the Performance of MPI-2 Dynamic Communicators and One-Sided Communication
- Locality and Topology Aware Intra-node Communication among Multicore CPUs
- A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
- Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures
- Reducing the Amount of Pivoting in Symmetric Indefinite Systems
- A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs
- The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot
- Data through the Computational Lens
- Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs
- Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs
- Performance of asynchronous optimized Schwarz with one-sided communication
- Multi-GPU Implementation of LU Factorization
- Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs
- Why is it Hard to Describe Properties of Algorithms?
- HARNESS fault tolerant MPI design, usage and performance issues
- Using agent-based software for scientific computing in the NetSolve system
- Reliability Analysis of Self-Healing Network using Discrete-Event Simulation
- Out of memory SVD solver for big data
- Revisiting the Double Checkpointing Algorithm
- Efficiency of General Krylov Methods on GPUs -- An Experimental Study
- ParILUT---A New Parallel Threshold ILU Factorization
- Feedback-directed thread scheduling with memory considerations
- CPU-GPU hybrid bidiagonal reduction with soft error resilience
- Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores
- clMAGMA
- Plan B
- Efficient implementation of quantum materials simulations on distributed CPU-GPU systems
- Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
- Load-balancing Sparse Matrix Vector Product Kernels on GPUs
- HeNCE: A Heterogeneous Network Computing Environment
- Conjugate-gradient eigenvalue solvers in computing electronic properties of nanostructure architectures
- A proposal for a user-level, message passing interface in a distributed memory environment
- Packed storage extension for ScaLAPACK
- Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods.
- Harnessing the Computing Continuum for Programming Our World
- Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
- Non-GPU-resident symmetric indefinite factorization
- Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments
- ACCT: Automatic Collective Communications Tuning
- Tuned: An Open MPI Collective Communications Component
- Power Management and Event Verification in PAPI
- Domain Overlap for Iterative Sparse Triangular Solves on GPUs
- GrADSolve – RPC for High Performance Computing on the Grid
- Parallel Tiled QR Factorization for Multicore Architectures
- Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications
- Decision Trees and MPI Collective Algorithm Selection Problem
- A Holistic Approach for Performance Measurement and Analysis for Petascale Applications
- An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs
- Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software
- Application-Level Tools
- Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms
- Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures
- Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
- NetSolve: a network-enabled solver; examples and users
- MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing
- Assessing the Impact of ABFT and Checkpoint Composite Strategies
- MIAMI: A framework for application performance diagnosis
- Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools
- ON THE CONVERGENCE OF COMPUTATIONAL AND DATA GRIDS
- Software distribution using Xnetlib
- Parallel reduction to hessenberg form with algorithm-based fault tolerance
- Visualizing execution traces with task dependencies
- Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU
- High-performance Cholesky factorization for GPU-only execution
- Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications
- An Iterative Solver Benchmark
- HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
- The Problem With the Linpack Benchmark 1.0 Matrix Generator
- A failure detector for HPC platforms
- Distributed-memory lattice H-matrix factorization
- A look at scalable dense linear algebra libraries
- A look at scalable dense linear algebra libraries
- Automatic translation of Fortran to JVM bytecode
- Parallel IO Support for Meta-computing Applications: MPI_Connect IO Applied to PACX-MPI
- Towards an Accurate Model for Collective Communications
- Fault Tolerant MPI for the HARNESS Meta-computing System
- More on Scheduling Block-Cyclic Array Redistribution
- HDF5
- MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing
- Optimal Checkpointing Period: Time vs. Energy
- A Framework for Out of Memory SVD Algorithms
- Design of Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations
- Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators
- Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization
- Big Data Meets Computational Science, Preface for ICCS 2014
- Computational Science at the Gates of Nature, Preface for ICCS 2015
- Numerical algorithms for high-performance computational science
- POSTER: Utilizing dataflow-based execution for coupled cluster methods
- Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs
- Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks
- Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
- Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC
- Solving Linear Diophantine Systems on Parallel Architectures
- On The Implementation Of A Fully Parallel Algorithm For The Symmetric Eigenvalue Problem
- DEPLOYING PARALLEL NUMERICAL LIBRARY ROUTINES TO CLUSTER COMPUTING IN A SELF ADAPTING FASHION
- Optimization for performance and energy for batched matrix computations on GPUs
- Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs
- Tuning stationary iterative solvers for fault resilience
- Adaptive precision solvers for sparse linear systems
- Towards Continuous Benchmarking
- Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures
- High Performance Development for High End Computing With Python Language Wrapper (PLW)
- A look back on 30 years of the Gordon Bell Prize
- A Block-Asynchronous Relaxation Method for Graphics Processing Units
- Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture- GeForce GTX 680
- SmartGridRPC: The new RPC model for high performance Grid computing
- NetBuild: transparent cross-platform access to computational software libraries
- Hash Functions for Datatype Signatures in MPI
- MPI Collective Algorithm Selection and Quadtree Encoding
- Taskers and General Resource Managers: PVM supporting DCE Process Management
- TBB (Intel Threading Building Blocks)
- Task Graph Scheduling
- Behavioral Equivalences
- Providing Uniform Dynamic Access to Numerical Software
- Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures
- Optimized Batched Linear Algebra for Modern Architectures
- Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor
- Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
- From Serial Loops to Parallel Execution on Distributed Systems
- GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
- LAPACK: A Linear Algebra Library for High-Performance Computers
- High performance linear algebra package for FORTRAN 90
- The Component Structure of a Self-Adapting Numerical Software System
- Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures
- Mixing LU and QR factorization algorithms to design high-performance dense linear algebra solvers
- Computation at the Frontiers of Science, Preface for ICCS 2013
- Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
- Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices
- Computational Benefit of GPU Optimization for the Atmospheric Chemistry Modeling
- Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems
- Access-averse framework for computing low-rank matrix approximations
- On Scalability for MPI Runtime Systems
- Fault tolerant matrix operations for networks of workstations using multiple checkpointing
- Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
- Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing
- Implementing a Blocked Aasen's Algorithm with a Dynamic Scheduler on Multicore Architectures
- Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters
- ParILUT - A Parallel Threshold ILU for GPUs
- On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures
- Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms
- PB-BLAS: a set of Parallel Block Basic Linear Algebra Subprograms
- Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs
- Gordon Bell prize lectures
- Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs
- PAPI software-defined events for in-depth performance analysis
- Summary of Software for Linear Algebra Freely Available on the Web
- Empirical Performance Tuning of Dense Linear Algebra Software
- Truss Structual Optimization using NetSolve System
- Performance and reliability trade-offs for the double checkpointing algorithm
- TOP500 Supercomputers for June 2002
- Solving dense symmetric indefinite systems using GPUs
- Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD
- Comparison of Nonlinear Conjugate-Gradient Methods for Computing the Electronic Properties of Nanostructure Architectures
- ParkBench: Methodology, relations and results
- ScaLAPACK tutorial
- Bisimulation
- BLAS (Basic Linear Algebra Subprograms)
- PaToH (Partitioning Tool for Hypergraphs)
- Dense Linear Algebra on Accelerated Multicore Hardware
- heFFTe: Highly Efficient FFT for Exascale
- Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs
- Accelerating Computation of Eigenvectors in the Dense Nonsymmetric Eigenvalue Problem
- Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs
- On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors
- Accelerating NWChem Coupled Cluster Through Dataflow-Based Execution
- Task-Based Cholesky Decomposition on Knights Corner Using OpenMP
- Prospectus for the Next LAPACK and ScaLAPACK Libraries
- A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures
- Reducing the Time to Tune Parallel Dense Linear Algebra Routines with Partial Execution and Performance Modeling
- Templates for linear algebra problems
- Special section: Grid computing and the message passing interface
- The art of computational science: Bridging gaps – forming alloys
- Science at the intersection of data, modelling, and computation
- Computational Science in the Interconnected World: Selected papers from 2019 International Conference on Computational Science
- Matrix multiplication on batches of small matrices in half and half-complex precisions
- Assessing the cost of redistribution followed by a computational kernel: Complexity and performance results
- Variable-size batched Gauss–Jordan elimination for block-Jacobi preconditioning on graphics processors
- Data through the Computational Lens, Preface for ICCS 2016
- The Art of Computational Science, Bridging Gaps – Forming Alloys. Preface for ICCS 2017
- Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation
- Scalable linear algebra software libraries for distributed memory concurrent computers
- Request Sequencing: Enabling Workflow for Efficient Problem Solving in GridSolve
- Towards numerical benchmark for half-precision floating point arithmetic
- Parallel Simulation of Superscalar Scheduling
- Virtual Systolic Array for QR Decomposition
- Designing LU-QR Hybrid Solvers for Performance and Stability
- Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster
- New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem
- Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
- Autotuning batch Cholesky factorization in CUDA with interleaved layout of matrices
- Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators
- Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations
- Poster: Matrices over Runtime Systems at Exascale
- Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES
- Location-independent naming for virtual distributed software repositories
- BlackjackBench
- Technologies for repository interoperation and access control
- Mixed-precision block gram Schmidt orthogonalization
- Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators
- GPU-accelerated co-design of induced dimension reduction
- Least squares solvers for distributed-memory machines with GPU accelerators
- Massively Parallel Automated Software Tuning
- Automatic translation of Fortran to JVM bytecode
- Preface to the special issue on the basic linear algebra subprograms (BLAS)
- Comparison of the CRAY X-MP-4, Fujitsu VP-200, and Hitachi S-810/20
- Numerical Libraries and Tools for Scalable Parallel Cluster Computing
- Recent Advances in Parallel Virtual Machine and Message Passing Interface
- Trace-based performance analysis for the petascale simulation code FLASH
- A survey of numerical linear algebra methods utilizing mixed-precision arithmetic
- Dense Linear Algebra for Hybrid GPU-Based Systems
- BLAS for GPUs
- Post-exascale supercomputing: research opportunities abound
- Reduction to condensed form for the eigenvalue problem on distributed memory architectures
- LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
- Reduction to condensed form for the eigenvalue problem on distributed memory architectures
- Computing the eigenvalues and eigenvectors of a general matrix by reduction to general tridiagonal form
- Evaluation of dataflow programming models for electronic structure theory
- Trends in High-Performance Computing
- Eigenvalue Computation with NetSolve Global Computing System
- The Impact of Multicore on Math Software and Exploiting Single Precision Computing to Obtain Double Precision Results
- A Grid Computing Environment for Enabling Large Scale Quantum Mechanical Simulations
- Self-Adapting Software for Numerical Linear Algebra Library Routines on Clusters
- A proposal for a Fortran 90 interface for LAPACK
- ScaLAPACK tutorial
- Providing access to high performance computing technologies
- LAPACK
- LAPACK
- LAPACK
- LAPACK
- TOP500
- Livermore Loops
- Livermore Loops
- Livermore Loops
- Livermore Loops
- Speculation, Thread-Level
- Bulk Synchronous Parallelism (BSP)
- ParaMETIS
- Parallel Random Access Machines (PRAM)
- PGAS (Partitioned Global Address Space) Languages
- Parallel Ocean Program (POP)
- Parallel Skeletons
- Spiral
- SPMD Computational Model
- BSP (Bulk Synchronous Parallelism)
- Speculative Parallelization of Loops
- Trace Theory
- Titanium
- Brent’s Theorem
- Self-Healing Network for Scalable Fault Tolerant Runtime Environments
- High Performance Linear Algebra Package - LAPACK90
- Algorithm Design for Different Computer Architectures
- Overview of High Performance Computers
- Case studies on the development of ScaLAPACK and the NAG Numerical PVM Library
- Counter Inspection Toolkit: Making Sense Out of Hardware Performance Events
- Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators
- Self-adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures
- Self-healing in Binomial Graph Networks
- An Overview of High Performance Computing and Challenges for the Future
- Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW
- Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems
- Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC
- Block reduction of matrices to condensed forms for eigenvalue computations
- Special section: Applications of distributed and grid computing
- Fine-grained bit-flip protection for relaxation methods
- Foreword
- Preface
- Performance Analysis and Optimisation of Two-sided Factorization Algorithms for Heterogeneous Platform
- High Performance Computers and Algorithms From Linear Algebra
- The Design and Implementation of the Reduction Routines in ScaLAPACK
- Scheduling Block-Cyclic Array Redistribution**This work was supported in part by the National Science Foundation Grant No. ASC-9005933; by the Defense Advanced Research Projects Agency under contract DAAH04-95-1-0077, administered by the Army Research Office; by the Department of Energy Office of Computational and Technology Research, Mathematical, Information, and Computational Sciences Division under Contract DE-AC05-84OR21400; by the National Science Foundation Science and Technology Center Cooperative Agreement No. CCR-8809615; by the CNRS-ENS Lyon-INRIA project ReMaP; and by the Eureka Project EuroTOPS. Yves Robert is on leave from Ecole Normale Supérieure de Lyon and is partly supported by DRET/DGA under contract ERE 96-1104/A000/DRET/DS/SR. Antoine Petitet now is with the NEC Research Center in Sankt Augustin, Germany. The authors acknowledge the use of the Intel Paragon XP/S 5 computer, located in the Oak Ridge National Laboratory Center for Computational Sciences, funded by the Department of Energy’s Mathematical, Information, and Computational Sciences Division subprogram of the Office of Computational and Technology Research.
- National HPCC Software Exchange (NHSE)
- NanoPSE: Nanoscience Problem Solving Environment for atomistic electronic structure of semiconductor nanostructures
- A fast algorithm for the symmetric eigenvalue problem
- Sampling algorithms to update truncated SVD
- Variable-Size Batched Condition Number Calculation on GPUs
- Flexible Linear Algebra Development and Scheduling with Cholesky Factorization
- Performance analysis and acceleration of explicit integration for large kinetic networks using batched GPU computations
- Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators
- Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs
- Mixed-Tool Performance Analysis on Hybrid Multicore Architectures
- Revisiting Matrix Product on Master-Worker Platforms
- A Block-Asynchronous Relaxation Method for Graphics Processing Units
- Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
- Hybrid Multi-elimination ILU Preconditioners on GPUs
- Software-Defined Events through PAPI
- The design of scalable software libraries for distributed memory concurrent computers
- From High-Level Specification to High-Performance Code
- The 30th Anniversary of the Supercomputing Conference: Bringing the Future Closer—Supercomputing History and the Immortality of Now
- Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization
- Optimal Routing in Binomial Graph Networks
- Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs
- Parallel matrix transpose algorithms on distributed memory concurrent computers
- ScaLAPACK++: an object oriented linear algebra library for scalable systems
- Symmetric Indefinite Linear Solver Using OpenMP Task on Multicore Architectures
- 13. Parallel Linear Algebra Software
- Poster
- Location-independent naming for virtual distributed software repositories
- Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms
- Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
- Towards batched linear solvers on accelerated hardware platforms
- Flexible batched sparse matrix-vector product on GPUs
- Active netlib
- Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations
- Recent Advances in Parallel Virtual Machine and Message Passing Interface
- Accelerating NWChem Coupled Cluster through dataflow-based execution
- HPC Challenge: Design, History, and Implementation Highlights
- Implementing Matrix Multiplication on the Cell B. E.
- Checkpointing Strategies for Shared High-Performance Computing Platforms
- High Performance Computing (HPC) Challenge (HPCC) Benchmark Suite Development
- Coordinated Fault Tolerance for High-Performance Computing
- Architecture-Aware Algorithms for Scalable Performance and Resilience on Heterogeneous Architectures
- Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems
- GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
- DOE Advanced Scientific Advisory Committee (ASCAC): Workforce Subcommittee Letter
- Improving the accuracy of computed matrix eigenvalues
- LINPACK working note number3: Fortran BLAS timing
- High performance computing: Clusters, constellations, MPPs, and future directions
- TOP500 Supercomputers for June 2003
- Reliability and Performance Models for Grid Computing
- Benchmarks to supplant export "FPDR" calculations
- Message-Passing Software Systems
- High Performance Computing and Trends: Connecting Computational Requirements with Computing Resources
- High Performance Computing, Computational Grid, and Numerical Libraries
- High Performance Computing, Computational Grid, and Numerical Libraries
- LAPACK for Fortran90 compiler
- Changing technologies of HPC
- Providing access to high performance computing technologies
- Block-cyclic array redistribution on networks of workstations
- High performance linear algebra package LAPACK90
- Hypergraph Partitioning
- Bitonic Sorting, Adaptive
- Spanning Tree, Minimum Weight
- Bioinformatics
- Logic Languages
- Logic Languages
- Logic Languages
- Logic Languages
- Sisal
- Transactional Memories
- Bitonic Sort
- Linear Algebra, Numerical
- Linear Algebra, Numerical
- Linear Algebra, Numerical
- Linear Algebra, Numerical
- Petri Nets
- Petascale Computer
- Shared-Memory Multiprocessors
- SPAI (SParse Approximate Inverse)
- Space-Filling Curves
- Linear Algebra Software
- Linear Algebra Software
- Linear Algebra Software
- Linear Algebra Software
- ScaLAPACK
- PLASMA
- Benchmarks
- HPC Challenge Benchmark
- Linear Least Squares and Orthogonal Factorization
- Linear Least Squares and Orthogonal Factorization
- Linear Least Squares and Orthogonal Factorization
- Linear Least Squares and Orthogonal Factorization
- Social Networks
- Layout, Array
- Layout, Array
- Layout, Array
- Layout, Array
- PASM Parallel Processing System
- Bandwidth-Latency Models (BSP, LogP)
- Banerjee’s Dependence Test
- Parallelization, Automatic
- SCI (Scalable Coherent Interface)
- Barnes-Hut
- Barriers
- Pi-Calculus
- Half Vector Length
- Laws
- Laws
- Laws
- Laws
- Load Balancing
- Load Balancing
- Load Balancing
- Load Balancing
- Locks
- Locks
- Locks
- Locks
- LU Factorization
- LU Factorization
- LU Factorization
- LU Factorization
- Parallel Prefix Algorithms
- Profiling
- Programming Languages
- Programming Models
- Scalability
- Scatter
- Semaphores
- Sequential Consistency
- SHMEM
- Special-Purpose Machines
- Speculation
- Speedup
- SSE
- Superlinear Speedup
- Symmetric Multiprocessors
- Tracing
- Locality of Reference and Parallel Processing
- Locality of Reference and Parallel Processing
- Locality of Reference and Parallel Processing
- Locality of Reference and Parallel Processing
- BSP
- LogP Bandwidth-Latency Model
- LogP Bandwidth-Latency Model
- LogP Bandwidth-Latency Model
- LogP Bandwidth-Latency Model
- Basic Linear Algebra Subprograms (BLAS)
- Prolog
- Lisp, Connection Machine
- Lisp, Connection Machine
- Lisp, Connection Machine
- Lisp, Connection Machine
- HEP, Denelcor
- Singular-Value Decomposition (SVD)
- Haskell
- High Performance Fortran (HPF)
- Libraries, Numerical
- Libraries, Numerical
- Libraries, Numerical
- Libraries, Numerical
- Profiling with OmpP, OpenMP
- Partitioned Global Address Space (PGAS) Languages
- Pthreads (POSIX Threads)
- Sparse Iterative Methods, Preconditioners for
- Tasks
- Threads
- Languages
- Languages
- Languages
- Languages
- Scan, Reduce and
- Scalable Coherent Interface (SCI)
- System on Chip (SoC)
- Thread-Level Speculation
- Scheduling
- Task Mapping, Topology Aware
- Pentium
- Beowulf-Class Clusters
- Beowulf Clusters
- Linux Clusters
- Linux Clusters
- Linux Clusters
- Linux Clusters
- PC Clusters
- Personalized All-to-All Exchange
- Transpose
- Thread Level Speculation (TLS) Parallelization
- Harmful Shared-Memory Access
- Single System Image
- Hierarchical Data Format
- Parallelization
- TAU Performance System®
- Tuning and Analysis Utilities
- Polyhedra Scanning
- Software Autotuning
- Performance Metrics
- Law of Diminishing Returns
- Law of Diminishing Returns
- Law of Diminishing Returns
- Law of Diminishing Returns
- Strong Scaling
- Scaled Speedup
- Little’s Lemma
- Little’s Lemma
- Little’s Lemma
- Little’s Lemma
- Little’s Principle
- Little’s Principle
- Little’s Principle
- Little’s Principle
- Little’s Result
- Little’s Result
- Little’s Result
- Little’s Result
- Little’s Theorem
- Little’s Theorem
- Little’s Theorem
- Little’s Theorem
- Brent’s Law
- Partitioning Tool for Hypergraphs (PaToH)
- Small-World Network Analysis and Partitioning (SNAP) Framework
- Sparse Gaussian Elimination
- Particle Dynamics
- Particle Methods
- System Integration
- Tera MTA
- Prefix
- Bitonic Sorting Network
- Linear Equations Solvers
- Linear Equations Solvers
- Linear Equations Solvers
- Linear Equations Solvers
- Place-Transition Nets
- Sparse Approximate Inverse Matrix
- Bisimilarity
- Bisimulation Equivalence
- Linear Regression
- Linear Regression
- Linear Regression
- Linear Regression
- Speculative Multithreading (SM)
- Speculative Parallelization
- Speculative Run-Time Parallelization
- Speculative Threading
- Speculative Thread-Level Parallelization
- Thread-Level Data Speculation (TLDS)
- TLS
- Hazard (in Hardware)
- HPF (High Performance Fortran)
- LBNL Climate Computer
- LBNL Climate Computer
- LBNL Climate Computer
- LBNL Climate Computer
- Tensilica
- Lock-Free Algorithms
- Lock-Free Algorithms
- Lock-Free Algorithms
- Lock-Free Algorithms
- Parallel Communication Models
- HPS Microarchitecture
- Parallel Operating System
- Processor Allocation
- Parallelization, Basic Block
- Heterogeneous Element Processor
- Horizon
- Server Farm
- The High Performance Substrate
- Loop Nest Parallelization
- Loop Nest Parallelization
- Loop Nest Parallelization
- Loop Nest Parallelization
- High-Level I/O Library
- Pnetcdf
- Blue CHiP Project
- Programmable Interconnect Computer
- TStreams
- State Space Search
- PRAM (Parallel Random Access Machines)
- SIMD Extensions
- SIMD ISA
- Performance Measurement
- Hang
- Stalemate
- Linda
- Linda
- Linda
- Linda
- Process Synchronization
- High-Performance I/O
- Program Graphs
- HT3.10
- HT
- PCIe
- PCI-E
- PCI-Express
- Backpressure
- Butterfly
- Blue CHiP
- Hypercube
- Torus
- Logarithmic-Depth Sorting Network
- Logarithmic-Depth Sorting Network
- Logarithmic-Depth Sorting Network
- Logarithmic-Depth Sorting Network
- SPEC HPC2002
- SPEC HPC96
- SPEC MPI2007
- SPEC OMP2001
- Parallel I/O Library (PIO)
- Terrestrial Ecosystem Modeling
- Blue Gene/L
- Blue Gene/P
- Blue Gene/Q
- Petaflop Barrier
- Large-Scale Analytics
- Large-Scale Analytics
- Large-Scale Analytics
- Large-Scale Analytics
- Phylogenetic Inference
- Process Calculi
- Process Description Languages
- Position Tree
- Processor Arrays
- Systolic Architecture
- Bus: Shared Channel
- Point-to-Point Switch
- Shared Interconnect
- Shared-Medium Network
- Switched-Medium Network
- SIMD (Single Instruction, Multiple Data) Machines
- Thick Ethernet
- Thin Ethernet
- Promises
- LANai
- LANai
- LANai
- LANai
- Partial Computation
- Theory of Mazurkiewicz-Traces
- Shared Virtual Memory
- Polytope Model
- Blocking
- Hyperplane Partitioning
- Preconditioners for Sparse Iterative Methods
- Loop Blocking
- Loop Blocking
- Loop Blocking
- Loop Blocking
- Loop Tiling
- Loop Tiling
- Loop Tiling
- Loop Tiling
- Supernode Partitioning
- Behavioral Relations
- Parallel Prefix Sums
- Prefix Reduction
- Stream Processing
- Total Exchange
- Problem Architectures
- Parallelization, Loop Nest
- Trace Scheduling
- Synchronization
- Performance Analysis Tools
- Loops, Parallel
- Loops, Parallel
- Loops, Parallel
- Loops, Parallel
- Periscope
- Polaris
- Topology Aware Task Mapping
- PLAPACK
- Peer-to-Peer
- Parallel Computing
- Superscalar Processors
- Path Expressions
- SIGMA-1
- Semantic Independence
- Broadcast
- Switching Techniques
- Switch Architecture
- HyperTransport
- PCI Express
- Pipelining
- Perfect Benchmarks
- SPEC Benchmarks
- Branch Predictors
- Terrestrial Ecosystem Carbon Modeling
- Homology to Sequence Alignment, From
- Hypercubes and Meshes
- PERCS System Architecture
- Latency Hiding
- Latency Hiding
- Latency Hiding
- Latency Hiding
- Parafrase
- PARSEC Benchmarks
- Phylogenetics
- POSIX Threads (Pthreads)
- Processes, Tasks, and Threads
- Processors-in-Memory
- Process Algebras
- Prolog Machines
- Protein Docking
- PVM (Parallel Virtual Machine)
- Suffix Trees
- Systems Biology, Network Inference in
- Systolic Arrays
- Buses and Crossbars
- Transactions, Nested
- Software Distributed Shared Memory
- Power Wall
- SoC (System on Chip)
- Sorting
- Load Balancing, Distributed Memory
- Load Balancing, Distributed Memory
- Load Balancing, Distributed Memory
- Load Balancing, Distributed Memory
- Sparse Direct Methods
- Tiling
- Scan for Distributed Memory, Message-Passing Systems
- Parallelism Detection in Nested Loops, Optimal
- SWARM: A Parallel Programming Framework for Multicore Processors
- Stream Programming Languages
- Bernstein’s Conditions
- *Lisp
- *Lisp
- *Lisp
- *Lisp
- Parallel Tools Platform
- PMPI Tools
- TAU
- Scalasca
- Scheduling Algorithms
- Little’s Law
- Little’s Law
- Little’s Law
- Little’s Law
- Hybrid Programming With SIMPLE
- PETSc (Portable, Extensible Toolkit for Scientific Computation)
- SPIKE
- PARDISO
- libflame
- libflame
- libflame
- libflame
- SNAP (Small-World Network Analysis and Partitioning) Framework
- SuperLU
- Developing an Architecture to Support the Implementation and Development of Scientific Computing Applications
- A Look at the Evolution of Mathematical Software for Dense Matrix Problems over the Past Fifteen Years
- Do Moldable Applications Perform Better on Failure-Prone HPC Platforms?
- Hands-On Research and Training in High Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments
- Integrating Deep Learning in Domain Sciences at Exascale
- Improving the Performance of the GMRES Method Using Mixed-Precision Techniques
- Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments
- With Extreme Scale Computing the Rules Have Changed
- Techniques for Solving Large-Scale Graph Problems on Heterogeneous Platforms
- Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations
- High Performance Computing Trends, Supercomputers, Clusters and Grids
- Fault Tolerance in Message Passing and in Action
- Present and Future Supercomputer Architectures
- High Performance Computing Trends and Self Adapting Numerical Software
- A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling
- Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure
- Programming the LU Factorization for a Multicore System with Accelerators
- Evolution of the HPC Market
- Workshop 16: Performance evaluation and prediction
- Digital software and data repositories for support of scientific computing
- Constructing numerical software libraries for high-performance computing environments
- Industrial application areas of high-performance computing
- Deploying fault-tolerance and task migration with NetSolve
- Editorial introduction to the special issue on computational linear algebra and sparse matrix computations
- Enabling workflows in GridSolve: request sequencing and service trading
- Numerical linear algebra algorithms and software
- Selected numerical algorithms
- Special section: Cluster and computational grids for scientific computing
- Translational process: Mathematical software perspective
- 20 years of computational science: Selected papers from 2020 International Conference on Computational Science
- Preface
- Empowering Science through Computing, Preface for ICCS 2012
- Heterogeneous Network-Based Concurrent Computing Systems
- Scaling point set registration in 3D across thread counts on multicore and hardware accelerator platforms through autotuning for large scale analysis of scientific point clouds
- A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs
- Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
- Using Arm Scalable Vector Extension to Optimize OPEN MPI
- Designing algorithms in linear algebra for different computer architectures
- HAN: a Hierarchical AutotuNed Collective Communication Framework
- Flexible Data Redistribution in a Task-Based Runtime System
- Fault tolerant matrix operations using checksum and reverse computation
- Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization
- Progressive Optimization of Batched LU Factorization on GPUs
- Scalable Data Generation for Evaluating Mixed-Precision Solvers
- Architecture-aware Algorithms and Software for Peta and Exascale Computing
- IPDPS 2011 Tuesday 25th Year Panel - Looking back
- Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation
- Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation
- Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems
- Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure
- HCW 2013 Keynote Talk
- EduPar Keynote
- Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime
- Asynchronous SGD for DNN training on Shared-memory Parallel Architectures
- Revisiting Credit Distribution Algorithms for Distributed Termination Detection
- Structure-Aware Linear Solver for Realtime Convex Optimization for Embedded Systems
- Optimal Routing in Binomial Graph Networks
- Abstract: Matrices Over Runtime Systems at Exascale
- Abstract: A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
- Poster: A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
- Replacing Pivoting in Distributed Gaussian Elimination with Randomized Techniques
- High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs
- Accelerating Geostatistical Modeling and Prediction With Mixed-Precision Computations: A High-Productivity Approach with PaRSEC
- Accelerating Restarted GMRES with Mixed Precision Arithmetic
- A New Approach to Scientific Computation (Ulrich W. Kulisch and Willard L. Miranker, eds.)
- Changes in Dense Linear Algebra Kernels: Decades-Long Perspective
- Recent trends in high performance computing
- REVISITING MATRIX PRODUCT ON MASTER-WORKER PLATFORMS
- PREFACE
- IMPROVED RUNTIME AND TRANSFER TIME PREDICTION MECHANISMS IN A NETWORK ENABLED SERVERS MIDDLEWARE
- GUEST EDITORS NOTE
- GUEST EDITORS' NOTE: SPECIAL ISSUE ON CLUSTERS, CLOUDS, AND DATA FOR SCIENTIFIC COMPUTING
- Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
- Guest Editors’ Note: Special Issue on Clusters, Clouds and Data for Scientific Computing
- An update notice on the level 3 BLAS
- HPC challenge---The 2006 HPC challenge awards
- Poster reception---Targeting multi-core architectures for linear algebra applications
- LAPACK is now available
- BlackjackBench
- Panel
- Distributed information management in the National HPCC Software Exchange
- With Extreme Scale Computing the Rules Have Changed
- Using Advanced Vector Extensions AVX-512 for MPI Reductions
- A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines
- Active netlib
- PDS: A Performance Database Server
- Preface
- Preface
- Special Issue on Tools in the ACTS Collection 2004
- Preface
- Editorial
- Selected papers of the Workshop on Clusters, Clouds and Grids for Scientific Computing (CCGSC)
- Introduction to the Special Issue
- Introduction for August Special Issue CCDSC
- Guest Editor’s Note: Special Issue on Clusters, Clouds and Data for Scientific Computing
- Guest editors’ note
- Guest editors’ note: Special issue on clusters, clouds, and data for scientific computing
- MAGMA templates for scalable linear algebra on emerging architectures
- Efficient exascale discretizations: High-order finite element methods
- Algorithm Design for Large-Scale Computations
- Book Reviews : The Connection Machine
- Advanced Computing Research Facility, Mathematics and Computer Science Division, Argonne National Laboratory
- Editorial
- Preface To the Special Issue
- Clusters and Computational Grids for Scientific Computing
- The Semantic Conference Organizer
- Advanced Architecture Computers
- Graphics tools for developing high-performance algorithms*
- Keeneland: Computational Science Using Heterogeneous GPU Computing
- BLAS
- Lapack
- Summary of Software for Linear Algebra Freely Available on the Web
- Lapack
- Blas
- Prospectus for a Dense Linear Algebra Software Library
- Disaster Survival Guide in Petascale Computing
- Implementing Matrix Factorizations on the Cell B. E.
- BLAS
- LAPACK
- Summary of Software for Linear Algebra Freely Available on the Web
- Transparent Cross-Platform Access to Software Services Using GridSolve and GridRPC
- Task-based Cholesky decomposition on Xeon Phi architectures using OpenMP
- Task based Cholesky decomposition on Xeon Phi architectures using OpenMP
- Evaluation of directive-based performance portable programming models
- Software Distribution Using XNetlib.
- Benchmarking and Analysis of High Productibility Computing (HPCS)
- High Productivity Computing Systems (HPCS) Library Study Effort
- Blackjack
- Netlib services and resources
- Software distribution using xnetlib
- Predicting the Electronic Properties of 3D, Million-atom Semiconductor nanostructure Architectures
- Minimizing System Noise Effects For Extreme-Scale Scientific Simulation Through Function Delegation
- Institute for Sustained Performance, Energy, and Resilience (SuPER)
- Extreme-scale Algorithms and Solver Resilience
- PaRSEC: A Software Framework for Performance and Productivity on Hybrid, Manycore Platforms
- ASCR@40: Highlights and Impacts of ASCR’s Programs.
- LINPACK working note No. 9: preliminary LINPACK user's guide. [In FORTRAN]
- Numerical considerations in computing invariant subspaces
- LINPACK working note No. 13: implementation guide for LINPACK
- TOP500 Supercomputers for June 2005
- TOP500 Supercomputers for November 2004
- TOP500 Supercomputers for November 2003
- TOP500 Supercomputers for November 2002
- TOP500 Supercomputers for June 2004
- TOP500 Sublist for November 2001
- 17th Edition of TOP500 List of World's Fastest SupercomputersReseased
- Project-Based Research and Training in High Performance Data Sciences, Data Analytics, and Machine Learning
- LINPACK User's Guide.
- Parallel Processing for Scientific Computing.
- Solving Linear Systems on Vector and Shared Memory Computers.
- Parallel Processing for Scientific Computing.
- Reliability and Performance Models for Grid Computing
- Linear algebra - software issues

University of Manchester

Public research university in Manchester, England

Illinois Institute of Technology

American university

University of Tennessee

Public university in Knoxville, Tennessee, United States

Rice University

University in Houston, Texas, USA

University of New Mexico

Public research university in Albuquerque, New Mexico, United States

#230 World Rank

Computer Science

#863 World Rank

Mathematics

#1076 World Rank

Engineering

Stay informed! Get the latest Academic Influence news, information, and rankings with our upcoming newsletter.