Matei Zaharia

Q: What Schools Are Affiliated With Matei Zaharia

Matei Zaharia is affiliated with the following schools: University of California, Berkeley, University of Waterloo, Stanford University, Massachusetts Institute of Technology

Matei Zaharia's AcademicInfluence.com Rankings

Matei Zaharia

Computer Science

#526

World Rank

#546

Historical Rank

Big Data

World Rank

Historical Rank

Cloud Computing

World Rank

Historical Rank

Database

#630

World Rank

#660

Historical Rank

computer-science Degrees

Download Badge

Computer Science

Why Is Matei Zaharia Influential?

(Suggest an Edit or Addition)

According to Wikipedia, Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark. As of April 2022, Forbes ranked him and Ion Stoica as the 3rd-richest people in Romania with a net worth of $1.6 billion.

(See a Problem?)

Matei Zaharia's Published Works

Number of citations in a given year to any of this author's works

Total number of citations to an author for the works they published in a given year. This highlights publication of the most important work(s) by the author

Published Works

A view of cloud computing (2010) (9961)
Above the Clouds: A Berkeley View of Cloud Computing (2009) (7317)
Spark: Cluster Computing with Working Sets (2010) (5190)
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012) (4272)
Improving MapReduce Performance in Heterogeneous Environments (2008) (1905)
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (2011) (1828)
MLlib: Machine Learning in Apache Spark (2015) (1637)
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling (2010) (1588)
Spark SQL: Relational Data Processing in Spark (2015) (1249)
Apache Spark (2016) (1246)
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types (2011) (1185)
Discretized streams: fault-tolerant streaming computation at scale (2013) (1075)
On the Opportunities and Risks of Foundation Models (2021) (938)
Apache Spark: a unified engine for big data processing (2016) (740)
Managing data transfers in computer clusters with orchestra (2011) (629)
Sparrow: distributed, low latency scheduling (2013) (629)
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (2020) (534)
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (2012) (532)
Shark: SQL and rich analytics at scale (2012) (467)
A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples (2014) (412)
Job Scheduling for Multi-User MapReduce Clusters (2009) (403)
PipeDream: generalized pipeline parallelism for DNN training (2019) (393)
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks (2014) (354)
Learning Spark: Lightning-Fast Big Data Analytics (2015) (325)
Beyond Data and Model Parallelism for Deep Neural Networks (2018) (315)
Low-cost communication for rural internet kiosks using mechanical backhaul (2006) (284)
DAWNBench : An End-to-End Deep Learning Benchmark and Competition (2017) (272)
Faster and More Accurate Sequence Alignment with SNAP (2011) (259)
NoScope: Optimizing Neural Network Queries over Video at Scale (2017) (251)
Vuvuzela: scalable private messaging resistant to traffic analysis (2015) (240)
Multi-resource fair queueing for packet processing (2012) (210)
Accelerating the Machine Learning Lifecycle with MLflow (2018) (204)
Choosy: max-min fair sharing for datacenter jobs with constraints (2013) (193)
MLPerf Training Benchmark (2019) (189)
Fast and Interactive Analytics over Hadoop Data with Spark (2012) (175)
TASO: optimizing deep learning computation with automatic generation of graph substitutions (2019) (167)
Shark: fast data analysis using coarse-grained distributed memory (2012) (149)
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark (2018) (145)
NoScope: Optimizing Deep CNN-Based Queries over Video Streams at Scale (2017) (143)
From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers (2019) (139)
Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc (2020) (129)
Sparse GPU Kernels for Deep Learning (2020) (117)
Scaling Spark in the Real World: Performance and Usability (2015) (110)
Weld : A Common Runtime for High Performance Data Analytics (2016) (108)
Stadium: A Distributed Metadata-Private Messaging System (2017) (108)
Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware (2016) (107)
Selection Via Proxy: Efficient Data Selection For Deep Learning (2019) (104)
Very low-cost internet access using KioskNet (2007) (100)
Above the Clouds : A View of Cloud Computing (2009) (100)
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (2021) (97)
Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark (2018) (97)
Matrix Computations and Optimization in Apache Spark (2015) (95)
Making caches work for graph analytics (2016) (93)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021) (93)
GraphFrames: an integrated API for mixing graph and relational queries (2016) (91)
Resilient Distributed Datasets (2016) (91)
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (2020) (89)
Splinter: Practical Private Queries on Public Data (2017) (86)
An Architecture for and Fast and General Data Processing on Large Clusters (2016) (82)
Memory-Efficient Pipeline-Parallel DNN Training (2020) (80)
Gossip‐based search selection in hybrid peer‐to‐peer networks (2008) (77)
Scaling the mobile millennium system in the cloud (2011) (77)
Design and implementation of the KioskNet system (2007) (72)
BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics (2018) (71)
Evaluating End-to-End Optimization for Data Analytics Applications in Weld (2018) (68)
SparkR: Scaling R Programs with Spark (2016) (67)
FairRide: Near-Optimal, Fair Cache Sharing (2016) (61)
ObliDB: Oblivious Query Processing for Secure Databases (2017) (58)
Relevance-guided Supervision for OpenQA with ColBERT (2020) (56)
Efficient Large-Scale Language Model Training on GPU Clusters (2021) (56)
Optimizing DNN Computation with Relaxed Graph Substitutions (2019) (55)
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing (2012) (55)
Provenance Analysis for Missing Answers and Integrity Repairs. (2018) (53)
MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis (2018) (53)
DIFF (2018) (51)
Advances, challenges and opportunities in creating data for trustworthy AI (2022) (47)
Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems (2012) (46)
LIT: Learned Intermediate Representation Training for Model Compression (2019) (45)
Spark: The Definitive Guide: Big Data Processing Made Simple (2018) (45)
Filter Before You Parse: Faster Analytics on Raw Data with Sparser (2018) (45)
ICTD for healthcare in Ghana: Two parallel case studies (2009) (40)
A Common Substrate for Cluster Computing (2009) (40)
The Datacenter Needs an Operating System (2011) (39)
BlazeIt: Fast Exploratory Video Queries using Neural Networks (2018) (39)
Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle (2020) (38)
Express: Lowering the Cost of Metadata-hiding Communication with Cryptographic Privacy (2019) (38)
Model Assertions for Monitoring and Improving ML Models (2020) (37)
Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution (2018) (37)
A Common Runtime for High Performance Data Analysis (2017) (37)
Large-Scale Estimation in Cyberphysical Systems Using Streaming Data: A Case Study With Arterial Traffic Estimation (2013) (36)
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics (2021) (35)
Finding Content in File-Sharing Networks When You Can't Even Spell (2007) (35)
Model Assertions for Debugging Machine Learning (2018) (35)
An Oblivious General-Purpose SQL Database for the Cloud (2017) (33)
DIFF: a relational interface for large-scale data explanation (2018) (31)
Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine (2019) (30)
Infrastructure for Usable Machine Learning: The Stanford DAWN Project (2017) (27)
Tachyon : Memory Throughput I / O for Cluster Computing Frameworks (2013) (27)
Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics (2020) (26)
Machine Learning to Classify Intracardiac Electrical Patterns During Atrial Fibrillation (2020) (26)
Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale (2016) (24)
Reliable, Memory Speed Storage for Cluster Computing Frameworks (2014) (23)
PipeDream (2019) (23)
Arthur : Rich Post-Facto Debugging for Production Analytics Applications (2013) (22)
MLSys: The New Frontier of Machine Learning Systems (2019) (21)
ColBERT (2020) (21)
Hindsight: Posterior-guided training of retrievers for improved open-ended generation (2021) (21)
Sparrow: Scalable Scheduling for Sub-Second Parallel Jobs (2013) (20)
Fast and Optimal Scheduling Over Multiple Network Interfaces (2007) (20)
Introduction to Spark 2.0 for Database Researchers (2016) (20)
SysML: The New Frontier of Machine Learning Systems (2019) (20)
Dominant Resource Fairness: Fair Allocation of Heterogeneous Resources in Datacenters (2010) (19)
To Index or Not to Index: Optimizing Exact Maximum Inner Product Search (2017) (19)
Nexus: A Common Substrate for Cluster Computing (2009) (18)
Fleet: A Framework for Massively Parallel Streaming on FPGAs (2020) (18)
Weld: Rethinking the Interface Between Data-Intensive Applications (2017) (18)
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (2021) (17)
ObliDB: Oblivious Query Processing using Hardware Enclaves (2017) (16)
Privacy Preserving Ranked Multi-Keyword Search for Multiple Data Owners in Cloud Computing (2016) (16)
Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data (2020) (16)
FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply (2020) (16)
Contracting Wide-area Network Topologies to Solve Flow Problems Quickly (2020) (15)
What can Data-Centric AI Learn from Data and ML Engineering? (2021) (15)
Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP (2021) (15)
PLAID: An Efficient Engine for Late Interaction Retrieval (2022) (15)
A Policy-Oriented Architecture for Opportunistic Communication on Multiple Wireless Networks (2006) (14)
POSH: A Data-Aware Shell (2020) (14)
Optimally Designing Games for Cognitive Science Research (2012) (14)
Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference (2019) (13)
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022) (13)
Machine Learned Cellular Phenotypes Predict Outcome in Ischemic Cardiomyopathy. (2020) (13)
Machine Learned Cellular Phenotypes in Cardiomyopathy Predict Sudden Death (2020) (12)
Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training (2020) (12)
LIT: Block-wise Intermediate Representation Training for Model Compression (2018) (12)
Breakfast of champions: towards zero-copy serialization with NIC scatter-gather (2021) (12)
Optimizing data-intensive computations in existing libraries with split annotations (2018) (12)
Approximate selection with guarantees using proxies (2020) (12)
DBOS: A Proposal for a Data-Centric Operating System (2020) (11)
Optimizing Cache Performance for Graph Analytics (2016) (11)
Batch Sampling : Low Overhead Scheduling for Sub-Second Parallel Jobs (2012) (10)
A Thunk to Remember: make -j1000 (and other jobs) on functions-as-a-service infrastructure (2017) (10)
Gossip-based search selection in hybrid peer-to-peer networks: Research Articles (2008) (10)
Performance and Scalability of Broadcast in Spark (2010) (9)
Similarity Search for Efficient Active Learning and Search of Rare Concepts (2020) (9)
DBOS: A DBMS-oriented Operating System (2021) (9)
Select Via Proxy: Efficient Data Selection For Training Deep Networks (2018) (9)
Hypervisors as a Foothold for Personal Computer Security: An Agenda for the Research Community (2012) (8)
Optimally designing games for behavioural research (2014) (7)
Overlook: Differentially Private Exploratory Visualization for Big Data (2020) (7)
Did the Model Change? Efficiently Assessing Machine Learning API Shifts (2021) (7)
Photon: A Fast Query Engine for Lakehouse Systems (2022) (7)
Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads (2020) (7)
Large-Scale Online Expectation Maximization with Spark Streaming (2012) (7)
linalg: Matrix Computations in Apache Spark (2015) (6)
Enabling Innovation Below the Communication API (2009) (6)
DIY Hosting for Online Privacy (2017) (6)
TASO (2019) (6)
Design and Implementation of the KioskNet System (Extended Version) (2007) (6)
Future Directions for Parallel and Distributed Computing: SPX 2019 Workshop Report (2019) (5)
FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks (2021) (5)
Challenges and Opportunities for Autonomous Vehicle Query Systems (2021) (5)
Large Scale Estimation in Cyberphysical Systems using Streaming Data: a Case Study with Smartphone Traces (2012) (5)
Adaptive Peer - to - Peer Search (2004) (5)
Lessons from Large-Scale Software as a Service at Databricks (2019) (5)
VIVA: An End-to-End System for Interactive Video Analytics (2022) (4)
Estimating and Explaining Model Performance When Both Covariates and Labels Shift (2022) (4)
Mesos: Flexible Resource Sharing for the Cloud (2011) (4)
Extricating IoT Devices from Vendor Infrastructure with Karl (2022) (4)
Allocation of fungible resources via a fast, scalable price discovery method (2021) (4)
A demonstration of willump (2019) (4)
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks (2023) (3)
Finding Label and Model Errors in Perception Data With Learned Observation Assertions (2022) (3)
Outsourcing Everyday Jobs to Thousands of Cloud Functions with gg (2019) (3)
What's Changing in Big Data? (2016) (3)
ObliDB (2019) (3)
Abstract 14675: Developing Convolutional Neural Networks for Deep Learning of Ventricular Action Potentials to Predict Risk for Ventricular Arrhythmias (2019) (2)
An Efficient Oblivious Database for the Public Cloud (2017) (2)
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts (2022) (2)
Concurrency Control (2009) (2)
A Progress Report on DBOS: A Database-oriented Operating System (2022) (2)
SiRen: Leveraging Similar Regions for Efficient and Accurate Variant Calling (2015) (2)
Apiary: A DBMS-Backed Transactional Function-as-a-Service Framework (2022) (2)
TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data (2020) (2)
SiRen : Leveraging Si milar Re gio n s for Efficient & Accurate Variant Calling (2015) (2)
Analysis and Simulation of a System for Free-Text Search in Peer-to-Peer Networks (2004) (2)
A Polystore Based Database Operating System (DBOS) (2020) (1)
Distributed Pipeline for Genomic Variant Calling (2012) (1)
Spectral Lower Bounds on the I/O Complexity of Computation Graphs (2019) (1)
Splitability Annotations: Optimizing Black-Box Function Composition in Existing Libraries (2018) (1)
Fleet (2020) (1)
Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking (2022) (1)
A New Communications API (2009) (1)
researchdesigning games for behavioural (2014) (1)
HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions (2022) (1)
Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions (2021) (1)
BlazeIt : An OptimizingQuery Engine for Video at Scale Extended Abstract (2018) (1)
Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems (2022) (1)
BlazeIt (2019) (1)
Proof: Accelerating Approximate Aggregation Queries with Expensive Predicates (2021) (1)
DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution (2021) (1)
Don't Give Up on Large Optimization Problems; POP Them! (2021) (1)
Optimizing Video Analytics with Declarative Model Relationships (2022) (1)
Fuzzy set intersection based paired-end short-read alignment (2021) (1)
Tachyon (2019) (1)
Author Correction: Advances, challenges and opportunities in creating data for trustworthy AI (2022) (1)
Boosting the Performance of MapReduce by Better Resource Utilization in Cluster (2020) (0)
Optimally Designing Games for Cognitive Science Research - eScholarship (2012) (0)
Clamor: Extending Functional Cluster Computing Frameworks with Fine-Grained Remote Memory Access (2021) (0)
Weld : Rethinking the Interface Between Data-Intensive Libraries (2017) (0)
Cloud Data Systems: What are the Opportunities for the Database Research Community? (2022) (0)
Exploring the Use of Learning Algorithms for Efficient Performance Profiling (2018) (0)
Research Statement Data Analytics Systems (2013) (0)
source computing framework unifies streaming , batch , and interactive big data workloads to unlock new applications (2016) (0)
Improving Hybrid Keyword-Based Search (0)
Interface to 'MLflow' [R package mlflow version 1.10.0] (2020) (0)
B-PO01-081 MACHINE LEARNING OF THE ELECTROCARDIOGRAM IDENTIFIES CARDIAC WALL MOTION ABNORMALITIES BEYOND THE Q WAVE (2021) (0)
Transactions Make Debugging Easy (2022) (0)
Spark SQL Resilient Distributed Datasets Spark JDBC Console User Programs ( Java , Scala , Python ) Catalyst Optimizer DataFrame API (2015) (0)
Don't Hate the Player, Hate the Game: Safety and Utility in Multi-Agent Congestion Control (2021) (0)
DIFF: a relational interface for large-scale data explanation (2020) (0)
Designing Production-Friendly Machine Learning (2021) (0)
Parallelism-Optimizing Data Placement for Faster Data-Parallel Computations (2022) (0)
Lecturer: Alistair Sinclair Based on scribe notes by: (0)
Snippet from the Black Scholes options pricing benchmark implemented using Intel MKL (2019) (0)
plaid (2022) (0)
Models Built over RDDs (2016) (0)
Willump: Statistically-Aware Optimizations for Fast Machine Learning Inference (2019) (0)
To Index or Not to Index : Optimizing Maximum Inner Product Search (2018) (0)
SimDex: Exploiting Model Similarity in Exact Matrix Factorization Recommendations (2017) (0)
Automated Lower Bounds on the I/O Complexity of Computation Graphs (2019) (0)
A proposal for adaptive routing using neural prediction in a distributed environment (2001) (0)
Generality of RDDs (2016) (0)
Analysis of the Time-To-Accuracy Metric and Entries in the DAWNBench Deep Learning Benchmark (2018) (0)
Cloud data systems (2022) (0)
PPMLP 2020: Workshop on Privacy-Preserving Machine Learning In Practice (2020) (0)
A MORPHOLOGICAL OPERATION-BASED APPROACH TO AUTOMATICALLY SEPARATE AND LABEL LEFT ATRIUM BODY AND PULMONARY VEINS (2022) (0)
Clamor (2021) (0)
Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression (2021) (0)
Big Data Platforms for Data Analytics (2018) (0)
What is MapReduce used for ? (2012) (0)
GraphFrames (2016) (0)
DBOS (2021) (0)
Abstract 311: Machine Learning of the Electrocardiogram to Detect Regional Structural Abnormalities of the Heart (2020) (0)
PREDICTING SUDDEN CARDIAC DEATH BY MACHINE LEARNING OF VENTRICULAR ACTION POTENTIALS (2020) (0)
BIGGIE : A Distributed Pipeline for Genomic Variant Calling (2012) (0)
Overlook: DIFFERENTIALLY PRIVATE EXPLORATORY VISUALIZATION (2022) (0)

This paper list is powered by the following services:

Other Resources About Matei Zaharia

What Schools Are Affiliated With Matei Zaharia?

Matei Zaharia is affiliated with the following schools:

Matei Zaharia's Academic­Influence.com Rankings

Why Is Matei Zaharia Influential?

Matei Zaharia's Published Works

Published Works

Other Resources About Matei Zaharia

What Schools Are Affiliated With Matei Zaharia?

Matei Zaharia's AcademicInfluence.com Rankings