Matei Zaharia
#5,542
Most Influential Person Now
Romanian-Canadian computer scientist and engineer
Matei Zaharia's AcademicInfluence.com Rankings
Matei Zahariacomputer-science Degrees
Computer Science
#526
World Rank
#546
Historical Rank
Big Data
#4
World Rank
#4
Historical Rank
Cloud Computing
#6
World Rank
#6
Historical Rank
Database
#630
World Rank
#660
Historical Rank
Download Badge
Computer Science
Why Is Matei Zaharia Influential?
(Suggest an Edit or Addition)According to Wikipedia, Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark. As of April 2022, Forbes ranked him and Ion Stoica as the 3rd-richest people in Romania with a net worth of $1.6 billion.
Matei Zaharia's Published Works
Published Works
- A view of cloud computing (2010) (9961)
- Above the Clouds: A Berkeley View of Cloud Computing (2009) (7317)
- Spark: Cluster Computing with Working Sets (2010) (5190)
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012) (4272)
- Improving MapReduce Performance in Heterogeneous Environments (2008) (1905)
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (2011) (1828)
- MLlib: Machine Learning in Apache Spark (2015) (1637)
- Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling (2010) (1588)
- Spark SQL: Relational Data Processing in Spark (2015) (1249)
- Apache Spark (2016) (1246)
- Dominant Resource Fairness: Fair Allocation of Multiple Resource Types (2011) (1185)
- Discretized streams: fault-tolerant streaming computation at scale (2013) (1075)
- On the Opportunities and Risks of Foundation Models (2021) (938)
- Apache Spark: a unified engine for big data processing (2016) (740)
- Managing data transfers in computer clusters with orchestra (2011) (629)
- Sparrow: distributed, low latency scheduling (2013) (629)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (2020) (534)
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (2012) (532)
- Shark: SQL and rich analytics at scale (2012) (467)
- A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples (2014) (412)
- Job Scheduling for Multi-User MapReduce Clusters (2009) (403)
- PipeDream: generalized pipeline parallelism for DNN training (2019) (393)
- Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks (2014) (354)
- Learning Spark: Lightning-Fast Big Data Analytics (2015) (325)
- Beyond Data and Model Parallelism for Deep Neural Networks (2018) (315)
- Low-cost communication for rural internet kiosks using mechanical backhaul (2006) (284)
- DAWNBench : An End-to-End Deep Learning Benchmark and Competition (2017) (272)
- Faster and More Accurate Sequence Alignment with SNAP (2011) (259)
- NoScope: Optimizing Neural Network Queries over Video at Scale (2017) (251)
- Vuvuzela: scalable private messaging resistant to traffic analysis (2015) (240)
- Multi-resource fair queueing for packet processing (2012) (210)
- Accelerating the Machine Learning Lifecycle with MLflow (2018) (204)
- Choosy: max-min fair sharing for datacenter jobs with constraints (2013) (193)
- MLPerf Training Benchmark (2019) (189)
- Fast and Interactive Analytics over Hadoop Data with Spark (2012) (175)
- TASO: optimizing deep learning computation with automatic generation of graph substitutions (2019) (167)
- Shark: fast data analysis using coarse-grained distributed memory (2012) (149)
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark (2018) (145)
- NoScope: Optimizing Deep CNN-Based Queries over Video Streams at Scale (2017) (143)
- From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers (2019) (139)
- Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc (2020) (129)
- Sparse GPU Kernels for Deep Learning (2020) (117)
- Scaling Spark in the Real World: Performance and Usability (2015) (110)
- Weld : A Common Runtime for High Performance Data Analytics (2016) (108)
- Stadium: A Distributed Metadata-Private Messaging System (2017) (108)
- Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware (2016) (107)
- Selection Via Proxy: Efficient Data Selection For Deep Learning (2019) (104)
- Very low-cost internet access using KioskNet (2007) (100)
- Above the Clouds : A View of Cloud Computing (2009) (100)
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (2021) (97)
- Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark (2018) (97)
- Matrix Computations and Optimization in Apache Spark (2015) (95)
- Making caches work for graph analytics (2016) (93)
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021) (93)
- GraphFrames: an integrated API for mixing graph and relational queries (2016) (91)
- Resilient Distributed Datasets (2016) (91)
- Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (2020) (89)
- Splinter: Practical Private Queries on Public Data (2017) (86)
- An Architecture for and Fast and General Data Processing on Large Clusters (2016) (82)
- Memory-Efficient Pipeline-Parallel DNN Training (2020) (80)
- Gossip‐based search selection in hybrid peer‐to‐peer networks (2008) (77)
- Scaling the mobile millennium system in the cloud (2011) (77)
- Design and implementation of the KioskNet system (2007) (72)
- BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics (2018) (71)
- Evaluating End-to-End Optimization for Data Analytics Applications in Weld (2018) (68)
- SparkR: Scaling R Programs with Spark (2016) (67)
- FairRide: Near-Optimal, Fair Cache Sharing (2016) (61)
- ObliDB: Oblivious Query Processing for Secure Databases (2017) (58)
- Relevance-guided Supervision for OpenQA with ColBERT (2020) (56)
- Efficient Large-Scale Language Model Training on GPU Clusters (2021) (56)
- Optimizing DNN Computation with Relaxed Graph Substitutions (2019) (55)
- Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing (2012) (55)
- Provenance Analysis for Missing Answers and Integrity Repairs. (2018) (53)
- MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis (2018) (53)
- DIFF (2018) (51)
- Advances, challenges and opportunities in creating data for trustworthy AI (2022) (47)
- Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems (2012) (46)
- LIT: Learned Intermediate Representation Training for Model Compression (2019) (45)
- Spark: The Definitive Guide: Big Data Processing Made Simple (2018) (45)
- Filter Before You Parse: Faster Analytics on Raw Data with Sparser (2018) (45)
- ICTD for healthcare in Ghana: Two parallel case studies (2009) (40)
- A Common Substrate for Cluster Computing (2009) (40)
- The Datacenter Needs an Operating System (2011) (39)
- BlazeIt: Fast Exploratory Video Queries using Neural Networks (2018) (39)
- Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle (2020) (38)
- Express: Lowering the Cost of Metadata-hiding Communication with Cryptographic Privacy (2019) (38)
- Model Assertions for Monitoring and Improving ML Models (2020) (37)
- Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution (2018) (37)
- A Common Runtime for High Performance Data Analysis (2017) (37)
- Large-Scale Estimation in Cyberphysical Systems Using Streaming Data: A Case Study With Arterial Traffic Estimation (2013) (36)
- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics (2021) (35)
- Finding Content in File-Sharing Networks When You Can't Even Spell (2007) (35)
- Model Assertions for Debugging Machine Learning (2018) (35)
- An Oblivious General-Purpose SQL Database for the Cloud (2017) (33)
- DIFF: a relational interface for large-scale data explanation (2018) (31)
- Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine (2019) (30)
- Infrastructure for Usable Machine Learning: The Stanford DAWN Project (2017) (27)
- Tachyon : Memory Throughput I / O for Cluster Computing Frameworks (2013) (27)
- Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics (2020) (26)
- Machine Learning to Classify Intracardiac Electrical Patterns During Atrial Fibrillation (2020) (26)
- Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale (2016) (24)
- Reliable, Memory Speed Storage for Cluster Computing Frameworks (2014) (23)
- PipeDream (2019) (23)
- Arthur : Rich Post-Facto Debugging for Production Analytics Applications (2013) (22)
- MLSys: The New Frontier of Machine Learning Systems (2019) (21)
- ColBERT (2020) (21)
- Hindsight: Posterior-guided training of retrievers for improved open-ended generation (2021) (21)
- Sparrow: Scalable Scheduling for Sub-Second Parallel Jobs (2013) (20)
- Fast and Optimal Scheduling Over Multiple Network Interfaces (2007) (20)
- Introduction to Spark 2.0 for Database Researchers (2016) (20)
- SysML: The New Frontier of Machine Learning Systems (2019) (20)
- Dominant Resource Fairness: Fair Allocation of Heterogeneous Resources in Datacenters (2010) (19)
- To Index or Not to Index: Optimizing Exact Maximum Inner Product Search (2017) (19)
- Nexus: A Common Substrate for Cluster Computing (2009) (18)
- Fleet: A Framework for Massively Parallel Streaming on FPGAs (2020) (18)
- Weld: Rethinking the Interface Between Data-Intensive Applications (2017) (18)
- Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (2021) (17)
- ObliDB: Oblivious Query Processing using Hardware Enclaves (2017) (16)
- Privacy Preserving Ranked Multi-Keyword Search for Multiple Data Owners in Cloud Computing (2016) (16)
- Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data (2020) (16)
- FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply (2020) (16)
- Contracting Wide-area Network Topologies to Solve Flow Problems Quickly (2020) (15)
- What can Data-Centric AI Learn from Data and ML Engineering? (2021) (15)
- Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP (2021) (15)
- PLAID: An Efficient Engine for Late Interaction Retrieval (2022) (15)
- A Policy-Oriented Architecture for Opportunistic Communication on Multiple Wireless Networks (2006) (14)
- POSH: A Data-Aware Shell (2020) (14)
- Optimally Designing Games for Cognitive Science Research (2012) (14)
- Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference (2019) (13)
- Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022) (13)
- Machine Learned Cellular Phenotypes Predict Outcome in Ischemic Cardiomyopathy. (2020) (13)
- Machine Learned Cellular Phenotypes in Cardiomyopathy Predict Sudden Death (2020) (12)
- Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training (2020) (12)
- LIT: Block-wise Intermediate Representation Training for Model Compression (2018) (12)
- Breakfast of champions: towards zero-copy serialization with NIC scatter-gather (2021) (12)
- Optimizing data-intensive computations in existing libraries with split annotations (2018) (12)
- Approximate selection with guarantees using proxies (2020) (12)
- DBOS: A Proposal for a Data-Centric Operating System (2020) (11)
- Optimizing Cache Performance for Graph Analytics (2016) (11)
- Batch Sampling : Low Overhead Scheduling for Sub-Second Parallel Jobs (2012) (10)
- A Thunk to Remember: make -j1000 (and other jobs) on functions-as-a-service infrastructure (2017) (10)
- Gossip-based search selection in hybrid peer-to-peer networks: Research Articles (2008) (10)
- Performance and Scalability of Broadcast in Spark (2010) (9)
- Similarity Search for Efficient Active Learning and Search of Rare Concepts (2020) (9)
- DBOS: A DBMS-oriented Operating System (2021) (9)
- Select Via Proxy: Efficient Data Selection For Training Deep Networks (2018) (9)
- Hypervisors as a Foothold for Personal Computer Security: An Agenda for the Research Community (2012) (8)
- Optimally designing games for behavioural research (2014) (7)
- Overlook: Differentially Private Exploratory Visualization for Big Data (2020) (7)
- Did the Model Change? Efficiently Assessing Machine Learning API Shifts (2021) (7)
- Photon: A Fast Query Engine for Lakehouse Systems (2022) (7)
- Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads (2020) (7)
- Large-Scale Online Expectation Maximization with Spark Streaming (2012) (7)
- linalg: Matrix Computations in Apache Spark (2015) (6)
- Enabling Innovation Below the Communication API (2009) (6)
- DIY Hosting for Online Privacy (2017) (6)
- TASO (2019) (6)
- Design and Implementation of the KioskNet System (Extended Version) (2007) (6)
- Future Directions for Parallel and Distributed Computing: SPX 2019 Workshop Report (2019) (5)
- FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks (2021) (5)
- Challenges and Opportunities for Autonomous Vehicle Query Systems (2021) (5)
- Large Scale Estimation in Cyberphysical Systems using Streaming Data: a Case Study with Smartphone Traces (2012) (5)
- Adaptive Peer - to - Peer Search (2004) (5)
- Lessons from Large-Scale Software as a Service at Databricks (2019) (5)
- VIVA: An End-to-End System for Interactive Video Analytics (2022) (4)
- Estimating and Explaining Model Performance When Both Covariates and Labels Shift (2022) (4)
- Mesos: Flexible Resource Sharing for the Cloud (2011) (4)
- Extricating IoT Devices from Vendor Infrastructure with Karl (2022) (4)
- Allocation of fungible resources via a fast, scalable price discovery method (2021) (4)
- A demonstration of willump (2019) (4)
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks (2023) (3)
- Finding Label and Model Errors in Perception Data With Learned Observation Assertions (2022) (3)
- Outsourcing Everyday Jobs to Thousands of Cloud Functions with gg (2019) (3)
- What's Changing in Big Data? (2016) (3)
- ObliDB (2019) (3)
- Abstract 14675: Developing Convolutional Neural Networks for Deep Learning of Ventricular Action Potentials to Predict Risk for Ventricular Arrhythmias (2019) (2)
- An Efficient Oblivious Database for the Public Cloud (2017) (2)
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts (2022) (2)
- Concurrency Control (2009) (2)
- A Progress Report on DBOS: A Database-oriented Operating System (2022) (2)
- SiRen: Leveraging Similar Regions for Efficient and Accurate Variant Calling (2015) (2)
- Apiary: A DBMS-Backed Transactional Function-as-a-Service Framework (2022) (2)
- TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data (2020) (2)
- SiRen : Leveraging Si milar Re gio n s for Efficient & Accurate Variant Calling (2015) (2)
- Analysis and Simulation of a System for Free-Text Search in Peer-to-Peer Networks (2004) (2)
- A Polystore Based Database Operating System (DBOS) (2020) (1)
- Distributed Pipeline for Genomic Variant Calling (2012) (1)
- Spectral Lower Bounds on the I/O Complexity of Computation Graphs (2019) (1)
- Splitability Annotations: Optimizing Black-Box Function Composition in Existing Libraries (2018) (1)
- Fleet (2020) (1)
- Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking (2022) (1)
- A New Communications API (2009) (1)
- researchdesigning games for behavioural (2014) (1)
- HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions (2022) (1)
- Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions (2021) (1)
- BlazeIt : An OptimizingQuery Engine for Video at Scale Extended Abstract (2018) (1)
- Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems (2022) (1)
- BlazeIt (2019) (1)
- Proof: Accelerating Approximate Aggregation Queries with Expensive Predicates (2021) (1)
- DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution (2021) (1)
- Don't Give Up on Large Optimization Problems; POP Them! (2021) (1)
- Optimizing Video Analytics with Declarative Model Relationships (2022) (1)
- Fuzzy set intersection based paired-end short-read alignment (2021) (1)
- Tachyon (2019) (1)
- Author Correction: Advances, challenges and opportunities in creating data for trustworthy AI (2022) (1)
- Boosting the Performance of MapReduce by Better Resource Utilization in Cluster (2020) (0)
- Optimally Designing Games for Cognitive Science Research - eScholarship (2012) (0)
- Clamor: Extending Functional Cluster Computing Frameworks with Fine-Grained Remote Memory Access (2021) (0)
- Weld : Rethinking the Interface Between Data-Intensive Libraries (2017) (0)
- Cloud Data Systems: What are the Opportunities for the Database Research Community? (2022) (0)
- Exploring the Use of Learning Algorithms for Efficient Performance Profiling (2018) (0)
- Research Statement Data Analytics Systems (2013) (0)
- source computing framework unifies streaming , batch , and interactive big data workloads to unlock new applications (2016) (0)
- Improving Hybrid Keyword-Based Search (0)
- Interface to 'MLflow' [R package mlflow version 1.10.0] (2020) (0)
- B-PO01-081 MACHINE LEARNING OF THE ELECTROCARDIOGRAM IDENTIFIES CARDIAC WALL MOTION ABNORMALITIES BEYOND THE Q WAVE (2021) (0)
- Transactions Make Debugging Easy (2022) (0)
- Spark SQL Resilient Distributed Datasets Spark JDBC Console User Programs ( Java , Scala , Python ) Catalyst Optimizer DataFrame API (2015) (0)
- Don't Hate the Player, Hate the Game: Safety and Utility in Multi-Agent Congestion Control (2021) (0)
- DIFF: a relational interface for large-scale data explanation (2020) (0)
- Designing Production-Friendly Machine Learning (2021) (0)
- Parallelism-Optimizing Data Placement for Faster Data-Parallel Computations (2022) (0)
- Lecturer: Alistair Sinclair Based on scribe notes by: (0)
- Snippet from the Black Scholes options pricing benchmark implemented using Intel MKL (2019) (0)
- plaid (2022) (0)
- Models Built over RDDs (2016) (0)
- Willump: Statistically-Aware Optimizations for Fast Machine Learning Inference (2019) (0)
- To Index or Not to Index : Optimizing Maximum Inner Product Search (2018) (0)
- SimDex: Exploiting Model Similarity in Exact Matrix Factorization Recommendations (2017) (0)
- Automated Lower Bounds on the I/O Complexity of Computation Graphs (2019) (0)
- A proposal for adaptive routing using neural prediction in a distributed environment (2001) (0)
- Generality of RDDs (2016) (0)
- Analysis of the Time-To-Accuracy Metric and Entries in the DAWNBench Deep Learning Benchmark (2018) (0)
- Cloud data systems (2022) (0)
- PPMLP 2020: Workshop on Privacy-Preserving Machine Learning In Practice (2020) (0)
- A MORPHOLOGICAL OPERATION-BASED APPROACH TO AUTOMATICALLY SEPARATE AND LABEL LEFT ATRIUM BODY AND PULMONARY VEINS (2022) (0)
- Clamor (2021) (0)
- Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression (2021) (0)
- Big Data Platforms for Data Analytics (2018) (0)
- What is MapReduce used for ? (2012) (0)
- GraphFrames (2016) (0)
- DBOS (2021) (0)
- Abstract 311: Machine Learning of the Electrocardiogram to Detect Regional Structural Abnormalities of the Heart (2020) (0)
- PREDICTING SUDDEN CARDIAC DEATH BY MACHINE LEARNING OF VENTRICULAR ACTION POTENTIALS (2020) (0)
- BIGGIE : A Distributed Pipeline for Genomic Variant Calling (2012) (0)
- Overlook: DIFFERENTIALLY PRIVATE EXPLORATORY VISUALIZATION (2022) (0)
This paper list is powered by the following services:
Other Resources About Matei Zaharia
What Schools Are Affiliated With Matei Zaharia?
Matei Zaharia is affiliated with the following schools: