publications by categories in reversed chronological order. generated by jekyll-scholar.
2026
Submission
Trident: Adaptive Scheduling for Heterogeneous Multimodal Data Pipeline
Ding
Pan, Zhuangzhuang
Zhou, Long
Qian, and Binhang
Yuan
In Submission, 2026
2025
Arxiv
Meteion: Fast and Efficient Serverless Workflows for Latency-Critical Interactive Applications
Zhuangzhuang
Zhou, Yanqi
Zhang, and Christina
Delimitrou
In Arxiv, 2025
2024
ASPLOS
Characterizing a Memory Allocator at Warehouse Scale
Zhuangzhuang
Zhou, Vaibhav
Gogte, Nilay
Vaish, Chris
Kennelly, Patrick
Xia, Svilen
Kanev, Tipp
Moseley, Christina
Delimitrou, and Parthasarathy
Ranganathan
In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), La Jolla, CA, USA, 2024
Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings.We present the first comprehensive characterization study of TCMalloc, a memory allocator used by warehouse-scale applications in Google’s production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we optimize TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. For the applications with the highest memory allocation usage, we observe up to 8.1% and 6.3% improvement in throughput and memory usage respectively. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs.
@inproceedings{tcmalloc,author={Zhou, Zhuangzhuang and Gogte, Vaibhav and Vaish, Nilay and Kennelly, Chris and Xia, Patrick and Kanev, Svilen and Moseley, Tipp and Delimitrou, Christina and Ranganathan, Parthasarathy},title={Characterizing a Memory Allocator at Warehouse Scale},year={2024},isbn={9798400703867},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3620666.3651350},doi={10.1145/3620666.3651350},booktitle={Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},pages={192-206},numpages={15},keywords={datacenter, warehouse-scale computing, memory allocator, memory management},location={La Jolla, CA, USA},series={ASPLOS '24}}
HPCA
Ursa: Lightweight Resource Management for Cloud-Native Microservices
Yanqi
Zhang, Zhuangzhuang
Zhou, Sameh
Elnikety, and Christina
Delimitrou
In The 30th IEEE HPCA International Conference on High Performance Computer Architecture (HPCA), 2024
Resource management for cloud-native microservices has attracted a lot of recent attention. Previous work has shown that machine learning (ML)-driven approaches out-perform traditional techniques, such as autoscaling, in terms of both SLA maintenance and resource efficiency. However, ML-driven approaches also face challenges including lengthy data collection processes and limited scalability. We present Ursa, a lightweight resource management system for cloud-native microservices that addresses these challenges. Ursa uses an analytical model that decomposes the end-to-end SLA into per-service SLA, and maps per-service SLA to individual resource allocations per microservice tier. To speed up the exploration process and avoid prolonged SLA violations, Ursa explores each microservice individually, and swiftly stops exploration if latency exceeds its SLA. We evaluate Ursa on a set of representative and end-to-end microservice topologies, including a social network, media service and video processing pipeline, each consisting of multiple classes and priorities of requests with different SLAs, and compare it against two representative ML-driven systems, Sinan and Firm. Compared to these ML-driven approaches, Ursa provides significant advantages: It shortens the data collection process by more than 128X, and its control plane is 43X faster than ML-driven approaches. At the same time, Ursa does not sacrifice resource efficiency or SLAs. During online deployment, Ursa reduces the SLA violation rate by 9.0% up to 49.9%, and reduces CPU allocation by up to 86.2% compared to ML-driven approaches.
@inproceedings{ursa,author={Zhang, Yanqi and Zhou, Zhuangzhuang and Elnikety, Sameh and Delimitrou, Christina},booktitle={The 30th IEEE HPCA International Conference on High Performance Computer Architecture (HPCA)},title={Ursa: Lightweight Resource Management for Cloud-Native Microservices},year={2024},volume={},number={},pages={954-969},keywords={Analytical models;Social networking (online);Scalability;Pipelines;Microservice architectures;Process control;Maintenance engineering;Microservices;Resource management;SLA},doi={10.1109/HPCA57654.2024.00077}}
2023
ASPLOS
AQUATOPE: QoS-and-Uncertainty-Aware Resource Management for Multi-stage Serverless Workflows
Zhuangzhuang
Zhou, Yanqi
Zhang, and Christina
Delimitrou
In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vancouver, BC, Canada, 2023
Multi-stage serverless applications, i.e., workflows with many computation and I/O stages, are becoming increasingly representative of FaaS platforms. Despite their advantages in terms of fine-grained scalability and modular development, these applications are subject to suboptimal performance, resource inefficiency, and high costs to a larger degree than previous simple serverless functions. We present Aquatope, a QoS-and-uncertainty-aware resource scheduler for end-to-end serverless workflows that takes into account the inherent uncertainty present in FaaS platforms, and improves performance predictability and resource efficiency. Aquatope uses a set of scalable and validated Bayesian models to create pre-warmed containers ahead of function invocations, and to allocate appropriate resources at function granularity to meet a complex workflow’s end-to-end QoS, while minimizing resource cost. Across a diverse set of analytics and interactive multi-stage serverless workloads, Aquatope significantly outperforms prior systems, reducing QoS violations by 5X, and cost by 34% on average and up to 52% compared to other QoS-meeting methods.
@inproceedings{aquatope,author={Zhou, Zhuangzhuang and Zhang, Yanqi and Delimitrou, Christina},title={AQUATOPE: QoS-and-Uncertainty-Aware Resource Management for Multi-stage Serverless Workflows},year={2023},isbn={9781450399159},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3567955.3567960},doi={10.1145/3567955.3567960},booktitle={Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},pages={1-14},numpages={14},keywords={serverless computing, resource management, resource efficiency, resource allocation, quality of service, machine learning for systems, function-as-a-service, datacenter, Cloud computing},location={Vancouver, BC, Canada},series={ASPLOS 2023}}
Approximate computing is an emerging paradigm for error-tolerant applications. By introducing a reasonable amount of inaccuracy, both the area and delay of a circuit can be reduced significantly. To produce approximate circuits automatically, many approximate logic synthesis (ALS) algorithms are proposed. However, they mainly focus on area reduction and are not optimal in reducing the circuit delay. In this article, we propose HEDALS, a Highly Efficient Delay-driven ALS framework, which supports various types of local approximate changes (LACs), circuit representations, and average error metrics. To reduce delay, HEDALS builds a critical error graph (CEG) consisting of nodes on the critical paths and error information, and finds an optimized set of LACs in the CEG by either a maximum flow-based method or a priority cut-based method. The resulting set of LACs is applied to shorten all critical paths simultaneously so that the circuit delay is reduced. Besides, the simultaneous application of multiple LACs also makes HEDALS extremely fast. Compared to a state-of-the-art method, on average, HEDALS can reduce the circuit delay by 32.3%, while being 167times faster. The code of HEDALS is made open-source.
@article{hedals,author={Meng, Chang and Zhou, Zhuangzhuang and Yao, Yue and Huang, Shuyang and Chen, Yuhang and Qian, Weikang},title={HEDALS: Highly Efficient Delay-Driven Approximate Logic Synthesis},year={2023},issue_date={Nov. 2023},publisher={IEEE Press},volume={42},number={11},issn={0278-0070},url={https://doi.org/10.1109/TCAD.2023.3268221},doi={10.1109/TCAD.2023.3268221},journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)},pages={3491-3504},numpages={14}}
2021
ASPLOS
Sinan: ML-based and QoS-aware resource management for cloud microservices
Yanqi
Zhang, Weizhe
Hua, Zhuangzhuang
Zhou, G. Edward
Suh, and Christina
Delimitrou
In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, USA, 2021
Cloud applications are increasingly shifting from large monolithic services, to large numbers of loosely-coupled, specialized microservices. Despite their advantages in terms of facilitating development, deployment, modularity, and isolation, microservices complicate resource management, as dependencies between them introduce backpressure effects and cascading QoS violations. We present Sinan, a data-driven cluster manager for interactive cloud microservices that is online and QoS-aware. Sinan leverages a set of scalable and validated machine learning models to determine the performance impact of dependencies between microservices, and allocate appropriate resources per tier in a way that preserves the end-to-end tail latency target. We evaluate Sinan both on dedicated local clusters and large-scale deployments on Google Compute Engine (GCE) across representative end-to-end applications built with microservices, such as social networks and hotel reservation sites. We show that Sinan always meets QoS, while also maintaining cluster utilization high, in contrast to prior work which leads to unpredictable performance or sacrifices resource efficiency. Furthermore, the techniques in Sinan are explainable, meaning that cloud operators can yield insights from the ML models on how to better deploy and design their applications to reduce unpredictable performance.
@inproceedings{sinan,author={Zhang, Yanqi and Hua, Weizhe and Zhou, Zhuangzhuang and Suh, G. Edward and Delimitrou, Christina},title={Sinan: ML-based and QoS-aware resource management for cloud microservices},year={2021},isbn={9781450383172},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3445814.3446693},doi={10.1145/3445814.3446693},booktitle={Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},pages={167-181},numpages={15},keywords={Cloud computing, cluster management, datacenter, machine learning for system, mi-croservices, quality of service, resource efficiency, resource management, resourceallocation, tail latency},location={Virtual, USA},series={ASPLOS '21}}
2020
GLSVLSI
Reliability-Enhanced Circuit Design Flow Based on Approximate Logic Synthesis
With the downscaling of CMOS technology, the circuit design margin becomes more and more tight due to wider guardband, which is required to counteract the severer transistor aging and variations. Thus, reliability-enhanced circuit design is urgently needed to reduce the guardband. In this paper, a reliability-enhanced design framework based on approximate synthesis is proposed to completely eliminate the aging guardband. It mainly includes two key parts: first, a forward reliability simulation flow supporting statistical static timing analysis (SSTA) is performed to estimate the path failure rates after aging; if the timing constraints are not satisfied, then a backward delay-driven approximate logic synthesis flow will perform approximate local changes on the critical paths to reduce the delay until the reliability requirement is finally satisfied and no aging guardband is needed. The results show that the approximate circuit has a smaller aged delay than the original circuit, so that the path failure rates are significantly decreased. It indicates that the proposed design flow can convert the timing errors that have fatal impact on applications, into negligible error on low-significance bits to improve the resilience of circuits, which provides a new perspective of reliability-enhanced design at nanoscale.
@inproceedings{reliability,author={Zhang, Zuodong and Wang, Runsheng and Zhang, Zhe and Huang, Ru and Meng, Chang and Qian, Weikang and Zhou, Zhuangzhuang},title={Reliability-Enhanced Circuit Design Flow Based on Approximate Logic Synthesis},year={2020},isbn={9781450379441},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3386263.3406926},doi={10.1145/3386263.3406926},booktitle={Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI)},pages={71-76},numpages={6},keywords={statistical static timing analysis (SSTA), reliability-enhanced design, negative bias temperature instability (NBTI), logic synthesis, guardband, circuit reliability simulation, approximate computing, aging},location={Virtual Event, China},series={GLSVLSI '20}}
2018
ICCAD
DALS: delay-driven approximate logic synthesis
Zhuangzhuang
Zhou, Yue
Yao, Shuyang
Huang, Sanbao
Su, Chang
Meng, and Weikang
Qian
In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, California, 2018
Approximate computing is an emerging paradigm for error-tolerant applications. By introducing a reasonable amount of inaccuracy, both the area and delay of a circuit can be reduced significantly. To synthesize approximate circuits automatically, many approximate logic synthesis (ALS) algorithms have been proposed. However, they mainly focus on area reduction and are not optimal in reducing the delay of the circuits. In this paper, we propose DALS, a delay-driven ALS framework. DALS works on the AND-inverter graph (AIG) representation of a circuit. It supports a wide range of approximate local changes and some commonly-used error metrics, including error rate and mean error distance. In order to select an optimal set of nodes in the AIG to apply approximate local changes, DALS establishes a critical error network (CEN) from the AIG and formulates a maximum flow problem on the CEN. Our experimental results on a wide range of benchmarks show that DALS produces approximate circuits with significantly reduced delays.
@inproceedings{dals,author={Zhou, Zhuangzhuang and Yao, Yue and Huang, Shuyang and Su, Sanbao and Meng, Chang and Qian, Weikang},title={DALS: delay-driven approximate logic synthesis},year={2018},isbn={9781450359504},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3240765.3240790},doi={10.1145/3240765.3240790},booktitle={2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)},articleno={86},numpages={7},keywords={approximate computing, approximate logic synthesis, delay optimization, timing optimization},location={San Diego, California},series={ICCAD '18}}