I am Yihao Zhang (张益豪), a First-Year Applied Mathematics PhD at the School of Mathematical Sciences, Peking University. I am currently conducting research under the guidance of Professor Meng Sun and am a part of his research group.

I got my bachelor’s degree (Data Science, Math) from Peking University in 2024. I served as a visiting research assistant at Singapore Management University, from October 2023 to May 2024. My supervisor at Singapore Management University is Jun Sun.

My research interests include mechanistic interpretability of large language models, formal methods, model checking, AI model explanation/verification/safety issues, and AI-aided automatic verification. I have published multiple papers at different international conferences with total google scholar .

Currently, I am following representation-related works on LLMs and interpretability related work on LLMs. If you are interested in those tracks of papers, I would appreciate it if you contacted me. I am also interested in formal methods for Quantum Computing and formalizing Causality in Machine Learning. My email: jekyllzhang@gmail.com.

🔥 News

📝 Selected Papers

(*: Equal Contribution; ${}^\dagger$: Corresponding Author)

ICML 2024

On the Duality Between Sharpness-Aware Minimization and Adversarial Training (ICML 2024)

Yihao Zhang*, Hangzhou He*, Jingyu Zhu*, Huanran Chen, Yifei Wang, Zeming Wei${}^\dagger$

Adversarial Training (AT), which adversarially perturb the input samples during training, has been acknowledged as one of the most effective defenses against adversarial attacks, yet suffers from a fundamental tradeoff that inevitably decreases clean accuracy. Instead of perturbing the samples, Sharpness-Aware Minimization (SAM) perturbs the model weights during training to find a more flat loss landscape and improve generalization. However, as SAM is designed for better clean accuracy, its effectiveness in enhancing adversarial robustness remains unexplored. In this work, considering the duality between SAM and AT, we investigate the adversarial robustness derived from SAM. Intriguingly, we find that using SAM alone can improve adversarial robustness. To understand this unexpected property of SAM, we first provide empirical and theoretical insights into how SAM can implicitly learn more robust features, and conduct comprehensive experiments to show that SAM can improve adversarial robustness notably without sacrificing any clean accuracy, shedding light on the potential of SAM to be a substitute for AT when accuracy comes at a higher priority. Code is available at this https URL.

[pdf] [arxiv] [code]

NeurIPS 2024

Towards General Conceptual Model Editing via Adversarial Representation Engineering (NeurIPS 2024)

Yihao Zhang, Zeming Wei, Jun Sun${}^\dagger$, Meng Sun${}^\dagger$

Since the development of Large Language Models (LLMs) has achieved remarkable success, understanding and controlling their internal complex mechanisms has become an urgent problem. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to use representation engineering methods to guide the editing of LLMs by deploying a representation sensor as an oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple model editing paradigms demonstrate the effectiveness of ARE in various settings. Code and data are available at this https URL.

[pdf] [arxiv] [code]

Arxiv Preprint

Automata Extraction from Transformers (Preprint)

Yihao Zhang, Zeming Wei, Meng Sun${}^\dagger$

In modern machine (ML) learning systems, Transformer-based architectures have achieved milestone success across a broad spectrum of tasks, yet understanding their operational mechanisms remains an open problem. To improve the transparency of ML systems, automata extraction methods, which interpret stateful ML models as automata typically through formal languages, have proven effective for explaining the mechanism of recurrent neural networks (RNNs). However, few works have been applied to this paradigm to Transformer models. In particular, understanding their processing of formal languages and identifying their limitations in this area remains unexplored. In this paper, we propose an automata extraction algorithm specifically designed for Transformer models. Treating the Transformer model as a black-box system, we track the model through the transformation process of their internal latent representations during their operations, and then use classical pedagogical approaches like L* algorithm to interpret them as deterministic finite-state automata (DFA). Overall, our study reveals how the Transformer model comprehends the structure of formal languages, which not only enhances the interpretability of the Transformer-based ML systems but also marks a crucial step toward a deeper understanding of how ML systems process formal languages. Code and data are available at this https URL.

[pdf] [arxiv] [code]

Weighted automata extraction and explanation of recurrent neural networks for natural language tasks (JLAMP, Vol 136)

Zeming Wei, Xiyue Zhang, Yihao Zhang, Meng Sun${}^\dagger$

Recurrent Neural Networks (RNNs) have achieved tremendous success in processing sequential data, yet understanding and analyzing their behaviours remains a significant challenge. To this end, many efforts have been made to extract finite automata from RNNs, which are more amenable for analysis and explanation. However, existing approaches like exact learning and compositional approaches for model extraction have limitations in either scalability or precision. In this paper, we propose a novel framework of Weighted Finite Automata (WFA) extraction and explanation to tackle the limitations for natural language tasks. First, to address the transition sparsity and context loss problems we identified in WFA extraction for natural language tasks, we propose an empirical method to complement missing rules in the transition diagram, and adjust transition matrices to enhance the context-awareness of the WFA. We also propose two data augmentation tactics to track more dynamic behaviours of RNN, which further allows us to improve the extraction precision. Based on the extracted model, we propose an explanation method for RNNs including a word embedding method – Transition Matrix Embeddings (TME) and TME-based task oriented explanation for the target RNN. Our evaluation demonstrates the advantage of our method in extraction precision than existing approaches, and the effectiveness of TME-based explanation method in applications to pretraining and adversarial example generation.

[pdf] [arxiv] [code]

Other Publications

(*: Equal Contribution; ${}^\dagger$: Corresponding Author)

MILE: A Mutation Testing Framework of In-Context Learning Systems (SETTA 2024)

Zeming Wei, Yihao Zhang, Meng Sun${}^\dagger$

[pdf] [arxiv] [code]

MedTiny: Enhanced Mediator Modeling Language for Scalable Parallel Algorithms (QRS 2023)

Xiangyu Li, Yihao Zhang, Xiaokun Luan, Xiaoyong Xue, Meng Sun${}^\dagger$


Sharpness-aware minimization alone can improve adversarial robustness (ICML 2023 AdvML-Frontiers Workshop)

Zeming Wei*${}^\dagger$, Jingyu Zhu*, Yihao Zhang*

[pdf] [arxiv] [code]

Using Z3 for Formal Modeling and Verification of FNN Global Robustness (SEKE 2023)

Yihao Zhang, Zeming Wei, Xiyue Zhang, Meng Sun${}^\dagger$

[pdf] [arxiv] [code]

Boosting jailbreak attack with momentum (ICLR 2024 R2-FM Workshop)

Yihao Zhang*, Zeming Wei*${}^\dagger$

[pdf] [arxiv] [code]

Exploring the robustness of in-context learning with noisy labels (ICLR 2024 R2-FM Workshop)

Chen Cheng*, Xinzhi Yu*, Haodong Wen*, Jinsong Sun, Guanzhang Yue, Yihao Zhang, Zeming Wei${}^\dagger$

[pdf] [arxiv] [code]

🎖 Honors and Awards

  • Huaixin Bachelor (怀新学士, Honours Degree), 2024
  • Selected for the Elite Program (拔尖计划, Graduate) in the School of Mathematical Sciences, Peking University.
  • University Scholarship, Peking University, 2023
  • Second prize, Chinese Mathematics Competitions for Undergraduates (Beijing Division), 2023

📖 Educations

  • 2023.10 - 2024.05, Research Assistant, School of Computing and Information Systems, Singapore Management University.
  • 2020.09 - 2024.06, Undergraduate Student, School of Mathematical Sciences, Peking University.
  • 2024.09 - 2028.06 (expected), PhD, School of Mathematical Sciences, Peking University.

💬 Talks

💻 Projects

  • 2023-2025, Study on the interpretability of large language model architecture and algorithm, Program Director.
  • 2022-2025, Trustworthy guarantee of deep learning system, Member.

🔗 Links