Yushi Hu

Hi! I am a final-year PhD student at the University of Washington (UW), advised by Prof. Mari Ostendorf and Noah A. Smith. I also closely collaborate with Prof. Ranjay Krishna. I have also interned at Meta GenAI, Allen Institute for AI (AI2), and Google Research.

My research primarily focuses on building multimodal models that can understand, reason, and generate across many modalities (text, image, video, ...). I am also interested in building powerful multimodal agents with these models.

Prior to that, I graduated from the University of Chicago with B.S. in Mathematics, Computer Science, and Economics in 2021, where I was fortunate to be advised by Prof. Karen Livescu at Toyota Technological Institute at Chicago (TTIC).

Publications

Most recent publications on Semantic Scholar and Google Scholar
* indicates equal contribution

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Yushi Hu*, Weijia Shi*, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, Ranjay Krishna
NeurIPS 2024
[paper] [code] [project page]

PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu*, Hang Hua*, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo
ICCV 2023
[paper][project page] [Huggingface Checkpoint] [poster]

Decoding-Time Language Model Alignment with Multiple Objectives
Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A. Smith, Simon Du
NeurIPS 2024
[paper]

BLINK: Multimodal Large Language Models Can See but Not Percieve
Xingyu Fu*, Yushi Hu*, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna
ECCV 2024
[paper] [project page] [code] [HF data]
TLDR: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations.

Training Language Models to Generate Text with Citations via Fine-grained Rewards
Chengyu Huang, Zeqiu Wu, Yushi Hu, Wenya Wang
ACL 2024
[paper]
TLDR: Using fine-grained rewards to train LLMs to generate text with citations.

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman
CVPR 2024 (Oral)
[paper] [project page]
TLDR: Using LLM-generated codes + vision tools to generate high-quality multimodal chain-of-thought reasoning data for large multimodal model (LMM) training. The resulting models, PaLI-3-VPD (5B) and PaLI-X-VPD (55B) set the new SOTA for many existing vision-Language tasks.

DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback
Jiao Sun*, Deqing Fu*, Yushi Hu*, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, Cyrus Rashtchian
Preprint 2023
[paper]
TLDR: Using TIFA as the reward model for text-to-image generation models. Improve the text-to-image faithfulness and image aesthetics with simple rejection-sampling fine-tuning.

Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-Image Generation
Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, Su Wang
ICLR 2024
[paper] [project page] [code & data]
TLDR: An improved and more reliable version of TIFA for text-to-image evaluation, based on Davidsonian semantics and scene graphs.

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Zeqiu Wu*, Yushi Hu*, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi
NeurIPS 2023 (Spotlight)
[paper] [project page] [code & data]
TLDR: F in current RLHF is overall preference, which conveys limited information. We introduce Fine-Grained RLHF and train LMs with explicit feedback like "sentence 1 is not factual", "sentence 2 is toxic". We show that Fine-Grained RLHF is more effective and enables customizing LMs for specific needs.

TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, Noah A. Smith
ICCV 2023
[paper] [project page] [code & data] [poster]
TLDR: Fine-grained and accurate evaluation of synthesized images using Image-to-Text Models (e.g. GPT-4, BLIP-2, etc.) and Large Language Models (e.g. GPT-3.5). More accurate than CLIP!

PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu*, Hang Hua*, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo
ICCV 2023
[paper][project page] [Huggingface Checkpoint] [poster]
TLDR: A captioning model that is controlled by natural language instruction. Simple and effective visual frontend for LLMs like GPT-3 and ChatGPT.

One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Hongjin Su*, Weijia Shi*, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu
ACL 2023
[paper] [project page] [checkpoint]
TLDR: Instruction-finetuned text embeddings. The SOTA embedding for retrieval, semantic similarity, etc. Open-source, and better than OpenAI embeddings!

Binding Language Models in Symbolic Languages
Zhoujun Cheng*, Tianbao Xie*, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Tao Yu
ICLR 2023 (Spotlight)
[paper][project page]
TLDR: Combining GPT-3 with Python and SQL. Proposes the concept of Toolformer and ChatGPT plugins.

Unsupervised Learning of Hierarchical Conversation Structure
Bo-Ru Lu, Yushi Hu, Hao Cheng, Noah A. Smith, Mari Ostendorf
EMNLP 2022
[paper]
TLDR: Learning the common dialogue structure from a huge amount of customer-service dialogues.

In-Context Learning for Few-Shot Dialogue State Tracking
Yushi Hu, Chia-Hsuan Lee, Tianbao Xie, Tao Yu, Noah A. Smith, Mari Ostendorf
EMNLP 2022
[paper] [code] [bibtex]
TLDR: The first paper that shows GPT-3 is surprisingly good at dialogue understanding tasks.

Acoustic Span Embeddings for Multilingual Query-by-Example Search
Yushi Hu, Shane Settle, Karen Livescu
IEEE Spoken Language Technology Workshop (SLT 2021)
[paper] [code] [bibtex] [slides]

Multilingual Jointly Trained Acoustic and Written Word Embeddings
Yushi Hu, Shane Settle, Karen Livescu
InterSpeech 2020
[paper] [code] [bibtex] [slides] [video] [demo]

Freestanding Ferroelectric Bubble Domains
Saidur R Bakaul, Sergei Prokhorenko, Qi Zhang, Yousra Nahas, Yushi Hu, Amanda Petford-Long, Laurent Bellaiche, Nagarajan Valanoor
Advanced Materials, 2021
[paper]

Work Experience

AI Research Scientist Intern, Meta GenAI, Menlo Park, CA, 2024
Student Researcher, Allen Institute for AI, Seattle, WA, 2024
Student Researcher, Google Research, Mountain View, CA, summer 2023
Research Intern, Tencent AI Lab (Seattle), Bellevue, WA, summer 2022
Research Assistant, Toyota Technological Institute at Chicago (TTIC), Chicago, IL, 2019 - 2021
Research Intern, Argonne National Laboratory, Lemont, IL, summer 2018

Yushi Hu 胡雨石

Publications

Work Experience