BloomIntent:
Automating Search Evaluation with LLM-Generated Fine-Grained User Intents
KAIST
KAIST
KAIST
NAVER
Donghyun Park
NAVER
Honggu Lee
KAIST
PDF
Note
Yoonseo Choi is on a job market! Her thesis research is about Generative Proxies --- interactive, data-grounded representations of users' diverse perspectives and goals, distilled from existing user data with LLMs. Generative Proxies enable understanding of intended users' needs without synchronous user sessions and make that understanding immediately usable for task completion --- such as Ideation (Proxona), and Evaluation (BloomIntent), and Rapid Prototyping (Work-In-Progress). She is interested in joining industry positions, so let's have a chat! 🗣️

30-sec Video Preview

Abstract

If 100 people issue the same search query, they may have 100 different goals. While existing work on user-centric AI evaluation highlights the importance of aligning systems with fine-grained user intents, current search evaluation methods struggle to represent and assess this diversity. We introduce BloomIntent, a user-centric search evaluation method that uses user intents as the evaluation unit. BloomIntent first generates a set of plausible, fine-grained search intents grounded on taxonomies of user attributes and information-seeking intent types. Then, BloomIntent provides an automated evaluation of search results against each intent powered by large language models. To support practical analysis, BloomIntent clusters semantically similar intents and summarizes evaluation outcomes in a structured interface. With three technical evaluations, we showed that BloomIntent generated fine-grained, evaluable, and realistic intents and produced scalable assessments of intent-level satisfaction that achieved 72% agreement with expert evaluators. In a case study (N=4), we showed that BloomIntent supported search specialists in identifying intents for ambiguous queries, uncovering underserved user needs, and discovering actionable insights for improving search experiences. By shifting from query-level to intent-level evaluation, BloomIntent reimagines how search systems can be assessed---not only for performance but for their ability to serve a multitude of user goals.

From Queries to Fine-grained Intents

BloomIntent Pipeline

1. Original Query

Everything begins with a user's query — often short and underspecified. For example, 'Hawaii honeymoon' could mean hotels, activities, or travel reviews.

2. Background Knowledge

To ground our process in up-to-date information, we retrieve relevant context from search engines (e.g., Google, Naver). This ensures generated intents reflect the real web, not just model priors.

3. Expanded Queries

We expand the original query into realistic reformulations, incorporating user attributes such as budget sensitivity, expertise, or content preferences. This captures the diversity of how real users refine their searches.

4. Intent Types

Each expanded query is mapped to one or more intent types from an established taxonomy (e.g., compare, plan, purchase). This gives structure to otherwise messy reformulations.

5. Final Intents

The result is a set of clear, sentence-level intent statements — realistic, fine-grained user goals that evaluators can directly judge against search results.

Interactive Demo

Explore Intents, Clusters, and Evaluation Results

Experience BloomIntent firsthand! Select a query to explore how our system generates fine-grained user intents, clusters them semantically, and evaluates search results against each intent. This interactive demo showcases the key components of our interface: query selection, intent cluster distribution, and detailed intent analysis with natural language explanations.

Insights from Case Study

How do search specialists perceive and use intent-based evaluation in real-world workflows?

Fine-grained intents reveal hidden user needs

BloomIntent helped experts understand why a query failed. Instead of just seeing low satisfaction scores, they could pinpoint missing elements — like richer comparisons, specific content formats, or overlooked user goals.

Automated evaluation is useful for triage

Participants said they wouldn't fully rely on automation alone, but valued it as a quick, cost-effective way to spot low-performing queries. Automated judgments guided where to focus deeper human review.

Intent clusters turn signals into actionable insights

Reviewing dozens of intents individually was overwhelming. Clustering similar intents made it easier to see patterns — like groups about price comparisons or regulatory differences — and propose concrete improvements.

Beyond human imagination

Experts appreciated that BloomIntent generated intents they wouldn't have considered themselves, especially in domains outside their expertise. This broadened their perspective and uncovered underserved needs.

Bibtex

@inproceedings{uist25-bloomintent,
      author = {Choi, Yoonseo and Kim, Eunhye and Kim, Hyunwoo and Park, Donghyun and Lee, Honggu and Kim, Jin Young and Kim, Juho},
      title = {BloomIntent: Automating Search Evaluation with LLM-Generated Fine-Grained User Intents},
      year = {2025},
      isbn = {9798400720376},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3746059.3747677},
      doi = {10.1145/3746059.3747677},
      booktitle = {Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology},
      articleno = {177},
      numpages = {34},
      keywords = {Evaluation method, Query understanding, Intent diversification, LLM-as-a-judge},
      location = {Busan, Republic of Korea},
      series = {UIST '25}}