no code implementations • 3 May 2024 • Negar Arabzadeh, Siqing Huo, Nikhil Mehta, Qinqyun Wu, Chi Wang, Ahmed Awadallah, Charles L. A. Clarke, Julia Kiseleva
The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents, assisting humans in their daily tasks.
no code implementations • 11 Apr 2024 • Marwah Alaofi, Negar Arabzadeh, Charles L. A. Clarke, Mark Sanderson
We resolve this apparent circularity in two ways: 1) by viewing LLM-based assessment as a form of "slow search", where a slower IR system is used for evaluation and training of a faster production IR system; and 2) by recognizing a continuing need to ground evaluation in human assessment, even if the characteristics of that human assessment must change.
2 code implementations • 5 Apr 2024 • Negar Arabzadeh, Charles L. A. Clarke
Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels.
no code implementations • 31 Jan 2024 • Negar Arabzadeh, Charles L. A. Clarke
The rapid advancement of natural language processing, information retrieval (IR), computer vision, and other technologies has presented significant challenges in evaluating the performance of these systems.
no code implementations • 9 Jan 2024 • Negar Arabzadeh, Amin Bigdeli, Charles L. A. Clarke
In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments.
no code implementations • 20 Sep 2023 • Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke
After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer.
Generative Question Answering Open-Domain Question Answering +1
no code implementations • 23 Jun 2023 • Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke
After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer.
1 code implementation • 11 May 2023 • Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei
The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging.
no code implementations • 9 Aug 2022 • Negar Arabzadeh, Mahsa Seifikar, Charles L. A. Clarke
While the research community has paid substantial attention to the problem of predicting query ambiguity in traditional search contexts, researchers have paid relatively little attention to predicting when this ambiguity is sufficient to warrant clarification in the context of conversational systems.
no code implementations • 9 Aug 2022 • Dahlia Shehata, Negar Arabzadeh, Charles L. A. Clarke
In this work, we propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed.
2 code implementations • 21 Apr 2022 • Xinyi Yan, Chengxi Luo, Charles L. A. Clarke, Nick Craswell, Ellen M. Voorhees, Pablo Castells
Based on these simulations, one algorithm stands out for its potential.
no code implementations • 22 Sep 2021 • Negar Arabzadeh, Xinyi Yan, Charles L. A. Clarke
These hybrid retrievers leverage low-cost, exact-matching based sparse retrievers along with dense retrievers to bridge the semantic gaps between query and documents.
1 code implementation • 31 Aug 2021 • Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, Charles L. A. Clarke
To test this observation, we employed crowdsourced workers to make preference judgments between the top item returned by a modern neural ranking stack and a judged relevant item.
2 code implementations • 22 Jul 2020 • Charles L. A. Clarke, Alexandra Vtyurina, Mark D. Smucker
To measure the performance of a ranker, we compare its output to this preferred ordering by applying a rank similarity measure. We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences for the TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure named "compatibility".