Friday, February 17, 2012

Evaluation

Accuracy =

relevant items retrieved + nonrelevant items not retrieved
_________________________________________________________
all items

Okay, I see why the fact that there are usually way more nonrelevant items would make that an impractical measure.




I don't really understand R-precision. If we need to already have a set of known relevant documents for a query, then how is it really helpful? None of what it said seemed to make any sense with the assumption that Rel is known to be relevant documents. "If ... we examine the top |Rel| results of a system, and find that r are relevant"-- What?! If they're part of Rel, then they're relevant! Why are we examining them? Why is the result of that examination meaningful if it's tautologically true?

I'm curious about the kappa statistic and how it's modified when there are different groups of judges. They say that a kappa value between 0.67 and 0.8 is usually seen as "fair", but what if there's a group of judges that are all European law professionals, and then another group that's American junior high students, and then a third group that's rural African villagers? Have there been studies determining how much agreement there is between very disparate groups?

Accepting the idea that all judges are part of the same single pool, why is it 0.67 to 0.8? Is higher than 0.8 NOT fair? Why isn't it just "greater than 0.67"?

Taking into account marginal relevance seems extremely useful, but also extremely difficult. It would require the system to correctly judge how similar the information is in two different documents. Also, a misjudgment would lead to much decreased recall. If a retrieval system judged two types of documents to have the same information when they in fact don't, then the second type of document might be left out of the retrieved documents entirely, or tucked into a "see others like this" corner.

As usual, user studies are expensive and time-consuming. I bet one of the most useful things an information science professional could do is find the most effective ways to integrate information retrieval systems so that the data we would want from a user study is gathered organically when somebody uses the system. Big money-saver. I wonder how much, when there's a new type of system being developed, companies slip little tests into already-in-use systems to gauge this without having to go through the expensive of user studies (or at least be able to have fewer of them)?

-->Ha, the very next section answers me that yes, they do! It's called A/B testing. Good to know my thought processes are on the right track.

R-precision aside, I found the material this week much more straightforward than the past couple of weeks. Determining user utility and creating dynamic summaries feel like common sense the way that judging whether a query would generate a document and vice versa doesn't.


Muddiest Point: I guess it would be the process of inferring a language model for a document. I think maybe I'm thinking of a "language model" as a more complicated thing than it actually is, but I'd really like to see a simple example of exactly what a language model would be for a short document.

No comments:

Post a Comment