Friday, February 10, 2012

Probabilistic Information Retrieval

Chapter 11: For the first half or so I was diligent trying to work out all the math and understand all the most technical details, but as it went on I started going for more conceptual understanding. I have a strong math background but not very much statistics, so I'm not very familiar with the conventions.

Having come from very theoretical math, I find the the attitude "this assumption isn't true, but in practice if we use it the results are fine, so we'll use it" quite a novelty! It makes sense though. As was said in an earlier reading, when we've got a human user, they aren't really looking for 100% technical accuracy. They want something that will fulfill their information needs. If an overly simplistic view of the character of a document-- in this case, making the terms independent in documents, as in the Binary Independence Model-- doesn't bring results that seem particularly less useful for the user, then why make it more complicated? We don't actually have to model all the complexities of the world and meaning and how language is interrelated etc etc-- we just need enough to make a reasonable prediction of whether or not a document is relevant for a given query (or, perhaps more specifically, a given user's query).

(And, yes, the "smoothing" principle also threw me for a bit of a loop. It's actually kind of liberating!)

However, the reading also pointed out that some of the assumptions can be harmful, so it's important to look at them carefully. Assuming that the relevance of each document is independent of the relevance of other documents could create many overlaps, and the user could be faced with slightly different versions of the same document over and over again-- something sure to frustrate an information seeker!

I was very interested in what we do when we stop assuming that terms that *aren't* in the query cannot be indicators of whether or not a document is relevant. So then pt and ut can represent any term appearing in the documents, not just query words.

As you could probably guess from previous posts, I was excited to see more and more user interaction used to refine models. That's what I'm talkin' 'bout! The iterative updating process always seems elegant to me. The process is relatively simple: Start with a high value of κ since you're starting with very little (or no) feedback. Then, as users make judgments as to whether or not documents are relevant to a given query, record the relevant/non-relevant judgment for each document and which terms are in that document. Now, with pt and ut ever so slightly more refined, use those values for the next iteration, and decrease κ a little. The more iterations are run, the smaller κ gets and the more precise t becomes. My guess is that the relevance judgment precision is asymptotic and κ will approach 0, at least in theory.

In practice, I'm sure that the system gets conflicting information about whether a document belongs in VR or VNR. Different users find different documents useful, and are probably searching for slightly different things even if their query terms are the same or similar. The answer for this seems fairly straightforward; we are, after all, talking about the probability that a given document is relevant. Just use some model to take into account how likely "most users" are to think that a document is relevant given how many previous users labeled it VR and how many labeled it VNR. However, the system can no longer be Boolean (binary).

That idea segued quite nicely into the next section! Just as they have the tuning parameter k1 to calibrate how important term frequency is for the relevance outcome, there could be another parameter, let's call it g1, to calibrate how important the accumulated user relevance judgments are for the relevance outcome. Similarly to k1, for g1=0 it would be a binary model, and for a large value of g1 the percentage of users who judged it relevant would be the main criterion used.


Chapter 12: Language Models for Information Retrieval

Building a "probabilistic language model Md" out of each document, and then rank relevance from how likely your language model Md is to generate the query.

I hadn't thought of documents as generating queries, so this is a new framework for me!

I'm having trouble understanding the likelihood ratio. If you have Model 1 and Model 2, do you divide Model 1 by Model 2, or Model 2 by Model 1? The outcome won't be the same, so I can't understand how both of them (that is, M2/M1 and M1/M2) could be equally correct. I also don't really understand why it's necessary to use two different models. I guess it smooths the outcome set, but that doesn't seem like enough of a reason unless it's much easier to create a model than I think, or there are much worse costs for only using a single model than I realize.

It's interesting how term smoothing isn't just to decrease the problem of outliers and insufficient data, but can in fact implement important parts of the component for term weighting. I'm surprised it's still referred to as "smoothing" in that case, it seems like a bit of a misnomer.

A document that doesn't contain a query term q should still be a possibility, but the probability should be less than or equal to the probability of the term appearing by chance in every other document! That's a very neat solution, and seems reasonable: the likelihood that a query term just happened to be left out of the document by chance is <= the likelihood that a query term happened to appear in a random document. I'd set the probability for P(d,q) equal to

summation of the probability of q in each the non-relevant documents
____________________________________________________________________
the total number of non-relevant documents


But, of course, we can't use that because the whole point is that we don't know which documents are relevant or not relevant. So we just have to use ALL the documents, instead of just the non-relevant ones. That's why the value of P(d,q) should be LESS THAN that quotient, instead of equal.



Muddiest Points: The GSA's guest lecture was fairly straightforward and mostly just covered the material in that week's reading. I don't have any muddiest points.

No comments:

Post a Comment