Mel Knapp's Reading Notes and Muddiest Points

Friday, February 17, 2012

Evaluation

Accuracy =

relevant items retrieved + nonrelevant items not retrieved
_________________________________________________________
all items

Okay, I see why the fact that there are usually way more nonrelevant items would make that an impractical measure.

I don't really understand R-precision. If we need to already have a set of known relevant documents for a query, then how is it really helpful? None of what it said seemed to make any sense with the assumption that Rel is known to be relevant documents. "If ... we examine the top |Rel| results of a system, and find that r are relevant"-- What?! If they're part of Rel, then they're relevant! Why are we examining them? Why is the result of that examination meaningful if it's tautologically true?

I'm curious about the kappa statistic and how it's modified when there are different groups of judges. They say that a kappa value between 0.67 and 0.8 is usually seen as "fair", but what if there's a group of judges that are all European law professionals, and then another group that's American junior high students, and then a third group that's rural African villagers? Have there been studies determining how much agreement there is between very disparate groups?

Accepting the idea that all judges are part of the same single pool, why is it 0.67 to 0.8? Is higher than 0.8 NOT fair? Why isn't it just "greater than 0.67"?

Taking into account marginal relevance seems extremely useful, but also extremely difficult. It would require the system to correctly judge how similar the information is in two different documents. Also, a misjudgment would lead to much decreased recall. If a retrieval system judged two types of documents to have the same information when they in fact don't, then the second type of document might be left out of the retrieved documents entirely, or tucked into a "see others like this" corner.

As usual, user studies are expensive and time-consuming. I bet one of the most useful things an information science professional could do is find the most effective ways to integrate information retrieval systems so that the data we would want from a user study is gathered organically when somebody uses the system. Big money-saver. I wonder how much, when there's a new type of system being developed, companies slip little tests into already-in-use systems to gauge this without having to go through the expensive of user studies (or at least be able to have fewer of them)?

-->Ha, the very next section answers me that yes, they do! It's called A/B testing. Good to know my thought processes are on the right track.

R-precision aside, I found the material this week much more straightforward than the past couple of weeks. Determining user utility and creating dynamic summaries feel like common sense the way that judging whether a query would generate a document and vice versa doesn't.

Muddiest Point: I guess it would be the process of inferring a language model for a document. I think maybe I'm thinking of a "language model" as a more complicated thing than it actually is, but I'd really like to see a simple example of exactly what a language model would be for a short document.

Friday, February 10, 2012

Probabilistic Information Retrieval

Chapter 11: For the first half or so I was diligent trying to work out all the math and understand all the most technical details, but as it went on I started going for more conceptual understanding. I have a strong math background but not very much statistics, so I'm not very familiar with the conventions.

Having come from very theoretical math, I find the the attitude "this assumption isn't true, but in practice if we use it the results are fine, so we'll use it" quite a novelty! It makes sense though. As was said in an earlier reading, when we've got a human user, they aren't really looking for 100% technical accuracy. They want something that will fulfill their information needs. If an overly simplistic view of the character of a document-- in this case, making the terms independent in documents, as in the Binary Independence Model-- doesn't bring results that seem particularly less useful for the user, then why make it more complicated? We don't actually have to model all the complexities of the world and meaning and how language is interrelated etc etc-- we just need enough to make a reasonable prediction of whether or not a document is relevant for a given query (or, perhaps more specifically, a given user's query).

(And, yes, the "smoothing" principle also threw me for a bit of a loop. It's actually kind of liberating!)

However, the reading also pointed out that some of the assumptions can be harmful, so it's important to look at them carefully. Assuming that the relevance of each document is independent of the relevance of other documents could create many overlaps, and the user could be faced with slightly different versions of the same document over and over again-- something sure to frustrate an information seeker!

I was very interested in what we do when we stop assuming that terms that *aren't* in the query cannot be indicators of whether or not a document is relevant. So then p_t and u_t can represent any term appearing in the documents, not just query words.

As you could probably guess from previous posts, I was excited to see more and more user interaction used to refine models. That's what I'm talkin' 'bout! The iterative updating process always seems elegant to me. The process is relatively simple: Start with a high value of κ since you're starting with very little (or no) feedback. Then, as users make judgments as to whether or not documents are relevant to a given query, record the relevant/non-relevant judgment for each document and which terms are in that document. Now, with p_t and u_t ever so slightly more refined, use those values for the next iteration, and decrease κ a little. The more iterations are run, the smaller κ gets and the more precise _t becomes. My guess is that the relevance judgment precision is asymptotic and κ will approach 0, at least in theory.

In practice, I'm sure that the system gets conflicting information about whether a document belongs in VR or VNR. Different users find different documents useful, and are probably searching for slightly different things even if their query terms are the same or similar. The answer for this seems fairly straightforward; we are, after all, talking about the probability that a given document is relevant. Just use some model to take into account how likely "most users" are to think that a document is relevant given how many previous users labeled it VR and how many labeled it VNR. However, the system can no longer be Boolean (binary).

That idea segued quite nicely into the next section! Just as they have the tuning parameter k₁ to calibrate how important term frequency is for the relevance outcome, there could be another parameter, let's call it g₁, to calibrate how important the accumulated user relevance judgments are for the relevance outcome. Similarly to k₁, for g₁=0 it would be a binary model, and for a large value of g₁ the percentage of users who judged it relevant would be the main criterion used.

Chapter 12: Language Models for Information Retrieval

Building a "probabilistic language model M_d" out of each document, and then rank relevance from how likely your language model M_d is to generate the query.

I hadn't thought of documents as generating queries, so this is a new framework for me!

I'm having trouble understanding the likelihood ratio. If you have Model 1 and Model 2, do you divide Model 1 by Model 2, or Model 2 by Model 1? The outcome won't be the same, so I can't understand how both of them (that is, M2/M1 and M1/M2) could be equally correct. I also don't really understand why it's necessary to use two different models. I guess it smooths the outcome set, but that doesn't seem like enough of a reason unless it's much easier to create a model than I think, or there are much worse costs for only using a single model than I realize.

It's interesting how term smoothing isn't just to decrease the problem of outliers and insufficient data, but can in fact implement important parts of the component for term weighting. I'm surprised it's still referred to as "smoothing" in that case, it seems like a bit of a misnomer.

A document that doesn't contain a query term q should still be a possibility, but the probability should be less than or equal to the probability of the term appearing by chance in every other document! That's a very neat solution, and seems reasonable: the likelihood that a query term just happened to be left out of the document by chance is <= the likelihood that a query term happened to appear in a random document. I'd set the probability for P(d,q) equal to

summation of the probability of q in each the non-relevant documents
____________________________________________________________________
the total number of non-relevant documents

But, of course, we can't use that because the whole point is that we don't know which documents are relevant or not relevant. So we just have to use ALL the documents, instead of just the non-relevant ones. That's why the value of P(d,q) should be LESS THAN that quotient, instead of equal.

Muddiest Points: The GSA's guest lecture was fairly straightforward and mostly just covered the material in that week's reading. I don't have any muddiest points.

Friday, February 3, 2012

Term Weighting and the Vector Space Model

Wow, I knew that you needed math for computer science, but I thought that was more about understanding the way of thinking! To understand this reading, I had to bring out my Linear Algebra book. I took that seven years ago!

This reading said near the beginning that "the expensive component of this methodology is the labor-intensive assembly of user-generated relevance judgments from which to learn the weights". But for a service like Google, they can gather that information by seeing which documents people click on, or some function of how long they stay on each document. That has its own issues with how to weight these variables, and once again a large issue is the length of the document; an extremely relevant document that is extremely short should not be excluded just because the user only has to look for 5 seconds to obtain the information desired.

Thinking of documents as vectors was a new concept to me, and it was very difficult to grasp at first. But there are a lot of benefits to changing our conceptual model in this way. Once I got the hang of it, I realized that the similarity function was actually very straightforward, though at first I didn't understand it. If a term doesn't exist in the document, it's given a value of 0, so if the term shows up in Doc1 and not Doc2, then the Doc1[i] value is multiplied by 0, since Doc2[i]=0. Thus only terms that show up in both documents are counted, and then the occurances are simply added together, then normalized to a unit vector.

I worked out some of the math in my notebook in order to understand where the formulas came from, but I'm not sure how to put all that math notation onto this blog! It did come together very neatly, though.

The maximum tf normalization was an interesting idea. As the chapter stated, there are definitely times when that would go wrong. This is especially so in the case where one or more words are used very commonly in a particular discipline. IIt's discipline-specific, so it can't be a stopword for generalized use, but for that to be counted as the maximum tf normalization would make the search service extremely inefficient for anybody who is part of that discipline.

One thing that occurred to me was to not use the absolute maximum, but eliminate one or two outliers and choosing the third or fourth most highly used term. The decrease in the average value of tfmax(d) could be mitigated by slightly lowering a, for example to a=0.35.

Muddiest Points: I had a lot of trouble understanding the Pivoted normalized document length. I'm just not quite putting together the idea probability as a function of document length. The graph made it look like the longer a document is, the more likely to be relevant it is, which doesn't sound right at all.

Saturday, January 21, 2012

It's amazing how obvious it is that a text was written only a few years ago. A lot of the things they explain are just intuitively obvious for someone who's used Google for the past 5+ years.

Also, for so many of the issues with programming searches, it's sick how often my brain just jumps to:

"Hasn't someone already done all of this? Isn't it open source? Yes? Okay, sweet. Let's move on!"

Or, perhaps slightly more relevantly:

"Why go through all these guesses about what human beings mean? Why not start with a simple model, and record what search terms people use for Google/Wikipedia/Dictionary.com/whatever, and what page/word/article they ultimately choose? After 100 or 1000 people have done this, and the agreement rate is over, say, 95%, they add that association into your indexing relations. No more guesswork!"

But of course, 4 or 5 years from now, a programmer or information scientist reading that suggestion would probably consider it outdated.

Is there an application available, probably as a browser extension, that would allow users to tag web sites with whatever labels they think are relevant? I think it would be tremendously useful. I need to think more about that. There must be inherent problems or it would be used popularly by now, right?

I'm not sure I get the point of biwords and their extensions. I thought that the process was pretty much standardized:
If it's in quotation marks, then it's a phrase
If there are commas between the words, treat them as separate words (AFTER
the first step, so anything inside the quotations wouldn't apply)
If it's not in quotation marks, first treat it as a phrase and return the
results first, then afterwards treat them as separate words and return
those results, too

Isn't that pretty much agreed upon? I don't think it's ever steered me wrong using a popular search engine before.