Wow, I knew that you needed math for computer science, but I thought that was more about understanding the way of thinking! To understand this reading, I had to bring out my Linear Algebra book. I took that seven years ago!
This reading said near the beginning that "the expensive component of this methodology is the labor-intensive assembly of user-generated relevance judgments from which to learn the weights". But for a service like Google, they can gather that information by seeing which documents people click on, or some function of how long they stay on each document. That has its own issues with how to weight these variables, and once again a large issue is the length of the document; an extremely relevant document that is extremely short should not be excluded just because the user only has to look for 5 seconds to obtain the information desired.
Thinking of documents as vectors was a new concept to me, and it was very difficult to grasp at first. But there are a lot of benefits to changing our conceptual model in this way. Once I got the hang of it, I realized that the similarity function was actually very straightforward, though at first I didn't understand it. If a term doesn't exist in the document, it's given a value of 0, so if the term shows up in Doc1 and not Doc2, then the Doc1[i] value is multiplied by 0, since Doc2[i]=0. Thus only terms that show up in both documents are counted, and then the occurances are simply added together, then normalized to a unit vector.
I worked out some of the math in my notebook in order to understand where the formulas came from, but I'm not sure how to put all that math notation onto this blog! It did come together very neatly, though.
The maximum tf normalization was an interesting idea. As the chapter stated, there are definitely times when that would go wrong. This is especially so in the case where one or more words are used very commonly in a particular discipline. IIt's discipline-specific, so it can't be a stopword for generalized use, but for that to be counted as the maximum tf normalization would make the search service extremely inefficient for anybody who is part of that discipline.
One thing that occurred to me was to not use the absolute maximum, but eliminate one or two outliers and choosing the third or fourth most highly used term. The decrease in the average value of tfmax(d) could be mitigated by slightly lowering a, for example to a=0.35.
Muddiest Points: I had a lot of trouble understanding the Pivoted normalized document length. I'm just not quite putting together the idea probability as a function of document length. The graph made it look like the longer a document is, the more likely to be relevant it is, which doesn't sound right at all.
No comments:
Post a Comment