3   6   Finding Duplicate News Articles 6 08

3 6 Finding Duplicate News Articles 6 08

Next we shall look at the problem
of finding news articles that represent the same story. Since the same story maybe used
by many different sources, each with its own way of
presenting it on a web page, the problem is not too different from
finding similar documents in general. But there is a special way of shingling
that works well when the difference are mostly in the ads
associated with the article. We shall also talk about a simple
bucketing method that works when the number of sets is not too great,
and the similarity expected for the underlying articles is quite high. So, we turn to the third interesting
variant of LSH with another true story. Awhile ago a group of political
scientists at Stanford asked for help from the CS department going
through a large repository of news articles to identify those
that were essentially duplicates. The problem was that many of the articles
really came from the same source, but they can look quite different when published on
the website of different news services. They wanted to group web pages
whose underlying articles were the same or similar. This by the way, is the same problem
faced every day by services such as Google News although they do the grouping
day by day rather than once and for all. Identifying the same underlying
article is tricky because each news source creates a page from the article
that has unique elements on that page. For example, they will put the newspaper’s
name and other text elements on the top and there will be links
to ads on the page. It’s also common for
one site to include links to related or interesting stories on its own site. In addition, it is common for
a site not to place the entire article on its pages especially if
the article is long. They will leave off
the paragraphs at the end or even delete other paragraphs that
the editor finds less relevant. Now, the CS team had not heard of
locality sensitive hashing or minhashing. However, they invented a form of shingling
that is probably better than the standard approach we covered for those webpages
that are of the type we just described. And they invented a simple substitute for
LSH that worked adequately well for the scale of problem they were looking at. They partitioned the pages into
groups of similar length and they only compared pages in
the same group or nearby groups. After they had done all this, we happened to be talking in the hall and
I mentioned minhashing and LSH. They implemented these algorithms on that
data, and they found that minhashing plus LSH was better as long as the
similarity threshold was less than 80%. Actually, that is consistent
with what is known. When you’re looking for
very high Jaccard similarities like 80 or 90% then there are indeed more
efficient algorithms, and we’re going to cover these
before we leave the topic. Interestingly, the first time
they implemented minhashing, they got it wrong and
decided that the method was terrible. But the problem was that I
forgot to remind them to do the minhashing row by row,
where you compute the hash value for each row number once and for
all rather than once for each column. Remember that the rows correspond to the
shingles and the columns to the web pages. Since their data was naturally stored
by columns, that is by web, web pages, they needed to sort their set of shingle
webpage pairs to organize them by row, that is, by shingle. Once they did that, they got the positive
results I mentioned on the previous slide. Before leaving this topic, let me tell
you about the way these guys shingle web pages containing news articles. The key observation was that they needed
to give more weight to the articles themselves than to the ads and
other elements surrounding the article. That is, they did not want to
identify as similar two articles from the same newspaper with the same ads and other elements, but
different underlying stories. Their trick was based on stop words,
the common little words that we need in order to construct sentences properly,
but that do not convey much meaning. Typically the set of stop words includes
things like and, the, to and so on. Usually when analyzing text we ignore
stop words because they don’t tell us anything about the subject matter. But here the key observation
was that the stop words tell us whether we’re looking at the news article
from which we want to take our shingles or ads or other peripheral matters from
which we do not want to take shingles. For example,
ordinary prose would say something like, I recommend that you buy Sudzo for
your laundry. The likely stop words are shown in orange. I, that, you, for, your. Very common words. But in an ad we would just find
something abbreviated like buy Sudzo, which has no stop word at all. So they defined a shingle to be a stop
word followed by the next two words in the sentence, stop words or not. Thus, in the sentence on the slide one
shingle would be, I recommend that,the next would be, that you buy and so on. Notice that there
are relatively few shingles and it does not guarantee that each
word is part of even one shingle. The reason this notion of shingle makes
sense is that it biases the set of shingles for
a page in favor of the news article. That is, suppose for simplicity that
all pages are half news article and half ads,
if you count by number of characters. If we have a second page with,
with the same article, but different ads,
we find that most of the shingles for both pages come from the article,
because that’s where the stop words are. So these two pages have
almost the same shingles, and therefore have very high
Jaccard similarity. Now consider two pages with the same
ads and different articles. These will have low Jaccard similarity,
because again, most of their shingles come from the articles and these shingles would
be mostly different for the two articles


One thought on “3 6 Finding Duplicate News Articles 6 08”

Leave a Reply

Your email address will not be published. Required fields are marked *