NULL in dimensions: Number of documents vs rank in FTS

Dear all,
I am interested in using FTS for scientific investigation.
Specifically, I need to know exactly how often certain keywords occur
in documents in my collection. I know how many documents contain the
keyword, but how do I find out how often the keyword occurs in these
documents? I know the ranking formula used the inverse document
frequency to penalize the number of occurrences for keywords common in
my document collection and also penalizes longer documents, so I cannot
just use the rank as given by sql server. However, I am sure that the
FT index somehow contains the frequency with which each word occurs in
each document since it needs that to calculate the rank in the first
place. Is there any way I can get this information?
[As an example, I would like to know how often the terms 'bush NEAR
president', 'kerry NEAR senator', and 'terrorism' occur in my
collection of NYTimes articles so I can calculate co-occurrence scores]
Thanks in advance,
Wouter van Atteveldt
Free University Amsterdam
You are looking for the hit count property. This feature is available in
other Microsoft Search products but not in SQL FTS. You can use an inverted
file index to get this value. Search the web for implementations of this in
tsql.
Here is one -
http://www.n3labs.com/pdf/putz91usin...rted-DBMS.html
The ranking formula actually "penalizes" the contribution to words or tokens
which you are searching on which occur frequently or rarely in your entire
document collection or the frequency that words occur compared to the entire
count of words indexed. Words which slightly more than avereage - as defined
by Zipf's law, have greater "resolving" power and are ranked higher.
The formula also normalizes for length - so its sort of like an occurence
density. So if you have a large document with the word occuring twice in a
10,000 word document, this document will be ranked lower than a 1,000 word
document where the word occurs twice.
"wouter" <wouter@.2at.nl> wrote in message
news:1108296276.416611.195050@.z14g2000cwz.googlegr oups.com...
> Dear all,
> I am interested in using FTS for scientific investigation.
> Specifically, I need to know exactly how often certain keywords occur
> in documents in my collection. I know how many documents contain the
> keyword, but how do I find out how often the keyword occurs in these
> documents? I know the ranking formula used the inverse document
> frequency to penalize the number of occurrences for keywords common in
> my document collection and also penalizes longer documents, so I cannot
> just use the rank as given by sql server. However, I am sure that the
> FT index somehow contains the frequency with which each word occurs in
> each document since it needs that to calculate the rank in the first
> place. Is there any way I can get this information?
> [As an example, I would like to know how often the terms 'bush NEAR
> president', 'kerry NEAR senator', and 'terrorism' occur in my
> collection of NYTimes articles so I can calculate co-occurrence scores]
> Thanks in advance,
> Wouter van Atteveldt
> Free University Amsterdam
>
|||Dear Hilary,
Thanks for your answer. The article on using the BTree index to build
my own index is very interesting and very tempting. However, my
intuition tells me that I must be reinventing the wheel if I start
writing procedures to build, update, and query such an index and
especially if I start writing procedures to query the index using
boolean phrases or proximity terms. I think it would be very
interesting and insightful to write myself but it I'm sure other people
have written it already and probably more efficiently...
So my question is: If it is true that I can't retrieve the inverse
document frequency of a boolean query in SQL server*, can anyone advise
me on a good alternative? Is there a Document Retrieval engine that
communicates well with databases? Does anyone think I should try using
the API on the search engine directly either from a stand-alone program
or from a SQL Server function (esp. with 2005 supporting .NET code) ?
Any help appreciated!
Wouter
* which would be a shame since it is obvious that SQL server or at
least the search service 'knows' this number and the rest of my program
is built around an SQL server database anyway. O well, who said life is
easy anyway :-)
Hilary Cotter wrote:
> You are looking for the hit count property. This feature is available
in
> other Microsoft Search products but not in SQL FTS. You can use an
inverted
> file index to get this value. Search the web for implementations of
this in
> tsql.
> Here is one -
> http://www.n3labs.com/pdf/putz91usin...rted-DBMS.html
> The ranking formula actually "penalizes" the contribution to words or
tokens
> which you are searching on which occur frequently or rarely in your
entire
> document collection or the frequency that words occur compared to the
entire
> count of words indexed. Words which slightly more than avereage - as
defined
> by Zipf's law, have greater "resolving" power and are ranked higher.
> The formula also normalizes for length - so its sort of like an
occurence
> density. So if you have a large document with the word occuring twice
in a
> 10,000 word document, this document will be ranked lower than a 1,000
word[vbcol=seagreen]
> document where the word occurs twice.
> "wouter" <wouter@.2at.nl> wrote in message
> news:1108296276.416611.195050@.z14g2000cwz.googlegr oups.com...
occur[vbcol=seagreen]
the[vbcol=seagreen]
these[vbcol=seagreen]
in[vbcol=seagreen]
cannot[vbcol=seagreen]
the[vbcol=seagreen]
in[vbcol=seagreen]
first[vbcol=seagreen]
scores][vbcol=seagreen]
|||Hit Count - the property you are looking for ships with other Microsoft
Search products. It may ship with future versions of SQL Server - currently
in 2005, this property is not exposed.
Other search engine vendors that interact with SQL Server (SQL Turbo,
DTSearch) might.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"wouter" <wouter@.2at.nl> wrote in message
news:1108398341.423014.22460@.o13g2000cwo.googlegro ups.com...
> Dear Hilary,
> Thanks for your answer. The article on using the BTree index to build
> my own index is very interesting and very tempting. However, my
> intuition tells me that I must be reinventing the wheel if I start
> writing procedures to build, update, and query such an index and
> especially if I start writing procedures to query the index using
> boolean phrases or proximity terms. I think it would be very
> interesting and insightful to write myself but it I'm sure other people
> have written it already and probably more efficiently...
> So my question is: If it is true that I can't retrieve the inverse
> document frequency of a boolean query in SQL server*, can anyone advise
> me on a good alternative? Is there a Document Retrieval engine that
> communicates well with databases? Does anyone think I should try using
> the API on the search engine directly either from a stand-alone program
> or from a SQL Server function (esp. with 2005 supporting .NET code) ?
> Any help appreciated!
> Wouter
>
> * which would be a shame since it is obvious that SQL server or at
> least the search service 'knows' this number and the rest of my program
> is built around an SQL server database anyway. O well, who said life is
> easy anyway :-)
>
> Hilary Cotter wrote:
> in
> inverted
> this in
> tokens
> entire
> entire
> defined
> occurence
> in a
> word
> occur
> the
> these
> in
> cannot
> the
> in
> first
> scores]
>

Wednesday, March 21, 2012

Number of documents vs rank in FTS

No comments:

Post a Comment

NULL in dimensions

Blog Archive

About Me