Interview With Google's Director of Research 135
Cialti writes "Salon has a very interesting article with Monika Henziger, Google's Director of Research, about their search technology and where they're going with it.
"
Don't tell me how hard you work. Tell me how much you get done. -- James J. Ling
Re:[ot]Google's data structure? (Score:1)
use strict is REMed out.
Re:Voice activated search engine (Score:1)
He did read the article. He said "I'm not sure that's not why they were working with BMW." Note the double negative, hence he is sure that is why they were working with BMW.
you want choices? (Score:1)
nah!
bestbet at shootybangbang.com [shootybangbang.com] bounces you straight to the *best bet*
sometimes it is smart, sometimes it is stupid
Re:Voice activated search engine (Score:2)
(car cuts driver off)
"Fuck you, asshole!"
(computer beeps)
[25,945 results found.]
Re:Actual Questions for Ask Jeeves (Score:1)
Actual Questions for Ask Jeeves (Score:2)
Happy reading, and remember, you're looking at the end of the human race.
Yikes, Zephyr Interactive? (Score:1)
Re:Prepositions need love too (Score:2)
Deja (Score:2)
Google also does Mac searches! (Score:1)
But the Google/Mac logo isn't as cool as the Google/Linux logo -- it contains all the fruity colors that Apple has largely abandoned.
Re:[ot]Google's data structure? (Score:2)
There are various ways to speed this up by compressing the arrays, hash joins, etc., but the basic idea is the same.
Re:[ot]Google's data structure? (Score:3)
As far as I can tell from their paper [nec.com], Google manages its web crawls the same way. It partitions the data into "barrels" and indexes each separately. Once the indices are built, they aren't updated. They also extend the hit lists to include word position and some other attributes for each hit.
Re:Prepositions need love too (Score:1)
Re:Prepositions need love too (Score:2)
The problem is that you +'ed it too much. If you search for +"+but +that +the +dread" [google.com] you'll notice that it gives you some warnings. Google's ignoring all of the +'s you added, because you're using some of them incorrectly. ("dread" is not a stop word, for example)
Instead, try searching for "but +that +the dread" [google.com]. Then you'll get what you're looking for.
Re:Voice activated search engine (Score:2)
That's just great. Now the cell-phone dolts in the SUVs will be using Google *at the same time* to check on their facts, *while* they are driving...
--
Re:Disturbing Search Requests (Score:1)
That site is absolutely hilarious! Thank you for the link.
Masturbation Techniques (Score:5)
Google absolutely blows away the competition, however it is humorous seeing entries in my log file related to people looking for masturbation tips (from the beginner level "How To" style queries, to full blown searches for advanced techniques). The page [yafla.com] in question is entitled "Hey Jerk : Get Off My Computer!" (and relates to pop-up ad windows) and I'm, uh, proud to see that it ranks #2 for searches for "jerk off technique" (I've had dozens of related hits appearing). While it is humorous seeing searching going a little off-track, I am very curious how many consumers know that each link you follow passes on where you came from, so for instance I see log entries like
200x-xx-xx xx:xx:xx xxx.xxx.xxx.xxx GET /rants/jerk/index.htm 200 5986 334 270 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;+Dig Ext) http://google.yahoo.com/bin/query?p=jerk+off&b=21& hc=0&hs=5 /rants/jerk/index.htm 200 5986 437 1292 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;+Dig Ext;+sureseeker.com) http://www.google.com/search?q=guys+who+jerk+off
-or-
200x-xx-xx xx:xx:xx xxx.xxx.xxx.xxx GET
Re:Masturbation Techniques (Score:2)
Unfortunately, OmniWeb's JavaScript support is lacking in other areas, but that feature is brilliant, and their text display is the cleanest I've ever seen in any program. Linux users should get MacOS X just to rest their bad font weary eyes
D
----
Re:why I like google (Score:2)
http://www.google.com/windows/
doesn't work. Great job!
D
----
Voice activated search engine (Score:2)
Even out of the scope of a car - this feature would be awesome if it were integrated with cable (or satellite) and the TV room
Get me Gilligan's Island ... Click
Re:Voice activated search engine (Score:3)
Re:Perks (Score:2)
All search engines spider ahead of time and store; to do otherwise would take forever to get you any search results ("It's a terrible strain on the animators' wrists." :) My impression from the article was not that they generate whole searches ahead of time, but that they categorize by the individual search words, and then when you type in a query they generate the intersection of the pages on their many word lists. Then one miracle occurs, and ...
Caution: contents may be quarrelsome and meticulous!
Re:Regex: won't happen (Score:2)
But in the case where they would implement my ability to submit a RegEx, I could give them lots of flex on the time in return for the exact one page that I want. How hard could it possibly be?
(dodging)
Re:Prepositions need love too (Score:3)
Dumb question (?) (Score:2)
Search 1,346,966,000 web pages
and this number doesn't change?"
Re:Actual Questions for Ask Jeeves (Score:1)
It is too bad they took that away.
Re:Prepositions need love too (Score:3)
And, actually, that's not quite right, either. It's apparently always going to blow off your "the" (I just tried it). This is, alas, a seriously hard problem. What you were doing was looking for what actually amounts to a single chunk of information: the title of a fanfare played for the president. Unfortunately, the English version of the title is four words long although the title itself might in some cases act just like a single word (or noun phrase). So:
Yes, you might even pluralize it just like a noun. So that's one problem right there: search terms that really are tantamount to a single lexical item might be four or more words long, and might even be inflected.Ideally, you'd like to index separately these multi-word chunks, especially if you can prove they occur way more often than expected. So in your example, "hail" and "chief" co-occur on about 28,000 pages, while "hail" alone is on 510,000 and "chief" alone is on over 1,500,000. If Google indexes 1.5 billion pages (or so), and the terms were independent, then, you'd expect something like 5000 co-occurrences, and 28,000 is so outrageously out of line you would know that something is up.
Now, I'm guessing that *local* co-occurrence information is likely to eventually going to prove even handier in this regard. So, for example, "hail to" comes up 157,000 times, which is about 1/3 of all "hail" pages. That's very unlikely unless there's something systematic (and very possibly exploitable) going on.
The big problem is that you can't really do much with function words alone, since they're just too staggeringly frequent. In running English text, the frequency of "the" is just about 70,000 per million. In other words, 7% of all English text consists of the definite article, and most web pages contain many distinct copies. You've got to kill that. Unfortunately, by omitting "the", you lose a lot of potentially useful information about definiteness of the noun phrase. In the "hail to the chief" example, the song title itself is just one example of a (somewhat) productive expression "hail to [definite-NP]", which has a specific kind of meaning implied (interestingly, usually sarcastic or abusive). Picking up on this could be very useful.
So suppose I typed into deja "bush mass-mooning Gothenburg". I'll get 9 hits. That's nice, but google might want to do more, and provide additional examples of president (or candidate) Bush being derided in public. Or maybe give me pages that refer to the same incident being described as the Swedish version of "hail to the chief".
So there is no doubt that function words need love, but I'd argue for a love that seeks to understand them and their weird little contributions to meaning rather than just a way to make sure you can nail a song title exactly.
the technology behind google (Score:2)
--sean
Re:[ot]Google's data structure? (Score:3)
http://www-db.stanford.edu/~backrub/google.html [stanford.edu].
Also, try a lookup for a bloom filter [google.com], which google uses, I think. Most search engines work by inverting the index, and then merging the lists. Taking the intersection of all the keywords gives ou the membership, then you apply ranking to the membership. Pretty simple concept. I don't know of any search engines that use a trie, or use any form of stemming.
-js
Re:Is she hot or not? (Score:1)
See a female who ain't your mother,
run in circles, sweat and stutter.
From the article:
"...people like my husband would get crazy. He just wants to find pages that have his words."
Lesbian? Not. Competent? Hell, yes!
Re:Smarter Searches (Score:2)
German queries at fireball.de (Score:2)
MP3 of that talk (Score:3)
Re:Masturbation Techniques (Score:2)
Re:MP3 of that talk (Score:1)
chris
Re:Yeah Suckah! (Score:5)
an interesting side note: they found that when one of the linux boxes stops working, it's more cost effective to replace it than to fix the problem (hardware, at least). google throws out a lot of good hardware because of that. the lecture hall was begging for a student donation program of some sort when the google guy mentioned that
chris
Send messages to the staff! (Score:5)
"Help, I'm stuck in here!!" is an obvious classic to try. If enough of us do it, it might even get noticed...
"Intelligence is the ability to avoid doing work, yet getting the work done".
Re:Smarter Searches (Score:2)
Re:[ot]Google's data structure? (Score:2)
I am curious as to what kind of implementation they are using; Google's roots would suggest some hacked form of Berkeley DB with lots of performance improvements.
Oh, well, just some guesswork... if I am close, I am expecting a job offer by the way
Re:Send messages to the staff! (Score:2)
Before I read your post, I had the same idea. I just sent one that said "Sorry, am I DOSing the Google lobby scroller?" Then, after reading this post [slashdot.org], I did a search for "jerk off technique."
Hope those scroller babies don't log IPs. It would look like I was so bored (at work right now) that I decided to SPAM their scroller, which had somehow gotten me into some kind of masturbatory mood.
< tofuhead >
--
Gnut (Score:1)
Sing it brother (Score:1)
Re:Dumb question (?) (Score:1)
Re:Google is still sloppy and second-rate. (Score:1)
I remember when... (Score:1)
I switched from dogpile to google. It was the day that I read on /. that you could search for "more evil than satan" on google and the first hit was www.microsoft.com [microsoft.com]. That was a great day.
Re:Smarter Searches (Score:2)
Re:Smarter Searches (Score:2)
Thanks for this suggestion. Although it is a good example of interaction between the engine and the user, it seems to be based on a simple spelling check. Rather, I was thinking more in terms of what Monika Henziger referred to as a topic based query. For example, typing 'bicycle' and receiving a choice of 'bicycle repair', 'bicycle racing', 'bicycle sales', 'bicycle parts', 'bicycle touring', etc...
Re:Smarter Searches (Score:2)
Smarter Searches (Score:4)
This is interesting. I wonder if there might be a way for the engine to have a two way back-and-forth "conversation" with the user. IOW, if the engine interprets the query to have several possible meanings, a few multiple choice questions might clarify the meaning and narrow the search parameters. I think this could be more helpful than doing a blind guess of the user's intention.
Re:Send messages to the staff! (Score:1)
-Ben
Re:I remember when... (Score:1)
--
phone book function (Score:2)
I first noticed this function when searching for information on the professional work of someone who I was going to be working with - and the #1 thing google spat up was his home address and phone number. I know I could have found this almost immediately if I went actively looking for it, but it was a bit creepy anyway. I guess the reason I'm disturbed it that it wouldn't have occured to me to go looking for that information, but once it was thrust in my face like that, I could immediately think of reasons it might be handy to have it.. In the event, I didn't copy it down anywhere, but, well, I could think of people who wouldn't hesistate to call me at 3am if they had my home number..
Fortunately google seems willing to at least let you opt out - http://www.google.com/help/pbremoval.html - which is fine for people who know about google and its more esoteric functions, but ain't going to help Jane Shmoe when she starts wondering why so many more people seem to know here she lives and what her home number is - people who wouldn't necessarily have gone looking for the information (that would be rude..) but who don't mind having it when it's 'handed' to them.
Re:[ot]Google's data structure? (Score:1)
Re:Prepositions need love too (Score:3)
I had the same problem yesterday when I was searching for "quotes about Shakespeare". "to be or not to be" (with quotes) pulls up the proper category, but the first rsult it comes up with is the GNU homepage, because GNU's not Unix!. The second link is to Am I Hot or Not, BTW...
Strangely enough, it warns about "or", and if I want to use it in a search, it must be in CAPS, but then how do I search for something in ORegon? For some reason, it says nothing about "not", so I don't know what's up with their search terms anymore.
--
Search Query (Score:1)
isn't Google always getting itself in the news? (Score:1)
Re:Yahoo took a much bigger leap - it licensed Goo (Score:1)
Re:Prepositions need love too (Score:1)
Well, actually you don't. AltaVista indexes every word, including "the". This helps it do exact phrase queries. For instance, try searching for "The Who".
Prepositions need love too (Score:4)
This makes sense on a general level, but when you try searching for a phrase embedded in quotation marks, it's frustrating to have Google decide which parts of a literal string to search for and which to ignore. If I had wanted it to ignore parts of it, I wouldn't have indicated that it was a literal phrase, dangnabbit!
It is possible to include words that you typed in the search phrase, but you have to add an Altavista-style '+' before it.
For example, searching for: "Hail to the chief" would ignore to and the. In order to actually search for the phrase (which I indicated that I wanted to do by surrounding it in quotation marks), I would have to type "Hail +to +the chief". Hardly user-friendly.
Oh, well.
Re:[ot]Google's data structure? (Score:1)
I figure it's something derived from a B-Tree (like a binary tree - but better for databases) and distribute it on a cluster of of boxens (linux right?)
I'm sure there's a hell of a lot more to it then that. a hell of a lot more, hell let's ask him.
begin question
Hey google guy, how is the webpage index data stored and retrieved. What data structures and what algorithms are used. how many boxes do you have for indexing?
end question
maybe he'll answer.
-Jon
Re:why *I* like google (Score:1)
http://www.google.com/bsd [google.com]
-jon
Re:[ot]Google's data structure? (Score:1)
BigFiles
BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. The allocation among multiple file systems is handled automatically. The BigFiles package also handles allocation and deallocation of file descriptors, since the operating systems do not provide enough for our needs. BigFiles also support rudimentary compression options.
4.2.2 Repository
Figure 2. Repository Data Structure
The repository contains the full HTML of every web page. Each page is compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib's speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib's 3 to 1 compression. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure 2. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors.
Document Index
The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL. This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a search
Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums with their corresponding docIDs and is sorted by checksum. In order to find the docID of a particular URL, the URL's checksum is computed and a binary search is performed on the checksums file to find its docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is the technique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial because otherwise we must perform one seek for every link which assuming one disk would take more than a month for our 322 million link dataset.
Lexicon
The lexicon has several different forms. One important change from earlier systems is that the lexicon can fit in memory for a reasonable price. In the current implementation we can keep the lexicon in memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words (though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers. For various functions, the list of words has some auxiliary information which is beyond the scope of this paper to explain fully.
Hit Lists
A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization -- simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding. In the end we chose a hand optimized compact encoding since it required far less space than the simple encoding and far less bit manipulation than Huffman coding. The details of the hits are shown in Figure 3.
Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits. Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the document using three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancy hit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limited phrase searching as long as there are not that many anchors for a particular word. We expect to update the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields. We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font.
The length of a hit list is stored before the hits themselves. To save space, the length of the hit list is combined with the wordID in the forward index and the docID in the inverted index. This limits it to 8 and 5 bits respectively (there are some tricks which allow 8 bits to be borrowed from the wordID). If the length is longer than would fit in that many bits, an escape code is used in those bits, and the next two bytes contain the actual length.
Forward Index
The forward index is actually already partially sorted. It is stored in a number of barrels (we used 64). Each barrel holds a range of wordID's. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words. This scheme requires slightly more storage because of duplicated docIDs but the difference is very small for a reasonable number of buckets and saves considerable time and coding complexity in the final indexing phase done by the sorter. Furthermore, instead of storing actual wordID's, we store each wordID as a relative difference from the minimum wordID that falls into the barrel the wordID is in. This way, we can use just 24 bits for the wordID's in the unsorted barrels, leaving 8 bits for the hit list length.
Inverted Index
The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. It points to a doclist of docID's together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents.
An important issue is in what order the docID's should appear in the doclist. One simple solution is to store them sorted by docID. This allows for quick merging of different doclists for multiple word queries. Another option is to store them sorted by a ranking of the occurrence of the word in each document. This makes answering one word queries trivial and makes it likely that the answers to multiple word queries are near the start. However, merging is much more difficult. Also, this makes development much more difficult in that a change to the ranking function requires a rebuild of the index. We chose a compromise between these options, keeping two sets of inverted barrels -- one set for hit lists which include title or anchor hits and another set for all hit lists. This way, we check the first set of barrels first and if there are not enough matches within those barrels we check the larger ones.
Northernlight (Score:2)
Re:Much like McDonalds (Score:1)
Re:Smarter Searches (Score:2)
I believe it was Altavista that had (and may still have, though I don't see any sign of it) something along these lines - after a query, it would also present an option to narrow the query by selecting some other key words that appeared in some of the pages. If I recall correctly this was not on the main query results pages, but there was a link to it.
For the example someone posted earlier where he gets a lot of hits from people looking for masturbation tips, using that option would present you with several groupings of words - one group might include "masturbate" and other terms likely to be found on that sort of pages, another group might include "network," "security," and "adware." Each group and each word within a group had a checkbox that could be used to select additional words to use in limiting the search.
I suspect that this was dropped for load reasons, though I could be wrong - it may be that people just didn't use it and they decided it wasn't worth the hassle.
-- fencepost
Here is the real google info... (Score:5)
I think we all could use more understanding of the topic. A link to the paper that started it all here [nec.com].
Re:More on language translation... (Score:2)
> and then back to English again)
And that's the catch. Most documents are readible after they;ve been put though the blender once. But two passes through the blender results in garbage.
The Fish is quite good for the one-way trips that it was designed for. A round trip ticket through the Fish is usually deadly.
--
its been said (Score:1)
every one should check out that if you change your preferences on google having to do with language, one of the languages is Bork Bork Bork!, or the sweedish chef's language.
also, what happened to searching for 666, the first entry it spat up was microsoft?
zero
Re:Smarter Searches (Score:1)
Re:[ot]Google's data structure? (Score:1)
read the article (Score:2)
"That's a filtered version, except that the filter doesn't work well in other languages. So we had people here from BMW, and they told me that there were some German queries that got through that shouldn't have.
[Note to self: Curse on Google only in foreign tongues.]"
Re:read the article (Score:2)
Re:why *I* like google (Score:2)
Mac only searches.. and a cool Mac logo!
http://www.google.com/mac [google.com]
AND...
US Government searches... and a "cool" US logo?
http://www.google.com/unclesam [google.com]
why I like google (Score:3)
http://www.google.com/linux [google.com]
-gerbik
Re:why I like google (Score:1)
Go under http://www.google.com/linux [google.com].
Try searching for "news".
Guess what comes up #2?
Ryan Finley
How to improve the timeliness searches? (Score:1)
My question is, what are you doing to improve the timeliness of searches? Often, there is a conservative bias as older sites have more links to them. As I watch the results from my site get integrated, it seems that your processing cycle is about a month--making google not the SE of choice to research recent news events. I may also add, that this seems like a bigger imperative given the recent acquistion of deja/usenet.
Keep up the good work (and don't ever sell out baby, no matter what riches the VC put in front of your nose).
SatireWire: interview with Jeeves (Score:2)
--
mrBlond (I don't email from Malaysia)
Re:read the article (Score:1)
--
Re:Disturbing Search Requests (Score:2)
--
Re:Disturbing Search Requests (Score:2)
It kinda makes you want to start checking those referer logs, eh? I found once that was looking for 'priceless pissing'. No clue how they ended up on my site!
$ grep google /usr/apache/logs/referer_log
--
Re:new search engine (Score:1)
In fact they seem to be claiming that they built most of Google. It's a pity their own web-site looks so bad though. Here is an excerpt: We also built a sophisticated server system to run the show and organized the site's starting database
Re:Prepositions need love too (Score:2)
Google always searches for pages containing all the words in your query, so you do not need to use + in front of words. [details] The word "or" was ignored in your query -- for search results including one term or another, use capitalized "OR" between words.[details] The following words are very common and were not included in your search: to be to be. [details]
That seems so pointy-haired-bossish.
Re:Why Google is my favorite search engine (Score:1)
The criticisms being made here about how Google omits certain words apply equally to their newsgroup searches. Very annoying. The advanced groups search lets you search for an "exact phrase". Or so it says. It doesn't let you search that way at all. They have done a pretty good job so far with deja's data, however. I missed it all being out there. I look forward to their improvements over time.
It's good to know... (Score:1)
I love google.. it's fast, gives lots of results and the page isn't cluttered with dozens of banner ads like some other *cough* search-engine-portal-wannabes *cough*.
Maybe someday I'll get to use my networking skills on that server farm the've got going there... ahhhh a guy can dream eh?
Re:[ot]Google's data structure? (Score:1)
That would work ok, except that the process of updating the lists would be very expensive. Indexing every word in the interenet would be trivial, but keeping the addresses for those words in sorted order would be extremely non-trivial.
Imagine the word 'test' for example. You gotta believe that 'test' is on about a hundred million web pages, with more being added each day. That's one hundred million sorted addresses- probably taking up more than 800 disk blocks (100,000,000 / 4096 bytes block / ~30 bytes address). Every time you add a new page with the word 'test' (or take one away), you have to update the list. That's a lot of disk block rearranging. Now multiply this by all the words on the web and you can see what a huge amount of rewriting has to be done. I don't think linear address lists would cut it.
Now they could have some kind of funky indexing scheme for all the addresses. But its still freakin expensive to update them all. The article mentioned they update every 28 days. Does this mean they stop everything every 28 days to update- or does it mean that it takes 28 days to do an update? Regardless, this could mean that Google is always 28 days out of date. Another search engine that beat this number could potentially compete by saying they are more up to date.
You have to imagine that as the internet grows larger, that this is going to get even more time consuming.
Yahoo took a much bigger leap - it licensed Google (Score:2)
http://news.cnet.com/news/0-1005-200-5561996.ht
What do you expect, a monolith? (Score:2)
Google OTOH, is developing new technology. Most of that development is incremental -things get better and better. Until we actually find an alien monolith to give us all our science, this is how most advancements happen.
Re: weird google pages (was "why *I* like google") (Score:2)
www.google.com/palm [google.com] - Looks to be made for monochrome PDA browsers
www.google.com/ie [google.com] - For Pocket IE maybe?
wishus
---
method for increasing hits (Score:2)
A friend of mine (web developer) says that he's created a way to increase the hit count among all the sites he creates. He uses a server-side Perl scripts to determine if the Google bot is hitting a page, and includes links to *all* of the sites' homepages that they are hosting. So if he includes this script on every page of every site he hosts, then every page links to every site.
Does this work? I mean, they include (in plain English) something like "Here are some of the other sites we, [our web design firm], created and host" along with a short blurb. It sounds like it would work, right?
new search engine (Score:2)
It really says 'To fullfill their needs, we built a brand new searcg engine for Google.....'
[flash alert]
Re:isn't Google always getting itself in the news? (Score:2)
Really. Google uses a patented ranking algorithm, described by Page and Brin (Stanford graduate students which founded Google) in a paper titled The PageRank Citation Ranking: Bringing Order to the Web (1998) [nec.com] . The algorithm does very well at recognizing relevant documents. Last I looked, other search engines used mostly sets of hand-tuned hacks which did not do as well. Has this changed? I'd appreciate some references, refereed if possible.
~
More on language translation... (Score:3)
Will be complete and on the front of the L it will be reliable to translation service (as the BabelFish is same) a yet positively is thin method to Altavista. It was special and when you from one language also translate in different one thing, you in child one silence comfort ended to this, (and the that time English back mac tayn Great Britain from again under translate again in a Korean):
-S
Regex: won't happen (Score:2)
[ot]Google's data structure? (Score:3)
Okay, this is so off-topic it's not even funny.
Anybody have an inkling of a clue of the data structure that Google uses (or probably uses) to store all its words? I was just thinking that maybe it was some sort of balanced binary tree with each node containing a word, two pointers to the next two words further down the tree, and the root of a linked list of all the pages that word is contained in? I know binary search trees are supposed to be fast, but I was wondering if that'd be good enough for something with probably hundreds of thousands of words?
I'm assuming they're not using some sort of sql LIKE "%searchword%", I can't imagine any kind of cluster that could speed that process up, although I don't really know all that much about the process or what the main benefits of clustering are.
Anyway, hugely sorry for the offtopic post, it's just something that's been on the brain lately...
Re:[ot]Google's data structure? (Score:5)
Search 1,346,966,000 web pages (Score:2)
I emailed Google about it gave me some crap about it being too difficult....
What the mess...
Re:read the article (Score:2)
Uh huh, and maybe you should have read the trailing ;) before replying.
Re:phone book function (Score:2)
As for not having your phone number/address on the internet... that's why the phone companies are required by law to allow you to de-list. Without the internet, it takes me all of 5 minutes to drive to my local library, where they have phone books from around the world for the taking. Oh yes, and the white pages here only list first initial anyway :)
Much like McDonalds (Score:2)
One Hundred Billion Served!. Could become as common as that evil Castaway DVD commercial that's repeated at least 50 times a night on TV.
Yeah Suckah! (Score:2)