Michael Burrows, Jeffrey A. Dean

Document treadmilling system and method for updating documents in a document repository and recovering storage space from invalidated documents
A tokenspace repository stores documents as a sequence of tokens. The tokenspace repository, as well as the inverted index for the tokenspace repository, uses a data structure that has a first end and a second end and allows for insertions at the second end and deletions from the front end. A document in the tokenspace repository is updated by inserting the updated version into the repository at the second end and invalidating the earlier version. Invalidated documents are not deleted immediately; they are identified in a garbage collection list for later garbage collection. The tokenspace repository is treadmilled to shift invalidated documents to the front end, at which point they may be deleted and their storage space recovered.

2 thoughts on “Michael Burrows, Jeffrey A. Dean

  1. shinichi Post author

    How often is the Google search index refreshed?

    https://www.quora.com/How-often-is-the-Google-search-index-refreshed

    **

    Bill Slawski

    The Google search index, Caffeine, is refreshed upon an ongoing basis, in an incremental manner, that at least one Google patent refers to as treadmilling. That patent is: “Document treadmilling system and method for updating documents in a document repository and recovering storage space from invalidated documents” at http://patft.uspto.gov/n… as invented by Michael Burrows and Jeffrey A. Dean. The abstract for the patent describes this incremental process in this manner:

    “A tokenspace repository stores documents as a sequence of tokens. The tokenspace repository, as well as the inverted index for the tokenspace repository, uses a data structure that has a first end and a second end and allows for insertions at the second end and deletions from the front end. A document in the tokenspace repository is updated by inserting the updated version into the repository at the second end and invalidating the earlier version. Invalidated documents are not deleted immediately; they are identified in a garbage collection list for later garbage collection. The tokenspace repository is treadmilled to shift invalidated documents to the front end, at which point they may be deleted and their storage space recovered.”

    **

    Fred Showker

    Continually. It never stops. It’s refreshing the index right now — as we speak.

    However, with billions of pages to visit, it may take some time before the spyder reaches them all.

    Plus the process is slower because Google “favers” pages that get the best returns on Google adWords. So some of those pet sites may get indexed every day, or every hour, depending on content frequency.

    Then you must remember that although Google has indexed the page, you won’t see it ever in a search result. Some pages Google won’t show — if you’re not in Google, you don’t exist.

    We’ve tested a number of pages and can identify those which are indexed, but never served by Google.

    We do not know the reason for your asking, but I’m assuming you asked because a page you are interested in never shows up.

    **

    Michal Illich

    Every page has its own frequency of updates.
    High quality news sites might be updated every few minutes into the “fresh” index.
    On the other hand, low rank and outdated pages might not get re-crawled for a month or even more.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *