Google如何追踪网页变化

分享是关怀!

Google可能不会索引网页更改的索引

Many sites on the Web contain elements that change on a regular basis, from advertisements that differ every time a page shows, to widgets that contain constantly updated information, to blog and 新s homepages that show 新 posts and 文章 s hourly or daily or weekly. Ecommerce sites add and remove products regularly, and display updated specials and features on their homepages. Sites including or focusing upon user-generated content may consistently change.

Search engines use web crawling programs to discover 新 pages and sites and to index content that changes on pages, they already know about 通过 following links from one page to another. A Google patent granted last week explores the potential problem of a crawler coming across a page that has changed only slightly, such as a change in content or having an advertisement displayed, and deciding whether it should reindex that whole page because of the slight change.

的 patent also describes the process behind how Google might check on webpage changes, comparing a 新er version of a page to an older version, and associating the older content with the 新er content. I’m让我想起了我几年前写的一篇Yahoo专利,标题为 避免爬网广告和会话跟踪链接的Yahoo方法.

In the Yahoo patent I described in that post, we were told that the search engine might crawl pages a minute or so after a first crawl to see if there was content or links that changed from the first to second crawls, which might help the search engine in identifying those sections or links as advertisements or URLs that might contain unique session tracking parameters for different visitors. Yahoo might not crawl the 新er URLs in those links, which it might consider to be “transient.”

在Google专利中,我们’re told that if the differences in ages of the older content and the 新er content aren’t that great, the search engine may continue to display the old content rather than updating its index to include the 新er webpage changes as well. After some additional crawls, if Google sees those webpage changes reach a certain age, it may then index the 新er version of the page with the 新 content.

根据计算出的文档中更改部分的使用期限来更新搜索引擎文档索引
Joachim Kupke和Jeff Cox发明
分配给Google Inc.
美国专利8,001,462
2011年8月16日授予
提交日期:2009年1月30日

抽象

A system receives a document that includes 新 content and aged content, and compares the document with a prior version of the document that includes the aged content but not the 新 content. 的 system also separates the 新 content and the aged content based on the comparison, determines ages associated with the 新 content and the aged content, and determines whether the ages of the 新 content and the aged content are greater than or equal to an age threshold.

的 system further calculates a checksum of the document based on the aged content when the age of the aged content is greater than or equal to the age threshold, and the age of the 新 content is less than the age threshold, and stores the calculated checksum.

的 focus of this patent seems aimed at keeping the search engine from reindexing pages after recrawling those pages where it finds some changes to the pages such as 新 advertisements being displayed or updated lists of related links. It makes sense for a search engine to not reindex the content of a page too quickly after those types of changes since doing so could result in reindexing many pages where there really hasn’对这些页面进行了任何实质性更改。

I suspect that this process acts to throttle how quickly a search engine might update its index when it discovers 新 content on pages, regardless of whether those changes are slight changes to the content of a page or even possibly the posting of a 新 blog post. Since many pages on the Web do have components that might show 新 content every time they are crawled, allowing a certain amount of time to pass before reindexing the content of a page might make sense, especially when the age differences between the older content and the 新er content isn’t that great.

If the 新 content is still present on a page after a certain passage of time (minutes, hours, or possibly even days), the page might then be indexed with the 新 content. 的 amount of time that a search engine may allow to pass before it will index changes might be based upon a historic view of how frequently some sites make changes to their pages.

我也想起了Google’s patent, 基于文档起始日期的文档评分,同时阅读此书 计算年龄 专利。在该专利中,我们’有人告诉您,对于某些查询,较新的文档可能是首选,而对于其他查询,较旧的文档可能是更好的结果,并且文档的年龄可能包含在该文档的排名得分中。

文件开始日期 该专利描述了更新频率分数和更新量分数如何在确定与文档相关联的年龄的分数中发挥作用。更新频率(UF)评分可能会查看页面在一段时间内所做的更改数量,而更新量(UA)可能会查看这些更改是什么。该专利告诉我们更多有关如何为网页更改计算UA分数的信息:

UA也可以根据一个或多个因素来确定,例如“new” or unique pages associated with a document over a period of time. Another factor might include the ratio of the number of 新 or unique pages associated with a document over a period of time versus the total number of pages associated with that document. Yet another factor may include the amount that the document is updated over one or more periods of time (e.g., n % of a document’可见含量可能会在一段时间t(例如最近m个月)内发生变化,这可能是平均值。另一个因素可能包括文档(或页面)在一个或多个时间段内(例如,最近x天内)更改的数量。

根据一种示例性实施方式,可以根据文档内容的不同加权部分来确定UA。例如,当确定UA时,可以认为在更新/更改后不重要的内容(例如Javascript,注释,广告,导航元素,样板材料或日期/时间标签)相对较小,甚至可以完全忽略。另一方面,如果对内容进行更新/更改(例如,更频繁,更近,更广泛等),则认为重要的内容(例如与前向链接相关联的标题或锚文本)的权重可能大于更改。确定UA时,其他内容。

文件开始日期 专利主要关注页面的更改如何影响该页面的排名,而文档的新鲜度或使用期限可能对该页面的排名产生积极影响。有趣的是它可能考虑的页面更改类型的深度,以及某些更改可能比其他更改少引起关注。这个 计算年龄 该专利更侧重于搜索引擎何时将页面的更改纳入其索引中,并且似乎对更改类型进行的分析要少得多,这可能会使页面爬行期间的决策速度更快。

结论

我上面提到的Yahoo专利告诉我们,如何在一分钟左右后重新抓取页面,以查看列出的任何链接是否已更改,以确定这些链接是否为“transient”链接或会定期更改的链接,例如广告。由于这些链接会在每次爬网到页面时发生更改,因此Yahoo将这些URL添加到要爬网的URL列表中可能没有意义,因为它们很可能是广告或具有会话跟踪ID的链接。

这项Google专利的目标似乎非常相似,可以识别快速变化的内容和没有变化的网页’不要停留任何时间,以识别广告内容或与页面上实际内容无关的其他内容。

行动项目

知道Google可能在做这样的事情可能会有所帮助的地方可能是在诸如电子商务商店等想要在其页面上显示特殊商品或特色内容链接的情况下。如果每次搜索蜘蛛出现时随机显示和更改指向这些链接的链接,则它们可能会被忽略。如果每天或每周更改一次,’它们被忽略的可能性较小。建议您检查一下Google是否正在将类似的网页更改编入索引,而不是假定它们会自动编入索引。

最后更新时间为2019年6月23日。

分享是关怀!

关于14条想法“Google如何追踪网页变化”

  1. 嗨,比尔,

    抓取频率很有趣– I suppose it’s是从大量数据的存储,数据的传输,数据的相关性中筛选出正确站点的正确条件的多种因素的乘积,并确保其与那些变化保持关联并与之保持一致网站。

    获得爬网频率并确定更改类型的优先级,对于确保为正确的搜索查询索引正确的页面至关重要。一世’我试图不间断地阅读专利,但是我’我肯定在不间断的方面努力!似乎向我伸出的那一块是我的可见部分– “,文档的可见内容的n%可能会在t期间内发生变化”.

    我一拿到一块’我会尽力消化– it’s always good when a patent for the google search 核心 product can add some understanding to how they might go about their bread & butter activity.

    nb –我喜欢您图片上方的海边小SEO图片,但是’当我尝试与LinkedIn分享时,我仍然没有出现-

  2. 你好汤姆,

    我阅读专利的方式是将其复制到文本文件中,然后开始删除确实没有的内容’直到我有更小的更容易理解的东西为止,再添加任何东西。如果我这样’m interrupted, it’没有那么大的损失。

    抓取频率绝对是其中的一件事’值得花一些时间来了解更多信息。最好的起点之一,如果有的话’看不到这份文件,这很可能是Google刚开始时所基于的一些方法:

    通过URL排序进行有效爬网

    照片上方的小图像实际上是所有文本和CSS,而不是图像。

  3. has anyone experienced rank drops when changing CMS and redirecting (301) old to 新 CMS on the same domain and usually then also 新 URL’s?任何有关最需要注意的建议都将不胜感激(如在页面上的跟随链接数一样?)
    我很确定301太多’s将泄漏果汁并降低等级 –googlebot遇到的障碍越多,您的排名就会越差,有人同意吗?

  4. 有趣。在我看来,这也与Google有关’链接的值(不仅仅是索引)。就像他们“devalue”(我认为)页脚链接…我想如果他们能确定链接不是“core” to the page but some supplemental items they could 贬值 the links in various ways? It seems to me Google gets better and better at figuring out how to programmatically do things that I can look at a page and evaluate well those links there are not really that important, these ones here are very related to the topic of the page…

  5. “..可以看一下页面并很好地评估那些链接并没有那么重要。”

    这也取决于您所在的国家–在美国或英国,Google必须启用更多过滤器(如您提到的过滤器),而在澳大利亚或新西兰,几乎没有人要过滤,因此页脚链接(甚至像白色BG上的白色字体这样的黑帽黑帽)仍然可以他们的把戏成功了–直到有一天AU竞争加剧,Google就会增加过滤条件“criteria” count and… BAM! All dodgy & “India” SEO websites >>从第一页甚至从Google索引完全消失–我称它为熊猫巴掌=)

  6. 罗恩你好

    我认为它’s pretty common to see some drops in rankings when moving from one version of URLs to another, including some 新 URLs. Google doesn’t necessarily capture and follow 301 redirects at the time when it first sees them, but might schedule them for a later crawl, anywhere from days to weeks later. Google might also have to recalculate PageRank for your pages a number of times as it incrementally captures information about your 新 URLs for old pages and your URLs that are just 新.

    It’s not a bad idea to do a little extra to attract some 新 URLs when you make a change like this, and change the URLs on any other pages that you have some control over (like telecom and directory links and links from other sites that you may also have control over). It also really helps to make sure that you change all internal links to the New URLs instead of relying upon the redirects.

    一篇旧文章,但我认为仍然有效的是: Web衰减和无效链接可能对您的网站不利。在SES会议上,我向一群搜索工程师提出了一个非常类似的问题“Meet the Crawlers,”他们指出,尽管越来越多的内部断开链接和重定向可能不是一个有力的信号,但他们正在研究这一信号。如果您还将多个内部重定向链接在一起,那么当链接链开始超过两个链接时,搜索引擎将完全停止跟踪它们。

  7. 罗恩你好

    关于Google的有趣想法可能是在不同的语言环境中未应用尽可能多的过滤器。这听起来像是一种可能性,尽管我不想过分依赖。 Google可以在一夜之间打开这些过滤器之一。

  8. Very interesting to see that Google may be changing the way they cache pages. I very often reffer to analytics frequesntly and tweak the on page content depending on a sites rankings and landing terms- It will be interesting to see if minor changes like that get through the 新 cache system- If not, these minior changes will have to become more Major.

  9. 作为Google的一部分,这是一个有趣的变化’使用反向链接,索引编制等来清理房屋。要保持所有更改似乎非常困难。我想作为网站所有者,保持内部和外部链接,检查断开的链接,查验更新的内容等非常重要。您的博客上有很多有价值的信息–感谢您成为SEO专家。 --

  10. 乔恩,你好

    I’我不确定这是否会改变Google缓存页面的方式,但是它肯定表明他们正在考虑如何更经济地处理事情,并制定了流程来帮助他们决定何时更新其内容。索引。

    I’一直在更加关注他们如何以及何时使用Google更新某些信息’s “show last 24 hours” results, and I do sometimes see that they will add a 新 blog post in their index without updating the result for my homepage to show the 新 post there as well.

  11. 嗨朱莉,

    谢谢。我喜欢查看专利的一件事是,有时它们描述了我们’已经看到一段时间了,但是没有’没有讨论的词汇或对它们背后某些过程的想法。

    这项专利回答了我的问题,为什么有时在我们对网页进行小的更改时,花的时间可能比我预期的要长一些,以便他们进入Google’s index.

评论被关闭。