11
Mar
2008

Avoiding duplicate content: a Google issue – Part 2 –

Web Talk, WebmasterComments Off on Avoiding duplicate content: a Google issue – Part 2 –





Bookmark and Share

If you didn’t have the chance to read this article of mine about my concerns regarding duplicate contents, please read it now. As you know I am very worried (to the point of being paranoid) about duplicate contents in my blog, since web engines like Google and Yahoo tend to give blogs with such an issue a very poor page rank, without speaking about the fact that the more duplicate contents a blog has the more it is indexed poorly. This means that your beloved blog will be shown in the 12.549 page after a search. Since I don’t want this, but I want exactly the other way around, namely being shown in the first pages of a search engine after a search, you can guess how sensitive I am about this issue. So, today I want to speak about another way (or trick) to avoid duplicate contents and it has to do with all those blogs which are programmed in PHP. Honestly speaking I think that 99,99% of blogs are in PHP nowadays, so I guess that this issue may be YOUR issue as well. But let’s start from the beginning. Did you have any chance to see how Google indexed your blog’s pages? No? Well, do it right away. Go to Google and write: site:www.yourblogname.com. Google will show you all the pages it managed to index coming from your blog. It will also show you all those pages which are duplicated content. In some cases duplicated content pages are quite easy to spot, since they have a well known web address with attached something that it is completely foreign to you. One day while browsing my indexed pages I stumbled upon something which left me at loss. As a matter of fact I saw lots of pages with the following extension: http://www.webtlk.com/…/3022334?PHPSESSID=1dcc03b2a4b. My first reaction was to say: What the hack is this? My second reaction was to browse the web looking for more info. This was what I found out.

You have to know that the above mentioned address, containing the embedded code “?PHPSESSID=” is a session ID created by PHP after a visitor comes to your blog with cookies disabled in his browser. This is not an issue itself, but becomes a big problem when some spiders coming from search engines start crawling your nice blog with their cookies disabled as well. In this way these spiders will see your URLs with the session ID included and will start indexing them right away. These “bad addresses” point to pages which already have a good address and this means that for a certain page there will be two or more URLs. That’s how a duplicated content is created. And that’s the beginning of the end for your blog. Most of the time you won’t even know about this. You will start understanding that something is wrong when your blog will just manage to have 20 visitors a day. Luckily for you the solution is really easy and at hand. In fact all you have to do, is to add this piece of code in your .htaccess or php.ini file:

php_value session.use_only_cookies 1
php_value session.use_trans_sid 0

This code tells the server to store the PHPSESSID in a cookie. If the browser does not have cookies enabled (like Googlebot), the session id won’t be applied to the URL. This also means that all functionality that relies on sessions will not work (such as session based logins). Keep it in your mind. Be extremely aware of duplicate contents in your site, since it may penalise the blog itself. A more effective way to watch your blog’s indexed pages is to use the good Google Webmaster Tool which gives you a whole set of tools to know how many pages are being indexed by Google, which ones are not indexed or not reacheable, how many URLs your blog has got, and if your Robots.txt is doing a good job or not.




Related Articles Latest Articles
.

Comments are closed.


Copyright © 2007-2017 | Sitemap | Privacy | Back To Top
Best screen resolution 1280x800 or higher.
Web Talk is best viewed in Firefox.

Stat