16
Feb
2008



Bookmark and Share

So, my quest for a perfect website still continues and, among the things I have discovered, is that there is an issue concerning the duplicate contents. All this started when, among the Google results for my website, I started seeing Web Talk links ending with weird strings such as: ?wpcf7=json links or ?wpcf7=json&wpcf7=json. There are even longer strings, all having the ?wpcf7=json code nested over and over. So, at this point I started trying to look for some infos about this and, to tell you the truth, I didn’t find anything relevant, apart from the good article you can see here. One thing is for sure. When Google bots crawl your website and discover duplicated contents or pages having similar addresses well, Google will penalise you sooner or later. But where does that string come from? Apparently from a WordPress plugin named Contact Form 7. This good plugin (I still have it) was using the code only during AJAX submitting (POST) process, but that was enough to generate the problem. Now the issue has been resolved by the author plugin named Miyoshi who states:

Dexter and some people told me that there seems be a SEO issue in Contact Form 7. The “?wpcf7=json” code is used by Contact Form 7 only in AJAX submitting (POST) process. I wonder why Google indexed such URLs even now. Anyway, I worked around the issue. Now Contact Form 7 doesn’t use “?wpcf7=json”, so I believe that kind of problem is fixed. But Google’s existing indexes are still there, I can’t do anything for that.”

So, the problem (as far as search engines are concerned) is still there and Miyoshi can’t do anything for that. How can we solve the issue then? Over the last two weeks, after a boost thanks to the All-in-one-SEO-pack plugin, I am experiencing a dramatic drop in the quantity of people visiting my blog while at the same time, more than 100 new Web Talk duplicated contents were shown in the search engine results. Am I being penalised by Google? I think so. More than 50 people have stopped visiting my web site in the last weeks. Well, that’s enough for me to be worried about, above all if a blog like the one I have, gets around 150 visitors per day. I think that during a website span of life some fluctuation are pretty normal, but I don’t want to live with the doubt that maybe that weird string is in some ways refraining my blog from taking off. After reading Dexter comments and having written myself to him I decided to adopt some Robots.txt trick to my website. As you well known the Robots.txt file is located in the root of your ftp blog and determine the way search engine and other site bots crawl your blog. You can tweak a lot in here and literally force bots to behave in certain ways. Let’s see togheter how to compile this file.

When, for the first time, you open the robots.txt you will see something like this:

# BEGIN XML-SITEMAP-PLUGIN Sitemap:

http://www.mywebsitename/sitemap.xml.gz

# END XML-SITEMAP-PLUGIN

If you don’t use any sitemap file or plugin the web address in the middle won’t be shown. Ok, now let’s compile it and let’s start selecting the bot we want to manage:

User-Agent: [Spider or Bot name]

If you don’t want to select a particular bot, but you want to include all of them use this:

User-Agent: *

Now, let’s ask it what to do:

Disallow: [Directory or File Name or website address]

Disallow means that the Google bots, for example, don’t have to follow a particular directory, file name or web address shown in our blog.

Let’s see in which way we can use the Disallow command:

Disallow: /newsection/ This will exclude a whole section of your blog from being crawled.

Disallow: /private_file.html This will exclude all webpages ending with private_file.html from being crawled

Disallow: /*.gif$ This will exclude all files of a specific file type (for example, .gif) from being crawled

Disallow: /*? This will exclude any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string)

Disallow: /?* This will exclude any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by a question mark, followed by any string)

In order to get rid of duplicate contents this is how I compiled my robots.txt:

# BEGIN XML-SITEMAP-PLUGIN

Sitemap: http://www.webtlk.com/sitemap.xml.gz

User-agent: *

Disallow: /?wpcf7=json*

Disallow: /*?wpcf7=json

# END XML-SITEMAP-PLUGIN

The results are expected to be seen within 3-4 weeks. I will keep you informed about results.


Tags: , ,


Related ArticlesLatest Articles
.

9 Comments to “Avoiding duplicate content: a Google issue”

  1. Dexter | Techathand.net Says:

    Thanks for the link my friend. I do hope that the trick will also work for you. Like what happen at my site. It will not be remove that easy but google will put some low ranking to those with json upon implementing the robots.txt

  2. Web Talk Says:

    You are welcome Dexter. You gave me some good tips about something i had no idea!

  3. Duplicated Contents Problem, Be Careful! Says:

    [...] called: ?wpcf7=json which is extractly the same as cnreviews homepage. According to WebTalk, this is a problem created by a WordPress plugin called “Contact Form 7″ which we have [...]

  4. Wild Card Zheng Jie Eliminated from Wimbledon | Travel - CN Reviews Says:

    [...] called: cnreviews.com?wpcf7=json which is extractly the same as cnreviews homepage. According to WebTalk, this is a problem created by a WordPress plugin called “Contact Form 7″ which we have [...]

  5. Balisugar Says:

    Hi… I’m still having duplicate content and cannot find the answer
    eg:

    http://www.balisugar.com/page/5/
    http://www.balisugar.com/page/5/?nggpage=3&pageid=714
    How to fix this ? Please have a look my site !

    Plugins I have:
    Google XML Sitemaps
    Homepage Excerpts
    Permalink Redirect
    Platinum SEO Pack
    Robots Meta
    Smart Trailing Slash
    Wordpress Duplicate Content Cure
    WP Page Numbers
    Of course I have text robots.

    It looks like I need another way to fix the problem. Do I need .htaccess redirect for that ? Or maybe WordPress hack ? I’m getting very tired trying to find the solution. Please help me, if you can ! Thanks

  6. Web Talk Says:

    From what I have gathered it looks like you have or had a plugin called Next Generation gallery (nggpage). I guess that you already uninstalled it but maybe its code is still around in the internet. Try this in your robots.txt:

    Disallow: /?nggpage=*

    Take into consideration that the result from this code in your robots.txt won’t take place overnight. You have to wait around a month. Monitor the whole thing by using Google Webmaster Tools — Tools — Analyse Robots txt. This useful tool lets you check if your robots works correctly. After a few weeks you should see in Google Webmaster Tools — Overview — URLs restricted by robots.txt that those links having the string ?nggpage= are blocked. Just be patient and everything will be all right!

  7. Balisugar Says:

    O…I see thankyou verymuch for your help.

    I will link to your blog.

  8. Web Talk Says:

    thanks!

  9. NextGEN Gallery causing Page Not Found error and Duplicate content pages Says:

    [...] under the folder “wp” and hence the robots.txt rule starts with /wp/ Thanks to http://www.webtlk.com/2008/02/16/avoiding-duplicate-content-a-google-issue/ for the [...]

Leave a Comment

If you want to show an image next to your comments, get your gravatar now!

This blog is moderated. Inappropriate comments will be edited or removed. Users posting offensive comments will be banned from this blog. Report Inappropriate Comments Here.

XHTML - You can use the following tags:  <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Copyright © 2007-2012 | Sitemap | Privacy | Back To Top
Best screen resolution 1280x800 or higher.
Web Talk is best viewed in Firefox.

Stat