Forum OpenACS Q&A: Approach to Google-optimizing

Collapse
Posted by Joel Aufrecht on
I'm working on improving Greenpeace's Google visibility; part of the problem is that it's more work to find out what the visibility _is_ then to do things that change it.  Here's the approach I have in mind.  Does this look like the correct approach, and has anybody had good experiences with an existing tool that does something like this?

1) Measure current Google rankings
  - for 10 pages on the site, including 5 that will be changed and 5 that won't
  - For each page, google search for the title, first 5 words, last 5 words, and 5 words in the middle, of the main text.
  - How to measure Google ranking?  Look through the search results and count the highest appearance of the site?  Can this be automated through the Google API?  What is acceptable use of Google?

2) Roll out a change

3) Repeat measurements 1 day, 5 days, 10 days, 30 days, 60 days after change

Planned changes, one at a time:
1) Put article titles into page <title> tags
2) Move article titles into <h?> tags from <div>
3) Implement pretty URLs (foo.org/article/1) and construct index pages that link to articles directly (current index pages show 10 at a time)
4) Add meta tags to form edit/add modes so that they don't get indexed

Collapse
Posted by Guan Yang on
1) The Google toolbar for Windows provides PageRank information. I'm not sure if the Google API provides it. At least locating the exact search rank is possible through the Google API.

If you are willing to get dirty, the "professionals" have some nasty tricks to increase Google rankings. One guy I know maintains a special front page for each of his client sites that is only served to Googlebot (based on IP address I presume). This special page, which is sometimes (but often not) visible through Google Cache, contains links to popular sites, a search form (this is apparently a plus), as well links to all of his other clients.

This way he can exploit googlejuice between his clients without explicitly linking (they're often completely unrelated commercial sites).

Collapse
Posted by xx xx on
I guess it will be hard to measure your exact results. It may take weeks to months before your efforts pay off.

And what did you optimize if PageRank goes up from 8/10 to 9/10?

A nice reference for this is  http://selfpromotion.com/improve.t

Collapse
Posted by Dave Bauer on
Great link. I was going to recommend selfpromotion.com. It has a whole series of very effective tools for analying keywords, search engine ranking etc. It is very afforable, and you can use it very effectively to improve your web site's rankings as well as provide that service to your clients, if you are in that business.
Collapse
Posted by Tilmann Singer on
Serving pages exclusively to the googlebot depending on ip address or user-agent header does not sound like a good idea - that's called cloaking and googlebot is able to detect that (it has to be able to, otherwise cheating by creating huge fake link nets would work), propably by sending an anonymous googlebot around from an unknown ip address from time to time.
Collapse
Posted by Guan Yang on
I'm not saying that cloaking is a good idea or at all moral. It's just an example of a technique that's widely used.
Collapse
Posted by Tom Jackson on

You will screw yourself bigtime if you try to trick googlebot. Here are some links, read these first. I'm lauching a campaign to redo a few sites myself.

First, is the Greenpeace page you are interested in ranking, actually in Google? Page rank, after that point depends on the key words you choose. You shouldn't judge yourself or allow Greenpeace to judge you based on how a page ranks, but you can ask yourself, what are the key words I want Google to index? Are those words used on the page as a main theme? How relevent are other pages on the net based on searching those key words? Also, with key words, what are users typing into Google, you really want Google to return Greenpeace on subjects where they believe they have some authority, but you cannot choose what users will type.

Bottom line is there are no tricks, only sound writing skills and webmastering. Maybe one exception: hopefully a search for "Greenpeace" will return their home page...

Collapse
Posted by Jade Rubick on
Joel, here's a couple of other suggestions:

- there is a really great application for Mac OS X called Advanced Web Ranking. It allows you to monitor your search results by keyword on hundreds of search engines, month by month or week by week, and it displays graphs and reports -- very effective for showing your client what you're trying to do.

http://www.apple.com/downloads/macosx/internet_utilities/advancedwebranking.html

- there was a great thread on openacs.org that discussed how to score better on Google, it refered to this great link:

http://wolfram.org/writing/howto/3.html

One of the most important things, it seems, is to have a lot of meaningful links, and to make the site useful to other people, so that other people will link to it.

After that, directory naming seems very important. I notice that my rubick.com pages are found iff I name the directories something meaningful.

So http://rubick.com:8002/openacs/ad_form

would be much better than

http://rubick.com:8002/openacs/notes_on_ad_form

because people would most likely search for openacs ad_form

Collapse
Posted by Jeff Davis on
We should also try to set noindex,nofollow robot meta tags to reduce the number of duplicates returned from some of the applications. In particular, something like bug tracker or photo-album returns the same content many times and since google will limit the number of pages it indexes when they have query variables the duplicates reduce the depth of the indexing on the site.

My idea for bug tracker would be to only index pages when no state variables are set and for photo-album to make the medium noindex nofollow so that the large image is not indexed.

Collapse
Posted by Tom Mizukami on
I remembered Eric having some good experience at getting OACS sites a high Google ranking. https://openacs.org/forums/message-view?message_id=107569
Collapse
Posted by Joel Aufrecht on
Of course we don't want to do any cloaking. The quality of the content and whether or not people are linking to it is a bit out of scope of my core mission, which is to identify and solve any technical glitches that cause poor indexing. I've got Eric Wolfram's top item, fixing page titles, on the top of my list.

After reading all the notes and some of the linked items, I'm wondering:

  • Is it worth it to try and monitor results? greenpeace.org gets many google hits every day, but not every page is hit every day, and some authors claim that pages go months without re-indexing. Maybe we should just make the obvious fixes and leave it alone, or check back in 6 months.
  • Should I put any effort into better pretty urls - not just /article/145 but putting a keyword into the pretty url? We do the foundation work for this in some parts of openacs, where short_name is a locally unique string suitable for a url. This is nicer for users, certainly - how standard can we make in OpenACS? Is it worth trying to retrofit this to old apps that just have ids, by creating a short-name field and populating it?
  • Where else should we be setting noindex,nofollow? So far:
    • in edit and add mode of form-builder
    • in packages with duplicates. Are the duplicates a bigger problem then the possibility of not getting indexed at all if we block some pages from indexing and the "intended to be indexed" pages don't get hit? Maybe we're better off trusting the search engines' ability to hide duplicates.
Collapse
Posted by Dave Bauer on
Joel,

One simple way to get a more descriptive url is to convert the item title into the cr_items.name using util_text_to_url.

Collapse
Posted by Jeff Davis on
I think the reason we should be more careful about noindex,nofollow is this statement from google:
1. Reasons your site may not be included.
Your pages are dynamically generated. We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index.
So since they limit the amount of dynamic content they spider you want to make what they do spider unique to increase coverage (and to lower the burden on your own server).

Changing everything to have pretty urls would remedy the spidering scope issue (although it would still leave google pulling down an order of magnitude too many pages for things like bug tracker).

Collapse
Posted by James Thornton on
As of the February update, I have noticed that outgoing links are more important to Google. Google has added more weight to what it deems authority pages, and it appears that Google identifies authority pages/sites by the number and quality of related incoming links, number and quality of related outgoing links, number of related pages in the site, and possibly how long the page/site as been online.

As discussed in When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics, a paper written by two Google researchers, authority sites often link to other authority sites. This paper describes an algorithm for ranking "expert" sites.

A paper entitled Authoritative Sources in a Hyperlinked Environment, written by Jon Kleinberg at Cornell, distinguishes between hubs and authority sites. Hubs have many outgoing links, ideally to related authority pages, and authority pages have many incoming links, ideally from related authority pages.

Improved Algorithms for Topic Distillation in a Hyperlinked Environment describes a query-based approach that ranks the interconnectivity of pages linked to and from the other top 1000 results for given query.

It is thought that Google has recently modified its algorithm to incorporate some or all of these techniques. In January I optimized a bank's site based on the algorithms discussed in these papers. The site launched in October, and I noticed substantial improvement in rankings after the Google February update.

It does not appear that modifying a page's outgoing links will have any immediate effect. It appears that site connectivity rankings are calculated once a month at the same time PageRank is calculated. In addition, the effects from tweaking the content of a page aren't as noticeable on a day to day basis. Until November, you could change the number of times you repeated a phrase in a page, and the next day you could notice a significant adjustment in the SERPs.

Also, in April 2003, Google acquired Applied Semantics. Recently it appears that it is more effective to use related keywords on a page/site than it is to optimize a page for a particular phrase. Patterns in Unstructured Data discusses the concept of latent semantic indexing. Use Google's Keyword Suggestion Tool to find a list of keywords Google identifies as related to your target phrase.

You can find links to all of the above papers, and ~40 others on my website: Search Engine Research Papers.

Collapse
Posted by Andrew Piskorski on
Wow, lot's of good info above. Here are links to a few old discussion threads that might still be relevent: Feb. 2004, May 2003, and Sept. 2002.
Collapse
Posted by Chris Davies on
there are a few other things that I think originally hurt greenpeace..

1) redirects -- if someone links to http://www.greenpeace.org/, and it redirects to something else, google's engine used to treat the 302 as a 404 and then spider the resulting content.  Not a huge problem until you realize that you get no PR transferred to the domain from the cache of links pointing at the site.

2) keyworded URLs.  recently it seems that google is penalizing .php, .phtml, .shtml, .shtm as 'dynamic'.  I've tested this numerous times with two clients and every time we check, the conclusion is the same.  ? and & in the url are also dynamic triggers and one of my biggest pet peeves.  Yes, if you can, put keywords in the directory path so that the pages have some chance at a higher relevence.

3) Http 1.0/googlebot requests without the Host Header.  I don't know that google still does this, but they used to have a bot that would do checks without sending a host header.  If I recall, greenpeace's website pointed surfers to a non-existent host when that happened.

Other notes:

cloaking.  There are things that you can do that will help google, that are not specifically cloaking.  Yes, they do have some bot that checks whether the page looks similar and contains similar elements, however, you can unfold menus, present navbars that allow google to spider more efficiently, etc.

content location.  I've had a theory for many years that google seems to put more weight on the first 5120 bytes of a page.  Thus, when you design a page that contains css, menus, headers and comments, etc, you are pushing the important page content 'lower' on the page to what google sees.  This in turn affects the relevence to other sites.

keyword relevence.  Google seems to take notice of particular phrases in the <a> container.

For instance, if you link to Nike as:

<a href="http://nike.com/">Nike</a>

you bump the keyword relevence for Nike.  However, a better keyword relevence might be:

<a href="http://nike.com/">Running Shoes</a>

A few other things I've learned along the way:
If at all possible, no inline javascript or css.  Google will try to index it as content.  Use Alt tags that represent what is in the picture (rather than alt="picture1")

404s are the devils bane.  If you put content online, leave it there.  Disk is cheap.  :)

Just some random thoughts.

Collapse
Posted by Dirk Gomez on
A good search engine should try to behave like an experienced web surfer. How do YOU rate a page?

You read the first few paragraphs and then decide on whether it makes sense to continue through the rest, so you rate the first bytes higher.

You look at the URL and decide upon whether it is dodgy or trustworthy.

You look at the bold and big letters. Hence a search engine should rate h2 and h3 higher.

You don't care about meta tags, hence a good search engine will silently ignore them as well.

I wouldn't even be astonished if average response time per transferred bytes weren't a metric. The slower the site, the worse it usually appears.

How much of the site appears to be original content, to what extent is it just a metasite - original content being a ton more interesting. e.g. the features section in Greenpeace links to a whole lot of different sites and gives the uninitiated bot the impression that the *major* navigation bar links to other sites. It knows that this is the major navigation bar because most sites that have links to the left use it for navigation.

Then: what do you want to be indexed? What are people looking for when they look for Greenpeace? greenpeace.org or some particular content...what would be ten search terms where Greenpeace should be ranked prominently. Which story or page seems to deserve a high ranking for any of these terms?

If we then look at the application - page - we might ponder why it doesn't get the rating it may deserve.

(All this assumptions. Remember that google said 2 years ago that they have more than 100 heuristics per page. :))

Collapse
Posted by James Malzahn on
I am having a hell of a time getting my page to rank high in Google.  I am listed on page 5 for the search "winnipeg data recovery".  There are only a few sites that are even relevent.  I have done a lot of research on optimizing my page for google and nothing seems to help.  If anyone could look at my page and give me some tips please do so.

My site is http://www.winnipegdatarecovery.com and I would like to optimize for "winnipeg data recovery" and "winnipeg file recovery"

My email address is mailto:support@winnipegdatarecovery.com

PLEASE HELP!

Collapse
Posted by Jarkko Laine on
James,

First, I don't know how this question relates to OpenACS. Second, DO NOT CROSS-POST. It kills the slightest desire for answering. Third, did you at all read the thread you first posted to (i.e. the post I'm answering to)? It contains a lot of good tips and links that should be helpful to you. There's no silver bullet.

Collapse
Posted by Jarkko Laine on
Oh, and BTW, James,

Your page title goes "WinnipegDataRecovery.com offers Winnipeg a Data Recovery Solution for hard drives, Digital Camera's, CD's, ZIP and Floppy Disks, Password Recovery, File Repairs for Office Documents".

Wonder why it ranks #1 with "winnipegdatarecovery" but not with "winnipeg data recovery"? To find the answer is left as an excercise ;)

Collapse
Posted by James Malzahn on
I believe if I would have made the domain name www.winnipeg-data-recovery.com instead of www.winnipegdatarecovery.com it would have worked better correct? As far as my title goes I have changed it

from

"WinnipegDataRecovery.com offers Winnipeg a Data Recovery Solution for hard drives, Digital Camera's, CD's, ZIP and Floppy Disks, Password Recovery, File Repairs for Office Documents".

to

"Winnipeg Data Recovery . com offers Solution for hard drives, Digital Camera's, CD's, ZIP and Floppy Disks, Password Recovery, File Repairs for Office Documents"

Do you think this may help?

Collapse
Posted by Sushubh Mittal on
I am pretty new to this community and was interested in this topic considering i like following google stuff. And coming from India, I was obviously more interested in:
http://www.greenpeaceindia.org/

And related to google for thsi site, 1 thing straightly comes to mind... use of images. most of the links on this site that points to other major sections of the sites are images. which i believe is totally unneccessary. the menu on the left and the top is all possible creating using CSS stuff. Images makes it tough for google to guess what the linked page is all about and does not makes addition and editing of content easy!

I will be back with more as I get to study the site more! :)

Collapse
Posted by Michael Schlenker on
Maybe take a look at visitors to analyze your logs and see what you found: http://www.hping.org/visitors/
Collapse
24: Google Sitemaps API (response to 15)
Posted by Andrew Piskorski on
Also read about Google's new Sitemaps feature - an API which lets you notify Google that dynamic content on your website has been updated.
Collapse
25: Re: Google Sitemaps API (response to 24)
Posted by Carsten Clasohm on
I have done an implementation for OpenACS 5.1.5, which allows modules like lars-blogger to generate Google Sitemaps.

The infrastructure package can be found at

http://www.clasohm.com/prj/clasohm.com/browser/trunk/packages/google-sitemaps/

The code that generates the sitemap for lars-blogger is in

http://www.clasohm.com/prj/clasohm.com/file/trunk/packages/lars-blogger/tcl/sitemap-procs.tcl
http://www.clasohm.com/prj/clasohm.com/file/trunk/packages/lars-blogger/tcl/sitemap-init.tcl
http://www.clasohm.com/prj/clasohm.com/file/trunk/packages/lars-blogger/www/index.vuh

For this to work, you will also need the directory "google-sitemaps" in the server root, which must be writable by the Web server.

The generated sitemaps can be downloaded at [ad_url]/google-sitemaps/index.xml. See

http://www.clasohm.com/google-sitemaps/index.xml

for an example.

If you want to retrieve a copy of the code, you can do so with Subversion and the URL http://www.clasohm.com/svn/clasohm.com/trunk/

Collapse
Posted by russ m on
For this to work, you will also need the directory "google-sitemaps" in the server root, which must be writable by the Web server.

The generated sitemaps can be downloaded at [ad_url]/google-sitemaps/index.xml. See

http://www.clasohm.com/google-sitemaps/index.xml

according to the Sitemaps docs, a sitemap can only refer to pages below it in the site hierarchy - so a map at http://example.com/google-sitemaps/index.xml will only be used for other pages below http://example.com/google-sitemaps/. So for this to actually help it looks like the sitemap file needs to be placed directly in the site root directory.
Collapse
Posted by Klyde Beattie on
I would like to point out that this is called cloaking and google may permenantly remove you from their listings for it.