Tyranny of a Google vote

Image by Hajo de Reijger, politicallyillustrated.com

For the past year I’ve been seeing headline blogs analyzing the effect of Google’s last algorithm update, dubbed the Panda. There was much talk of unfair relegation from the first page of Google search results, and general indignance by the SEO community.

As with any subject in which I only have cursory knowledge I didn’t think much of it. I thought that as long as I didn’t engage in link-buying and whatever is known as “black hat” tactics, the search engines would be fair. What I didn’t realise with Google was how subjective it has become in ranking websites. I was particularly tripped up in the area of duplicate content.

Some of my articles are syndicated to DZone.com, a hub for tech bloggers. They’d approached me about a year ago asking if I’d like them to carry our content. It seemed like a good way to gain visibility so I agreed.

Recently, I ran a Google search on the actual content that was syndicated and found the following results:

Title Dzone rank iheavy rank
Zero Downtime – What is it? #21 (not in first 5 pages)
Deploying MySQL on EC2 #1 (not in first 5 pages)
Cloud Computing Use Cases #5 (not in first 5 pages)

Now DZone.com’s pagerank is a 6 while iheavy.com is a 3. Google’s algorithm is probably weighing the pagerank of Dzone higher, and serving up those results at the expense of the original. It could be that the algorithms can’t determine which is original but Google’s bots know full well when content is published, so it knows the iheavy.com content was created before.

More likely it cannot supersede the ranking algorithm. It is one based on popularity.

What’s Original Content?

All of this raises the question of the value of original content. In this case we’ve given this site permission to carry our content, not knowing whether that would be good or bad for us. Nevertheless, it does seem to go against common sense, and perhaps what Internet users intuition might tell them, that they were not clicking through to the original creator of some content.

Can Google’s Algorithm Discern Original Work? Is there an incentive to do so?

The update brings up interesting questions about Google’s ranking algorithm. In a world where the popularity indicator is given the highest weight, will we find what we are looking for? In my case, I write articles related to my area of expertise, which is on web architecture, scalability and general tech consulting matters. Of course I want people to find my site when they’re looking for solutions to problems relevant to them. Yet in Google’s calculations, popularity trumps provenance.

Many bigger sites are experiencing the same thing and at an even larger scale. According to SEO Moz, Panda is forcing a change onto the role of SEOs, turning them into that of web strategists. While traditional SEO methods of optimizing for keywords, and putting out quality content still count, design and user experience, shareability, likeability; what’s known as “signals” that could predict the site’s popularity, are influencing the overall results of your site.

As an independent business owner with limited resources my time off hours is invested in writing better articles that appeal to people searching for MySQL or scalability consulting and less about putting in the bells and whistles to raise site popularity. But if that’s what Google favors then I’ll probably have to rethink my approach.

As much as I can squeeze out of a busy schedule, there’s not a high chance that this website can surpass a giant such as Dzone in popularity.

The Internet is meant to be a place where the pint-sized can have a fair chance at making an impression. With the way search algorithms have evolved, things are looking more like a reality TV talent contest where skill alone without good looks and a nice smile are just not enough to win the popular vote.

The Power of Habit by Charles Duhigg

Habits. We all have them. The good ones we celebrate, but the bad ones we struggle with. Duhigg’s book may introduce some ideas to those of us less familiar with behavioral sciences but it fails to effectively teach us how to form good habits and break the bad ones.

Filled with pages of stories from successful brands such as Pepsodent which Duhigg credits for turning the brushing of teeth into a daily routine; and perhaps more tenuous ones about leaders such as Paul O’ Neill, the CEO of Alcoa who purportedly turned around the fortunes of an ailing organisation by changing its safety practices.

From cue, routine to reward we must first identify the habit, then in a way that parallels the success of Alcoholic Anonymous, you replace the routine, keeping the cue and reward. In discussing the success of AA and others, he brings up the importance of belief in long term success of habit change. He references William James’ famous quote “Believe that life is worth living, and your belief will help create that fact”.

Still, I couldn’t help thinking that for the average business manager it lacked actionable advice of the kind you might find in a Jim Collins Good to Great or Chip Conley’s Peak. These books also have excellent story telling, but break things down in a very specific set of steps and attributes that an organization or individual can apply today.

Duhigg’s writing is easy to read and that’s probably the book’s greatest strength. Yet with most of it grounded more in interesting anecdotes than credible research, the examples unfortunately give for more entertaining reading than any deep insight.

Bulletproofing MySQL replication with checksums

bulletproof glass

Also find Sean Hull’s ramblings on twitter @hullsean.

Your MySQL replcas running well? You might not even know if they aren’t. One of the scariest things about MySQL replication is that it can drift out of sync with the master “silently”. No errors, no warnings.

  1. What and Why?
  2. MySQL’s replication solution evolved as a statement based technology. Instead of sending actual block changes, MySQL just has to log committed transactions, and reapply those on the slave side. This affords a wonderful array of topologies and different uses, but has it’s drawbacks. The biggest occur when data does not get updated or changed in the same way on the slave. If you’re new to MySQL or coming from the Oracle world you might expect that this would flag an error. But there are many scenarios in which MySQL will not flag an error:

    • mixed transactional and non-transactional tables
    • use of non-deterministic functions such as uuid()
    • stored procedures and functions
    • update with LIMIT clause

    There are others but suffice it to say if you want to rely on your slave being consistent, you need to check it!

  3. The solution – mathematical checksums
  4. If you’re a seasoned Linux user, you’re probably familiar with the md5sum command. It creates a checksum on a file. You can do so on different servers to compare a file in a mathematically exact way. In fact rsync uses this technique to efficiently determine what files or pieces of files need to be copied across a network. That’s what makes it so fast!

    It turns out that MySQL can checksum tables too. However were we to build our own solution, we might have trouble doing so manually as table data is constantly in a state of flux.

    Enter Percona’s pt-table-checksum tool formerly part of Maatkit. Run it periodically against your master schemas or the entire instance if you like. It will store checksums of all of your tables in a special checksum table. The data from this table then will propagate through replication to all of your connected slaves.

    The tool then has a check mode, which allows you to verify all the connected slaves are ok, or report the differences if it finds any.

  5. Step-by-step Setup
  6. First you’ll need to grab a copy of the percona toolkit. Note that if you previously installed maatkit then you may want to delete those old scripts to avoid confusion. mk-table-checksum if you used maatkit, or pt-table-checksum if you have 1.0 versions. You likely installed using wget or perl Makefile, so you may need to go and remove those manually.

    Assuming you’ve already got the percona repository installed issue:

    $ yum install -y percona-toolkit

    I’ve found some of the maatkit tools to be rather fussy about getting all the options right. The first thing to do which will help simplify this is to add a section in your local user’s “.my.cnf” file like this:

    [client]

    user=root

    password=myrootpassword

    That way the percona tools will look for this whenever it needs authentication credentials. Otherwise we assume localhost for this example, so you should verify you can connect with the mysql client as root from localhost.

    Now let’s checksum the “mysql” system schema.

    $ pt-table-checksum --replicate=test.checksum --create-replicate-table --databases=mysql localhost

    Note the –create-replicate-table option. You only need this option the first time. From there the test.checksum table will exist.

    You should see some output that looks like this:

    TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE

    04-24T16:06:45 0 0 0 1 0 0.099 mysql.columns_priv

    04-24T16:06:45 0 0 32 1 0 0.100 mysql.db

    04-24T16:06:45 0 0 0 1 0 0.096 mysql.event

    04-24T16:06:45 0 0 0 1 0 0.096 mysql.func

    04-24T16:06:45 0 0 38 1 0 0.102 mysql.help_category

    04-24T16:06:45 0 0 452 1 0 0.106 mysql.help_keyword

    04-24T16:06:46 0 0 993 1 0 0.096 mysql.help_relation

    04-24T16:06:46 0 0 506 1 0 0.100 mysql.help_topic

    04-24T16:06:46 0 0 0 1 0 0.099 mysql.host

    04-24T16:06:46 0 0 0 1 0 0.104 mysql.ndb_binlog_index

    04-24T16:06:46 0 0 0 1 0 0.107 mysql.plugin

    04-24T16:06:46 0 1 1 1 0 0.115 mysql.proc

    04-24T16:06:46 0 0 0 1 0 0.186 mysql.procs_priv

    04-24T16:06:46 0 1 1 1 0 0.097 mysql.proxies_priv

    04-24T16:06:47 0 0 0 1 0 0.097 mysql.servers

    04-24T16:06:47 0 0 0 1 0 0.096 mysql.tables_priv

    04-24T16:06:47 0 0 0 1 0 0.098 mysql.time_zone

    04-24T16:06:47 0 0 0 1 0 0.097 mysql.time_zone_leap_second

    04-24T16:06:47 0 0 0 1 0 0.100 mysql.time_zone_name

    04-24T16:06:47 0 0 0 1 0 0.100 mysql.time_zone_transition

    04-24T16:06:47 0 0 0 1 0 0.095 mysql.time_zone_transition_type

    04-24T16:06:47 0 1 38 1 0 0.100 mysql.user

  7. How to check slaves
  8. Once you’ve collected all those fancy checksums for your tables, nicely timestamped, you’ll want to verify that your slaves are happily in sync. You can do that with the following command, also on the master:

    $ pt-table-checksum --replicate=test.checksum --replicate-check-only --databases=mysql localhost

    If there’s no differences you’ll see no output. If you have a difference it’ll look something like this:

    Differences on ip-10-15-27-19

    TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY

    mysql.user 1 1 1

    In our case you can see we created some users on the slaves accidentally, hence the differences. It illustrates how easy it is for differences to creep into your environment and also how easy it now is to find them!

  9. Special Cases
  10. Since one of my clients uses Drupal, they’ve had trouble replicating the semaphore table. This tables is a MyISAM table, and unfortunately no one dares convert it to InnoDB. So from time to time some gunk builds up in there, and it fails on the slave. We could clean out the table, but we decided to just filter out this one table. Since Drupal doesn’t use fully qualified schema.table names in it’s code, only “use” we have found this to be safe.

    However the percona toolkit explicitely checks for replication filters and will not run. It’ll stop with an error as follows:

    $ pt-table-checksum --replicate=test.checksum --databases=sean --ignore-tables=semaphore localhost

    04-24T15:59:29 Replication filters are set on these hosts:

    ip-10.15.27.19

    replicate_ignore_table = sean.semaphore

    ip-10-15-27-72

    replicate_ignore_table = sean.semaphore

    ip-10-15-27-18

    replicate_ignore_table = sean.semaphore

    Please read the --check-replication-filters documentation to learn how to solve this problem. at /usr/bin/pt-table-checksum line 6166.

The solution is the –nocheck-replication-filters option. Keep in mind that this sanity check is there for a reason, so be sure to skip the relevant tables in your checksum building, and checksum checks.

To build checksums skipping the semaphore table use this command:

$ pt-table-checksum --replicate=test.checksum --ignore-tables=prod.semaphore --nocheck-replication-filters localhost

Now you can check your slaves but ignore the semaphore table:


$ pt-table-checksum --replicate=test.checksum --replicate-check-only --ignore-tables=prod.semaphore --nocheck-replication-filters localhost

We also found a bug which preventing us from specifying multiple tabes on the ignore-tables line. So we used multiple invocations to do different schemas like this:


$ pt-table-checksum --replicate=test.checksum --replicate-check-only --databases=prod1 --ignore-tables=prod1.semaphore --nocheck-replication-filters localhost

$ pt-table-checksum --replicate=test.checksum --replicate-check-only --databases=prod2 --ignore-tables=prod2.semaphore --nocheck-replication-filters localhost

  • Crash Protection
  • If you’ve used MySQL replication for any length of time, you’ve probably seen a server crash. MySQL replication can have trouble restarting if you’re using temporary tables, as they’ll be missing upon restart. Also MySQL before 5.5 leaves syncing the info files to the operating system. So they may be incorrect after a crash.

    1. Use MySQL 5.5 if possible

    There are some new parameters in 5.5 that protect the info files from a crash. These are a great addition, and will make your slave databases more bulletproof.

    sync_master_info = 1

    sync_relay_log = 1

    sync_relay_log_info = 1

    2. Don’t use temporary tables

    After a restart they’ll simply be gone, so queries requiring or running against them will fail.

    The Percona guys’ new book High Performance MySQL third edition, suggests an alternate solution to using temporary tables. Use a special schema to hold your temp data, but create them as normal permanent tables. Be sure your application creates them with unique names, using the connection_id() for example. Lastly have a cleanup process drop tables periodically, based on closed connection_ids.