Why I like Etsy’s site performance report

etsy code as craft

Etsy publishes a great tech blog titled Code As Craft.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

I was recently sifting through some of their newer posts & stumbled upon their Q2 2015 Site Performance Report. It’s really in-depth, though not impossibly technical. Here’s what I liked.

1. Transparency to business & public

Show real performance to customers

The first thing I thought while reading, is the strong show of transparency. The blog is public, so it’s not just an internally facing document that shares with the company, but sharing with the wider world. True, presented as a technical post it may only appeal to a segment of readers, but it’s great none the less.

Show real performance to non-technical business units

I think this kind of analysis & summary also provides transparency to the business itself. Product teams, business operations & sales teams can all view what’s happening. Where are there problems? What is being done to address them?

Also: When hosting data on Amazon turns bloodsport

2. Highlighting change

Added pagination to the cart

One thing that popped out, was the discussion of pagination changes, that impacted page load times in the shopping cart. Page load times in the shopping cart are particularly crucial, because that’s where customers can “abandon” an order out of frustration.

Illustrating performance impact to product decisions

When product is evaluating that new feature, and they can see how changes affect performance, it better *sells* what all those engineering resources are being used for.

Related: 5 reasons to move data to amazon redshift

3. Where we don’t have data

We can’t analyze what data we haven’t captured

The report highlights that data around the shopping cart is new. That’s great because it highlights what the value collecting data offers, by providing new insights that were not available previously. This also pushes for more metrics collection & analysis as the business begins to see the value of all of this gymnastics.

Read: Is Amazon too big to fail?

4. Product tradeoffs

The discussion around the shopping cart performance also illustrates how the business makes product decisions. The engineering team can only build & write so much code. Deciding to spend time on pagination, means time not spent on some other new feature. Which is more valuable? Selling new feature A in one corner of the product, that customers may spend real money on? Or speeding up page load times on page B?

Also: Is Apple betting against big data?

5. Cleaner data

At a Look & Tell event, I heard Lincoln Ritter talk about Data as a product to the business.

When you expose a performance report like this to the business, an iterative process begins to happen. The company gains insight from the report, makes better decisions, and thus can spend more energy time & resources on clean data. Cleaner data in term means better reports, which produce better decisions & so on.

Also: What is venue analytics & why is it important?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

When hosting data on Amazon turns bloodsport

reddit aws outage

There’s a strong trend to automation across the cloud. That’s a great thing for startups because it reduces operational headaches & lets them focus on building products.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

But as that trend begins to touch the database tier, all sorts of complications emerge. Let’s take a look at some of the tradeoffs.

1. Database as a service trend

I was recently reading Baron Schwartz’s article on the trend to database as a service.

I work with a lot of venture backed startups & pay close attention to what’s happening in New York & SF. From where I’m standing I see a similar trend. As automation simplifies management across the application stack, from load balancers to web & search servers, the same advantages are moving to database management.

Also: How to automate MySQL analysis on Amazon RDS

2. How Amazon RDS helps

Amazon’s RDS offers firms a data solution for Oracle & SQL Server as well as MySQL. For those just starting, it offers a long list of advantages.

o quick push-button deployment in minutes
o standardized parameters settings that just work
o ability to scale up or down from the dashboard
o automated backups
o multi-az so you can sleep at night

This brings a huge advantage to startups. Many have a team of developers but aren’t large enough to need an operations team and can’t afford a dedicated database administrator.

Amazon is obviously helping these firms raise the bar. And that’s a good thing.

Related: RDS or MySQL 10 use cases

3. How Amazon RDS hurts

As you get bigger, your needs will grow too. You’ll have tens of millions of customers, and with more customers comes an even higher bar. Zero downtime becomes critical. It’s then that Amazon’s solution starts to become frustrating.

Unpredictable upgrades

MySQL upgrades on RDS are a messy activity. Amazon will restart the instance, backup the instance, perform the upgrade then restart again. Each of these restarts takes a few minutes. The whole operation may have you down for ten minutes. This becomes more frustrating when your hands are completely tied. You don’t know when or what will happen!

When you roll-your-own instance, an upgrade can be performed in a matter of seconds. No instance restarts are necessary and you can monitor the process to know exactly where you are. This is the kind of control you’re going to want if you have millions of customers relying on your site & uptime.

Unnecessary slow restarts

When you apply parameter changes on RDS, some require a MySQL restart. Amazon forces the whole server to restart, increasing this downtime from a few seconds (when you roll your own) to many minutes. And while some parameters can be changed online, Amazon can provoke some strange behavior that is not always predictable.

With the frequency of these types of changes, you’ll quickly grow tired and frustrated with RDS.

EBS Snapshots are not portable

As mentioned above Amazon uses it’s standard filesystem snapshot technology to perform backups. While this works well, it can be slow & unpredictable in a multi-tenant environment.

When you roll your own, you can take advantage of xtrabackup, and perform hot backups against your database with zero downtime. This is a real godsend. What’s more they are portable, and can be moved to any other server even ones not hosted in Amazon’s cloud!

Promoting a read-replica is slow too!

One feature that Amazon touts is creating copies or “read replicas” of your data. These are great and can facilitate easy copying of data. However promoting these again brings unnecessary restarts which are slow.

When you roll your own, you can promote a read-replica or read-only slave in seconds. A few seconds can seem invisible to end users, while minutes will be perceived as a real outage or downtime.

Read: Is zero downtime even possible with RDS?

4. Is migration an option?

So what to do? As I mentioned above, there are real advantages to startups deploying their first database. It really does help. I would argue for many it can be a good place to start.

If you’re starting to outgrow RDS and frustrated with the limitations, performance tuning headaches & unneeded downtime, luckily you have options.

Migrating off of RDS onto a physical server can be done in a number of ways.

o slave off of the master

Here you build a MySQL slave on a standard EC2 instance, with your RDS instance as the master. When you’re caught up, bring your site down temporarily. Reset the slave & set to read-write mode. Then point your webservers at your new EC2 instance and bring the site back up. If done carefully 10 to 20 seconds of downtime should be plenty.

Don’t forget to run through the process with a firedrill first!

o dump & import

Another way to move your data may be MySQLdump. This option would be slower & bring a lot more downtime, but possibly necessary in some cases.

Also: 5 Reasons to move data to Amazon Redshift

5. Speed: It’s the database

Fred Wilson says speed is the number one feature of a web application. If customers are frustrated & waiting, they may leave & not come back. On the web it can be everything.

Many firms are rushing to database as a service to simplify administration. While that’s wonderful at the beginning, as you grow performance will become more of a day-to-day concern. And when it does, the database is going to be big on your list of headaches.

Web application performance inevitably involves the database and while it does, your decision to choose database as a service may come into question. Don’t be afraid to bite the bullet and manage things yourself when that time comes.

Also: Is upgrading RDS like a shit-storm that will not end?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How 1and1 failed me

1and1 fail

I manage this blog myself. Not just the content, but also the technology it runs on. The systems & servers are from a hosting company called 1and1.com. And recently I had some serious problems.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

The publishing platform wordpress, as a few versions out of date. Because of that some vulnerabilities surfaced.

1. Malware from Odessa

While my eyes were on content, some russian hackers managed to scan my server & due to the older version of wordpress, found a way to install some malware onto the box. This would be invisible to most users, but was nevertheless dangerous. As a domain name with a fifteen year life, it has some credibility among the algorithms & search engines. There’s some trust there.

Google identified the malware, and emailed me about it. That was the first I was alerted in mid-August. That was a few days before I left for vacation, but given the severity of it, I jumped on the problem right away.

Also: Why I say Always be publishing

2. Heading off a lockout

I ordered up a new server from 1and1.com to rebuild. I then set to work moving over content, and completely reinstalled the latest version of wordpress.

Since it was within the old theme that the malware files had been hidden, I eliminated that whole directory & all files, and configured the blog with the newest wordpress theme.

Around that time I got some communication from 1and1. As it turns out they had been notified by google as well. Makes sense.

Given the shortage of time, and my imminent vacation, I quickly called 1and1. As always their support team was there & easy to reach. This felt reassuring. I explained the issue, how it occurred and all the details of how the server & publishing system had been rebuillt from the ground up.

This was August 24th timeframe. As I had received emails about a potential lockout, I was reassured by the support specialist that the problem had been resolved to their satisfaction.

Read: Do managers underestimate operational cost?

3. Vacation implosion

I happily left for vacation knowing that all my hard work had been well spent.

Meantime around August 25th, 1and1.com sent me further emails asking me for “additional details”. Apparently the “I’m going on vacation” note had not made it to their security division. Another day goes by and since they received no email from me the server was locked!

Being locked, means it is completely unreachable. Totally offline. No bueno! That’s certainly frustrating, but websites do go down. What happened next was worse.

Since I use Mailchimp to host my newsletter, I write that well in advance each month. Just like clockwork the emails go out to my 1100 subscribers on September 1st. Many of those are opened & hundreds click on the link. And there they are faced with a blank screen & browser. Nothing. Zilch! Offline!

Also: Why I use Airbnb chat even when texting is easier

4. The aftermath

As I return to connectivity, I begin sifting through my emails. I receive quite a few from friends in colleagues explaining that they couldn’t view my newsletter. I immediately remember my conversation with 1and1, their assurances that the server won’t be locked out, and that all is well. I’m thinking “I bet that server got locked out anyway”. Damn it, I’m angry.

Taking a deep breath, I call up 1and1 and get on the line with a support tech. Being careful not to show my frustration, I explain the situation again. I also explain how my server was down for two weeks and how it was offline during a key moment when my newsletter goes out.

The tech is able to reach out to the security department & explain things again. Without any additional changes to my server or technical configuration they are then able to unlock the server. Sad proof of a beurocratic mixup if there ever was one.

Also: Is Amazon too big to fail?

5. Reflections on complexity

For me this example illustrates the complexity in modern systems. As the internet gets more & more complex, some argue that we are building a sort of house of cards. So many moving parts, so many vendors, so many layers of software & so many pieces to patch & update.

As things get more complex, their are more cracks for the hackers to exploit. And patching those up becomes ever more daunting.

Related: Are we fast approaching cloud-mageddon?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters