Reliability and Resiliency for Cloud Connected Applications

Building cloud connected applications that are reliable is hard. At the heart of building such a system is a solid architecture and focus on resiliency. We're going to explore what that means in this post.

When I first started development on FastBar a cashless payment system for events, there were a few key criteria that drove my decisions for the overall architecture of the system.

Fundamentally, FastBar is a payment system designed specifically for event environments. Instead of using cash or drink tickets or clunky old credit card machines, events use FastBar instead.

There are 2 key characteristics of the environment in which FastBar operates, and the system that provide that drive almost all underlying aspects of the technical architecture: internet connectivity sucks, and we're dealing with people's money.

Internet connectivity at an event sucks

Prior to starting FastBar, I had a side business throwing events in Seattle. We'd throw summer parties, Halloween parties, New Year's Eve parties etc… In 10 years of throwing events, I cannot recall a single event where internet worked flawlessly. Most of the time it ranged from "entirely useless" to "ok some of the time".

At an event, there are typically 2 choices for getting internet connectivity:

  1. Rely on the venue's in-house WiFi

  2. Use the cellular network, for example a hotspot

Sometimes the venue's WiFi would work great in an initial walkthrough… and then 2000 people would arrive and connectivity goes to hell. Other times it would work great in certain areas of the venue, then we tested it where we wanted to setup registration, or place a bar, only to get an all too familiar response from the venue's IT folks: "oh, we didn’t think anyone would want internet there".

Relying on hotspots was just a bad: at many indoor locations, connectivity is poor. Even if you're outdoors with great connectivity, add a couple of thousand people to that space, each of them with smartphones hungry for bandwidth so they can post to Facebook/Instagram/Snapchat, or their phone just decides now is a great time to download that latest 3Gb iOS update in the background.

No matter what, internet connectivity at event environments is fundamentally poor and unreliable. This is something that isn't true in a standard retail environment like a coffee shop or hairdresser where you'd deploy a traditional point of sale, and it would have generally reliable internet connectivity.

We're dealing with people's money

For the event, the point of sale system is one of the most critical aspects - it effects attendee's ability to buy stuff, and the events ability to sell stuff. If the point of sale is down, attendees are pissed off and the event is losing money. Nobody wants that.

Food, beverage and merchandise sales are a huge source of revenue for events. For some events, it could be their only source of revenue.

In general, money is a very sensitive topic for people. Attendees have an expectation that they are charged accurately for the things they purchase, and events expect the sales numbers they see on their dashboard are correct, both of which are very reasonable expectations.

Reliability and Resiliency

Like any complicated, distributed software system, there are many non-functional requirements that are important to create something that works. A system needs to be:

  • Available

  • Secure

  • Maintainable

  • Performant

  • Scalable

  • And of course reliable

Ultimately, our customers (the events), and their customers (the attendees), want a system that is reliable and "just works". We achieve that kind of reliability by focusing on resiliency - we expect things will fail, and design a system that will handle those failures.

This means when thinking about our client side mobile apps, we expect the following:

  • Requests we make over the internet to our servers will fail, or will be slow. This could mean we have no internet connectivity at the time and can't event attempt to make a request to the server, or we have internet, but the request failed to get to the server, or the request made it to our server but the client didn't get the response

  • A device may run out of battery in the middle of an operation

  • A user may exit the app in the middle of an operation

  • A user may force close the app in the middle of an operation

  • The local SQLite database could get corrupt

  • Our server environment may be inaccessible

  • 3rd party services our apps communicate with might be inaccessible

On the server side, we run on Azure and also depend on a handful of 3rd party services. While generally reliable, we can expect:

  • Problems connecting to our Web App or API

  • Unexpected CPU spikes on our Azure Web Apps that impact client connectivity and dramatically increase response time for requests

  • Web Apps having problems connecting to our underlying SQL Azure database or storage accounts

  • Requests to our storage account resources being throttled

  • 3rd party services that normally respond in a couple of hundred milliseconds taking 120+ seconds to respond (that one caused a whole bunch of issues that still traumatize me to this day)

We've encountered every single one of these scenarios. Sometimes it seems like almost everything that can fail, has failed at some point, usually at the most inopportune time. That's not quite true, I can mentally create some nightmare scenarios that we could potentially encounter in the future, but these days we're in great shape to withstand multiple critical failures across different parts of our system and still retain the ability to take orders at the event, and have minimal impact to attendees and event staff.

We've done this by focusing on resiliency in all parts of the system - everything from the way we architect to the details of how we make network requests and interact with 3rd party services.

Processing an Order

To illustrate how we achieve resiliency, and therefore reliability, let's take a look at an example of processing and order. Conceptually, it looks like this:

FastBar - Order Processing - Conceptual.png

The order gets created on the POS and makes an API request to send it to the server. Seems pretty easy, right?

Not quite.

Below is a highly summarized version of what actually happens when an order is placed, and how it flows through the system:

There is a lot more to it that just a simple request-response. Instead, it's a complicated series of asynchronous operations and a whole bunch of queues in between which help us provide a system that is reliable and resilient.

On the POS App

  1. The underlying Order object and associated set of OrderItems are created and persisted to our local SQLite database
  2. We create a work item and place it on a queue. In our case, we implement our own queue as a table inside the same SQLite database. Both steps 1 and 2 happen transactionally, so either all inserts succeed, or none succeed. All of this happens within milliseconds, as it's all local on the device and doesn't rely on any network connectivity. The user experience is never impacted in case of network connectivity issues
  3. We call the synchronization engine and ask it to push our changes
    1. If we're online at the time, the synchronization engine will pick up items from the queue that are available for processing, which could be just the 1 order we just created, or there could be many orders that have been queued and are waiting to be sent to the server. For example if we were offline and have just come back online. Each item will be processed in the order that it was placed on the queue, and each item involves its own set of work. In this case, we're attempting to push this order to our server via our server-side API. If the request to the server succeeds, we'll delete the work item from the queue, and update the local Order and OrderItems with some data that the server returns to us in the response. This all happens transactionally.
    2. If there is a failure at any point, for example a network error, or a server error, we'll put that item back on the queue for future processing
    3. If we're not online, the synchronization engine can't do anything, so it returns immediately, and will re-try in the future. This happens either via a timer that is syncing periodically, or after another order is created and a push is requested
    4. Whenever we make any request to the server that could update or create any data, we send a IndempotentOperationKey which the server uses to determine if the request has been processed already or not

The Server API

  1. Our Web API receives the request and processed it
    1. We make sure the user has permissions to perform this operation, and verify that we have not already processed a request with the same IdempotentOperationKey the client has supplied
    2. The incoming request is validated, and if we can, we'll create an Order and set of OrderItems and insert them into the database. At this point, our goal is to do the minimal work possible and leave the bulk of the processing to later
    3. We'll queue a work item for processing in the background

Order Processor WebJob

  1. Our Order Processor is implemented as an Azure WebJob and runs in the background, constantly looking at the queue for new work
  2. The Order Processor is responsible for the core logic when it comes to processing an order, for example, connecting the order to an attendee and their tab, applying any discounts or promotions that may be applicable for that attendee and re-calculating the attendee's tab
  3. Next, we want to notify the attendee of their purchase, typically by sending them an SMS. We queue up a work item to be handled by the Outbound SMS Processor

Outbound SMS Processor WebJob

  1. The Outbound SMS processor handles the composition and sending of SMS messages to our 3rd party service for delivery, in our case, Twilio
  2. We're done!

That's a lot of complexity for what seems like a simple thing. So why would we add all of these different components and queues? Basically, it’s necessary to have a reliable and resilient system that can handle a whole lot of failure and still keep going:

  • If our client has any kind of network issues connecting to the server

  • If our client app is killed in any way, for example, if the device runs out of battery, or if the OS decided to kill our app since we were moved to the background, or if the user force quits our app

  • If our server environment is totally unavailable

  • If our server environment is available but slow to respond, for example, due to cloud weirdness (a whole other topic), or our own inefficient database queries or any number of other reasons

  • If our server environment has transitory errors caused by problems connecting with dependent services, for example, Azure SQL or Azure storage queues returning connectivity errors

  • If our server environment has consistent errors, for example, if we pushed a new build to the server that had a bug in it

  • If 3rd party services we depend on are unavailable for any reason

  • If 3rd party services we depend on are running slow for any reason

Asynchronicity to the Max

You'll notice the above flow is highly asynchronous. Wherever we can queue something up and process it later, we will. This means we're never worried if whatever system we're talking to is operating normally or not. If it's alive and kicking, great, we'll get that work processed quickly. If not, no worries, it will process in the background at some point in the future. Under normal circumstances, you could expect an order to be created on the client and a text message received by the customer within seconds. But, it could take a lot longer if any part of the system is running slowly or down, and that's ok, since it doesn't dramatically affect the user experience, and the reliability of the system is rock solid. .

It's also worth noting that all of these operations are both logically asynchronous, and asynchronous at the code level wherever possible.

Logically asynchronous meaning instead of the POS order creation UI directly calling the server, or, on the server side, the request thread directly calling a 3rd party service to send an SMS, these operations get stored in a queue for later processing in the background. Being logically asynchronous is what gives us reliability and resiliency.

Asynchronous at the code level is different. This means that wherever possible, when we are doing any kind of I/O we utilize C#’s async programming features. It's important to note that underlying code being asynchronous doesn’t actually have anything to do with resiliency. Rather, it helps our components achieve higher throughput since they're not tying up resources like threads, network sockets, database connections, file handles etc… waiting for responses. Asynchronity at the code level is all about throughput and scalability.

Conclusion

When you're building mobile applications connected to the cloud, reliability is key. The way to achieve reliability is by focusing on resiliency. Expect that everything can and probably will fail, and design you system to handle these failures. Make sure your system is highly logically asynchronous and queue up work to be handled by background components wherever possible.

FastBar's Technical Architecture

Previously, I've discussed the start of FastBar, how the client and server technology stacks evolved and what it looks like today (read more in part 1, part 2 and part 3 of that series).

As a recap, here's what the high-level components of FastBar look like:

FastBar Components - High Level.png

Let's dive deeper.

Architecture

So, what does FastBar’s architecture look like under the hood? Glad you asked:

Client apps:

  • Registration: used to register attendees at an event, essentially connecting their credit card to a wristband and all of the associated features that are required at a live event

  • Point of Sale: used to sell stuff at events

These are both mobile apps built in Xamarin and running on iOS devices.

The server is more complicated from an architectural standpoint, and is divided into the following primary components:

  • getfastbar.com - the primary customer facing website. Built on Squarespace, this provide primarily marketing content for anyone wanting to learn about FastBar

  • app.getfastbar.com - our main web app which provides 4 key functions:

    • User section - as user, there are a few functions you can perform on FastBar, such as creating an account, adding a credit card, updating your user profile information and if you've got the permissions, creating events. This section is pretty basic

    • Attendee section - as an attendee, you can do things like pre-register for an event, view your tab, change your credit card, email yourself a receipt and adjust your tip. This is the section of the site that receives the most traffic

    • Event Control Center section - this is by far the largest section of the web app, it's where events can be fully managed: configuring details, connecting payment accounts, configuring taxes, setting up pre-registration, managing products and menus, viewing reports and downloading data and a whole lot more. This is where event organizers and FastBar staff spend the majority of their time

    • Admin section - various admin related features used by FastBar support staff. The bulk of management related to a specific event, they would do from the Event Control Center if acting on behalf of an event organizer

  • api.getfastbar.com - our API, primarily used by our own internal apps. We also open up various endpoints to some partners. We don’t make this broadly accessible publicly yet because it doesn't need to be. However, it’s something we may decide to open up more broadly in the future

The main web app and API share the same underlying core business logic, and are backed by a variety of other components, including:

  • WebJobs:

    • Bulk Message Processor - whenever we're sending a bulk message, like an email or SMS that is intended to go to many attendees, the Bulk Message Processor will be responsible for enumerating and queuing up the work. For example, if we were to send out a bulk SMS to 10,000 attendees of the event, whatever initiates this process (the web app or the API) will queue up a work item for the Bulk Message Processor that essentially says "I want to send a message to a whole bunch of people". The Bulk Message Processor will pick up the message and start enumerating 10,000 individual work items that it will queue up for processing by the Outbound SMS Processor, a downstream component. The Outbound SMS Processor will in turn pick up each work item and send out individual SMSs

    • Order Processor - whenever we ingest orders from the POS client via the API, we do the minimal amount of work possible so that we can respond quickly to the client. Essentially, we're doing some initial validation and persisting the order in the database, then queuing a work item so that the Order Processor can take care of the heavy lifting later, and requests from the client are not unnecessarily delayed. This component is very active during an event

    • Outbound Email Processor - responsible for sending an individual email, for example as the result of another component that queued up some work for it. We use Mailgun to send emails

    • Outbound Notification Processor - responsible from sending outbound push notifications. Under the covers this uses Azure Notification Hub

    • Outbound SMS Processor - responsible for sending individual SMS messages, for example a text message update to an attendee after they place an order. We send SMSs via Twilio

    • Sample Data Processor - when we need to create a sample event for demo or testing purposes, we send this work to the Sample Data Processor. This is essentially a job that a admin user may initiate from the web app, and since it could take a while, the web app will queue up a work item, then the Sample Data Processor picks it up and goes to work creating a whole bunch of test data in the background

    • Tab Authorization Processor - whenever we need to authorize someone's credit card that is connected to their tab, the Tab Authorization Processor takes care of it. For example, if attendees are pre-registering themselves for an event week before hand, we vault their credit card details securely, and only authorize their card via the Tab Authorization Processor 24 hours before the event starts

    • Tab Payment Processor - when it comes time execute payments against a tab, the Tab Payment Processor is responsible for doing the work

    • Tab Payment Sweeper - before we can process a tab's payment, that work needs to be queued. For example, after an event, all tabs get marked for processing. The Tab Payment Sweeper runs periodically, looking for any tabs that are marked for processing, and queues up work for the Tab Payment Processor. It's similar in concept to the Bulk Message Processor in that it's responsible for queuing up work items for another component

    • Tab Authorization Sweeper - just like the Tab Payment Sweeper, the Tab Authorization Sweeper looks for tabs that need to be authorized and queues up work for the Tab Authorization Processor

  • Functions

    • Client Logs Dispatcher - our client devices are responsible for pushing up their own zipped-up, JSON formatted log files to Azure Blob Storage. The Client Logs Dispatcher then takes the logs and dispatches them to our logging system, which is Log Analytics, part of Azure Monitor

    • Server Logs Dispatcher - similar in concept to the Client Logs Dispatcher, the Server Logs Dispatcher is responsible for taking server-side logs, which initially get placed into Azure Table Storage, and pushing them to Log Analytics so we have both client and server logs in the same place. This allows us to do end to end queries and analysis

    • Data Exporter - whenever a user requests an esport of data, we handle this via the Data Exporter. For large events, an export could take some time. We don’t want to tie up request threads or hammer the database, so we built the Data Exporter to take care of this in the background

    • Tab Recalculator - we maintain a tab for each attendee at an event, it's essentially the summary of all of their purchases they've made at the event. From time to time, changes happen that require us to recalculate some or all tabs for an event. For example, let's say the event organizer realized that beer was supposed to be $6, but was accidentally set for $7 and wanted to fix this going forward, and for all previous orders. This means we need to recalculate all tabs that have orders involving the affected products. For a large event there could be many thousands of tabs affected by this change, and since each tab has unique characteristics, including the rules around how totals should be calculated, this has to be done individually for each tab. The Tab Recalculator takes care of this work in the background

    • Tags Deduplicator - FastBar is a complicated distributed system that support offline operation of client devices like the Registration and POS apps. On the server side, we also process things in parallel in the background. Long story short, these two characteristics mean that sometimes data can get out of sync. The Tags Deduplicator helps put some things back in sync so eventually arrive at a consistent state

Azure Functions vs WebJobs

So, how come some things are implemented as Functions and some as WebJobs? Quite simply, the WebJobs were built before Azure Functions existed and/or before Azure Functions really became a thing.

Nowadays, it seems as though Azure Functions are the preferred technology to use so we made the decision a while ago to create any new background components using Functions, and, if any significant refactoring was required to a WebJob, we'll take the opportunity to move it over to a Function as well.

Over time, we plan on phasing out WebJobs in favor of Functions.

Communication Between the Web App / API and Background Components

This is almost exclusively done via Azure Storage Queues. The only exception is the Client Logs Dispatcher, which can also be triggered by a file showing up in Blob Storage.

Azure has a number of queuing solutions that could be used here. Storage queues is a simple solution that does what we need, so we use it.

Communication with 3rd Party Services

Wherever we can, we’ll push interaction with 3rd party services to background components. This way, if 3rd party services are running slowly or down completely for a period of time, we minimize impact on our system.

Blob and Table Storage

We utilize Blog storage in a couple of different ways:

  • Client apps upload their logs directly to blog storage for processing bys the Client Logs Dispatcher

  • Client apps have a feature that allows the user to create a bug report and attach local state. The bug is logged directly into our work item tracking system, Pivotal Tracker. We also package up the client side state and upload it to blob storage. This allows developers to re-create the state on a client device on their own device, or the simulator for debugging purposes

Table storage is used for the initial step in our server-side logging. We log to Table storage, and then push that data up to Log Analytics via the Server Logs Dispatcher.

Azure SQL

Even though there are a lot of different technologies to store data these days, we use Azure SQL for a few key reasons: it’s familiar, it works, it’s a good choice for a financial system like FastBar where data is relational and we require ACID semantics.

Conclusion

That’s a brief overview of FastBar’s technical architecture. In future posts, I’ll go over more of the why behind the architectural choices and the key benefits that it has.

Choosing a Tech Stack for Your Startup - Part 3: Cloud Stacks, Evolving the System and Lessons Learnt

This is the final part in a 3 part series around choosing a tech stack for your startup:

  • In part 1, we explored the choices we made and the evolution of FastBar’s client apps

  • In part 2, we started the exploration of the server side, including our technology choices and philosophy when it came to building out our MVP

  • In part 3, this post, we’ll wrap-up our discussion of the server side choices and summarize our key lessons learnt.

As a recap, here’s the areas of the system we’re focused on we’re focused on:

FastBar Components - Server Hilighted.png

And in part 2, we left off discussing self-hosting vs utilizing the cloud. TL;DR - forget about self hosting and leverage the cloud.

Next step - let’s pick a cloud provider…

AWS vs Azure

In 2014, AWS was the undisputed leader in the cloud, but Azure was quickly catching up in capabilities and feature set.

Through the accelerator we got into, 9 Mile Labs, we had access to some free credits with both AWS and Azure.

I decided to go with Azure, in part because they offered more free credits via their Bizpark Plus program than what AWS was offering us, in part because I was more familiar with their technology than that of AWS, in part because I'm generally a fan of Microsoft technology, and in part because I wanted to take advantage of their Platform as a Service (PaaS) offerings. Specifically, Azure App Service Web Apps and Azure SQL - AWS didn't have any direct equivalents for those at the time. I could certainly spin up VMs and install my own versions of IIS and SQL on AWS, but that was more work for me, and I had enough things to do.

PaaS vs IaaS

After doing some investigation into Azure's PaaS offerings, namely App Service Web Apps and Azure SQL, I decided to give them a go.

With PaaS offerings, you're trading some flexibility for convenience.

For example, with Web Apps, you don’t deploy your app to a machine or a specific VM - you deploy it to Azure's Web app service, and it deploys it to one or more VMs on your behalf. You don’t remote desktop into the VM to poke around - you use the web-based tools or APIs that Microsoft provides for you. Azure SQL doesn't support all of the features that regular SQL does, but it supports most of them. You don’t have the ability to configure where your database and log files will be placed, Azure manages that for you. In most cases, this is a good thing, as you've got better things to do.

With Web Apps, you can easily setup auto-scaling, like I described in part 2, and Azure will magically create or destroy more VMs according to the rules you setup, and route traffic between them. With SQL Azure, you can do cool things like create read-only replicas and geo-redundant failover databases within minutes:

If there is a PaaS offering of a particular piece of infrastructure that you require on whatever cloud you're using, try it out. You'll be giving up some flexibility, but you'll get a whole lot in return. For most scenarios, it will be totally worth it.

3rd Party Technologies

Stripe

The first 3rd party technology service we integrated was Stripe - FastBar is a payment system after all, so we needed a way to vault credit cards and do payment processing. At the time Stripe was the gold standard in terms of developer friendly payment APIs, so we went with it and still use it to this day. We've had our fair share of issues with Stripe, but overall it's worked well for us.

Loggly: A Cautionary Tale

Another piece of 3rd party tech we used early on was Loggly. This is essentially “logging as a service” and instead of you having to figure out how to ingest, process and search large volumes of log data, Loggly provides a cloud-based service for you.

We used this for a couple of years and eventually moved off it because we found the performance was not reliable.

We ran into an indecent one time where Loggly, which typically would respond to our requests in 200-300ms, was taking 90-120 seconds to respond (ouch!). Some of our server-side web and API code called Loggly directly as part of the request execution path (a big no-no, that was our bad) and needless to say, when your request thread is tied up waiting for a network call that is going to take 90-120 seconds, everything went to hell.

During the incident, it was tough for us to figure out what was going on, since our logging was impacted. After the incident, we analyzed and eventually tracked down 90-120 second response times from Loggly as the cause. We made changes to our system so that we would never again call Loggly directly as part of a request's execution path, rather we'd log everything "locally" within the Azure environment and have a background process that would push it up to Loggly. This is really what we should have been doing from the beginning. At the same time, Loggly should have been more robust.

This made us immune to any future slowdowns on the Loggly side, but over time we still found that our background process was often having trouble sending data to Loggly. We had an aot-retry mechanism setup so we’d keep retrying to send to Loggly until we succeeded. Eventually this would work, but we found this retry mechanism was bring triggered way too often for our liking. We also found similar issues on our client apps, where we'd have our client apps send logs directly to Loggly in the background to avoid us having to send to our server, then to Loggly. This was more of an issue, since clients operate in constrained bandwidth environments.

Overall, we experienced lots of flakiness with Loggly regardless of if we were communicating with it from the client or server.

In addition, the cheaper tiers of Loggly are quite limited in the amount of data you can send to them. For a large event, we'd quickly hit the data cap, and the rest of our logs would be dropped. This made the Loggly search features (which were awesome by the way, and one of the key things that attracted us to Loggly) pretty much useless for us, since we'd only have a fraction of our data available unless we moved up to a significantly more expensive tier.

We removed Loggly from the equation in favor of Azure's Log Analytics (now renamed to Azure Monitor). It's inside Azure with the rest of our stuff, has awesome query capabilities (on par with Loggly) and it’s much cheaper for us due to its “cloud-based pricing model” that scales based on the amount you use it, as opposed to handful of main pricing buckets with Loggly.

Twilio

We use Twilio for sending SMS messages. Twilio has worked great for us from the early days, and we don’t have any plans to change it anytime soon.

Cloudinary

On a previous project, I got deep into the complexities of image processing: uploading, cropping, resizing and hosting, distributing to a CDN etc…

TL;DR it's something that seems really simple on the surface, but quickly spirals out of control - it’s a hard problem to solve properly. 

I learnt my lesson on a previous project, and on FastBar, I did not pass Go and did not collect $200, rather I went straight to Cloudinary. It's a great product, easy to use, and it removes all of our image processing and hosting hassles.

Mailgun

Turns out sending email is hard. That’s why companies like Mailgun and Sendgrid exist.

We decided to go with Mailgun since it had a better pricing model for our purposes compared to Sendgrid. But fundamentally, they’re both pretty similar. They help you take care of the complexities of sending reliable email so you don’t have to deal with it.

Building out the Event Control Center

As our client apps and their underlying APIs started to mature, we started turning our development focus to building out the Event Control Center on the server - the place where event organizers and FastBar staff could fully configure all aspects of the event, manage settings, configure products and menus, view reports etc…

This was essentially a traditional web app. We looked at using tech like React or Angular. As we speced out our screens, we realized that our requirements were pretty straightforward. We didn't have lots of pages that needed a rich UI, we didn’t have a need for a Single Page App (SPA), and overall, our pages were pretty simple. We decided to go with a more "traditional" request/response ASP.NET web app, using HTML 5, JS, CSS, Bootstrap, Jquery etc…

The first features we deployed were around basic reporting, followed by the ability to create events, edit settings for the event, view and manage attendee tabs, manage refunds, create and configure products and menu items.

Nowadays, we've added multi user support, tax support, comprehensive reporting and export, direct and bulk SMS capabilities, configuration of promotions (ie discounts), device management, attendee surveys and much more.

The days of managing via SQL Management Studio are well in the past (thankfully!).

Re-building the public facing website

For a long time, the public facing section of the website was a simple 1-pager explanation of FastBar. It was long overdue for a refresh, so in 2018 I set out to rebuild it, improve the visuals, and most importantly, update the content to better reflect what we had to offer customers.

For this, we considered a variety of options, including: custom building an ASP.NET site, Wordpress, Wix, Squarespace etc...

Building a custom ASP.NET website was kind of a hassle. Our pages were simple, but it would be highly beneficial if non-developers could easily edit content, so we really needed a basic CRM. This meant we needed to go for a self-hosted CRM, like Wordpress, or a hosted CRM, like Wordpress.com, Wix or Squarespace.

I had build and deployed enough basic Wordpress sites to know that I didn't want to spend our time, effort and money on self-hosting it. Self-hosting means having to deal with constant updates to the platform and the plugins (Wordpress is a ripe target for hackers, so keeping everything up to date is critical), managing backups and the like.

We were busy enough building features for FastBar system, I didn’t want to allocate precious dev resources to the public facing website when a hosted solution at $12/mo (or thereabouts) would be sufficient.

For Wordpress in general, I found it tough to find a good quality template that matches the visuals I was looking for. To be clear, there are a ton of templates available, I'd say way too many. I found it really hard to hunt through the sea of mediocrity to find something I really liked.

When evaluating the hosted offerings like Squarespace and Wix, my first concern was that as a technology company I was worried potential engineering hires might judge us for using something like that. I don’t know about you, but I'll often F12 or Ctrl-U a website to see what's going on under the hood :) Also, while quick to spin up, hosted offerings like Squarespace lacked what I consider basic features, like version control, so that was a big red flag.

Eventually I determined that the pros and simplicity of a hosted offering outweighed the cons and we went with Squarespace. Within about a week, we had the site re-designed and live - the vast majority of that time was spent on the marketing and messaging, the implementation part was really easy.

Where we're at today

Today, our backend is comprised of 3 main components: the Core Web App and API, our public facing website and 3rd party services that we depend on.

Our core Web App and API is built in ASP.NET and WebAPI and runs on Azure. We leverage Azure App Services, Azure SQL, Azure Storage service (Blob, Table and Queue), Azure Monitor (Application Insights and Log Analytics), Azure Functions, WebJobs, Redis and a few other bits and pieces.

The public facing website runs on Squarespace. 

The 3rd party services we utilize are Stripe, Cloudinary, Twilio and Mailgun.

Lessons Learnt

Looking back at our previous lessons learnt from client side development:

  1. Optimize for Productivity

  2. Choose Something Popular

  3. Choose the Simplest Thing That Works

  4. Favor Cross Platform Tech

The first 3 are highly applicable to the server side development. The 4th is more client specific. You could make the argument that it’s valuable server-side as well, it depends on how many server environments you’re planning on deploying to. In most cases, you’re going to pick a stack and stick with it, so it’s less relevant.

Here are some additional lessons we learnt on the server side.

Ruthlessly Prioritize

This one applies to both client and server side development. As a startup, it's important to ruthlessly prioritize your development work and tackle the most important items first. How far can you go without building out an admin UI and instead relying on SQL scripts? What's the most important thing that your customers need right now? What is the #1 feature that will help you move the product and business forward?

Prioritization is hard, especially because it's not just about the needs of the customer. You also have to balance the underlying health of the code base, the design and the architecture of the system. You need to be aware of any technical debt you're creating, and you need to be careful not to paint yourself into a corner that you might not be able to get out of later. You need to think about the future, but not get hung up on it too much that you adopt unnecessary work now that never ends up being needed later. Prioritization requires tough tradeoffs. 

Prioritization is more art than science and I think it's something that continues to evolve with experience. Prioritize the most important things you need right now, and do your best to balance that with future needs.

Just go Cloud

Whether you're choosing AWS, Azure, Google or something else, just go for the cloud. It's pretty much a given these days, so hopefully you don't have the urge to go to the dark side and buy and host your own servers. 

Save yourself the hassle, the time and the money and utilize the cloud. Take advantage of the thousands upon thousands of developers working at Amazon, Microsoft and Google who are working to make your life easier and use the cloud.

Speaking of using the cloud…

Favor PaaS over IaaS

If there is a PaaS solution available that meets you needs, favor it over IaaS. Sure, you'll lose some control, but you'll gain so much in terms of ease of use and advanced capabilities that would be complicated and time consuming for your to build yourself.

It means less work for you, and more time available to dedicate to more important things, so favor PaaS over IasS.

Favor Pre-Built Solutions

Better still, if there is an entire solution available to you that someone else hosts and manages, favor it.

Again, less work for you, and allows you to focus your time, energy and resources on more important problems that will provide value to your customers, so favor pre-built solutions.


Conclusion

In part 1 we discussed client side technology choices we went through when building FastBar, including our thinking around Android vs iOS, which client technology stack to use, how our various apps evolved, where we’re at today, and key lessons learnt.

In part 2 and part 3, this post, we discussed the server side, including choosing a server side stack, building out an MVP, deciding to self-host or utilize the cloud, AWS vs Azure, various other 3rd party technologies we adopted, where we’re at today and more lessons learnt, primarily related to server side development.

Hopefully you can leverage some of these lessons in building your own startup. Good luck, and go change the world!

Choosing a Tech Stack for Your Startup - Part 2: Server Side Choices and Building Your MVP

In part 1 of this series, we covered a detailed overview of how FastBar chose it's client side technology stack, how it evolved over the years, where it is today, and key lessons we learnt along the way.

In this post, part 2, we'll start exploring the server side technology choices and conclude in part 3.

FastBar Components - Server Hilighted.png

Selecting a Stack

The very first prototype version of FastBar that was built at Startup Weekend didn't have much of a server side at all. I think we had a couple of basic API endpoints built in Ruby on Rails and deployed to Heroku.

After Startup Weekend when we became serious about moving FastBar forward, building out the server side became a priority and we needed to select a tech stack.

We discussed various options: Ruby on Rails, Go, Java, PHP, Node.js and ASP.NET. I decided to go with ASP.NET MVC and C# for a few reasons:

  1. Familiarity

  2. Suitability for the job

  3. How the platform was evolving

Familiarity

.NET and C# were the platform and language that I was most familiar with. I spent 7.5 years working at Microsoft, the first 2 of which were in the C# compiler team, the next 5.5 helping customers architect and build large-scale systems on Microsoft technology. Since leaving Microsoft, I spent a lot of time using .NET for a startup, along with some consulting work. For me, .NET technology was going to be the most productive option.

In tech, there is a ton of religion, and often times (perhaps most of the time) people make decisions on technologies based on their particular flavor of religion. They believe that X is faster than Y, or A is better than B for [insert unfounded reason here].  The reality is there are many technology choices, all of which have pros and cons. So long as you're choosing a technology that is mainstream and well suited for the task at hand, you can probably be successful building your system using (almost) whatever technology stack you prefer.

Suitability for the job

It's important to select a tool that's suitable for the job you're trying to achieve. For us, our backed required a website and an API - pretty standard stuff. ASP.NET is a great platform for doing just that. Likewise, there are many other fine choices including Ruby on Rails or Node.js.

How the platform was evolving

Back in 2014, Microsoft technologies were often shunned by developers who were not familiar with .NET, because they felt that Microsoft was the evil empire, all the produced was proprietary tech and closed source code, and nothing good could possibly come from the walled garden that was Redmond. 

The reality was quite different. As early as 2006, Microsoft started a project called Codeplex as a way to share source code, and used it to publish technologies like the AJAX Control Toolkit. In October 2007, Microsoft published the source code for the entire .NET Framework, it wasn't an "open source" project per se, but rather "reference source" - it allowed developers to step into the code and see what was going on under the hood, a primary complaint related to proprietary software or closed source systems. Also in October 2007, Microsoft announced that the upcoming ASP.NET MVC project would be fully open source. In 2008, the rest of the ASP.NET technologies were also open sourced.

That trend continued, with Microsoft open sourcing more and more stuff. Fast forward to April 2014 and Microsoft made a big announcement regarding open source: the creation of the .NET foundation, and the open-sourcing of large chunks of .NET. Later that year, they open sourced even more stuff.

Fast forward again to today, now Microsoft owns Github and has made most, if not all, of .NET open source. It's pretty clear that Microsoft is "all in" on open source. Here's an interesting article on the state of Microsoft and open source as of December 2018. And if you're interested in some more background, Scott Hunter and Beth Massi have some great Medium posts that chronicle some of Microsoft's journey into open source.

Back in 2014, I was a fan .NET technology, I liked the direction it was moving in, and felt that the trend towards open sourcing more stuff would only strengthen the technology and the ecosystem. Looking back, this has proved correct.

Building the Basics

With our tech stack chosen, the first 2 things we needed to build were (a) a basic customer facing website and (b) an API with underlying business logic and data schema to support the POS. In Startup and, this is often called a MVP or Minimum Viable Product.

For the customer facing website, I built a 1-pager ASP.NET website using Bootstrap. It was simple, but looked decent enough and was mobile friendly. It really just needed to serve as a landing page with a brief explanation of FastBar and a "join our email list" form. That site actually lasted way longer than it should have :)

The more important thing we needed was an API that the client apps could talk to: first to push up order details and next to display tabs to attendees so they could keep track of their spending. Although it would have been nice to have an administrative UI so we could view and manage attendees and orders, configure products and menus, view reports etc… there was a lot of effort required to build it, and it wasn't the highest priority thing we needed to implement.

Our First UI: SQL Management Studio and Excel

For a long time, the primary interface we used to setup and manage events was SQL Management Studio. I created an Excel spreadsheet that served as a helper tool to generate SQL statements which would in turn be run in SQL Management Studio. This was definitely a rough and ready approach, and not my preferred path, but hey, as a startup with limited resources, you need to pick your battles.

Reporting was done via a somewhat complicated SQL query that would spit out tabular sales results, which I'd then copy/paste into a fancy Excel spreadsheet I'd created. The results of the copy/paste would drive a "dashboard" tab in the spreadsheet that summarized key metrics, as well as a series of other pages that would show fancy graphs of sales over time and product breakdowns.

This was all rather crude, but like I said, as a startup with limited resources, you need to pick your battles and focus on highest priority tasks first.

You see, our attendees didn’t care about how the system was configured or how reports were generated. They simply wanted to get their wristband, tap to pay for their drinks and get back to enjoying the event.

Our event organizers didn’t much care how the event was configured either, so long as the point of sale displayed the right products at the right prices, customers were charged correctly, money appeared in their bank account and they could get some kind of reporting. We took care of all of the configuration of the system on their behalf, and Excel-based reports were fine for them, in the early days.

Self-hosted vs Cloud

In 2011 some friends of mine left Microsoft to start a company. At the time, Amazon Web Services (AWS) was coming along nicely, and most forward-thinking companies and startups were looking to the cloud.

The CTO for my friend's company, let's call him "Bob" (not his real name) decided that it would be cheaper to buy the hardware himself instead of going with something cloud-based. Bob created spreadsheets that he used to justify his desire to buy server and show how over time it would be cheaper. In reality, Bob was a "build his own metal" kinda guy. Bob wanted to spend money on cool hardware and build his own servers and that's what he was comfortable with, so he found a way to justify it.

Bob spent a couple of hundred thousand dollars on servers. A few years later, all of those servers were sitting in a spare room in their office collecting dust.

Don't be like Bob.

In 2011, it didn’t make sense to buy your own servers. AWS was a great choice. Azure was early, and quite frankly pretty crap at the time. Google's App Engine existed, but I don't think anyone actually used it.

In 2014 when FastBar started, it didn’t make sense to buy your own servers. AWS was cranking along and adding new services at a furious pace, and Azure was busy catching up. Azure had moved from crap a couple of years earlier to a really solid offering by 2014.

Today, it definitely doesn't make sense to buy your own servers. Unless you're Google, or Microsoft, or Amazon, then sure, buy as many servers as you need. But for the rest of us, cloud computing is so much simpler and easier. For example, at FastBar we have a script that will deploy a fresh version of our entire FastBar environment, including:

  • Web applications, APIs and background workers across multiple servers

  • Geo-redundant SQL databases

  • Geo-redundant storage services

  • Monitoring and logging resources

  • Redis caches

There's like 20 odd components all together, and this all happens within minutes. This is something that would take days to deploy to our own servers by hand, or we would have spent weeks or months automating the deployment process.

Not only that, but if we decide we need to scale up our front end web servers, all is takes is a couple of clicks, and within minutes, we’ll be running on more powerful hardware: 

Better still, we have automatic scaling setup, so if our webservers start getting overloaded, Azure makes more of them magically appear, and when things go back to normal, the extra servers simply go away.

It's a beautiful thing, and it makes me very happy. I'm pretty sure I still have a bit of PTSD when I think about how much effort it would take to set this stuff up manually before the cloud came along.

Another argument I used to hear against the cloud from Bob was that the cloud has lots of outages (here's some recent outages from 2018), but Bob claimed his "servers have never gone down". Maybe they haven't. But they will eventually, and usually at the worst possible time.

It's true that the cloud has outages. And these days when something fails at one of the big cloud providers, it's got the potential to take out a huge portion of the internet. But the cloud providers are getting better - their systems get stronger, and they learn from their mistakes.

Personally, I'd much rather be relying on something like Azure or AWS or Google Cloud, so that when an outage occurs (note I said when, not if - all system go down at some point), there are thousands upon thousands of people tweeting/writing/blogging about it, and hundreds or maybe thousands of engineers working on fixing the problem.

Forget about buying servers, and deploy your system to the cloud. There are so many benefits - from zero up front capital expenditure, to spinning up and down infrastructure and building out globally scalable and redundant systems within minutes.

Stay tuned for part 3 where we'll explore the different cloud stacks, Platform as a Service (PaaS) vs Infrastructure as a Service (IaaS), 3rd party technologies, where FastBar is at today and key lessons learnt.

Why the cloud can't be trusted

Late the night of June 19th I was chatting with one of my developers and he mentioned that our hosted source control provider, Codespaces, was down. I checked the website, and sure enough, nothing. At first, it didn’t seem like anything new, from time to time Codespaces would go down (more often than I'd like) but it would eventually come back up and everything would be back to normal.

I figured I'd check their Twitter feed to see if there was any details on what was happening and an ETA for it to be back online:

The Codespaces website they were pointing to was down so I couldn't get any more information there, but some quick googling revealed that the nightmare cloud security scenario had just occurred to my hosted source code provider. From reading various articles online it became clear that all or most of Codespaces data had been deleted. Ouch.

An attacker orchestrated a DDOS attack, gained control of Codespaces' AWS account and tried to extort money. When Codespaces didn't pay, the attacker started deleting assets in the AWS account. That includes all the machine images, EBS volumes, backups, customer source code repositories and snapshots. Almost everything was deleted, essentially wiping out the business along with all of their customers precious data.

Fortunately I was one of the lucky ones. By chance, my repository was hosted on one of their older nodes that survived the attack, and I was able to get them to send me a link to download a dump of the repository. It was an intense 48 hours while I contemplated next steps in case my source code repository containing 3.5 years worth of development history was lost.

As a side note, I recall an email from Codespaces some time ago with an offer to upgrade to their new SVN hosting. It didn’t seem super important and I never got around to it. I'm normally all about the upgrades, but that's one upgrade I'm very glad I didn’t do :)

During the 48 hours when I was contemplating the worst case scenario for Shindigg, I realized that it wasn't actually that bad. The latest version of the code is available locally on our development and build machines so at worst we'd lose our history, but still have the latest version. Although being able to look back in time at source code history isn't something that you need everyday, it's important to have when you need it. Losing history is a hassle, but not crippling.

So, what can we learn from this?

Lesson 1 - Do Not Trust Your Data To Any Single Cloud Provider

The first and biggest lesson is that you should never trust your data to any one provider. No matter who the provider is or what kind of redundancy they have in place, never trust your data to a single provider.

Whether you're an individual, a business using hosted services, or a business providing hosted services to others (and probably using someone else's hosted services in the process), you've got to take responsibility for your own backups.

I made a mistake trusting our source code repository data to Codespaces and relying on them to properly operate their business and back-up their data, which they failed to do. Our most important data, the latest version of the source, was replicated in several places, so it was safe, but the repository (and therefore history) was vulnerable.

I fixed that by choosing a new hosted source code control provider that has a feature to automatically create backups and send them to my S3 account, thus giving me redundancy across 2 different providers. If my source code control provider gets wiped out, I've got a backup on a completely different system.

At Shindigg, we use a combination of self-hosting and a variety of other services including Azure, Cloudinary, Stripe, Mandrill etc… The most important data for us is our core database, and it gets backed up every 10 minutes and those backups are immediately uploaded offsite to a completely different provider. Which brings me to my next point…

Lesson 2 - Offsite Means At a Different Cloud Provider

Codespaces claimed to have offsite backups. Technically, the data was probably replicated to different AWS data centers, but it was still all in AWS, and it was all accessible through a single AWS account. That's not really "offsite" when it comes to the cloud.

In the cloud, offsite doesn't just mean relying on the cloud provider to have the data in a different physical location - you need to have critical data backing up at least to a separate account but really to a completely different provider.

At a minimum what Codespaces should have done is have their backups pushing to a separate AWS account, but ideally send backups to Azure, Rackspace or any other cloud storage provider. That way when their main AWS account was compromised, they wouldn't have lost everything. It may have taken them a couple of days to recover from back-ups which would have been bad, but having almost all of your customer data and therefore your business irrevocably wiped out is significantly worse. Which brings me to lesson 3...

Lesson 3 - Think About The Worst Case Scenario

Spend some time thinking about the worst case scenario and taking steps to mitigate the risk - what happens if a part of your infrastructure gets compromised? Security is all about layers.

For most small-mid size websites, the reality is a determined attacker can probably get in if they want to. Nothing is 100% secure, but you need to be taking reasonable steps to protect yourself and harden your site against attack. Even large websites or companies with huge teams dedicated to security get compromised from time to time (Google, Adobe, Target etc...).

Ask yourself what happens if someone does get in to part of your system? What's the worst that can happen? If for any one component of your infrastructure the answer is "we're completely screwed" then take steps to fix it and make it harder for your business to get wiped out. The answer needs to be more like "if A happened and B happened and C happened and D happened and E happened and F happened, then we'd be totally screwed".

Different applications require different degrees of redundancy and security, and building all of this costs time and money. You need to figure out what makes sense for your business.

If Codespaces had have asked themselves this simple question, they could have taken steps to mitigate the risk by having a backup at a different provider.

Incidentally, Codespaces could have suffered the same loss of data through operator error (ie accidentally deleting something from the AWS account), massive AWS failure or a disgruntled employee. That's far too many ways to completely wipe out their business.

Years ago at university, I remember a lecturer talking about how the digitization or information makes it not only more accessible, but easier to destroy. He used the analogy of a pallet full of papers vs a CD-ROM (back before the days of flash storage :)). Destroying a stack of papers takes effort. Sure, you could rip them, burn them, shred then, blow them up. But any of those options required a decent amount of time, effort or equipment to execute. Compare that with destroying the same amount of data on a CD, which you could just snap with your hands in less than 1 second. The cloud exacerbates this problem. Now it's possible to destroy a pallet load of CD's worth of data in seconds.

Lesson 4 - Follow Best Practices

There are plenty of resources on best practices for security. It's hard to know where to begin when analyzing Codespace's security failures, but one thing that likely wasn't in place and should have been was multi-factor authentication. AWS has it, Azure has it, Google has it, Facebook has it, Twitter has it. Pretty much all major sites have it these days and encourage people to use it even for personal accounts.

If thousands of customers are trusting you with their data, do yourself a favor and enable it. It probably would have saved Codespaces' ass.

Conclusion

Just because your data is in the cloud doesn't mean it's safe. As an individual or a business, never trust your data to a single provider, make sure you've got backups in multiple places. As a business, think about the worst case scenario and make sure it's not too easy to have your data (and consequently your business) wiped out. Security can be complicated and expensive, but there are a lot of simple, cheap things you can do to make your app more secure. If Codespaces had followed these, they'd still be in business.