The end of the cloud: A truly serverless web

Now, 'end of the cloud' is a bold statement, I know, maybe it even sounds a little mad. And wasn't serverless in the cloud? Bear with me, I'll explain.

The conventional wisdom about running server applications, be it web apps or mobile app backends, is the future is in the cloud. As Amazon and Google add layers of tools and make running server software more and more easy and convenient, it would seem that hosting your code in AWS, GCP or Azure is the best you can do - most convenient, cheapest, easiest to fully automate, scale elastically... I could keep going. So why on Earth am I talking about the end of it? If it ain't broken don't fix it, right? Well...

For one, building a scalable, reliable, highly available web application, even in the cloud, is pretty difficult. And if you do it right and make your app a huge success, the scale will cost you both money and effort. And even if your business is really successful, you eventually hit the limits of what the cloud, the web itself can do: The compute speed and storage capacity of computers are growing faster than the bandwidth of the networks. Ignoring the net neutrality debate, this may not be a problem for most (apart from Netflix and Amazon) at the moment, but it will be soon - the volumes of data we're pushing through the network are growing massively as we move from HD, to 4k to 8k and soon there will be VR datasets to move around.

This is a problem mostly because of the way we've organised the web - there are many clients that want to get content and use programs and only relatively few servers that have the programs and content. The original idea of the web was everyone and their dog could have a server and host content, but over time we realised it's quite a bit of work to manage servers and make services reliable, so almost everyone migrated to the cloud, where a lot of these things have been figured out for them. This has fairly serious consequences: When someone posts a funny picture of a cat on Slack, even though I'm sitting next to 20 other people who want to look at that same picture, we all have to download it from the server where it's hosted and the server needs to send it 20 times. As the servers are moving to the cloud, i.e. onto Amazon's or Google's computers in Amazon's or Google's data centres, the networks close to these places need to have incredible throughput to handle all of this data. There also has to be huge numbers of hard drives that store the data for everyone and CPUs that push it through the network to every single person that needs it. This gets worse with the rise of streaming services. All that means a lot of energy and cooling is needed and makes the whole thing fairly inefficient, expensive and bad for the environment.

The other issue with centrally storing our data and programs is availability and permanence. What if Amazon's data centre gets hit by an asteroid, flooded or destroyed by a tornado? Or, less drastically, what if it loses power for a while? The data stored on the machines in it now can't be accessed temporarily or even gets lost permanently. We're generally mitigating this problem by storing data in multiple locations, but that only means more data centres. That may make the risk of accidental loss negligible, but how about the data that you really, really care about? Your wedding videos, pictures of your kids growing up, or the important public information sources, like Wikipedia. All of that is now stored in the cloud - on Facebook, in Google Drive, iCloud or Dropbox and others. What happens to the data when any of these services go out of business or lose funding? And even if they don't, it is pretty restricting that to access your data, you have to go to their service, and to share it with friends, they have to go through that service too.

On top of that, the only way for your friends to trust that the data they get is the data you sent is by trusting the middleman and their honesty. This is ok in most cases, but websites and networks we use are operated by legal entities registered in nation states and the governments of these nations have the power to force them to do a lot of things. While most of the time, this is a good thing and is used to help solve crimes or remove illegal content from the web, there are also many cases where this power has been abused. Just a few weeks ago, the Spanish government did everything in their power to stop an independence referendum in the Catalonia region - including blocking information websites telling people where to vote. Blocking inconvenient websites or secretly modifying their content on its way to the reader has long been a standard practice in places like China. While free speech is probably not a high-priority issue for most westerners, it would be nice to keep the Internet as free and open as it was intended to be and have a built-in way of verifying content you are reading is the content the authors published.

The really scary side of the highly centralised internet is the accumulation of personal data. Large companies that provide services we all need to use in one way or another are sitting on monumental caches of people's data. Data that gives them enough information about you to predict what you're going to buy, who you're going to vote for, when you are likely to buy a house, even how many children you're likely to have. Information that is more than enough to get a credit card, a loan or even buy a house in your name. You may be ok with that. After all, they were trustworthy enough for you to give them your information in the first place, but it's not them you need to worry about. It's everyone else. Earlier this year the credit reporting agency Equifax lost data about a 140 million of their customers in one of the biggest data breaches in history. That data is now public. We can dismiss this as a once in a decade event that could've been prevented if we had just been more careful, but it is becoming increasingly clear that data breaches like this are very hard to prevent entirely and too dangerous to tolerate. The only way to really prevent them is to not gather the data on that scale in the first place.

The current model of the web, powered largely by client-server protocols (like HTTP) and security based on trust in a central authority (like TLS), is flawed and causes problems that are fundamentally either really hard or impossible to solve. It's time to look for something better. A model where nobody else is storing your personal data, large media files are spread across the entire network and the whole system is entirely peer-to-peer and serverless (and I don't mean cloud hosted functions here, I mean actually no servers). I've been reading about emerging technologies in this space for the past couple months and I became pretty convinced that this is where we're inevitably going. These peer to peer web technologies are aiming to replace the building blocks of the web we know with protocols and strategies that solve most of the problems outlined above. Their goal is a completely distributed, permanent, redundant data storage, where each participating client in the network is storing copies of some of the data available in it.

A content-addressable distributed web

If you've heard about BitTorrent, the following should all sound familiar. In BitTorrent, users of the network share large data files split into smaller blocks (each with a unique id) without a need for any central authority. In order to download a file, all you need is a 'magic' number - a hash - a fingerprint of the content. The BitTorrent client will then find peers that have pieces of the file and download them, until you have all the pieces.

The interesting part is in how the peers are found. BitTorrent uses a protocol called Kademlia for this. In Kademlia each peer on the network has a unique id number, which is of the same length as the unique block ids. Blocks with a particular id are stored on a node whose id is "closest" to the id of the block. For random ids of both blocks and network peers, the distribution of storage should be pretty uniform across the network. There is a benefit, however, to not choosing the block id randomly and instead use a cryptographic hash - a unique fingerprint of the content of the block itself. The blocks are content-addressable. This also makes it easy to verify the content of the block (by re-calculating and comparing the fingerprint) and provides the guarantee that given a block id, it is impossible to download any other data than the original.

The other interesting property of using a content hash for addressing is that by embedding the id of one block in the content of another, you link the two together in a way that can't be tampered with. If the content of the linked block changed, its id would change and the link would be broken. If the embedded link changed, the id of the containing block would change as well. This makes it possible to create chains of such blocks (like the blockchain powering Bitcoin and other cryptocurrencies) or even more complicated structures, generally known as Directed Acyclic Graphs (or DAGs for short). This kind of link also has a name - Merkle link - after the inventor Ralph Merkle. So now, if you hear someone talking about Merkel DAGs, you know roughly what they are. One commonly known example of a Merkle DAG is git repositories. Git stores the commit history and all directories and files as blocks in a giant Merkle DAG, which leads us to another interesting property of distributed storage based on content-addressing: it's immutable, i.e. the content cannot change in place. Instead, new revisions are stored next to existing ones. Blocks that have not changed between revisions get reused, because they have, by definition, the same id. This also means identical files cannot be duplicated in such storage system, translating into efficient storage. On this new web, every unique cat picture will only exist once (although in multiple redundant copies across the swarm). They will still be half the content of the internet though.

tenor.gif

Protocols like Kademlia together with Merkle chains and Merkle DAGs give us the tools to model file hierarchies and revision timelines and share them in a large scale peer to peer network. There are already protocols that use these technologies to build a distributed storage that fits our needs. One of them that looks very promising is IPFS, which we might get a bit deeper into in the future, watch this space!

The problem with names and shared things

Ok, so with the above techniques, we can solve quite a few of the problems outlined at the beginning: We get distributed highly redundant storage on the devices connected to the web which can keep track of the history of files and keep all the versions around for as long as they are needed. This (almost) removes the availability, capacity, permanence and content verification problems and also the bandwidth problem - peers send data to each other and there are no major hotspots/bottlenecks. We will also need a scalable compute resource, but this shouldn’t be too difficult: everyone's laptops and phones are now orders of magnitude more powerful than most apps need (including fairly complex machine learning computations) and compute is generally pretty horizontally scalable, so as long as we can make every device do the work necessary for its user, there shouldn't be a major problem.

Ok then, the image on Slack can now come from one of my co-workers sitting next to me instead of the Slack servers (and without crossing the ocean in the process). In order to post a cat picture, on the other hand, I need to update a channel in place (i.e., the channel is no longer what it was before my message, it changed). This fairly innocuous sounding thing turns out to be the hard part. This will, unfortunately, get a bit technical, so feel free to skip to the next section if you're not interested in techno-philosophical nonsense.

The concept of an entity that changes over time is really just a human idea to give the world some order and stability in our heads. We can also think about such entity as just an identity - a name - that takes on a series of different values (which are static, immutable) as time progresses (Rich Hickey explains this really well in his talks Are we there yet? and The value of values). This is a much more natural way of modelling information in a computer, with more natural consequences. If I tell you something, I can no longer change what I told you, or make you un-learn it. Facts, e.g. who the President of the United States is, don't change over time, they just get superseded by other facts referred to by the same name, the same identity. In the git example, a ref (branch or tag) can point to (hold an id and thus a value of) a different commit at different times and making a commit is replacing the value it currently holds. The Slack channel would also represent an identity whose values over time are growing lists of messages.

The real trouble is we're not alone in the channel. Multiple people try to post messages, change the channel, sometimes simultaneously and someone needs to decide what the result should be.

Allow me a brief aside: 'simultaneously' is not a real thing either. There is no such thing as a universal global time, that would be breaking laws of physics - information only travels, like everything else, at the speed of light. In computer networks, information travels from user A to user B at a certain speed as well. And significantly slower than the speed of light. Therefore we can't decide which update happened first because 'first' is relative to where in space (or the network) you are - 'happening first' is not a thing. If A and B both send a message 'at the same time', each will see their update first and the other's update second. And both would be right.

In centralised systems, like basically all the current web apps, there is a single central entity deciding this 'update race' and serialising the events - whichever event reaches it first wins. In a distributed system, however, everyone is an equal, so there needs to be a mechanism that ensures the network reaches a consensus about the 'history of the world'.

Consensus is the most difficult problem to solve for a truly distributed web supporting the whole range of applications we are used to today. It doesn't only affect concurrent updates, but also any other updates that need to happen 'in-place'. Updates where the 'one source of truth' is changing over time. This is particularly difficult for databases, but also affects other key services, like the DNS: registering a human name for a particular block id or series of ids in a decentralised way means everyone involved needs to agree about a name existing and having a particular meaning, otherwise two different users could see two different files under the same name. Content-based addressing solves this for machines (remember a name can only ever point to one particular piece of matching content), but not humans.

There are a few major strategies to deal with distributed consensus. One of them involves selecting a relatively small 'quorum' of managers with a mechanism for electing a 'leader' who decides the truth (if you're interested, look at the Paxos and Raft protocols). All changes then go through the manager. This is essentially a centralised system, which can tolerate a loss of the central deciding entity or an interruption (a 'partition') in the network. Another approach is a proof-of-work based system like blockchain, where consensus is ensured by making peers solve a puzzle, in order to write an update (i.e. add a valid block to a Merkle chain). The puzzle is hard to solve but easy to check and some additional rules exist to resolve a conflict if it still happens. There are several other distributed blockchains that use a proof-of-stake based consensus which reduces Bitcoin's problem with energy requirements for solving the puzzle, the mining. If you’re interested, you can read about proof of stake in this whitepaper by BitFury. Yet another approach for specific problems revolves around CRDTs - conflict-free replicated data types which, for specific cases, don't suffer from the consensus problem at all. The simplest example is an incrementing counter: if all the updates are just 'add one', as long as we can make sure each update is applied just once, the order doesn't matter and the result will be the same.

There doesn't seem to be a clear answer to this problem just yet and there may never be only one, but a whole lot of clever people are working on it, so there are enough interesting solutions out there to pick from, selecting the particular trade-off you can afford. The trade-off generally lies in the scale of a swarm you're aiming for and picking a property of the consensus which you're willing to let go of at least a little - availability or consistency (or, technically, network partitioning, but that seems difficult to avoid in a highly distributed system like the ones we're talking about). Most applications seem to be able to favour availability over immediate consistency - as long as the state ends up being consistent in reasonable time, it's ok.

Privacy in the web of public files

One obvious problem that needs addressing is privacy. How do we store content in the distributed swarm of peers without making everything public? It's worth noting that if it's enough to hide things, content addressed storage is a good choice since, in order to find something, you need to know the hash of its content (somewhat like private Gists on Github). This essentially produces three levels of privacy: public, hidden and private. The answer to the third one, it seems, is in cryptography - encrypting the stored content with strong encryption and sharing the key “out of band” (e.g. physically on paper, by touching two NFC devices, by scanning a QR code...).

Relying on cryptography may sound risky at first (after all, vulnerabilities are found all the time), but it's actually not that much worse than what we do today. In fact, it's most likely better in practice: Sensitive data is currently being stored in ways that are generally not shared with the public (even the people said data is about), it's exposed to an undisclosed number of people employed by the organisations holding the data and protected, at best, by cryptography based methods anyway. More often than not, if you can gain access into the systems storing this data, you can have all of it. By storing private data in a way that's essentially public, you are forced to protect it (with strong encryption) such that it is no good to anyone who gains access to it. This is the same idea as making security-related software open source so that anyone can look and find problems in it. This means knowing how the security works doesn't help you break it.

An interesting property of this kind of access control is that once you've granted somebody access to some data, they will have it forever for that particular revision of the data. You can always change the encryption key for the future revisions, of course. This is also no worse than what we have today, even though it may not be obvious: Given access to some data, anyone can always make a private copy of what they were given access to. It's impossible to unlearn things.

The interesting challenge in this area is coming up with a good system of establishing and verifying identities and sharing private data between a group of people that needs to change over time, e.g. a group of collaborators on a private git repository. It can definitely be done with some combination of private-key cryptography and rotating keys, but making the user experience smooth is likely going to be a challenge.

From the cloud to a.. fog... an information aether

Hard problems to solve notwithstanding, I think this is quite an exciting future. First, on the technical front, we should get a fair number of improvements out of a peer to peer web. Content-addressable storage provides cryptographic verification of content itself without a trusted authority and hosted content is permanent (for as long as any humans are interested in it), it can be cached forever and in most parts of the network, there should be a fairly significant speed improvement, even at the edges in the developing world, far away from the data centres (or even on another planet!). At some point even data centres may become a thing of the past - consumer devices are getting so powerful and ubiquitous that computing power and storage (a computing “substrate”) is almost literally lying in the streets.

For businesses running web applications, this should translate to significant cost savings and much fewer headaches building reliable digital products. This, in turn, translates into less focus on downtime risk mitigation and more focus on adding customer value, benefitting everyone. There is still going to be a need for cloud-hosted servers, but they will only be one of many similar peers. This is, of course, not excluding the possibility of heterogeneous applications, where not all the peers are the same - there could easily be consumer-facing peers and back office peers as part of the same application 'swarm' and the difference in access is only in access level based on cryptography.

The other large benefit for both organisations and customers is in the treatment of customer data. When a fully distributed application architecture is available technically, there's no longer any need to centrally store huge amounts of customer information. This reduces the risk of losing such data in bulk to a minimum. Leaders in the software engineering community (like Joe Armstrong, the creator of Erlang, whose talk from Strange Loop 2014 is worth a watch) have long argued that the design of the internet where customers send data to programs owned by businesses is backwards and what we should do instead is send programs to customers to execute on their privately held data that is never directly shared. Such a model seems much safer and doesn't in any way prevent businesses from collecting useful customer metrics they need. In some cases, businesses will need to invest in the combination of transparency and real security based on cryptography that distributed storage of data requires, but it's likely worth the investment long-term. And if not, nothing prevents a hybrid approach with some services being opaque and holding on to private data.

There are most likely applications this kind of technology enables we can't really even see today, but one obvious one to me is used in online publishing. Publishers currently make large investments in content delivery through CDNs which move content closer to the consumer, while still making content (mostly) inaccessible while the consuming device has no Internet connection. This generally requires a significant engineering effort to produce the final content as fast as possible and invalidate caches when it changes. Coupled with mesh networks, like the kind we're seeing in the Internet of Things devices (e.g. the Philips Hue smart lightbulbs), peer to peer web could mean fast, reliable access to content anywhere at any time, as long as another device on the network is nearby. With connectivity being peer to peer, entirely new business models could appear centred around how much storage and bandwidth a network node is willing to share with the rest of the network - e.g. instead of paying your trusted news outlet money for content, you could pay by helping with the content distribution.

In general, it feels like this type of application architecture is a much more natural way to do large-scale computing and software services (it almost makes it a commonly accessible utility). One where your data is actually yours, services work without a huge effort going into making them available, reliable and scalable, information is not being overwritten in place as changes are happening and lost when businesses lose servers or go out of business. An Internet closer to the original idea of open information exchange, where anyone can easily publish content for everyone else and control over what can be published and accessed is exercised by consensus of the network's users, not by private entities owning servers. This, to me, is hugely exciting. (If you want to play with an embryonic version of this new kind of web, try the Beaker web browser, which supports a few popular protocols of the peer to peer kind. It's just a web browser, but it also lets you publish your own websites with a single click of a button.)

All of this is why I'd love to get a small team together in Red Badger and within a few weeks build a small, simple proof of concept mobile application, using some of the technologies mentioned above, to show what can be done with the peer to peer web. The only current idea I have that is small enough to build relatively quickly and interesting enough to demonstrate the properties of such approach is a peer to peer, truly serverless Twitter clone, which isn't particularly exciting.

If you've got a better one (which isn’t that hard, honestly), or if you have anything else related to peer to peer distributed web to talk about, please tweet at me at @charypar, I would love to hear about it!

An edited version of this article was first published on Venturebeat as a guest post. http://bit.ly/2hhUMlY

Viktor Charypar

Tech Lead

More from Viktor