Architecting Trump’s Bot Clone in AWS without Losing my House

Ioannis Tsiokos
7 min readDec 11, 2020

The Story, i.e. the WHY

One month before the US elections, while working on an NLP project, I thought it would be cool to create a bot clone of Donald Trump — dialogue style, wits and voice included.

Why on Earth would I Do That?

The answer is to learn, to be entertained, and to promote MasterBot. It turned out, I was successful in the first two, and not quite as successful at the third for two reasons.

First, TrumpBot turned out to be kind of dumb, even after training for a few hours in AWS’s 256GB GPU monster. Second, TrumpBot’s voice turned out awful (partially intended, but also due to time constraints).

Perhaps I am being over-dramatic. After all, the press release was picked by PR Newswire for complementary distribution, and more than 100 journalists visited the website. TrumpBot has, thus far, answered more than 10,000 questions in its robotic, Donald-like voice.

Ultimately though, TrumpBot did not rise to the expectations and did not gain momentum. Perhaps I should have written this article sooner. But hey, this is meant to be a technical article, so who cares.

The HOW

The Objective

My goal was to create a web page that would work in (most) mobile and desktop browsers. The user should be able to visit the page, click a button or two, and start chatting (voice or text) with TrumpBot, who would always reply with both text and voice.

Services Required

This rather simple application (just one button), requires:

* Voice to text service for the user

* Text generation service

* Text to voice service for the replies

The rather welcome problem that I had to face from the start was, what if Fox News decided to feature TrumpBot and 10,000,000 people came knocking on its door within minutes. Not that this ever happened, but why risk having my 1080 Ti freak out, explode and burn my house. I mean, would you risk your house for an imitation of Trump? Right.

The only way to be able to scale up so quickly, if you are not Amazon, Google, or Microsoft, is to use Amazon’s, Google’s, or Microsoft’s cloud services. Wait, is that even ethical? Whatever, moving on with AWS.

My initial hunch was to set up three scaling groups, one for user voice transcription (voice-to-text), one for voice generation (text-to-voice), and one for text generation (chatbot). Each group would scale up and down based on the number of requests. Ideally, the groups would scale based on GPU load, but there was no off-the-shelf way to do that, so the number of requests should suffice for a hit-and-run project.

Spoiler Alert — I Created Only One Scaling Group.

Here is what’s wrong with the three-group architecture.

First, it has high fixed costs. What if TrumpBot flopped? Do I really want to pay for three elastic load balancers every month so a few users can have a laugh asking TrumpBot you-know-whats? Certainly not.

Second, it would have high variable costs in case the project succeeded. One of the most fun (and scariest) exercises is calculating your cloud bill, should your project take off (come on, I know you do this). Based on my rough assumptions, AWS would kindly ask me for $20K just for trying to be funny. Granted the publicity would make it worth it, but my printer does not print this sort of numbers on invoices. Improvise I must.

Third, scaling was unnecessarily granular. When you design a system, ask yourself if one task scales proportionately with another task. If that is the case, isolating the two tasks should only be done to isolate their environments, not their scaling policies. With TrumpBot, it was clear that text-generation requests would grow in tandem with voice-generation requests, so why scale both independently?

Loss Aversion

It turns out, I could save my house from a burning 1080 Ti only to lose it to the bank for not paying my AWS bill. We will never know for sure because TrumpBot flopped. Here is what I did to save myself from a bill that never came.

Idea 1 — Turn Text and Voice into Husband and Wife

I recall reading an article (but I do not recall where) about Amazon putting two, otherwise independent services, in one box, to serve Alexa voice generation. Since both services scaled together, it made sense to house them together, as this would also reduce the latency between them. The point is, don’t use scalability as an argument for independence when it’s not.

Following that realization, I decided to house both voice and text generation on the same server. If the project dependencies allowed it, I would try to put them in the same service to simplify deployment. Since this was a quick project that would not get any updates, I could live with the idea of the services living with each other. I do have nightmares sometimes, but I am getting help.

It turned out that the Python version dependencies for both services were similar. I ended up using FastApi and including both services in the same server program. Each program is housed in a docker container, and each instance can run four dockers behind an Nginx proxy.

Idea 2 — Get the Client to Do Some Work

So much computing power (in GPU and CPU cycles) is wasted every second, it makes me want to cry. Why not, then, unload some of the computing burdens to the client itself and save some tears and a few thousand dollars?

So, which one of the three services can we move to the edge, if any?

The text generation model is going to be too large to even think about it. I also did not feel ready learning how to run GPT on the browser. So, skip.

Voice generation was a potential candidate. However, I would rather not have to explain to the FBI that anybody could create a model of the President’s voice. Usually, the problem is collecting the data, but when it comes to the President, there is enough audio of him on YouTube alone to make the perfect voice model. Nevertheless, it would be safer to keep the model on the server.

That left me with the voice-to-text service as the only candidate. It turns out, I could avoid deploying a service AND outsourcing this to the client, by taking advantage of Google’s Web Speech API. To make this work, I added some JavaScript code to the client to transcribe the audio using Google’s public servers.

Why on earth would Google offer free audio transcription? Well, there is a catch. Audio to text only works in Chrome. If you are building a commercial app, you can still use the Web Speech API to lower your bill, but you should also have your own service. I decided to simply ask visitors to use Chrome and automatically switched to text-mode for non-Chrome users.

Idea 3 — Say it Once

I do not know why, but I had a feeling that most people would ask Donald trump roughly the same things. It follows that the bot would return the same or similar text answers. Caching to the rescue.

The idea was to try to avoid regenerating audio for the same sentence. To avoid regeneration 100%, we would need a database to record all sentences along with Strong Consistency reads. However, that would be overkill.

In the end, I split all text responses to sentences, hashed each sentence, generated the audio, and stored that audio in S3 with the hash as the key. The text/voice generation service re-lists all S3 keys every hour and keeps a list of hashes in memory.

In addition to caching, the design brought two more advantages.

First, the service is no longer burdened with sending audio files to each client. Instead, it saves the file via a fast S3 endpoint and sends the client only the hash and the text response. S3 is responsible for serving the file and the server’s web sockets thanked me for taking that weight off their shoulders.

Second, since we split audio generation into sentences, we can start serving the client as soon as the audio for the first sentence (from a potentially multi-sentence response) is ready, or as soon as the text-generation model returns, if the audio was already cached.

In the End

I must admit, I rent my house, so, no bank is going to take it any time soon.

Nevertheless, there is beauty in architectures that are no more and no less complicated than they need to be, if only because we are such architectures ourselves and we sure like to think of ourselves as beautiful. Now, I do not know how all this ended up sounding philosophical, but if you are going to remember anything beyond my bad jokes, remember this.

Design for success. And design for failure.

TrumpBot is to spend the remaining of his zombie days off-cloud, in 2.5GB of my home GPU. Sometimes, I hear the Ti’s fan spinning. Sometimes, I’d rather not.

--

--

Ioannis Tsiokos

I have nothing to say that’s nearly as cool as I am, except maybe… wow, I am dad!