“4: Bring me tweets.”
― Matt’s law of robotics
As part of my plan to spend less time on-line in 2021, I decided to automate my Twitter scrolling. Being productive on Twitter means catching announcements, new papers, etc… from a small group of physicist and academics talking about esoteric concepts in science and technology. To help with this, I created a bot that streams tweets to a Slack channel where I can discuss them with ‘the bros’. I’m far from the first person to do this but hopefully there’s some useful information here. Seems like this was the thing to do in 2012, but no one has bothered to updated the syntax or authentication since. The rest of this post is a highlights and how-to in the process of building the bot. If you’d like to skip to the punch line, the code is on github at matthewware/slack-twitter-bot. Things I will try to cover:
- Authentication with Slack and Twitter
- Streaming and filtering Twitter streams – and catching errors
- Slack message
Blockformatting and sending HTTP
- Controlling the bot from Slack
- Notes on deploying the bot in the cloud
UPDATE 03/12/21: There were several issues in practice with this bot intermittently crashing. After making several changes, I believe the major cause was the handling of empty byte strings in python urllib3’s
response.py. See more here. After making the
line = (len(line)>0 and line or "0")change I haven’t had any issues.
In total, we need six tokens for this bot to work. Two are for Slack and four are for Twitter.
Twitter handles its authentication on a developer basis. This allows them to attach API calls to the tokens of individual developers. On top of these, each bot/app you build will also have its own set of access tokens:
TWITTER_CONSUMER_KEY- your individual developer key
TWITTER_CONSUMER_SECRET- your individual developer secret
TWITTER_ACCESS_TOKEN- an access token for your app itself
TWITTER_ACCESS_TOKEN_SECRET- an secret token for your app
To get started, go to the Twitter developer portal and request a developer account. This isn’t an automatic process. You’ll need to ask for an account and tell Twitter roughly what you plan to do with it. In my case, I only wanted to get data off Twitter and have no plans to post any of it publicly. That meant the process was pretty quick for me (~5 minutes).
For Slack all authentication is done on the applications/bot level. This means each bot will have a unique set of keys. For our purposes we only need a
SLACK_TOKEN to talk to our app and a
SLACK_SIGNING_SECRET so we can verify request from Slack to our control sever . The process for generating these is very simple and the first thing you do when you create your app/bot. The Slack API documentation is a great place to start learning.
Every request to the Slack API will use the
SLACK_TOKEN. A very basic usage of the
slack_sdk would look something like this:
new_token = os.environ['SLACK_TOKEN'] client = WebClient(token=new_token) def write_text(message, channel='bot-dev'): try: response = client.chat_postMessage(channel=channel, text=message)
In the other direction, we want to verify request to the bot control server are legitimate. This Flask server lets users update the users and keywords to follow on Twitter. The documentation for checking these
HTTP requests is pretty thorough. A
slack_sdk.verify.verify_request function in the SDK will do the authentication for you. But I ended up implementing it myself because wanted to add a very simple timeout for request authentication:
if abs(time.time() - int(timestamp)) > 60 * 5: # The request timestamp is more than five minutes from local time. # It could be a replay attack, so let's ignore it. return False
The heart of the app is the Tweepy
Stream object. It streams new tweets to your bot as they appear on Twitter. This is exactly what we want. The only issue is the
Stream only lets you filter by user or by keyword and not by both1. It was far easier for me to filter first by users given I’m only planning to follow < 100 people. Given this and the volume these people tweet, the rest of the filtering is done by brute force. Your mileage may vary depending on who you follow and what keywords you want to track! Lastly, I filter out reply tweets just so I can focus on what people tweet to their accounts and not the arguments that follow.
The real interesting part of Tweepy streaming is setting up the stream in a way that makes it easy stop and restart when users and keywords are changed by the control server. To do that, I have the
def launch_bot(channel=POST_CHANNEL): """ Start the stream and filter for users in the db list. All other filtering is done by the Listener. """ logging.info("Creating listener...") myStreamListener = MyStreamListener(channel=channel) myStream = CustTweepyStream(auth = api.auth, listener=myStreamListener, include_entities=True, tweet_mode = 'extended') # start filtering logging.info("Starting bot...") # async needs to be true so we don't block the file watcher myStream.filter(follow=get_ids(), is_async=True, stall_warnings=False) return myStream def restart_bot(stream): # try to kill previous stream stream.disconnect() # return a new stream but wait some time to avoid rate limiting time.sleep(60) return launch_bot()
This may look standard but you need to extend the
Stream class by adding methods for
__exit__ so the class can be used in contexts. This lets python know what to do when disconnect is called:
class CustTweepyStream(Stream): """Custom Stream usable in contexts""" def finalize(self): pass def __exit__(self, exc_type, exc_val, exc_tb): self.finalize() def __enter__(self): return self
Here we don’t need these methods to do much, but we do need them to exist. For more on the reasons why, checkout this blog post.
Tweepy’s connection to the Twitter API seems to suffer periodically from an IncompleteRead error.
This tends to happen roughly every ~day. There’s some speculation on StackOverflow this issue comes from getting too many tweets in a short amount of time. So that’s pretty annoying. Current plans are to put the tweet data into a queue and process the queue, but that still needs more testing2. For now I’m just catching the errors (and possibly missing a tweet here and there). Luckily, the
http libraries make it easy to catch this specific error. I found a discussion here and added the following to the tweet processing code:
from http.client import IncompleteRead as http_incompleteRead from urllib3.exceptions import IncompleteRead as urllib3_incompleteRead
# returning False closes the stream except BaseException as e: logging.error("Error on_data: %s, Pausing..." % str(e)) time.sleep(5) return True except http_incompleteRead as e: logging.error("http.client Incomplete Read error: %s" % str(e)) logging.error("~~~ Restarting stream search in 5 seconds... ~~~") time.sleep(5) #restart stream - simple as return true just like previous exception? return True except urllib3_incompleteRead as e: logging.error("urllib3 Incomplete Read error: %s" % str(e)) logging.error("~~~ Restarting stream search in 5 seconds... ~~~") time.sleep(5) return True
The Slack documentation discourages the use of attachments given they are now deprecated.
Blocks are now the blessed way to send rich message formats to a Slack channel. These are basically JSON objects that need certain fields. Getting the right message formatting was one of the more tedious parts of bot building. Luckily there’s now a Block Kit Builder playground for getting things just right. For the twitter bot, a
build_message function parses the tweet object and fills in data for the block. Unfortunately, there didn’t seem to be a way to avoid handling tweet, retweet, and quote tweet formatting all with separate block code.
One feature I really wanted this bot to have is the ability to update Twitter users and keywords directly from the Slack channel. A simple control server allows the csv files to be updated from Slack via ‘slash commands’. The server listens on a port and writes updated files to the local file system based on the verified commands from Slack3.
The tricky part of allowing slash commands is you’ll need to stop and restart the Tweepy stream every time the users or keywords are updated. This disconnect / clean up / wait / reconnect process we covered a bit in the Streaming section. To trigger the process, there’s a pretty nice file watching package call
watchgod based on the
watchdog package. It was pretty simple to extended the directory watcher to watch the keyword and user
class CSVWatcher(DefaultDirWatcher): def should_watch_file(self, entry): return entry.name.endswith(('.csv',))
When the file watcher sees a file change, it halts the current Tweepy stream and waits ~60 seconds before starting a new one to avoid the Twitter API rate limiting. In the main program, I have something like:
bot_stream = launch_bot() for changes in watch(os.path.abspath('.'), watcher_cls=CSVWatcher): bot_stream = restart_bot(bot_stream)
For large loads, take a look at the async Slack functions. These are nice if you expect a large volume of tweets and don’t want your code to block. Then again… if all this is going to a Slack channel do you really want multiple messages per second coming to a channel? Here are some straight forward updates you might consider making:
- maintain keyword and user files for multiple channels and have a context for each. This would allow the bot to join multiple channels and route certain tweets to certain channels.
- write directly to multiple channels
If you’re planning to use this bot in ‘production’ I’d recommend the usual security procedures – don’t run as root etc… I have an instance running on DigitalOcean and setup should be relatively painless. You’ll need to open the server control port in your firewall if you’re using one (as you should be). If you expect lots of commands or run the Twitter stream in debug mode for a long time you might consider setting up a wsgi server like waitress or uwsgi, using async functions, and doing log rotation.
It’s interesting to note the Twitter API allows this but the Tweepy implementation does not. ↩