To work at Ars is to interact constantly with Twitter, both as a source for developing news and also as a way to goof off with coworkers and other tech journalists (folks who follow the Ars staff on Twitter should be more than familiar with our long-winded late night multi-Tweet antics). But as with any electronic medium, spam on Twitter is a nagging problem—Twitter’s real-time messaging means crafty spammers can blast their messages out to large numbers of people before getting hammered by spam reports.
However, several months back, Twitter went on the offensive against spammers, rolling out a set of anti-spam features collectively referred to as "BotMaker." In a blog post today, Twitter explained that the various components of BotMaker have been operational for about six months, and in that time Twitter has recorded a significant drop in tweetspam—up to 40 percent by its internal metrics.
Twitter’s real-time nature poses trouble for a traditional monolithic spam-checking system that might add many seconds onto the delivery of a tweet to followers. Rather than maintaining such a monolithic system (something akin to SpamAssassin, a widely deployed e-mail anti-spam application), Twitter’s BotMaker lets Twitter engineers quickly establish simple sets of conditional rule-based actions (which they call "bots"—hence "BotMaker") and apply them to tweets both during and after the posting process.
It’s the selective application of rules, rather than a big all-in-one solution, that lets BotMaker function in a way that’s transparent to Twitter users.
"Real time" tweet checking uses a BotMaker component nicknamed "Scarecrow." Scarecrow is a low latency synchronous component of the Twitter posting process, meaning that a tweet can’t proceed down the posting path until Scarecrow finishes processing it. When write events come in from clients, Scarecrow parses the contents against its current set of rules and can either pass the tweet on, challenge the posting client with a CAPTCHA, or deny the tweet.
But Scarecrow, being synchronous, only has milliseconds to do its job before it starts to impact Twitter’s realtime nature. An asynchronous tool named "Sniper" fills in when Scarecrow can’t get the job done in time, applying more tests to tweets after they’ve been posted. Tweetspam that sneaks past Scarecrow for one reason or another has a second chance at detection by Sniper, which uses more complex "machine learning models [which] cannot be evaluated in real time."