sudofox's journal

Austin Burk's journal, where I share little snippets of my writing, code, and dreams.

Tomorrow marks the start of the Sudomemo Daily Draw!

Sunday will be the first Sudomemo Daily Draw. This daily event, open to anyone from Hatena or Sudomemo, is meant to be a fun way to encourage artists to try new ideas, and see other people's interpretations of the Daily Topic!

 

I'm looking forward to seeing this grow :)

 

If you're reading this, you are absolutely welcome to join in!

 

dailydraw.sudomemo.net

Hatena Haiku Anti-Spam

f:id:austinburk:20181018224616p:plain

I've developed a service called Hatena Haiku Anti-Spam. I've been testing it extensively over the past six months or so, and it's now at the point of extremely high accuracy.

 

History

Hatena has a spam issue. Not a small one, either. It affects most of the services that they provide, especially Haiku, Bookmark, Blog, Anond, Question, and a few more. Fotolife is included, although that is a different kind of spam which I am investigating.

While there's been some attempts at stopping it, it has come in the form of small mitigations (e.g the Humanity Quiz, used as a barrier to initial entry to some services, like Haiku, but with old accounts grandfathered in. I've found that it can be inconsistent in its behavior, and sometimes does not verify a user which has submitted three or more correct answers.)

 

The spam posted to Haiku and other services is extremely consistent in nature. I presently work at a managed web hosting company and have had some experience with the combination-Bayesian and regex-based filtering engine called Apache SpamAssassin. I decided to retool SpamAssassin to do filtering of Hatena Haiku posts, so I could use the service with less interruption!

From Emails to Haikus

One of the main issues is that SpamAssassin is designed specifically for email filtering; as such, it expects input formatted like an email message.

 

For example (original post):

Delivered-To: site@h.hatena.ne.jp
Received: by sudofox.spam.filter.lightni.ng (Sudofix)
        id 81807923680507219; Fri, 08 Jun 2018 09:12:22 -0400 (EDT)
From: austinburk@h.hatena.ne.jp (Sudofox)
To: site@h.hatena.ne.jp
Subject: 食べた
Content-Type: text/plain; charset=UTF-8
Message-Id: <20180908091222.81807923680507219@sudofox.spam.filter.lightni.ng>
X-Hatena-Fan-Count: 212
Date: Fri, 08 Jun 2018 09:12:22 -0400 (EDT)

食べた=That looks AMAZING QoQ 

 

We have the bare minimum headers for spamd (the service running behind SpamAssassin) to parse it, along with one extra, X-Hatena-Fan-Count.

SpamAssassin user_prefs and Custom Rules 

Here is my user_prefs file. If you start reading after "Begin Sudofox config", you will see some of the rule changes I made in order to make it more suitable for filtering messages from Haiku, including disabling all email-related checks (e.g SPF, RBLs, DKIM) as well as some changes to suit the primary user base of Haiku better (these changes are above the "Begin Sudofox config", but are noted) 

You'll also see a number of rules that I developed for SEO spam that is specific to Haiku - usually related to "view sports/television without paying" spam. 

Spam users rarely have any fans at all, which is a metric we can use for filtering. That's what X-Hatena-Fan-Count is for: 

header		SUDO_HATENA_ZERO_FANS X-Hatena-Fan-Count =~ /^0$/
score		SUDO_HATENA_ZERO_FANS 1.0
describe	SUDO_HATENA_ZERO_FANS User has no fans

header		SUDO_HATENA_FEW_FANS X-Hatena-Fan-Count =~ /^([1-9]{1})$/
score		SUDO_HATENA_FEW_FANS -1.0
describe	SUDO_HATENA_FEW_FANS User has 1-9 fans - spam less likely

header		SUDO_HATENA_10_PLUS_FANS X-Hatena-Fan-Count =~ /^([0-9]{2})$/
score		SUDO_HATENA_10_PLUS_FANS -2.0
describe	SUDO_HATENA_10_PLUS_FANS User has 10-99 fans - spam unlikely

header		SUDO_HATENA_100_PLUS_FANS X-Hatena-Fan-Count =~ /^([0-9]{3,})$/
score		SUDO_HATENA_100_PLUS_FANS -10.0
describe	SUDO_HATENA_100_PLUS_FANS User has 100+ fans - prob legitimate

 I suppose if they read my blog post, they could adapt, but the actual spam-content usually scores much higher than the -2 points that having them add each other as fans could do. 

The source code is here. As it started as an experiment, it is a bit messy; however, it has grown to a service that Hatena users actually use (via the API and UserScript), so I plan to rewrite the entire thing soon.

 

This repository contains only the spam-classification code and spamd-related things. 

 

Website

Hatena Haiku Anti-Spam has two sections which you can browse: The main summary page, and the user information page.

 

f:id:austinburk:20181018104837p:plain

 

The main page has two pie charts, with the one on the left showing recent spam-users, and the one on the right showing the recent legitimate posts. The time period displayed is twenty-four hours, and you can click on any slice of the pie to be taken to the user information page.

 

The user information page has a graph of overall scores for the user, show below:

f:id:austinburk:20181018105115p:plain

It also has a sample of ten posts and how they were classified. To avoid republishing the actual spam content, I take a sample of the text, store it as a base64-encoded snippet, and render it onto an HTML canvas so that you can view it.

Here is a recent example (there's been a lot of drug-related spam recently, which is even easier to filter):

f:id:austinburk:20181018105225p:plain

Here is an example of a legitimate message:

f:id:austinburk:20181018105623p:plain

API

Hatena Haiku Anti-Spam has an API that you can use.

The main one in use for spam filtering is the Recent Scores API:

https://haikuantispam.lightni.ng/api/recent_scores.json

This gives the recently-posting Hatena IDs and their scores. Any score above 5 should be considered spam.

And this one is the Recent Keywords API:

https://haikuantispam.lightni.ng/api/recent_keywords.json

Due to the nature of keywords on Haiku, I can't as easily label a particular keyword as bad or good. The scores here are obtained via averaging the recent scores for posts made under that keyword. Still, it can be useful for updating the keyword list on the right-hand side of the page.

UserScript

id:noromanba has developed an excellent UserScript that makes use of the Recent Scores API and filters spam when browsing on Hatena Haiku. I recommend installing it!

(You can use it with TampermonkeyViolentmonkey, or Greasemonkey. I personally use Tampermonkey, which is compatible with all modern browsers.)

The script is linked on the top of the page:

f:id:austinburk:20181018111344p:plain

Here is a direct link:

https://gist.github.com/noromanba/e485c35ffba606ae8ecacac2c9a8da3c/raw/hatenahaiku-spam-filter.user.js

Rich Content Tags for Slack

This is a limited-use feature, but I noticed that whenever I reported a user to Hatena with a link to Haiku Anti-Spam, I got a little hit from Slackbot (they are using Slack). As such, I added tags to give a summary of the user as classified by Haiku Anti-Spam:

f:id:austinburk:20181018104347p:plain

 

Going Forward

I've been really busy managing my many different projects, but I will certainly continue to maintain and improve Haiku Anti-Spam. What I'm really hoping for is to work with Hatena to develop an accurate and scalable spam detection system for their services, or at the very least provide an inspiration and an example for how it can be done.

Some of the things I'll be updating with Hatena Haiku Anti-Spam soon are:

- How I store posts used for classification

- Database: Right now I'm using SQLite, as it is portable, but it does not always scale well over time. I can already see Haiku Anti-Spam lagging from time to time (the API caches results and is updated independently of the user's request, and as such isn't subject to the lag). I plan to move the application to MySQL.

- Haiku Anti-Spam backend and site merge: Right now, the only code on GitHub is that which powers the backend, but not the site frontend. Once I complete the move to MySQL, eliminating the direct file-access/paths involved in the service, I plan to put the website portion on GitHub as well.

Can you help?

I'm still learning Japanese, but it is a difficult process. I would appreciate if someone who is fluent in both languages would translate this article to Japanese, so that I can publish it in both languages. 

Please bookmark this blog post!

I love Hatena, and I treasure the people that I've met here. Hatena builds services that promote the community, and that's always been my focus. I hope this contribution will serve that purpose!

Is this my new home..?

I'm moving into a new place (with less expensive rent) and it's already proving nicer than my old place. It has a dishwasher, too! I took over the lease from an ex-colleague who moved out to start his own business; the lease lasts through May of 2019.

 

I'm still working on moving, so there's a lot of boxes..

 

f:id:austinburk:20181002121816j:image

f:id:austinburk:20181002121920j:image

 

Technically, this is my new home, though it doesn't feel like it yet. The strange thing is, neither did my old apartment.

 

I have a sense of wanting to go someplace new, someplace beautiful. I am thankful for my current job, but I feel ready to move on to something different from a customer-facing system administrator (aka support technician).

 

I want to build and design new applications, and develop my skills. I want to do security research, and a lot of other things. I want to support the growth of artists and animators, as well. I can do most anything when I put my heart into it.

 

But right now I don't feel like where I am is the place I need to be. That, I suppose, is why I don't feel at home yet.

 

The important thing is that I treat my living space as my home. It needs to be a place I can look forward to going to at the end of the day; a place of rest and refuge.

 

At the end of the day, wherever God wants me to be, I'll do my best to honor Him by serving my employer to my absolute utmost ability, and do so with humility.

.kwz Flipnote sound format progress

We're one step closer to bringing back the Flipnotes from Flipnote Hatena and Flipnote Gallery: World. Right now the hard part is figuring out the sound format. We had a breakthrough today: originally we thought it was VOX ADPCM, nibbles reversed, at 16kHZ, but there was always a lot of loud background noise. As it turns out, it may actually be bitpacked, where the samples are of a variable bitrate to save space. If this is the case, then we should be able to work it how those bitlengths are stored and decode the audio cleanly!

James, id:rakujira, is heading the work to reverse-engineer the format. (as in, he's doing the real work!)

You can find our documentation of the format here: 

As well as a parser here (Go follow his GitHub!):

Finally, we have a JavaScript .ppm and .kwz player. James reimplemented the audio decoding (ADPCM->WAV) in pure JavaScript, what an absolute legend:

Stuck At the End of Time

"Are you ready?"

They stood upon the cracked and mossy pavement, the ceiling miles above them rumbling as chunks of bright blue rock and cement began crumbling down around the cavern. Thunderous crashes from miles around came as the world began to come to an end.

"Yes," she spoke, tears glistening on her face. They turned their backs against each, each reaching one hand, intertwining their fingers as a familiar glow began to envelope Grace, and then him. The buildings around them began to crumble as time finally ceased to flow.

There they stood, Grace to the right of Rubigo, their shoulders against each other's as they stared to the North and to the West. Off in the distance behind them, large chunk of sky was floating in midair, the chunks of rock and cement that had flaked off, similarly suspended in the air thousands of feet up. Grace's watches burned brightly with a dazzling light.

f:id:austinburk:20180914121129p:plain

Time had no meaning. In the infinities worth of other universes, it has rewound itself, but in this one keystone thread of time, it had frozen. And for as long as it had frozen, the damage across the multiverse continued to undo itself.

In this place, they would not stir, they would not wake. But it did not matter so much to them, for they would be holding each other's hand... For not an eternity, but to the end of this time.