sudofox's journal

Austin Burk's journal, where I share little snippets of my writing, code, and dreams.


The Daily Draw is struggling...

I'm a bit disappointed about how the Sudomemo Daily Draw has somewhat faded so far.

I was hoping to bank on the nostalgia factor to attract old Hatena users from the days of Flipnote (うごメモ), but that doesn't seem to have worked too well.

I do seem to be receiving a number of new followers from Japan each day, though I have received no entries from the new subscribers. (Currently: 172 subscribers)

The concept seems pretty simple: a daily drawing prompt, meant to be done as a quick doodle. The winner gets a number of Hatena Stars, gets featured, and gets a Bookmark!

I think it might be popular if Hatena users were interested. I may need to change how I write the articles to indicate that it is indeed open to anyone at all with a Hatena account.

I'm going to keep going. I don't give up on things easily, since good things take hard work to accomplish.

Google is Scanning for (and Crawling) URLs in Your Private YouTube Videos

I was recently uploading an unlisted video to YouTube to demonstrate an XSS vulnerability I stumbled across which I was responsibly disclosing. Part of this involved showing the URL of the script which had been run. After uploading it to YouTube and submitting the vulnerability disclosure, I decided to double-check that nobody had visited the page I was testing on before I had removed the link. As it turns out, somebody had: YouTube. - - [12/Dec/2018:14:23:40 -0500] "GET /js/redacted_1.js HTTP/1.1" 200 125 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:14:23:42 -0500] "GET /js/redacted_1.js HTTP/1.1" 200 125 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:15:24:21 -0500] "GET /redacted_subfolder/redacted_2.png HTTP/1.1" 200 4605 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:15:24:22 -0500] "GET /redacted_subfolder/redacted_3.png HTTP/1.1" 200 5102 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:15:24:23 -0500] "GET /js/redacted_4.js HTTP/1.1" 200 137 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:15:24:24 -0500] "GET /redacted_subfolder/redacted_2.png HTTP/1.1" 200 4605 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:15:24:26 -0500] "GET /redacted_subfolder/redacted_3.png HTTP/1.1" 200 5102 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:15:24:26 -0500] "GET /js/redacted_4.js HTTP/1.1" 200 137 "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)"

I was rather alarmed to see this, as I didn't imagine the links were up long enough to be crawled by Google. It was then that I realized that during the video, those URLs were visible in the address bar. It seemed that YouTube had run OCR (optical character recognition) across my entire video and decided to crawl the links within. But how could I be sure that this was not just a mistake on my part?

Time for an Experiment

I recorded a new video of me accessing a URL that does not exist for the very first time.

Here is the video that I uploaded:

I started another screen recording of me uploading the video, and watching the access logs. A few minutes later, Google took the bait, and sent two requests to the URL: - - [12/Dec/2018:18:42:02 -0500] "GET /nonexistent/url.js HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)" - - [12/Dec/2018:18:42:04 -0500] "GET /nonexistent/url.js HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Google-Youtube-Links)"

Hook, line, and sinker! I recorded me uploading the video and watching my access logs live (the accesses are around the 5:50 mark):

Why is this concerning?


The purpose for which I uploaded the video was to report a vulnerability. I uploaded it unlisted, so far all intents and purposes, it was meant to remain private. However, our friend Google-Youtube-Links scanned it for an unknown purpose and sent several requests to that URL. A second test as a fully private (not just unlisted) video revealed the same result.

By uploading the videos as unlisted or private, I have the expectation that nobody will see the video or the contained within except for me, or for the people who I explicitly share the links with.

Let's propose a scenario which is in a similar realm to what I was doing:

A security researcher has found a critical vulnerability in a site, and has crafted a URL that will trigger it, causing harmful effects to the website. (e.g a SQL injection vulnerability that will drop the database tables).

During the video, s/he makes mention that they will not visit the URL as it would cause trouble, but it is displayed so that the company they are responsibly disclosing to can remedy it. They upload it as unlisted to YouTube and submit their report. Five minutes later, Google-Youtube-Links comes along and sends two requests to the URL, triggering the SQL injection and rendering the site broken.

The Illusion of Privacy 

Here is Google's explanation of privacy settings:

What this does not include, however, is any mention that your video will be scanned for anything resembling URLs, and have these crawled.

What Google has to say about it

So what does Google have to say about this practice? Actually, nothing at all. Searching for the user-agent gives no relevant results save for one: a locked thread from a curious webmaster with no answer, back on March 27th, 2018. The thread was not resolved.!topic/webmasters/Ov_ODO8l2cU

This means that we are left with no explanation of why this is occuring, or disclosure that content uploaded as private to YouTube will be scanned with OCR and have any links within crawled by Google.

Honestly, I find this rather unsettling - especially for using private or unlisted YouTube videos as a way to quickly upload a video to disclose a vulnerability. I'm sure you can think of other scenarios in which this would be undesired, especially as we don't know why it's taking place or where those URLs will end up.

Let me know what you think of this development.

Questions or concerns?

If you have any questions or concerns, feel free to leave them below, or reach out to me at If you are a Hatena user, feel free to leave a star!


In English: Hatena Haiku Anti-Spam - austinburk's blog







ハイクその他のサービスに投稿されるスパムは性質が非常に一貫しています。私は現在Webホスティング管理の会社に勤務し、ベイズフィルターと、Apache SpamAssassinという正規表現ベースのフィルタリングエンジンの組み合わせに関する経験があります。そこで、SpamAssassinにはてなハイクの投稿のフィルタリング機能を付け加え、より快適にサービスを使用できるようにしようと思い立ちました。





Received: by (Sudofix)
        id 81807923680507219; Fri, 08 Jun 2018 09:12:22 -0400 (EDT)
From: (Sudofox)
Subject: 食べた
Content-Type: text/plain; charset=UTF-8
Message-Id: <>
Date:Fri, 08 Jun 2018 09:12:22 -0400 (EDT)

食べた=That looks AMAZING QoQ 


ヘッダーには、spamd (SpamAssassinのバックグラウンドで実行されるサービス)の解析の対象となる最小限の内容と、X-Hatena-Fan-Countが含まれています。


ここにuser_prefsファイルがあります。"Begin Sudofox config"の後をご覧になるとわかるとおり、Eメール関連のすべてのチェック(たとえばSPF、RBL、DKIM)を無効化するなどの変更をルールに加えて、ハイクからのメッセージのフィルタリングへの適合性を高めています。それ以外に、ハイクのプライマリユーザーベースにより適合させるための変更("Begin Sudofox config"の上にありますが注釈はしていません)も加えています。 



header		SUDO_HATENA_ZERO_FANS X-Hatena-Fan-Count =~ /^0$/
describe	SUDO_HATENA_ZERO_FANS User has no fans

header		SUDO_HATENA_FEW_FANS X-Hatena-Fan-Count =~ /^([1-9]{1})$/
describe	SUDO_HATENA_FEW_FANS User has 1-9 fans - spam less likely

header		SUDO_HATENA_10_PLUS_FANS X-Hatena-Fan-Count =~ /^([0-9]{2})$/
score		SUDO_HATENA_10_PLUS_FANS -2.0
describe	SUDO_HATENA_10_PLUS_FANS User has 10-99 fans - spam unlikely

header		SUDO_HATENA_100_PLUS_FANS X-Hatena-Fan-Count =~ /^([0-9]{3,})$/
score		SUDO_HATENA_100_PLUS_FANS -10.0
describe	SUDO_HATENA_100_PLUS_FANS User has 100+ fans - prob legitimate




































- 分類に使用する投稿の格納方法

- データベース: 現在は、ポータブルであるためSQLiteを使用していますが、時間の経過に伴い、適切にスケーリングできなくなる可能性があります。ハイクスパム対策にはすでに時折タイムラグが見られます(APIでは結果をキャッシュし、ユーザーのリクエストとは独立して更新を実行するため、タイムラグは発生しません)。アプリケーションをMySQLに移行することを計画しています。

- ハイクスパム対策バックエンドとサイトマージ: 現在、GitHub上のコードはバックエンド向けのもののみとなり、サイトのフロントエンド向けのものはありません。MySQLへの移行が完了し、サービスに関与するファイルへの直接アクセスや直接のパスを除去したら、Webサイトの部分もGitHubに配置することを計画しています。


Thank you to id:PlumAdmin for the translation of the original article. I am very grateful.



Introducing sudofox/blockdevdiff

I made something new today!

This tool is meant to help with resuming an interrupted disk clone (e.g dd). It's extremely efficient for volumes of any size (written for operation on a 24 TB volume after something crashed)


Usage: ./ </dev/source_device> </dev/target_device> <starting offset> <jump size> <sample size> [email address to notify]
INFO]        ===== Block Device Differ =====
[INFO]        sudofox/blockdevdiff
[INFO]        This tool is read-only and makes no modifications.
[INFO]        When the rough point of difference is found, reduce the jump size,
[INFO]        raise the starting offset, and retest until you have an accurate
[INFO]        offset (measured in bytes).
[INFO]        Recommended sample size: 1024 (bytes)
[INFO]        Starting time:        Fri Nov 30 21:34:10 EST 2018
[INFO]        Source device:        testfile1.bin
[INFO]        Target device:        testfile2.bin
[INFO]        Starting at offset:   0
[INFO]        Jump size:        100
[INFO]        Sample size:      10
[INFO]        testfile1.bin is not a block device
[INFO]        testfile2.bin is not a block device
[INFO]        Starting...
[PROGRESS]    Offset 25000 | Source 6fbb8d...| Target 858a9a...
[INFO]        Sample differed at position 25000, sample size 10 bytes
======== FOUND DIFFERENCE ========
Ending time: Fri Nov 30 21:34:12 EST 2018
Found difference at offset 25000
SOURCE_SAMPLE_HASH = 6fbb8d9e8669ba6ea174b5011c97fe80
TARGET_SAMPLE_HASH = 858a9a2907c7586ef27951799e55d0e8

Translating to a dd

I am not responsible for if you destroy your data doing this

Let's say we started with the following command:

dd if=/dev/source_device_here of=/dev/target_device_here bs=128k status=progress

Somewhere between 16 and 19 terabytes into the process, your server crashes. Perhaps your RAID card overheated. Now what?

Well, we can use our handy blockdevdiff tool to find out roughly where the data starts to diff. Start proportional to how big your volume is; arguments for blockdevdiff are in bytes.

Start big, using a skip size of ~50 GB or so, and then when you start getting different data, set your start size to the point you hit minus the skip size, reduce the skip size, and run it again.

[INFO]        ===== Block Device Differ =====
[INFO]        sudofox/blockdevdiff
[INFO]        This tool is read-only and makes no modifications.
[INFO]        When the rough point of difference is found, reduce the jump size,
[INFO]        raise the starting offset, and retest until you have an accurate
[INFO]        offset (measured in bytes).
[INFO]        Recommended sample size: 1024 (bytes)
[INFO]        Starting time:        Fri Nov 30 21:43:08 EST 2018
[INFO]        Source device:        /dev/sda
[INFO]        Target device:        /dev/sdb
[INFO]        Starting at offset:   17003360000000
[INFO]        Jump size:        1000000000
[INFO]        Sample size:      100
[INFO]        Starting...
[PROGRESS]    Offset 17074360000000 | Source 684146...| Target 6d0bb0...
[INFO]        Sample differed at position 17074360000000, sample size 100
======== FOUND DIFFERENCE ========
Ending time: Fri Nov 30 21:43:28 EST 2018
Found difference at offset 17074360000000
SOURCE_SAMPLE_HASH = 68414605a320573a0f9ad1c8e71ab013
TARGET_SAMPLE_HASH = 6d0bb00954ceb7fbee436bb55a8397a9

Keep going until you get close enough to a starting point which is reasonable for your volume's size.

Once you have your number, round it down generously. I rounded mine down a few hundred gigabytes just to be sure: it's better to start too early than too late.

Here is your new command (DO NOT COPY AND PASTE)

dd if=/dev/source_device_here of=/dev/target_device_here bs=128K conv=notrunc seek=XXXXXXXXX skip=XXXXXXXXXXX iflag=skip_bytes oflag=seek_bytes status=progress

if: input file (e.g. a device file like /dev/sda)

of: output file

Apparently conv=notrunc doesn't really make any difference for actual block devices, so just leave it in.

If you are using this on VM images stored on another filesystem then you DEFINITELY want it.

Pass the iflag=skip_bytes and oflag=seek_bytes, so that we can use bytes instead of blocks here, which makes things less confusing overall.

seek: dictates the position to start copying bytes from the source device

skip: dictates the position to start copying bytes to the target device

seek and skip should be the same!

status=progress: so you can actually see what dd is doing

"Email when done" functionality

This will require some installed mailserver (e.g. Exim, Postfix, etc) so that the "mail" binary will function. In cases where you need to get a really specific offset on a really big volume, you can pass one final argument containing an email address.

You will be emailed when blockdevdiff has finished.

I got the last 8TB drives that I needed for the Flipnote Hatena archive!



Happy Thanksgiving! I had a lovely time with my family.


I don't really do Black Friday shopping, but I did make one stop: I now have the final 3 8TB drives that I need to build a second, actually redundant RAID array to copy the Flipnotes from the RAID-0 array to! This is every Flipnote from Flipnote Hatena and Flipnote Gallery: World. The three drives (and $5 St. Jude donation) cost me $640.97, with parts and bandwidth and hardware the archive project is running me almost $2000 so far.