Blog - Code
The website thispersondoesnotexist.com generates a photorealistic image of a face every time you refresh the page. I won't pretend to know how it generates them (machine learning is dark magic to me), but my understanding is that a new face is generated per page visit. A quick test shows this isn't quite true - refresh fast enough, and you'll get the same image twice in a row. My best guess is that a new face is generated after each request, to be served as the next one. Refresh too fast, and it serves the most recent available image. Anyway, I started to wonder if you would ever see the same face twice. TL;DR: You probably won't. But here's how I figured it out.
My criteria for "the same face" would be better phrased as "the same image". A simple test is to hash each image, and keep a running list of these hashes. Same hash already in the list? It's a duplicate.
So, I needed a way to download images, and a way to hash them. There are easy ways to do both in Bash -
wget is a simple command line tool to download anything available on the public internet, up to mirroring entire websites for offline browsing.
Luckily, thispersondoesnotexist just returns a single .png image, rather than a HTML page, so
wget https://thispersondoesnotexist.com is all that's needed to download an image. Run it again, and you'll get a new image.
However, wget will name the file according to the filename the server provides by default, so I needed a way to give every image a unique name. I knew I would be implenting this whole thing in a loop, so it was trivial to implement a counter and sequentially number every file.
#!/bin/bash x=0 while: wget https://thispersondoesnotexist.com/image -o $x.png ((x++)) done
Running this as-is will work, but you'll get a lot of duplicate images, since
wget will make a request several times per second on a decent connection. Add
sleep 2 to the loop to slow down.
wget has an option,
--random-wait, but this only works when
wget is making more than one request in the same session. As we're calling it again every loop, it doesn't help here.
Next, we need to hash our file, and save that hash to a text file. We can assign the hash of the file to a variable with
md5=($(md5sum $x.png)). This allows us to check the hash against existing ones before deciding if we should write it to the file, preventing duplicates being recorded.
Next, we need to check if we've already saved this hash. As mentioned, we'll save a list of hashes, so we can just search for the hash of the most recent download against that file with
grep. I did this like so;
if grep -Fxq "$md5" hashes.txt then echo DUPLICATE FOUND rm $x.png else echo $md5 >> hashes.txt ((x++)) fi
Now, we only record the hash and increment the counter if the image is not a duplicate.
#!/bin/bash x=0 while: wget https://thispersondoesnotexist.com/image -o $x.png md5=($(md5sum $x.png)) if grep -Fxq "$md5" hashes.txt then echo DUPLICATE FOUND rm $x.png else echo $md5 >> hashes.txt ((x++)) fi sleep 2 done
So, we're pretty much done. I actually made a few QoL improvements, such as counting existing files and setting the counter before the loop, so it could resume without overwriting files. I also moved duplicates to their own directory rather than deleting them.
Over the next few days, I downloaded 50,029 images, and didn't find one duplicate. During testing I found not having a long enough
sleep did result in the script detecting and rejecting duplicates, so I'm pretty sure it didn't just miss any duplicates.
The total size of the image set is 57.4 GB, with the smallest image being 423 KB, and the largest being 1720 KB.
ffmpeg to stitch 43000 of the images into a 60FPS video, which you can watch here. The video is quite strobe-like, so maybe give it a miss if you have sensitivity to flashing images.