Forums / Discussion / General

233,053 total conversations in 7,793 threads

+ New Thread


Analysis of images on the site

Last posted Dec 28, 2015 at 10:27PM EST. Added Dec 27, 2015 at 05:50PM EST
11 posts from 4 users

(Note: Nerdy stuff)

So, yesterday, I decided I would use the ability to get random images to attempt to determine the composition of the images on this site – as in, how many of each kind of images are there? This came to mind while tagging images, as I found a few that were untagged, improperly tagged, and/or out of their proper gallery, meaning that attempting to find out how many of a kind there are through the search wouldn't be 100% accurate. I started out small, with a sample size of 100 images, and only 18 categories for images (if I really get going I'd expect to have at least 25 categories and 500 images).

My methods:
I used the ability to get a random image on the site to record 100 images under 18 categories, being Demotivational Posters, RWBY, Pokemon, Super Smash Bros., Undertale, JoJo's Bizarre Adventure, Steven Universe, MLP, Reaction Images, Tumblr, Hentai Quotes, Rage Comics, Fanart, Twitter, Image Macro, and an Other category for images that don't fit into any of the others. I also recorded the number assigned to each image in the URL.

A few people in the IRC expressed concerns about a random sample of such a small size, so I included two methods to check the reliability. The first was averaging the numbers assigned to the images. If the average of all the image numbers wasn't close to 500k, then we know that it had uneven distribution to either older or newer images. Also, I took some of the larger galleries and compared that to how the numbers gathered differed from the numbers in those galleries.

Numbers:
(Note that the numbers add up to above 100% because images could be added to multiple categories.)

  • 41% of images found did not fit into any of the categories of interest.
  • 31% of the images were fanart.
  • 25% of the images were MLP related.
  • 14% were image macros.
  • 5% were Pokemon related.
  • Super Smash Bros., JoJo's Bizzare Adventure, Steven Universe, and Tumblr each were related to 2% of the images found.
  • Reaction Images, Hentai Quotes, and Rage Comics each were related to 1% of the images found.
  • The rest of the categories had 0%,

Accuracy tests:

  • MLP, Hentai Quotes, Steven Universe, Tumblr, Super Smash Bros., Pokemon, RWBY, and Undertale got about the percentage expected. All is good there.
  • The average of the numbers came out to 578479, meaning it had a slight bias toward newer images. This means older ones, specifically rage comics, demotivational posters, and image macros, likely got slightly skimped out.
  • Reaction images were 2% lower than the minimum expected, which I'm guessing was caused by the newer image bias.
  • I'm only human, and this is my first time doing something like this. It's unlikely but possible I mislabeled something, which would be a whole percentage point changed.

So that's my findings thus far. This is just the preliminary test, which I think went pretty well, so no hard numbers about predicting the number of a certain type of image quite yet.

If you have any suggestions for what I should do next time, I'd be glad to hear them. As I said I'm new to this, so I might not be getting something right.

Mod edit: Removed personal info

Last edited Aug 20, 2016 at 02:03AM EDT

Great job. Glad to see MLP still on top, sad to see macros so low

Next could you do one on properly tagged NSFW and Spoilers? I see too many images not properly tagged i.e. tagged when they shouldn't.

Also just wondering when going through images in random how many NSFW and Spoiler warnings did you see? My guess is 0 if you just go through images without looking at thumbnails.

Last edited Dec 27, 2015 at 06:08PM EST

Ayy, we can be KYM statistics bros /)/)/)
This is pretty interesting, but like the IRC suggested, a larger sample size would've been nice.
When it comes to certain fandoms, you can just divide the number of images in their respective galleries by the total number of images on the site to get a percentage. However, stuff like fanart and image macro percentages is really nice to see.

Next could you do one on properly tagged NSFW and Spoilers? I see too many images not properly tagged i.e. tagged when they shouldn’t.
Also just wondering when going through images in random how many NSFW and Spoiler warnings did you see? My guess is 0 if you just go through images without looking at thumbnails.

I don't see them normally, but I can check each one at the cost of some time. Moreover, "properly tagged NSFW and Spoilers" is subjective, so you'll have to take that into account. I can do it though if you want.

Ayy, we can be KYM statistics bros /)/)/)
This is pretty interesting, but like the IRC suggested, a larger sample size would’ve been nice.
When it comes to certain fandoms, you can just divide the number of images in their respective galleries by the total number of images on the site to get a percentage. However, stuff like fanart and image macro percentages is really nice to see.

Ayy /)/)/)
Yeah, I plan to get a larger sample size next time.
While I could do that, it would slightly understimate it as crossovers would only get counted as one or the other. This is what I basically did though when comparing my results to what I expected. The MLP gallery consists of about 22ish percent of all the images on the site, for example, so I expected somewhere around that.
Asdfghjkl suggested fanart and image macros, and I plan to expand and refine the list next time. If anyone has any other suggestions for things like that, feel free to tell me.

Last edited Dec 27, 2015 at 06:29PM EST
I don’t see them normally, but I can check each one at the cost of some time. Moreover, “properly tagged NSFW and Spoilers” is subjective, so you’ll have to take that into account. I can do it though if you want.

Yeah I know NSFW and Spoilers are subjective but sometimes I'm just like "WTF???" at what some people are tagging. Also I know that in "random mode viewing" and from "left and right viewing" you never see the warnings they are only visible in the thumbnails. I would just love to see some stats on how well these warnings actually work or lack thereof to show mods that a filter system similar to the "pony filter" would work better. but that's just wishful thinking to try and get other users and mods aboard to make this a better site.

Also what about stats on user uploads? Like in random mode of 500 images who uploaded the most images?

Yeah, me too about how sometimes they tag it stupidly. What do you exactly mean for the filter system? That doesn't seem like the easiest to implement either, but it'd certainly be cool to have it.

Given the sheer amount of users who uploaded images, and that the random image thing seems to be fairly evenly distributed (if a possible lean towards newer images), I doubt even a selection of a thousand images would give meaningful information on who uploads the most images, so I don't think I'll add that. I will add NSFW and Spoiler though to the next attempt (250 images).

1) Cool project.

2) Accurate data gathering patterns are CRITICAL! No matter how many trials you do, you'll never get accurate results if your data collection changes the accurate distribution in any way.
I wouldn't suggest the random image button; it's probably fine, but there COULD be algorithms in place to show more popular images slightly more often. It wouldn't really be scandalous either, just a good business decision. I'd suggest using a "random number generator" online (technically it's pseudorandom, but that's at least as good as the kym image generator which would also be pseudorandom at best and a skewed pseudorandom distribution at worst) and get numbers from 0 to 1100000 (just ignore any numbers that are too high) and then get the image that corresponds to it at https://knowyourmeme.com/photos/[number goes here] .

Well, that's it. Have fun!

Last edited Dec 27, 2015 at 11:16PM EST

Roy G. Biv wrote:

1) Cool project.

2) Accurate data gathering patterns are CRITICAL! No matter how many trials you do, you'll never get accurate results if your data collection changes the accurate distribution in any way.
I wouldn't suggest the random image button; it's probably fine, but there COULD be algorithms in place to show more popular images slightly more often. It wouldn't really be scandalous either, just a good business decision. I'd suggest using a "random number generator" online (technically it's pseudorandom, but that's at least as good as the kym image generator which would also be pseudorandom at best and a skewed pseudorandom distribution at worst) and get numbers from 0 to 1100000 (just ignore any numbers that are too high) and then get the image that corresponds to it at https://knowyourmeme.com/photos/[number goes here] .

Well, that's it. Have fun!

1) Thanks!
2) I'll consider it. The next run is also supposed to be a test run, with 250 images. I'll do the random image button during that one, and if it continues to have a notable bias towards newer images (it doesn't seem to have a bias to more popular ones) I'll switch to the random number generator. Assuming I end up wanting to do 500 images, of course. It'd take about 4 days to gather all that up, so it wouldn't be a tiny project.

Also, I saw the large thing of math in there before you edited out. I can see why you removed it, but I'll ask anyways – so what you meant was that increasing numbers of images, even when multiplying by ten, will start giving less and less marginal accuracy (is that even a term?), correct? So, the jump from 250 to 500 might not even increase accuracy as much as 100 to 150, or if it does, not by a whole lot. I'll keep that in mind; probably won't wanna go any higher than 1000, if for some reason I decide to go that far.

Mom Rivers wrote:

1) Thanks!
2) I'll consider it. The next run is also supposed to be a test run, with 250 images. I'll do the random image button during that one, and if it continues to have a notable bias towards newer images (it doesn't seem to have a bias to more popular ones) I'll switch to the random number generator. Assuming I end up wanting to do 500 images, of course. It'd take about 4 days to gather all that up, so it wouldn't be a tiny project.

Also, I saw the large thing of math in there before you edited out. I can see why you removed it, but I'll ask anyways – so what you meant was that increasing numbers of images, even when multiplying by ten, will start giving less and less marginal accuracy (is that even a term?), correct? So, the jump from 250 to 500 might not even increase accuracy as much as 100 to 150, or if it does, not by a whole lot. I'll keep that in mind; probably won't wanna go any higher than 1000, if for some reason I decide to go that far.

Yeah, sorry for the confusion. I realized after I posted that your experiment would result in more of a binomial distribution (only yes or no possible in every trial) rather than a distribution you'd get from rolling some dice. So basically, my numbers were off because of that.

I believe the same reasoning still applies, but I can't prove it off the top of my head soand it was getting late. I feel like my reasoning is sound, and my stats teachers have consistently said that an accure test is more important than tons of trials. But, yeah, I'd have to a bit more research to be 100% sure. That said, it may well be that the standard deviation is still logarithmic, but it may be "less slow" e.g. 10x the trials means 1/500 average standard deviation, instead of roughly 1/3 like I calculated it would if you somehow managed to sort every image from most to least in each quality and then, after picking randomly, recorded the position of the image you picked in that ranking, instead of just picking a random image and saying that a quality did or didn't exist.

Lastly, there's a chance that the number could be skewed towards newer numbers if deleted images never have their number "taken again" by a new image. If that's the case, then if it's true that older images are more likely to be removed (since they have more time for a creator to find it, or for something to realize something objectionable was hidden in the image, or a rule change suddenly makes some images no longer appropriate, etc.) then the skewing isn't really "intentional", and would show up if every non-deleted image had an equal chance of being chosen, unless you did like only 2 trials or got astronomically unlikely results or something.

Good luck!

Last edited Dec 28, 2015 at 01:18AM EST

Mom Rivers wrote:

Yeah, me too about how sometimes they tag it stupidly. What do you exactly mean for the filter system? That doesn't seem like the easiest to implement either, but it'd certainly be cool to have it.

Given the sheer amount of users who uploaded images, and that the random image thing seems to be fairly evenly distributed (if a possible lean towards newer images), I doubt even a selection of a thousand images would give meaningful information on who uploads the most images, so I don't think I'll add that. I will add NSFW and Spoiler though to the next attempt (250 images).

you can filter MLP images but for some reason they say they can't do it for other tags

I got a new set of 100 images analyzed with new categories today. I'm planning to finish it with 150 more tomorrow so as to get better results. There's some concerns about if it's really reliable anyways, so I might have to start over if it turns out the random image function isn't quite random or just decided to give me a bad batch.

Mod edit: Removed personal info

Last edited Aug 20, 2016 at 02:03AM EDT
Skeletor-sm

This thread is closed to new posts.

Old threads normally auto-close after 30 days of inactivity.

Why don't you start a new thread instead?

Hey! You must login or signup first!