Comparison of image assessment accuracy of various solutions
The test set contained only images that left no doubt as to their classification as
appropriate/inappropriate. Each image has been pre-classified by five human moderators. We used only
images that were classified the same by all five of them. No ‘grey zone’ images were in the set.
The test involved twenty thousand random images uploaded to the Internet by users. Half of the set was
appropriate and the other half, inappropriate.
The test focused on the maximum thoroughness as regards results of our system:
- no test image was presented to our system before; none was used in the learning process. Some of them
might, however, be seen by our competition, which would be our handicap.
- For some competitive solutions, we adjusted the threshold so that they get as high results as possible. In
all such cases, we noted this fact in the ‘notes’ column in the table..
Note: results for other sets of images depend on the source of the images. They will be affected by the
subject matter of the website, average age of its users, etc.
The test set we used can be considered a difficult one. You should get similar or better results for other
sets.
Service name | Error rate | Notes |
---|---|---|
xModerator | 0.5% | |
Moderatecontent | 3.5% | |
Google Cloud | 4.8% | The "possible" result was treated as unsafe. Otherwise, the error rate was 8.6%. |
Yandex | 5.8% | |
Clarifai | 7.1% | |
Sightengine | 11.1% | The threshold was set to 0.2. The error rate with an original value of 0.5 was 19.3% |
Microsoft Azure | 11.8% | The threshold was set to 0.15. The error rate with an original value 0.5 was 33.2% |
Picpurify | 16.2% | |
Amazon AWS | 20.0% |
Chosen examples from a test set: