Skip to content
Snippets Groups Projects
Commit f0bcaca5 authored by Martin Mareš's avatar Martin Mareš
Browse files

Avoid mis-identifying certain PDF files as UTF-8 text

Closes #110.
parent 4350dace
No related branches found
No related tags found
No related merge requests found
...@@ -96,12 +96,17 @@ def is_utf8(f) -> bool: ...@@ -96,12 +96,17 @@ def is_utf8(f) -> bool:
f.seek(0) f.seek(0)
x = io.TextIOWrapper(f, encoding='utf-8', errors='strict') x = io.TextIOWrapper(f, encoding='utf-8', errors='strict')
try: try:
x.read(4096) head = x.read(4096)
if head.startswith('%PDF-'):
# PDF files are expected to contain non-ASCII bytes at their beginning,
# but surprisingly enough, these bytes sometimes decode as UTF-8.
verdict = False
else:
verdict = True
except UnicodeDecodeError: except UnicodeDecodeError:
verdict = False
x.detach() x.detach()
return False return verdict
x.detach()
return True
# Presentation of score # Presentation of score
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment