diff --git a/09-find_duplicates/task.md b/09-find_duplicates/task.md index 49536435ff16b4be23a85913025f7526839e2c83..fa21e0dd839b9931bd41a71fad3020e266900c98 100644 --- a/09-find_duplicates/task.md +++ b/09-find_duplicates/task.md @@ -15,7 +15,7 @@ tests works only on Linux (and not on Windows), and of course also in ReCodEx. You can use full standard library of Python and C++ in this assignment, including data structure implementations (also, `bytearray` might come handy). Your solution must also work on other input data of the same size with similar -number of duplicates. Hence solutions depending on the fact that each string is +number of duplicates. Hence, solutions depending on the fact that each string is uniquely determined by some its substring or similar properties of the input will not be accepted. @@ -25,3 +25,11 @@ Note that due to the space constraints of the Python solutions, tests `10M` and not used and are always considered successful by ReCodEx. Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master). + +Hints: +* Array [ False ] * 2**20 requires approximately 8 MB since Python stores it as an array of pointers to one value False. Use bytearray instead. +* Read carefully the documentation of bytearray and distinguish the terms bit and byte. +* In Python, do not import numpy or other libraries consuming more memory to load than available. +* The memory limit prevents storing all keys, so trying trivial solutions which store all keys in a dictionary is a waste of time. +* Count the number of duplicates and candidates for duplicates. For properly implemented hashing, those two numbers should be very close. +* Use profilers to trace memory usage; see e.g. https://docs.python.org/3/library/tracemalloc.html or https://valgrind.org/.