Is the V.I. Lenin on-line archive (HTML = actual) a character-by-character copy of all forty-five volumes of Lenin Collected Works (books = expected) printed in Moscow during the 1960s and early 1970s?
A database is created by typing the first line of every paragraph from every text from every volume of the Collected Works into a text file (the expected structure). Text file is read into a data dictionary by a Python script (see below).
Next, the HTML files referenced by
lenin/works/cw/index.htm are read into a Python
data dictionary (the actual structure).
Then the two data structures (expected and actual) are compared.
Finally, stuff that is missing or doesn't match gets reported.