OCR re-releases of leaked data

M R

May 8, 2022 • 2 min read

Over the past months since the start of Putin's aggression on Ukraine, several groups have released many leaks of data from various Russian companies or companies working for or with Russia. Most of the time they are spread as torrents, the main source being https://ddosecrets.com/

The problem with large heterogeneous data sets

Most of these leaks have one thing in common - they can be hard to work with. This may be due to their size, which is sometimes in terabytes. It is also due to the fact that they are not easily searchable due to the use of many different formats.

The OCR re-releases address this shortcomings by converting all of the leaked documents to plain text. In the original leak one can have a scanned PDF or TIFF document inside of a RAR attachment of an e-mail, within a PST archive, which itself is in a ZIP archive. In the OCR re-release it is a plain text document in a directory tree according to the original structure of the document, e.g. part-1.zip/[email protected]/3210.eml/пропаганда.rar/путин=хуйло.txt

The OCR was done using tesseract.

This way one can use simple tools like grep to try to search for information in the leaks easily.

How were the leaks processed?

Archives were extracted (zip, rar, pst, eml), becoming directorieswith their content unpacked.
PDF, TIFF, JPG, PNG images processed with OCR, converted to plain text
MS Office documents converted to plain text (doc, docx, xls, xlsx)
Plain text files and SQL dumps were copied without any modification

Next steps?

To take this further one could transliterate the Cyrillic documents to Latin alphabet, translate them from Russian to English and load all of them to a database like ElasticSearch for more comfortable searching.

Merging the original leaks with binary files to the OCR releases is entirely doable. One could cross-reference the data indexed in ElasticSearch with the original data. Say e.g. there is a passport data in a plain text file and you want to see the passport? Looks like an absolute must have for OSINT to provide traceability for all the information we have to be able to properly navigate the data set, verify the information and possibly dig deeper.

The problem with large heterogeneous data sets

How were the leaks processed?

Next steps?

Links to the releases made so far