Blog Posts

File Integrity

Lets start with answering the question of what is file integrity and why does it matter? Is this something you should be concerned with?

File Integrity

File Integrity is defined as a confirmed assertion that a file has not been altered unknowingly.

Most people “trust” that their files on the computer or in the cloud are safe from modification. But can you prove it? Could your OneDrive file be changed without you knowing? When was the last time you checked it? You’re storing your data on a third party system out in the cloud after all, what control do you really have over who accesses those files and what changes are made?

Why does file integrity matter?

Without verifiable file integrity you cannot trust that the data in the file is true and accurate. So in a nutshell, file integrity is critical.

If you make decisions based on the data you have digitally stored, then knowing that data is true and accurate is of paramount importance.

In our example, suppose you had a file that you created last month. It contains a spreadsheet of your bank account balances. You’re interested in tracking this information more closely. Each month you will add your balances to the file so you can see how your accounts are growing over time.

Imagine your computer was infected with a virus that embedded itself into that spreadsheet (Trojan macro virus). Unless your antivirus software caught the malware (in which case the file would not be infected), it would not be apparent that the file had been changed just by looking at it.

A does not equal B demonstrating image corruption
This is an example of a JPG image that has been corrupted, but still loads.

The second scenario is not nefarious at all. You have been diligently perform regular backups. Unbeknownst to you, sometime during the month something happened that corrupted the file (power failure, bad storage media). A period of time goes by before you access the file and realize it’s corrupt. No problem right? You’ll just do a restoration from backup. When you go to do a restore, everything goes well. But the file you just restored is a copy of the corrupt file.

Wouldn’t it be fantastic to know a file has been corrupted long before the good copy ages out of your backup?

This situation happens fairly commonly with accountants because often they are only accessing data once a year. Consider old digital family photos, home video or documents. When corruption happens to archival media, we use the term bitrot, or a degradation of storage media that causes file corruption.

What can I do to determine if my files have been changed?

We need a way to validate the contents of the file with a known good copy. If A is the file to compare and B is the copy and A = B then A has integrity. It’s easy when they match. Imagine they do not. If A isn’t a match to B then A could be bad, or is it B that is bad? I know what you’re saying, B is my known good copy. How do you know your known good copy is not corrupt? Did you verify B? As you can see this can get a bit out of hand.

File A: Original, without corruption

We can determine whether A or B is a good file by using a file “fingerprint.” We refer to these fingerprints as a file hash. By creating a list of file hashes, months later you can compare that historical hash to the current file’s hash to determine if the file has changed. Creating a list of file hashes has several benefits.

Firstly it’s a digital fingerprint of a file. If a single piece of data has changed in the file (corruption/maliciousness), the calculated hash will be different than the stored one, regardless of the file’s metadata.

File B: Corrupted. Corruption can happen to any file (document/image/video etc.) that will render it unreadable. What would you do if this was your son’s hospital delivery room photo?

Secondly, creating a B copy of the file is a good idea (in backups / otherwise). The file’s hash is useful to validate the backed up copy as the correct, unaltered, not corrupted original version. Note: Hashes cannot be used to recover files, so good backups are still required.

Thirdly, when we do a long term archival we store two copies of the archival data (on different media) with the file hash catalogue. The catalogue is basically a file that contains a list of hashes for all the data stored on the storage media. This allows us to validate which files on which media have suffered from corruption and use the other media to restore them.

File hashes are fast to generate, automate-able and a fantastic tool in the toolbox to ensure your files maintain their integrity.

For a quick and dirty powershell script for Windows that you can use to generate/store and verify your file “fingerprints” (hashes), check out our post here. The equivalent Linux Bash shell script is here.

Check out more of our How-To’s for additional great tips like this one.

Leave a Reply