Aim: To remove sensitive strings, such as hard-coded credentials or IDs, from the history of a git repo.
This blog post has come about as a result of yet another lesson learned from working with Azure Data Factory. I had a private repo in Github for a personal Data Factory project that I wanted to make public. Being diligent, I decided to review the code that I (or rather the Data Factory platform) had committed before making any changes to the visibility of the repo, and I found that the tenantId and principalId from my Azure subscription were stored in code, something I’d prefer not to be the case and would certainly prefer not be public. Using BFG Repo-Cleaner to strip the data from the git history isn’t a terribly difficult task but the instructions on the tool’s website are a little sparse so I had to remind myself of the below process.
Note: I am performing all these steps on Windows 10.
Setup
Install git if you haven’t done so already. This will give you access to git Bash, which is not necessarily available to you if you’re using a GUI for git. Additionally, clone the affected git repo to your local machine.
Since BFG Repo-Cleaner relies on Java, download and install that if you haven’t already got it (paying attention to the licensing terms).
Create a folder for your work and then download BFG Repo-Cleaner to that folder. I named my folder HistoryFix. Note: Don’t store the affected git repo in this folder. We will mirror the repo as part of modifying the history in a later step.
Clean up the latest commit
The first step is to remove the Azure tenantId and principalId GUIDs from the Data Factory files and commit that change to git. At minimum, they’ll be stored in the .json file in the factory folder but you may want to search all files. This change can be made by editing the file directly (e.g. using a text editor). As you can see from the screenshot, I’ve replaced the GUIDs with:
***REMOVED***
This only removes the GUIDs from the most recent commit. However BFG Repo-Cleaner won’t touch the most recent commit so we have to take this step to manually strip the sensitive data from it.
Clean up the git history
Create a .txt file and save it to the HistoryFix folder. I named my file replacements.txt and populated it with the tenantId and principalId GUIDs. As BFG Repo-Cleaner will read the file to determine how to replace the sensitive strings, the file should be structured in the following way:
STRING1 #Replace "STRING1" with "***REMOVED***" (default)
STRING2==>newSTRING #Replace with "newSTRING" instead
STRING3==> #Replace with an empty string
For both tenantId and principalId, I opted for the default option to correspond to my clean-up of the most recent commit.
Launch git Bash and cd to the HistoryFix folder. Run the following command to mirror clone the affected git repo to this folder, replacing <my account> and <my repo> with your own account and repo info.
git clone --mirror https://github.com/<my account>/<my repo>.git
The HistoryFix folder should now contain the following:
- The BFG .jar file
- The replacements.txt file
- A mirror of the git repo
Open Command Prompt and cd to the HistoryFix folder. Run the following command, replacing <bfg file> with the relevant BFG file name (mine was bfg-1.14.0) and <my repo> with the name of the mirror repo.
java -jar <bfg file>.jar --replace-text replacements.txt <my repo>.git
The output of the BFG replace command will detail the changes made. If you have not taken the steps to remove the sensitive strings from the most recent commit, it will also warn you about that. In that case, I suggest deleting the mirror repo and starting again from the “Clean up the latest commit” section above.
Note: I’m not sure if this next garbage collection step is needed for replacing strings (as opposed to deleting files or blobs, which BFG Repo-Cleaner can also do). However, running it doesn’t appear to cause any harm to the repo. Please feel free to add a comment if you have knowledge on this!
Return to git Bash and run the following to force git to garbage-collect inside the mirror repo. The reason behind this is that BFG Repo-Cleaner has updated the commits, all branches and tags so that they are clean but it hasn’t physically delete the unwanted stuff.
cd <my repo>.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
Run the following to force push the repo back to Github.
git push -f
As you can see, there are some errors in the output as it failed to push hidden refs. On investigation, this appears to be related to closed Pull Requests on my repo. As noted in the documentation from Github, there are certain scenarios where commits will not be updated and PRs are one of them. I manually reviewed the 5 PRs in question and confirmed that they do not reference the sensitive data I removed.
Checking the affected file in Github for all branches (in my case, main and develop), I can see that the history has been overwritten so that older commits no longer show the GUIDs for tenantId and principalId. Instead, they now show the replacement string.
As a final step, I deleted the old copy of the repo from my local machine and cloned it again from Github to get the most up-to-date version.
Thanks for Vinay Sharma’s blog for expanding on the BFG Repo-Cleaner commands.
Final thoughts on Azure Data Factory’s integration with git:
I don’t think that this is a full or long-term solution for using Data Factory with a public repo. Every commit that changes the .json file in the factory folder has the potential to reintroduce the GUIDs to the repo. Likewise, I’m not sure that Data Factory will run correctly if the GUIDs are removed from the associated repo. In this case, I have no plans to continue working on this Data Factory project and had removed the git integration so making the described changes to the repo had no impact, however that may not be the case in future.
While the argument can be made that tenantId can be determined from the domain name (see WhatIsMyTenantId), I don’t think it’s unreasonable for Azure customers to view those GUIDs as sensitive data that they don’t want to be made public.
One Comment
[…] publishing my last blog post about Azure Data Factory including my Azure tenant’s tenantId and principalId in its commits […]