How to automate removal of 4-Byte UTF-8 encodings from filenames?

Wasabi currently does not support 4-byte UTF8 characters.  You may refer to our KB article Does Wasabi support 4 byte UTF8 characters? for more information.  
 

In some cases, users may encounter problems uploading files to Wasabi S3 because some of their files contain non-ASCII characters that are 4-byte UTF8 encoded in their filenames. To solve this problem, these affected files could be renamed to remove the 4-byte UTF-8 encodings using the python script shown below. This script is also attached at the end of this document.

 

import os

# Get current working directory
current_dir = os.getcwd()

# List of all files in parent directory including subdirectories
listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk(current_dir):
listOfFiles += [os.path.join(dirpath, file) for file in filenames]


# Function that removes 4-byte UTF-8 encodings
def emojiRemovalTool():
for file in listOfFiles:
if file.isascii() is False:
try:
print(f"Found the following file with UTF-8 4-byte encodings: {file}")
string_unicode = file
string_encode = string_unicode.encode("ascii", "ignore")
string_decode = string_encode.decode()
os.rename(string_unicode, string_decode)
print(f" {string_unicode} has been successfully renamed to : {string_decode}")
except OSError:
print(f"The following error occurred: {OSError}")


if __name__ == "__main__":
emojiRemovalTool()

 

This script will crawl the directory in which it is placed, find all files in that directory including sub-directories, and rename any file that contains 4-byte UTF-8 encodings in their file names.

 

HOW TO RUN SCRIPT

1. Make sure you have downloaded and installed python on your system. You can do so by visiting Python Official

2. Copy Script to the parent directory where you have your files as shown below:

mceclip0.png

3. Open CMD in the parent directory where you have your files, and run the command below:

Command:

$ python rename_tool.py

 

Results Before:

before.png

 

Results After:

Screenshot_2022-07-27_184025.png

 

Once the script has completed successful execution, all non-ASCII characters that are 4 byte UTF8 encoded in filenames within the parent and sub-directories will be removed.

Have more questions? Submit a request