WordPress Is Not Importing Some CSV Files
Much of the past week was spent chasing down an issue where a Store Locator Plus® customer could not get their CSV file imported into WordPress. A review of the file both with automated validation apps as well as a manual inspection shows that while there were some warnings about the structure due to extended UTF-8 characters within text fields, the structure itself was valid. The interesting part about some of the flagged CSV content is that WordPress itself changes many characters like dashes and quotes to their own internal “visually appealing” versions known as long dashes or back quotes — which are NOT standard characters in the US_ASCII set. The big problem for the imports, however, are the newline characters.
Newline Characters in CSV Files
Newline characters in CSV files are perfectly valid as long as they are contained inside a quoted string. The CSV specification allows for UTF-8 characters, common in WordPress exports, as well as reserved characters such as newlines are also supported. Newline characters only need to be within a double-quote pair. You can read about the details, and inconsistency of processing and adherence to RFC 4180 and subsequent related RFCs at Wikiepedia.
WordPress Is Not Importing Some CSV Files
Unfortunately WordPress is not importing some CSV files containing newlines despite following the standard. Here we have a simple 4-line CSV file that follows the standard, yet WordPress rejects the upload.
If you validate this CSV file some processors will suffer from lack of supporting the standard and flag the line with a newline as an error. Robust validators have no issue with the file. You will find better CSV handling in tools like Google Sheets (my recommendation for solid CSV support) and even the half-customized Microsoft Excel CSV handlers and MacOS Numbers handlers will import and export this file with no problem, managing the newlines as expected.
Sadly WordPress is not importing these types of CSV files thanks to their reliance on PHP’s built-in file info.
WordPress Insecurity About Security
The problem lies within a new “security feature” added to WordPress around what appears to be version 5.7.0. For a few years now, WordPress has had an often over-zealous focus on security. While improved security is a great idea and something I promote, the WordPress ecosystem does not fully validate many decisions made by core contributors that are introduced under the label “security improvement”. Over the years WordPress core has had many failures in maintaining backward compatibility (something they promote practically “above all else”) and fundamentally breaking hundreds of themes and plugins thanks to “security improvements” that were not fully vetted and quality tested prior to launching as a public release.
File Insecurity
This is another such instance of this type of development. The problem lies within the async-upload.php functionality used by the WordPress file upload system including the media file interface. Sometime around the 5.7 release someone decided that a bad actor could manipulate the file extensions and/or header information and upload files containing “something bad” under the guise of a standard file such as a PDF, image, video, or CSV file (maybe).
PHP Fileinfo Shortcomings
Their solution to get around this? Utilize PHP’s Fileinfo and mime type reporting functions. The basic premise is that during a file upload WordPress will store the file in a temporary location, then open it with PHP Fileinfo functions and get back a report on what mime type PFP thinks the file is. If that mime type does NOT match what the web server told PHP the mime type would be then it is flagged as a “bad actor” and WordPress then bails on further processing.
The problem is, PHP’s Fileinfo functionality is notoriously HORRID at properly identifying text file types properly. For binary files there is often a header on the file often contains a binary signature that is well defined. It is often easy to determine if the contents match the often-proprietary header and content specification defined in these binary file standards. Text files, however — they can contain just about anything as long as they are composed of supported characters sets (US-ASCII, UTF-8, etc.) and follow a much more generic pattern. As such PHP’s fileinfo functions often “guess incorrectly” as to what TYPE of file a text file is. In this case, our CSV file is often reported back as “text/plain” or even “text/HTML” (depending which UTF-8 characters appear next to a newline in various samples we’ve tried) instead of “text/CSV”.
WordPress Exacerbates The Fileinfo Problem
In addition to WordPress relying too heavily on the PHP fileinfo function to report valid mime types, they push the problem further. In the logic of the _wp_hande_upload() and wp_check_filetype_and_ext() functions (the later being the real culprit here), they decide that if the browser reported your valid .csv file as mime type “text/csv” but PHP said it is “text/HTML”, they you must be trying to “trick the system”. It then flags the file as invalid and returns a stack of data back to the calling function where the upload bails out.
Unfortunately when paired with the long-present WordPress proclivity of turning a regular-old dash (-) into a long-dash or quote (‘) into a special UTF-8 charter, it is far too frequent that we find a text “description of a place” field where these special UTF-8 characters appear near a newline character that are both contained inside the CSV double-quote when exported. This ALMOST ALWAYS gets flagged by PHP’s fileinfo as text/HTML. In turn the WordPress upload of that file, which is exported from WordPress in the first place, is flagged as “invalid”.
The problem is the buried within the wp_check_filetype_and_ext() function which DOES NOT have a special exception handler like “text/plain” does to make it a valid “alias” for text/csv files.
A WordPress Workaround
Thankfully the WordPress wp_check_filetype_and_ext() function provides a possible escape hatch from this overzealous file processing hell. They used their ever-present filter system to pass the results of this function through a hook that you can tap into and completely override these checks. Funny that a security-centric feature allows any plugin or theme developer to completely bypass this — not there has EVER been a malicious actor publishing themes and plugins while posing as “just another” plugin or theme developer — but I’m glad it is there.
Thanks to the “security improvements” around file uploads back in WordPress 5.7, we will need to leverage this to add exceptions for when PHP thinks a valid CSV file is an HTML file — something this core function should have already taken care of. Sadly the filter does NOT provide the critical decision tree data of the “real mime type” variable to the filter — crazy since that is a primary piece of data that drives the entire decision tree before this filter is called. That means to patch around the WordPress shortcoming we will need to re-run the PHP fileinfo functions, reducing performance, so we can make intelligent decisions about CSV files without completely bypassing the file upload security feature.
A valiant attempt, but a major shortcoming for anyone that is export data from a WordPress site via CSV file and importing that data back into a WordPress site. Store Locator Plus® is not the only victim of this failure as is evident from multiple reports of Jetpack CRM and WooCommerce CSV export/import failures that were never resolved.
Maybe those authors, also part of Automattic, will see this article and figure it out. I may even submit a ticket against core, but the last few major tickets (related to data structures and dbDelta) took SEVEN YEARS to get deemed “worthy” and get patched — so I don’t hold much hope of a speed core patch from WordPress on this one.