Markdown and Upload
I’m trying out using Markdown on my blog, and to experiment, I have enumerated the steps needed to upload a picture to an image site I used to run. I have listed the details here because it is full of corner cases and non intuitive setups found in shared hosting environments.
This document describes the file upload procedure to the Server
- Validate User Session
- If the session key and id are present, use them as a Lookup
- Lookup the session by key and id
- If the session is not found, or is not valid, Look-up session by IP address
- If looking up by IP address fails, create a new session
- If creating a new session fails, explode.
- Extract and Clean Tags
- Take each tag, and transliterate from UTF-8 to ASCII
- Take eash tag, and trim leading and trailing white space
- Take each tag, and lowercase
- Remove all empty tags
- Remove all duplicate tags
- Check that there are no more than the maximum tags, to avoid abuse.
- Check that there are at least the minimum number of tags.
- Extract and validate the Category Id
- Check that the IP hasn’t exceeded that max uploads per unit time (not the session id!)
- Get a local copy of the Uploaded picture data
- If the picture came from a file upload
- Make sure the upload didn’t have an error
- The the file name (as provided by the client) and the file path (location in /tmp)
- If a file URL was provided
- Record the referrer, and other meta data about where the picture came from.
- Attempt to download the file. On success get the file path
- If the file download had a Content-disposition file name, use it.
- If the file did not have a filename, or didn’t from from HTTP, use the URL basename.
- If the picture came from a file upload
- Check if the file size is too small.
- Check that the file name doesn’t have php in it.
- Check that the file is of a valid type (GIF, PNG, or JPEG)
- Calculate the file hash, and lookup to see if the image already exists
- Check to see (using the hash) to see if the file has been previously deleted.
- Insert the Image post data into the database, and get a post ID
- Rename the temp Image file to the post ID and file extension
- Update the bump ordering using the post ID
- Record that the IP address has uploaded a picture for use in step #4
- Store references from step #5 if present
- Insert all tags from step #2
- Add “Goats” (the currency of the site) to the session based on how many tags were provided
- Record in the session that it was last used at this time. (for Garbage collection policy)
- Create a thumbnail image of the picture
- Invalidate the Index HTML cache
- If requested, redirect the user back to the index.
Some points that you might have noticed:
- Only ASCII text is support instead of UTF-8. This is because PHP (at the time I created the site) had very poor UTF-8 support, and the default collation of my MySQL 4 database was (unbeknownst to me) in Swedish. Additionally, the original audience of my site was a little rougher than typical people and so would try to abuse text input. In order to minimize admin overhead, I decided to keep the site ASCII only.
- Uploaded pictures have to be copied to a directory local to the server software. One day I found that file uploads were failing because I hit the max number of files allowed by my quota. The server software was run on a different partition (/tmp/). This means that from PHP’s perspective, the file was uploaded “successful”, despite being useless.
- Session ID (basically an anonymous, semi persistent login cookie) is not used for quota enforcement, because a small number of users could get a new cookies pretty easily. (of course, they could also get a new IP address pretty easily, but its much simpler to handle such users on a case by case basis, rather than trying to make the software handle it. The goal is to limit moderator overhead, not try to be perfect.)
And, some points that you may not have noticed:
- Naming files after their database primary key (the post id) is a chicken and egg problem. If there are any problems between allocating the ID (in MySQL, this is a side effect of a successful insert), and any post insert steps, the server could be left in an inconsistent state. Either a row is inserted into the DB without a corresponding picture, or the picture is the serving directory without a row. The latter is much safer, and if I had not been a high-schooler at the time, I would have picked a transactional database and side stepped these problem. Alas, I was not experienced and wrote the server assuming everything would work.