Cleaning up my Markdown table cleanup script

Way back in 2008, I wrote a script that made it easier to have nicely formatted tables in a Markdown1 document. The idea was to take a hastily written table like this,

| Column 1 | Column 2 | Column 3 |
|--|:--:|--:|
| first | second | third |
| column | column | column |
| left | center | right |

and turn it into this,

| Column 1 | Column 2 | Column 3 |
|:---------|:--------:|---------:|
| first    |  second  |    third |
| column   |  column  |   column |
| left     |  center  |    right |

which is much easier to read. Note that both result in the same generated HTML and will look the same in the output document:

Column 1 Column 2 Column 3
first second third
column column column
left center right

The script was all about making the Markdown source look better. It’s also easier to see if you have mistakes when the tables look like tables. I called the script Normalize Table.

The original script was incorporated into a TextMate command, which was the editor I used at the time. Later, when I switched to BBEdit, I took the script and saved it as a BBEdit Text Filter.

One of the problems with that script was that it didn’t handle Unicode characters correctly. Unicode characters use multiple bytes, which messed up the vertical alignment of the pipes. A table like this,

| Cølumn 1 | Côlumn 2 | Colümn 3 |
|--|:--:|--:|
| fírst | sèçond | thîrd |
| column | column | column |
| left | center | right |

would end up like this,

| Cølumn 1 | Côlumn 2 | Colümn 3 |
|:----------|:---------:|----------:|
| fírst    |  sèçond |    thîrd |
| column    |   column  |    column |
| left      |   center  |     right |

Earlier this month, Nils Schulte am Hülse sent me a fix for Unicode, which I thanked him for and incorporated in my BBEdit Text Filter. Now tables with Unicode characters get properly formatted:

| Cølumn 1 | Côlumn 2 | Colümn 3 |
|:---------|:--------:|---------:|
| fírst    |  sèçond  |    thîrd |
| column   |  column  |   column |
| left     |  center  |    right |

Nils also pointed out that my script assumed the separator line (the one with the hyphens between the column headings and the body of the table) couldn’t include spaces. This is incorrect for both MultMarkdown and PHP Markdown Extra. Nils included a simple fix for this, too, so now tables like this

| Cølumn 1 | Côlumn 2 | Colümn 3 |
| -- | :--: | --: |
| fírst | sèçond | thîrd |
| column | column | column |
| left | center | right |

won’t generate errors.

As I said, I thanked Nils for his improvements, but something was nagging at me. The separator issue was new to me—I had never written a table with spaces in the separator line and hadn’t even considered whether it was legal—but I thought I’d fixed the Unicode problem. And yet, when I applied the Normalize Table text filter in BBEdit, it didn’t work right until I incorporated Nils’s changes.

So I started to write a post to explain the changes and went searching through the vast and dusty ANIAT archives for links to the old table formatting entries. Which is when I found this one, in which I showed a solution to the Unicode problem as sent to me by reader Christoph Kepper. Apparently, I

  1. didn’t copy the right Normalize Table script when I switched from TextMate to BBEdit;
  2. forgot that I actually had a script that handled Unicode correctly; and
  3. didn’t realize any of this when I got Nils’s email.

I like to think that if it were my actual job to remember what I write I’d be better at it, but after episodes like this I’m not so sure. My grandfather used to tell me it’s hell to get old but there’s no good alternative.

Anyway, there is one silver lining. Nils’s Unicode solution is slightly shorter than Christoph’s in that it affects only one line of my original script. Here’s the current version:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import sys
 4:  
 5:  def just(string, type, n):
 6:      "Justify a string to length n according to type."
 7:      
 8:      if type == '::':
 9:          return string.center(n)
10:      elif type == '-:':
11:          return string.rjust(n)
12:      elif type == ':-':
13:          return string.ljust(n)
14:      else:
15:          return string
16:  
17:  
18:  def normtable(text):
19:      "Aligns the vertical bars in a text table."
20:      
21:      # Start by turning the text into a list of lines.
22:      lines = text.splitlines()
23:      rows = len(lines)
24:      
25:      # Figure out the cell formatting.
26:      # First, find the separator line.
27:      for i in range(rows):
28:          if set(lines[i]).issubset('|:.- '):
29:              formatline = lines[i]
30:              formatrow = i
31:              break
32:      
33:      # Delete the separator line from the content.
34:      del lines[formatrow]
35:      
36:      # Determine how each column is to be justified.
37:      formatline = formatline.strip(' ')
38:      if formatline[0] == '|': formatline = formatline[1:]
39:      if formatline[-1] == '|': formatline = formatline[:-1]
40:      fstrings = formatline.split('|')
41:      justify = []
42:      for cell in fstrings:
43:          ends = cell[0] + cell[-1]
44:          if ends == '::':
45:              justify.append('::')
46:          elif ends == '-:':
47:              justify.append('-:')
48:          else:
49:              justify.append(':-')
50:      
51:      # Assume the number of columns in the separator line is the number
52:      # for the entire table.
53:      columns = len(justify)
54:      
55:      # Extract the content into a matrix.
56:      content = []
57:      for line in lines:
58:          line = line.strip(' ')
59:          if line[0] == '|': line = line[1:]
60:          if line[-1] == '|': line = line[:-1]
61:          cells = line.split('|')
62:          # Put exactly one space at each end as "bumpers."
63:          linecontent = [ ' ' + x.strip() + ' ' for x in cells ]
64:          content.append(linecontent)
65:      
66:      # Append cells to rows that don't have enough.
67:      rows = len(content)
68:      for i in range(rows):
69:          while len(content[i]) < columns:
70:              content[i].append('')
71:      
72:      # Get the width of the content in each column. The minimum width will
73:      # be 2, because that's the shortest length of a formatting string and
74:      # because that matches an empty column with "bumper" spaces.
75:      widths = [2] * columns
76:      for row in content:
77:          for i in range(columns):
78:              widths[i] = max(len(row[i]), widths[i])
79:      
80:      # Add whitespace to make all the columns the same width and 
81:      formatted = []
82:      for row in content:
83:          formatted.append('|' + '|'.join([ just(s, t, n) for (s, t, n) in zip(row, justify, widths) ]) + '|')
84:      
85:      # Recreate the format line with the appropriate column widths.
86:      formatline = '|' + '|'.join([ s[0] + '-'*(n-2) + s[-1] for (s, n) in zip(justify, widths) ]) + '|'
87:      
88:      # Insert the formatline back into the table.
89:      formatted.insert(formatrow, formatline)
90:      
91:      # Return the formatted table.
92:      return '\n'.join(formatted)
93:  
94:          
95:  # Read the input, process, and print.
96:  unformatted = unicode(sys.stdin.read(), "utf-8")
97:  print normtable(unformatted)

Nils’s fix for the separator problem is in Line 28. His fix for the Unicode problem is in Line 96.


  1. Strictly speaking, there is no table format in Markdown. Tables like you see in this post are available only in Markdown variants like MultiMarkdown or PHP Markdown Extra. Because I suspect that these variants, taken in aggregate, are more popular than Gruber’s One True Markdown, I use the name Markdown to refer to all of them. ↩

  • Translator

  • Recent Articles

    • Microsoft Has Just Launched Its First Android Smartphone, The Nokia X2
    • ‘Star Wars: Episode VII’ filming possibly delayed due to Harrison Ford’s leg injury
    • BIO 2014: Training Bio-Entrepreneurs to Use Coffee, Yoga, and Data
    • Nokia X2 is official, 4.3-inch display & dual-core Snapdragon 200 in tow
    • Announcing PragueCrunch III
    • Select Smartphones With Mediatek Chipsets Could Be Compromised
    • How many hours you have to work to buy an iPhone
    • Mark Thoma posted an entry
    • Moto X+1 Press Photo Leaked
    • Huawei Ascend P7 Hits Target Of 1 Million Units Sold In A Month
    • Samsung Galaxy S5 Sport lands on Sprint in US
    • Apple iWatch launching in two models – Sports and Designer
    • Google Nexus 6 To Come With a 5.5-inch Display (Rumor)
    • Roy Taylor’s mystery FX chip is a refreshed FX-9590 CPU
    • HTC Desire 516 May Launch In India Soon
    • Top Intellectual Jokes you might not understand
    • HTC likely making a new Nexus tablet
    • Intel unveils more on ‘Knights Landing’
    • Withings Activité Looks Downright Sleek For An Activity Tracker
    • SMAs could change the shape of construction industry