View on GitHub

Tralics driver

Driver for tralics to convert LaTeX snippets to MathML elements

Download this project as a .zip file Download this project as a tar.gz file

Overview

Driver for Tralics: Convert LaTeX math snippets to MathML elements

Note: Tralics is a LaTeX to XML Translator from http://www-sop.inria.fr/marelle/tralics
This is only a driver.

Requirements

Python 2.6 or greater
Pexpect package
Tralics installation
lxml package

What does it do?

The problem this project attempts to solve is when you have LaTeX math snippets (inline or block math) and you want to convert them to MathML.

For example, you can write code like the following. You write the function get_mathstring() yourself. See the file runner.py for details.


      from tralics_driver import driver

      math_elems = list()
      t = driver.TralicsDriver('/usr/local')
      for fname, mathstring in get_mathstring():
        elemstring = t.convert(fname, mathstring)
        math_elems.append(elemstring)

      if t.errors:
        print 'ERRORS:'
        print t.errors
      t.stop()

Assumptions

You have installed Tralics
If you have custom newcommands, you put them in a file called newcommands.tex and place the file in the Tralics conf directory
You want to cache your conversions on disk

Project Layout

The code is structured as follows:

__init__.py Contains helper functions unescape and escape to handle html/xml markup changes, and a string template to return the contents of an error.
driver.py Contains only the driver class, TralicsDriver. The class methods are as follows:
- __init__(tralics_dir, options). Sets up locations for the tralics binary, the customized 'newcommands' file if it exists, and the cache file if it exists.
- start Spawns a pexpect process and reads in the customized 'newcommands' file if it exists.
- stop Closes the process and writes the cache to disk.
- convert(fname, mathstring) returns the MathML string of the LaTeX mathstring that was found in file name fname. If possible, get the converted MathML string from the cache; otherwise, pass the string to the Tralics process. Prints a dot (.) for strings found in the cache, or a plus ('+') if the string was processed by Tralics.
- getmath(expr) Passes the LaTeX string expr to the Tralics process and returns the MathML string or an element string containing the filename, error message, and LaTeX string that caused the error.
- clean_formula(expr, result) Modifies the element string to conform to the current MathML spec (changes attributes) or returns an element string containing the filename, error message, and LaTeX string that caused the error.
- handle_error(data) Records the error data in the instances list of errors and returns an element string containing the filename, error message, and LaTeX string that caused the error.
runner.py Contains an example of how to use the driver on a tree of HTML files.

Example

This example parses each HTML file from a target directory. It assumes that some images (those with class="math">) contain alt text containing the LaTeX markup used to create the image.

For each image, it

passes the LaTeX markup string to the Tralics driver to get the MathML version
creates an element from that MathML string
and replaces the original image element with the MathML element
writes the new tree (a copy of the original) to a new directory

So at the beginning there was one HTML tree of files that uses images for math. At the end, the original directory is untouched, and there is an identical HTML directory that contains MathML elements instead of the original images.

      from tralics_driver import driver
      target_dir = '/some/dir/with/html_files'
      mathml_root = '/another/place/to/write/new/html_files'
      d = {
        'type': 'text/javascript',
        'src':'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=MML_HTMLorMML'
        }
      mathjax_elem = etree.Element('script', attrib=d)

      t = driver.TralicsDriver()
      for fname in [x for x in os.listdir(target_dir) if x.endswith('.html')]:
        parser = etree.HTMLParser()
        tree = etree.parse(os.path.join(target_dir, fname), parser)
        head = tree.find('head')
        head.append(mathjax_elem)

        for image in tree.xpath('//img[@class="math"]'):
          mtext = image.get('alt')
          if mtext:
            elemstring = t.convert(fname, mtext)
            try:
              math_elem = etree.fromstring(elemstring)
            except etree.XMLSyntaxError as e:
              print e
            else:
              image.getparent().replace(image, math_elem)
          else:
            print 'No alt text %s:%s' % (fname, image.get('src'))

        with open(os.path.join(mathml_root, fname), 'wb') as f:
          f.write('\n%s' % etree.tostring(tree.getroot(),
                                                          pretty_print=True,
                                                          encoding='UTF-8',
                                                          method='html'))