Tuesday, May 18th, 2010

Scribd: Font face trickery and more

Category: Articles, Font

<p>Scribd is my “favourite company of the month”. First they show off their move from Flash to HTML5 and now they are generously taking time to share with us details on their implementation in a three part series.

The first part delves into the bowels of @font-face, starting with the simple:

  1. @font-face {
  2.   font-family: 'Scrivano';
  3.   src: url('scrivano.eot');
  4.   src: url("scrivano.svg") format('svg');
  5.   src: local('\u263a'), url('scrivano.otf')
  6.   format('truetype');
  7. }

and moving to how they support angled text such as this:

How do you encode the diagonal text in this document in a HTML page?

Short of using element transformations (-moz-transform, DXImageTransform etc.) which we found to be rather impractical, we encode the above HTML with a custom font created by transforming the original font. Here’s how our generated font looks in FontForge:

From the above font screenshot you also notice that we reduce fonts to only the characters that are actually used in the document; that helps save space and network bandwidth. Usually, fonts in the pdfs are already reduced, so this is not always necessary.

Naturally, for fonts with diagonal characters every character needs to be offset to a different vertical position (we encode fonts as left-to-right). In fact, this is how other HTML converters basically work: they place every single character on the page using a div with position:absolute:

  1. <!-- crude pdf to html conversion -->
  2. <div style="position:absolute;top:237px;left:250px;">H</div>
  3. <div style="position:absolute;top:237px;left:264px;">e</div>
  4. <div style="position:absolute;top:237px;left:271px;">l</div>
  5. <!-- etc. -->

At Scribd, we invested a lot of time in optimizing this, to the degree that we can now convert almost all documents to “nice” HTML markup. We detect character spacing, line-heights, paragraphs, justification and a lot of other attributes of the input document that can be encoded natively in the HTML. So a PDF document uploaded to Scribd may, in it’s HTML version, look like this (style attributes omitted for legibility):

HTML version:

  1. <p>
  2.   <span>domain block is in a different image than the range block), as</span>
  3.   <span>opposed to mappings which stay in the image (domain block</span>
  4.   <span>and range block are in the same image) - also see Fig. 5.</span>
  5.   <span>It's also possible to, instead of counting the number of</span>
  6. </p>

Together with tags for graphic elements on pages, we can now represent every PDF document in HTML while preserving fonts, layout and style, with text selectability, searchability, and making full use of the optimized rendering engines built into browsers.

I am looking forward to part 2 and 3!

Posted by Dion Almaer at 11:10 am
8 Comments

++---
2.3 rating from 4 votes

8 Comments »

Comments feed TrackBack URI

There was some good discussion on Hacker News about rotated glyphs vs using css transforms (and IE Matrix filter).. http://news.ycombinator.com/item?id=1355865

Comment by PaulIrish — May 18, 2010

They definitely have more work to do. The example document they link from that post (http://www.scribd.com/documents/5/Image-Cluster-Compression) is horribly broken in the latest Safari. It works in Chrome, so it’s not really clear why it wouldn’t work in Safari, but there’s a bunch of stuff that’s just all over the page in wrong places.

Comment by eyelidlessness — May 18, 2010

@eyelidlessness: it works in a late Webkit nightly so i expect the next version of Safari will work ok.

Comment by Jaaap — May 18, 2010

I wish they had been more specific than “impractical” so we knew the issues they ran into. I haven’t found that to be the case, and (re hacker news thread) the image-based transform of IE’s matrix transform means that things like a transformed textarea have little impact on performance, which can’t be said for some more modern browsers that are forced into a slower rendering path (and selection works just fine in the textarea).

Main annoyances I’ve had to work around
1) major performance drops in (Windows-based, non-hardware accelerated) WebKit and Opera when used with features like e.g. textareas or border-radius.
2) IE’s image based transforms mean that scaling uses bilinear sampling and can look pretty bad for even modest values.
3) IE doesn’t anti-alias text at all if the element that contains it and is being transformed has a transparent background color.
4) Cairo (on Windows, at least) renders even transformed glyphs at integer coordinates. This can be very obvious with small text as the rounding used can make each character have essentially its own little additional rotation.
5) Also on Windows, and I think just Windows, #4 combines with some layout issue where a changing transform on one element can can cause transformed (but unchanging) text elsewhere on the page to jitter back and forth from frame to frame.

#4 and #5 are the most annoying because they can’t really be worked around and can be pretty distracting. There are bugs, but they haven’t gotten much attention. However, Bas’s Direct2D Cairo backed renders text so beautifully (even at extreme skews) that I’m willing to hold out.

Comment by bckenny — May 18, 2010

I’ve had good luck using SVG to rotate text. The SVG rendering path is pretty well optimized for rotated text. The advantage is that it’s well-supported, even on Opera, and even on older browser versions. Except for IE ofcourse. The disadvantage is that you have to use javascript to embed svg nodes into html, but I suppose you could use a script to replace html elements with svg elements.

I made an ExtJS plugin that rotates panel header text when collapsing a west or east panel:
http://www.extjs.com/forum/showthread.php?89395-Ext.ux.PanelCollapsedTitle-Cross-browser-vertical-text&p=437553

Comment by Joeri — May 19, 2010

the clear advantage of the glyph approach over svg et al. is that you can use easier, semantically understandable markup. In short words: you can Ctrl+F-search even for distorted text like the caption “iterations” in the screenshot.

As we are used to this feature from pdf readers, it’s nice to have it in the html version too.

Furthermore, this will improve the “indexability” of the converted pdf documents. until now, pdf-to-html converters often ignored captions in diagrams and drawings.

Comment by znarfdwarf — May 19, 2010

Good point about ctrl+f and text selection. I tested it on a simple rotated svg text demo. It works in webkit, opera and the IE9 preview. It doesn’t work in firefox (5 year old bug 292498). In the bug notes it says that firefox fails the svg test suite because of this.

Comment by Joeri — May 19, 2010

Reduce fonts to only the characters that are actually used in the document to save network bandwidth is especially helpful for mobile apps which, even with 4G, are challenged for bandwidth. What other techniques can be used to save network bandwidth?

Comment by softwarequalityman — May 6, 2012

Leave a comment

You must be logged in to post a comment.