902 lines
25 KiB
Plaintext
902 lines
25 KiB
Plaintext
Metadata-Version: 2.4
|
||
Name: mammoth
|
||
Version: 1.11.0
|
||
Summary: Convert Word documents from docx to simple and clean HTML and Markdown
|
||
Home-page: https://github.com/mwilliamson/python-mammoth
|
||
Author: Michael Williamson
|
||
Author-email: mike@zwobble.org
|
||
License: BSD-2-Clause
|
||
Keywords: docx word office clean html markdown md
|
||
Classifier: Development Status :: 5 - Production/Stable
|
||
Classifier: Intended Audience :: Developers
|
||
Classifier: License :: OSI Approved :: BSD License
|
||
Classifier: Programming Language :: Python
|
||
Classifier: Programming Language :: Python :: 3
|
||
Classifier: Programming Language :: Python :: 3.7
|
||
Classifier: Programming Language :: Python :: 3.8
|
||
Classifier: Programming Language :: Python :: 3.9
|
||
Classifier: Programming Language :: Python :: 3.10
|
||
Classifier: Programming Language :: Python :: 3.11
|
||
Classifier: Programming Language :: Python :: 3.12
|
||
Requires-Python: >=3.7
|
||
License-File: LICENSE
|
||
Requires-Dist: cobble<0.2,>=0.1.3
|
||
Dynamic: author
|
||
Dynamic: author-email
|
||
Dynamic: classifier
|
||
Dynamic: description
|
||
Dynamic: home-page
|
||
Dynamic: keywords
|
||
Dynamic: license
|
||
Dynamic: license-file
|
||
Dynamic: requires-dist
|
||
Dynamic: requires-python
|
||
Dynamic: summary
|
||
|
||
Mammoth .docx to HTML converter
|
||
===============================
|
||
|
||
Mammoth is designed to convert .docx documents, such as those created by
|
||
Microsoft Word, Google Docs and LibreOffice, and convert them to HTML.
|
||
Mammoth aims to produce simple and clean HTML by using semantic
|
||
information in the document, and ignoring other details. For instance,
|
||
Mammoth converts any paragraph with the style ``Heading 1`` to ``h1``
|
||
elements, rather than attempting to exactly copy the styling (font, text
|
||
size, colour, etc.) of the heading.
|
||
|
||
There’s a large mismatch between the structure used by .docx and the
|
||
structure of HTML, meaning that the conversion is unlikely to be perfect
|
||
for more complicated documents. Mammoth works best if you only use
|
||
styles to semantically mark up your document.
|
||
|
||
The following features are currently supported:
|
||
|
||
- Headings.
|
||
|
||
- Lists.
|
||
|
||
- Customisable mapping from your own docx styles to HTML. For instance,
|
||
you could convert ``WarningHeading`` to ``h1.warning`` by providing
|
||
an appropriate style mapping.
|
||
|
||
- Tables. The formatting of the table itself, such as borders, is
|
||
currently ignored, but the formatting of the text is treated the same
|
||
as in the rest of the document.
|
||
|
||
- Footnotes and endnotes.
|
||
|
||
- Images.
|
||
|
||
- Bold, italics, underlines, strikethrough, superscript and subscript.
|
||
|
||
- Links.
|
||
|
||
- Line breaks.
|
||
|
||
- Text boxes. The contents of the text box are treated as a separate
|
||
paragraph that appears after the paragraph containing the text box.
|
||
|
||
- Comments.
|
||
|
||
Installation
|
||
------------
|
||
|
||
::
|
||
|
||
pip install mammoth
|
||
|
||
Other supported platforms
|
||
-------------------------
|
||
|
||
- `JavaScript <https://github.com/mwilliamson/mammoth.js>`__, both the
|
||
browser and node.js. Available `on
|
||
npm <https://www.npmjs.com/package/mammoth>`__.
|
||
|
||
- `WordPress <https://wordpress.org/plugins/mammoth-docx-converter/>`__.
|
||
|
||
- `Java/JVM <https://github.com/mwilliamson/java-mammoth>`__. Available
|
||
`on Maven
|
||
Central <http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.zwobble.mammoth%22%20AND%20a%3A%22mammoth%22>`__.
|
||
|
||
- `.NET <https://github.com/mwilliamson/dotnet-mammoth>`__. Available
|
||
`on NuGet <https://www.nuget.org/packages/Mammoth/>`__.
|
||
|
||
Usage
|
||
-----
|
||
|
||
CLI
|
||
~~~
|
||
|
||
You can convert docx files by passing the path to the docx file and the
|
||
output file. For instance:
|
||
|
||
::
|
||
|
||
mammoth document.docx output.html
|
||
|
||
If no output file is specified, output is written to stdout instead.
|
||
|
||
The output is an HTML fragment, rather than a full HTML document,
|
||
encoded with UTF-8. Since the encoding is not explicitly set in the
|
||
fragment, opening the output file in a web browser may cause Unicode
|
||
characters to be rendered incorrectly if the browser doesn’t default to
|
||
UTF-8.
|
||
|
||
**Mammoth performs no sanitisation of the source document, and should
|
||
therefore be used extremely carefully with untrusted user input.** See
|
||
the `Security <#security>`__ section for more information.
|
||
|
||
Images
|
||
^^^^^^
|
||
|
||
By default, images are included inline in the output HTML. If an output
|
||
directory is specified by ``--output-dir``, the images are written to
|
||
separate files instead. For instance:
|
||
|
||
::
|
||
|
||
mammoth document.docx --output-dir=output-dir
|
||
|
||
Existing files will be overwritten if present.
|
||
|
||
Styles
|
||
^^^^^^
|
||
|
||
A custom style map can be read from a file using ``--style-map``. For
|
||
instance:
|
||
|
||
::
|
||
|
||
mammoth document.docx output.html --style-map=custom-style-map
|
||
|
||
Where ``custom-style-map`` looks something like:
|
||
|
||
::
|
||
|
||
p[style-name='Aside Heading'] => div.aside > h2:fresh
|
||
p[style-name='Aside Text'] => div.aside > p:fresh
|
||
|
||
A description of the syntax for style maps can be found in the section
|
||
`“Writing style maps” <#writing-style-maps>`__.
|
||
|
||
Markdown
|
||
^^^^^^^^
|
||
|
||
Markdown support is deprecated. Generating HTML and using a separate
|
||
library to convert the HTML to Markdown is recommended, and is likely to
|
||
produce better results.
|
||
|
||
Using ``--output-format=markdown`` will cause Markdown to be generated.
|
||
For instance:
|
||
|
||
::
|
||
|
||
mammoth document.docx --output-format=markdown
|
||
|
||
Library
|
||
~~~~~~~
|
||
|
||
**Mammoth performs no sanitisation of the source document, and should
|
||
therefore be used extremely carefully with untrusted user input.** See
|
||
the `Security <#security>`__ section for more information.
|
||
|
||
Basic conversion
|
||
^^^^^^^^^^^^^^^^
|
||
|
||
To convert an existing .docx file to HTML, pass a file-like object to
|
||
``mammoth.convert_to_html``. The file should be opened in binary mode.
|
||
For instance:
|
||
|
||
.. code:: python
|
||
|
||
import mammoth
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.convert_to_html(docx_file)
|
||
html = result.value # The generated HTML
|
||
messages = result.messages # Any messages, such as warnings during conversion
|
||
|
||
You can also extract the raw text of the document by using
|
||
``mammoth.extract_raw_text``. This will ignore all formatting in the
|
||
document. Each paragraph is followed by two newlines.
|
||
|
||
.. code:: python
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.extract_raw_text(docx_file)
|
||
text = result.value # The raw text
|
||
messages = result.messages # Any messages
|
||
|
||
Custom style map
|
||
^^^^^^^^^^^^^^^^
|
||
|
||
By default, Mammoth maps some common .docx styles to HTML elements. For
|
||
instance, a paragraph with the style name ``Heading 1`` is converted to
|
||
a ``h1`` element. You can pass in a custom map for styles by passing an
|
||
options object with a ``style_map`` property as a second argument to
|
||
``convert_to_html``. A description of the syntax for style maps can be
|
||
found in the section `“Writing style maps” <#writing-style-maps>`__. For
|
||
instance, if paragraphs with the style name ``Section Title`` should be
|
||
converted to ``h1`` elements, and paragraphs with the style name
|
||
``Subsection Title`` should be converted to ``h2`` elements:
|
||
|
||
.. code:: python
|
||
|
||
import mammoth
|
||
|
||
style_map = """
|
||
p[style-name='Section Title'] => h1:fresh
|
||
p[style-name='Subsection Title'] => h2:fresh
|
||
"""
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.convert_to_html(docx_file, style_map=style_map)
|
||
|
||
User-defined style mappings are used in preference to the default style
|
||
mappings. To stop using the default style mappings altogether, pass
|
||
``include_default_style_map=False``:
|
||
|
||
.. code:: python
|
||
|
||
result = mammoth.convert_to_html(docx_file, style_map=style_map, include_default_style_map=False)
|
||
|
||
Custom image handlers
|
||
^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
By default, images are converted to ``<img>`` elements with the source
|
||
included inline in the ``src`` attribute. This behaviour can be changed
|
||
by setting the ``convert_image`` argument to an `image
|
||
converter <#image-converters>`__ .
|
||
|
||
For instance, the following would replicate the default behaviour:
|
||
|
||
.. code:: python
|
||
|
||
def convert_image(image):
|
||
with image.open() as image_bytes:
|
||
encoded_src = base64.b64encode(image_bytes.read()).decode("ascii")
|
||
|
||
return {
|
||
"src": "data:{0};base64,{1}".format(image.content_type, encoded_src)
|
||
}
|
||
|
||
mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
|
||
|
||
Bold
|
||
^^^^
|
||
|
||
By default, bold text is wrapped in ``<strong>`` tags. This behaviour
|
||
can be changed by adding a style mapping for ``b``. For instance, to
|
||
wrap bold text in ``<em>`` tags:
|
||
|
||
.. code:: python
|
||
|
||
style_map = "b => em"
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.convert_to_html(docx_file, style_map=style_map)
|
||
|
||
Italic
|
||
^^^^^^
|
||
|
||
By default, italic text is wrapped in ``<em>`` tags. This behaviour can
|
||
be changed by adding a style mapping for ``i``. For instance, to wrap
|
||
italic text in ``<strong>`` tags:
|
||
|
||
.. code:: python
|
||
|
||
style_map = "i => strong"
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.convert_to_html(docx_file, style_map=style_map)
|
||
|
||
Underline
|
||
^^^^^^^^^
|
||
|
||
By default, the underlining of any text is ignored since underlining can
|
||
be confused with links in HTML documents. This behaviour can be changed
|
||
by adding a style mapping for ``u``. For instance, suppose that a source
|
||
document uses underlining for emphasis. The following will wrap any
|
||
explicitly underlined source text in ``<em>`` tags:
|
||
|
||
.. code:: python
|
||
|
||
import mammoth
|
||
|
||
style_map = "u => em"
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.convert_to_html(docx_file, style_map=style_map)
|
||
|
||
Strikethrough
|
||
^^^^^^^^^^^^^
|
||
|
||
By default, strikethrough text is wrapped in ``<s>`` tags. This
|
||
behaviour can be changed by adding a style mapping for ``strike``. For
|
||
instance, to wrap strikethrough text in ``<del>`` tags:
|
||
|
||
.. code:: python
|
||
|
||
style_map = "strike => del"
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.convert_to_html(docx_file, style_map=style_map)
|
||
|
||
Comments
|
||
^^^^^^^^
|
||
|
||
By default, comments are ignored. To include comments in the generated
|
||
HTML, add a style mapping for ``comment-reference``. For instance:
|
||
|
||
.. code:: python
|
||
|
||
style_map = "comment-reference => sup"
|
||
|
||
with open("document.docx", "rb") as docx_file:
|
||
result = mammoth.convert_to_html(docx_file, style_map=style_map)
|
||
|
||
Comments will be appended to the end of the document, with links to the
|
||
comments wrapped using the specified style mapping.
|
||
|
||
API
|
||
~~~
|
||
|
||
``mammoth.convert_to_html(fileobj, **kwargs)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Converts the source document to HTML.
|
||
|
||
- ``fileobj``: a file-like object containing the source document. Files
|
||
should be opened in binary mode.
|
||
|
||
- ``style_map``: a string to specify the mapping of Word styles to
|
||
HTML. See the section `“Writing style maps” <#writing-style-maps>`__
|
||
for a description of the syntax.
|
||
|
||
- ``include_embedded_style_map``: by default, if the document contains
|
||
an embedded style map, then it is combined with the default style
|
||
map. To ignore any embedded style maps, pass
|
||
``include_embedded_style_map=False``.
|
||
|
||
- ``include_default_style_map``: by default, the style map passed in
|
||
``style_map`` is combined with the default style map. To stop using
|
||
the default style map altogether, pass
|
||
``include_default_style_map=False``.
|
||
|
||
- Source documents may reference files outside of the source document.
|
||
Access to any such external files is disabled by default. To enable
|
||
access when converting trusted source documents, pass
|
||
``external_file_access=True``.
|
||
|
||
- ``convert_image``: by default, images are converted to ``<img>``
|
||
elements with the source included inline in the ``src`` attribute.
|
||
Set this argument to an `image converter <#image-converters>`__ to
|
||
override the default behaviour.
|
||
|
||
- ``ignore_empty_paragraphs``: by default, empty paragraphs are
|
||
ignored. Set this option to ``False`` to preserve empty paragraphs in
|
||
the output.
|
||
|
||
- ``id_prefix``: a string to prepend to any generated IDs, such as
|
||
those used by bookmarks, footnotes and endnotes. Defaults to an empty
|
||
string.
|
||
|
||
- ``transform_document``: if set, this function is applied to the
|
||
document read from the docx file before the conversion to HTML. The
|
||
API for document transforms should be considered unstable. See
|
||
`document transforms <#document-transforms>`__.
|
||
|
||
- Returns a result with the following properties:
|
||
|
||
- ``value``: the generated HTML
|
||
|
||
- ``messages``: any messages, such as errors and warnings, generated
|
||
during the conversion
|
||
|
||
``mammoth.convert_to_markdown(fileobj, **kwargs)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Markdown support is deprecated. Generating HTML and using a separate
|
||
library to convert the HTML to Markdown is recommended, and is likely to
|
||
produce better results.
|
||
|
||
Converts the source document to Markdown. This behaves the same as
|
||
``convert_to_html``, except that the ``value`` property of the result
|
||
contains Markdown rather than HTML.
|
||
|
||
``mammoth.extract_raw_text(fileobj)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Extract the raw text of the document. This will ignore all formatting in
|
||
the document. Each paragraph is followed by two newlines.
|
||
|
||
- ``fileobj``: a file-like object containing the source document. Files
|
||
should be opened in binary mode.
|
||
|
||
- Returns a result with the following properties:
|
||
|
||
- ``value``: the raw text
|
||
|
||
- ``messages``: any messages, such as errors and warnings
|
||
|
||
``mammoth.embed_style_map(fileobj, style_map)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Embeds the style map ``style_map`` into ``fileobj``. When Mammoth reads
|
||
a file object, it will use the embedded style map.
|
||
|
||
- ``fileobj``: a file-like object containing the source document. Files
|
||
should be opened for reading and writing in binary mode.
|
||
|
||
- ``style_map``: the style map to embed.
|
||
|
||
- Returns ``None``.
|
||
|
||
Messages
|
||
^^^^^^^^
|
||
|
||
Each message has the following properties:
|
||
|
||
- ``type``: a string representing the type of the message, such as
|
||
``"warning"``
|
||
|
||
- ``message``: a string containing the actual message
|
||
|
||
Image converters
|
||
^^^^^^^^^^^^^^^^
|
||
|
||
An image converter can be created by calling
|
||
``mammoth.images.img_element(func)``. This creates an ``<img>`` element
|
||
for each image in the original docx. ``func`` should be a function that
|
||
has one argument ``image``. This argument is the image element being
|
||
converted, and has the following properties:
|
||
|
||
- ``open()``: open the image file. Returns a file-like object.
|
||
|
||
- ``content_type``: the content type of the image, such as
|
||
``image/png``.
|
||
|
||
``func`` should return a ``dict`` of attributes for the ``<img>``
|
||
element. At a minimum, this should include the ``src`` attribute. If any
|
||
alt text is found for the image, this will be automatically added to the
|
||
element’s attributes.
|
||
|
||
For instance, the following replicates the default image conversion:
|
||
|
||
.. code:: python
|
||
|
||
def convert_image(image):
|
||
with image.open() as image_bytes:
|
||
encoded_src = base64.b64encode(image_bytes.read()).decode("ascii")
|
||
|
||
return {
|
||
"src": "data:{0};base64,{1}".format(image.content_type, encoded_src)
|
||
}
|
||
|
||
mammoth.images.img_element(convert_image)
|
||
|
||
``mammoth.images.data_uri`` is the default image converter.
|
||
|
||
WMF images are not handled by default by Mammoth. The recipes directory
|
||
contains `an example of how they can be converted using
|
||
LibreOffice <https://github.com/mwilliamson/python-mammoth/blob/master/recipes/wmf_images.py>`__,
|
||
although the fidelity of the conversion depends entirely on LibreOffice.
|
||
|
||
Security
|
||
~~~~~~~~
|
||
|
||
Mammoth performs no sanitisation of the source document, and should
|
||
therefore be used extremely carefully with untrusted user input. For
|
||
instance:
|
||
|
||
- Source documents can contain links with ``javascript:`` targets. If,
|
||
for instance, you allow users to upload source documents,
|
||
automatically convert the document into HTML, and embed the HTML into
|
||
your website without sanitisation, this may create links that can
|
||
execute arbitrary JavaScript when clicked.
|
||
|
||
- Source documents may reference files outside of the source document.
|
||
If, for instance, you allow users to upload source documents to a
|
||
server, automatically convert the document into HTML on the server,
|
||
and embed the HTML into your website, this may allow arbitrary files
|
||
on the server to be read and exfiltrated.
|
||
|
||
To avoid this issue, access to any such external files is disabled by
|
||
default. To enable access when converting trusted source documents,
|
||
pass ``external_file_access=True``.
|
||
|
||
Document transforms
|
||
~~~~~~~~~~~~~~~~~~~
|
||
|
||
**The API for document transforms should be considered unstable, and may
|
||
change between any versions. If you rely on this behaviour, you should
|
||
pin to a specific version of Mammoth, and test carefully before
|
||
updating.**
|
||
|
||
Mammoth allows a document to be transformed before it is converted. For
|
||
instance, suppose that document has not been semantically marked up, but
|
||
you know that any centre-aligned paragraph should be a heading. You can
|
||
use the ``transform_document`` argument to modify the document
|
||
appropriately:
|
||
|
||
.. code:: python
|
||
|
||
import mammoth.transforms
|
||
|
||
def transform_paragraph(element):
|
||
if element.alignment == "center" and not element.style_id:
|
||
return element.copy(style_id="Heading2")
|
||
else:
|
||
return element
|
||
|
||
transform_document = mammoth.transforms.paragraph(transform_paragraph)
|
||
|
||
mammoth.convert_to_html(fileobj, transform_document=transform_document)
|
||
|
||
Or if you want paragraphs that have been explicitly set to use monospace
|
||
fonts to represent code:
|
||
|
||
.. code:: python
|
||
|
||
import mammoth.documents
|
||
import mammoth.transforms
|
||
|
||
_monospace_fonts = set(["consolas", "courier", "courier new"])
|
||
|
||
def transform_paragraph(paragraph):
|
||
runs = mammoth.transforms.get_descendants_of_type(paragraph, mammoth.documents.Run)
|
||
if runs and all(run.font and run.font.lower() in _monospace_fonts for run in runs):
|
||
return paragraph.copy(style_id="code", style_name="Code")
|
||
else:
|
||
return paragraph
|
||
|
||
convert_to_html(
|
||
fileobj,
|
||
transform_document=mammoth.transforms.paragraph(transform_paragraph),
|
||
style_map="p[style-name='Code'] => pre:separator('\n')",
|
||
)
|
||
|
||
``mammoth.transforms.paragraph(transform_paragraph)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Returns a function that can be used as the ``transform_document``
|
||
argument. This will apply the function ``transform_paragraph`` to each
|
||
paragraph element. ``transform_paragraph`` should return the new
|
||
paragraph.
|
||
|
||
``mammoth.transforms.run(transform_run)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Returns a function that can be used as the ``transform_document``
|
||
argument. This will apply the function ``transform_run`` to each run
|
||
element. ``transform_run`` should return the new run.
|
||
|
||
``mammoth.transforms.get_descendants(element)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Gets all descendants of an element.
|
||
|
||
``mammoth.transforms.get_descendants_of_type(element, type)``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Gets all descendants of a particular type of an element. For instance,
|
||
to get all runs within an element ``paragraph``:
|
||
|
||
.. code:: python
|
||
|
||
import mammoth.documents
|
||
import mammoth.transforms
|
||
|
||
runs = mammoth.transforms.get_descendants_of_type(paragraph, documents.Run);
|
||
|
||
Writing style maps
|
||
------------------
|
||
|
||
A style map is made up of a number of style mappings separated by new
|
||
lines. Blank lines and lines starting with ``#`` are ignored.
|
||
|
||
A style mapping has two parts:
|
||
|
||
- On the left, before the arrow, is the document element matcher.
|
||
- On the right, after the arrow, is the HTML path.
|
||
|
||
When converting each paragraph, Mammoth finds the first style mapping
|
||
where the document element matcher matches the current paragraph.
|
||
Mammoth then ensures the HTML path is satisfied.
|
||
|
||
Freshness
|
||
~~~~~~~~~
|
||
|
||
When writing style mappings, it’s helpful to understand Mammoth’s notion
|
||
of freshness. When generating, Mammoth will only close an HTML element
|
||
when necessary. Otherwise, elements are reused.
|
||
|
||
For instance, suppose one of the specified style mappings is
|
||
``p[style-name='Heading 1'] => h1``. If Mammoth encounters a .docx
|
||
paragraph with the style name ``Heading 1``, the .docx paragraph is
|
||
converted to a ``h1`` element with the same text. If the next .docx
|
||
paragraph also has the style name ``Heading 1``, then the text of that
|
||
paragraph will be appended to the *existing* ``h1`` element, rather than
|
||
creating a new ``h1`` element.
|
||
|
||
In most cases, you’ll probably want to generate a new ``h1`` element
|
||
instead. You can specify this by using the ``:fresh`` modifier:
|
||
|
||
``p[style-name='Heading 1'] => h1:fresh``
|
||
|
||
The two consecutive ``Heading 1`` .docx paragraphs will then be
|
||
converted to two separate ``h1`` elements.
|
||
|
||
Reusing elements is useful in generating more complicated HTML
|
||
structures. For instance, suppose your .docx contains asides. Each aside
|
||
might have a heading and some body text, which should be contained
|
||
within a single ``div.aside`` element. In this case, style mappings
|
||
similar to ``p[style-name='Aside Heading'] => div.aside > h2:fresh`` and
|
||
``p[style-name='Aside Text'] => div.aside > p:fresh`` might be helpful.
|
||
|
||
Document element matchers
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Paragraphs, runs and tables
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Match any paragraph:
|
||
|
||
::
|
||
|
||
p
|
||
|
||
Match any run:
|
||
|
||
::
|
||
|
||
r
|
||
|
||
Match any table:
|
||
|
||
::
|
||
|
||
table
|
||
|
||
To match a paragraph, run or table with a specific style, you can
|
||
reference the style by name. This is the style name that is displayed in
|
||
Microsoft Word or LibreOffice. For instance, to match a paragraph with
|
||
the style name ``Heading 1``:
|
||
|
||
::
|
||
|
||
p[style-name='Heading 1']
|
||
|
||
You can also match a style name by prefix. For instance, to match a
|
||
paragraph where the style name starts with ``Heading``:
|
||
|
||
::
|
||
|
||
p[style-name^='Heading']
|
||
|
||
Styles can also be referenced by style ID. This is the ID used
|
||
internally in the .docx file. To match a paragraph or run with a
|
||
specific style ID, append a dot followed by the style ID. For instance,
|
||
to match a paragraph with the style ID ``Heading1``:
|
||
|
||
::
|
||
|
||
p.Heading1
|
||
|
||
.. _bold-1:
|
||
|
||
Bold
|
||
^^^^
|
||
|
||
Match explicitly bold text:
|
||
|
||
::
|
||
|
||
b
|
||
|
||
Note that this matches text that has had bold explicitly applied to it.
|
||
It will not match any text that is bold because of its paragraph or run
|
||
style.
|
||
|
||
.. _italic-1:
|
||
|
||
Italic
|
||
^^^^^^
|
||
|
||
Match explicitly italic text:
|
||
|
||
::
|
||
|
||
i
|
||
|
||
Note that this matches text that has had italic explicitly applied to
|
||
it. It will not match any text that is italic because of its paragraph
|
||
or run style.
|
||
|
||
.. _underline-1:
|
||
|
||
Underline
|
||
^^^^^^^^^
|
||
|
||
Match explicitly underlined text:
|
||
|
||
::
|
||
|
||
u
|
||
|
||
Note that this matches text that has had underline explicitly applied to
|
||
it. It will not match any text that is underlined because of its
|
||
paragraph or run style.
|
||
|
||
Strikethough
|
||
^^^^^^^^^^^^
|
||
|
||
Match explicitly struckthrough text:
|
||
|
||
::
|
||
|
||
strike
|
||
|
||
Note that this matches text that has had strikethrough explicitly
|
||
applied to it. It will not match any text that is struckthrough because
|
||
of its paragraph or run style.
|
||
|
||
All caps
|
||
^^^^^^^^
|
||
|
||
Match explicitly all caps text:
|
||
|
||
::
|
||
|
||
all-caps
|
||
|
||
Note that this matches text that has had all caps explicitly applied to
|
||
it. It will not match any text that is all caps because of its paragraph
|
||
or run style.
|
||
|
||
Small caps
|
||
^^^^^^^^^^
|
||
|
||
Match explicitly small caps text:
|
||
|
||
::
|
||
|
||
small-caps
|
||
|
||
Note that this matches text that has had small caps explicitly applied
|
||
to it. It will not match any text that is small caps because of its
|
||
paragraph or run style.
|
||
|
||
Highlight
|
||
^^^^^^^^^
|
||
|
||
Match explicitly highlighted text:
|
||
|
||
::
|
||
|
||
highlight
|
||
|
||
Note that this matches text that has had a highlight explicitly applied
|
||
to it. It will not match any text that is highlighted because of its
|
||
paragraph or run style.
|
||
|
||
It’s also possible to match specific colours. For instance, to match
|
||
yellow highlights:
|
||
|
||
::
|
||
|
||
highlight[color='yellow']
|
||
|
||
The set of colours typically used are:
|
||
|
||
- ``black``
|
||
- ``blue``
|
||
- ``cyan``
|
||
- ``green``
|
||
- ``magenta``
|
||
- ``red``
|
||
- ``yellow``
|
||
- ``white``
|
||
- ``darkBlue``
|
||
- ``darkCyan``
|
||
- ``darkGreen``
|
||
- ``darkMagenta``
|
||
- ``darkRed``
|
||
- ``darkYellow``
|
||
- ``darkGray``
|
||
- ``lightGray``
|
||
|
||
Ignoring document elements
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Use ``!`` to ignore a document element. For instance, to ignore any
|
||
paragraph with the style ``Comment``:
|
||
|
||
::
|
||
|
||
p[style-name='Comment'] => !
|
||
|
||
HTML paths
|
||
~~~~~~~~~~
|
||
|
||
Single elements
|
||
^^^^^^^^^^^^^^^
|
||
|
||
The simplest HTML path is to specify a single element. For instance, to
|
||
specify an ``h1`` element:
|
||
|
||
::
|
||
|
||
h1
|
||
|
||
To give an element a CSS class, append a dot followed by the name of the
|
||
class:
|
||
|
||
::
|
||
|
||
h1.section-title
|
||
|
||
To add an attribute, use square brackets similarly to a CSS attribute
|
||
selector:
|
||
|
||
::
|
||
|
||
p[lang='fr']
|
||
|
||
To require that an element is fresh, use ``:fresh``:
|
||
|
||
::
|
||
|
||
h1:fresh
|
||
|
||
Modifiers must be used in the correct order:
|
||
|
||
::
|
||
|
||
h1.section-title:fresh
|
||
|
||
Separators
|
||
^^^^^^^^^^
|
||
|
||
To specify a separator to place between the contents of paragraphs that
|
||
are collapsed together, use ``:separator('SEPARATOR STRING')``.
|
||
|
||
For instance, suppose a document contains a block of code where each
|
||
line of code is a paragraph with the style ``Code Block``. We can write
|
||
a style mapping to map such paragraphs to ``<pre>`` elements:
|
||
|
||
::
|
||
|
||
p[style-name='Code Block'] => pre
|
||
|
||
Since ``pre`` isn’t marked as ``:fresh``, consecutive ``pre`` elements
|
||
will be collapsed together. However, this results in the code all being
|
||
on one line. We can use ``:separator`` to insert a newline between each
|
||
line of code:
|
||
|
||
::
|
||
|
||
p[style-name='Code Block'] => pre:separator('\n')
|
||
|
||
Nested elements
|
||
^^^^^^^^^^^^^^^
|
||
|
||
Use ``>`` to specify nested elements. For instance, to specify ``h2``
|
||
within ``div.aside``:
|
||
|
||
::
|
||
|
||
div.aside > h2
|
||
|
||
You can nest elements to any depth.
|
||
|
||
Donations
|
||
---------
|
||
|
||
If you’d like to say thanks, feel free to `make a donation through
|
||
Ko-fi <https://ko-fi.com/S6S01MG20>`__.
|
||
|
||
If you use Mammoth as part of your business, please consider supporting
|
||
the ongoing maintenance of Mammoth by `making a weekly donation through
|
||
Liberapay <https://liberapay.com/mwilliamson/donate>`__.
|