Skip to content

Working with XML and HTML

The kpointer-ksoup module exposes HTML and XML documents — parsed by KSoup — as JSON-Pointer-addressable trees. It adapts a KSoup Element (including a Document) to the kPointer adapter model.

Read-only

This module is read-only — there is no mutate DSL, and calling .mutate {} on a ksoup adapter is a compile error.

The mapping

An element becomes a KpaStruct whose keys are, in order:

  1. the element's attribute names (excluding KSoup-internal attributes), then
  2. the distinct tag names of its child elements.

The rules:

  • When an attribute and a child element share a name, the attribute wins.
  • An attribute value is a string primitive (KpaPrimitive).
  • A child tag that occurs once is a nested struct; a child tag that occurs two or more times is a KpaList of those elements, in document order.
<foo>
    <bar goo="baz" />
</foo>
import com.commonsware.kpointer.ksoup.*
import com.fleeksoft.ksoup.Ksoup

val doc = Ksoup.parseXml("<foo><bar goo=\"baz\"/></foo>")

doc.structAt("/foo/bar")          // KpaStruct with a single "goo" key
doc.primitiveAt("/foo/bar/goo")   // KpaPrimitive
doc.attributeAt("/foo/bar/goo")   // "baz"
doc.elementNodeAt("/foo/bar")     // the native KSoup Element <bar>

HTML wrapping

Ksoup.parse(html) wraps content in html/head/body, so address from the element you want — e.g. doc.body()?.structAt("/foo"). Ksoup.parseXml(xml) adds no wrapper.

A note on shape

Because a list arises only from repeated sibling tags, the shape at a key is data-dependent: /list/item is a struct when there is a single item child but a list when there are several. When you want a list regardless of the child count, use the =children: accessor.

<!-- one item child → struct -->
<list><item id="a"/></list>

<!-- two item children → list -->
<list><item id="a"/><item id="b"/></list>
val one = Ksoup.parseXml("<list><item id=\"a\"/></list>")
val two = Ksoup.parseXml("<list><item id=\"a\"/><item id=\"b\"/></list>")

one.structAt("/list/item")         // KpaStruct — the single <item>
one.listAt("/list/item")           // null — it's a struct, not a list

two.listAt("/list/item")           // KpaList of size 2
two.structAt("/list/item")         // null — it's a list, not a struct

When the cardinality is not fixed by your schema, =children: gives a stable KpaList regardless:

one.listAt("/list/=children:item")   // KpaList of size 1
two.listAt("/list/=children:item")   // KpaList of size 2

=-prefixed synthetic accessors

A path segment beginning with = is a reserved synthetic accessor, not an attribute or child tag name. A well-formed XML/HTML attribute name can never contain =, so these never collide with real content in valid markup.

Text and markup

Text nodes are not addressable as structure — only element structure and attributes participate in plain pointer navigation. The accessors below bridge that gap: each resolves to a string primitive derived from the element's text content or serialized markup.

A trailing segment from this set resolves to a string primitive derived from the element:

Segment KSoup method Meaning
=ownText ownText() Direct text nodes only, normalized
=text text() Combined text of the element and descendants, normalized
=wholeText wholeText() Combined text, original whitespace preserved
=wholeOwnText wholeOwnText() Direct text nodes only, original whitespace preserved
=html html() Serialized inner markup (an opaque string, not a re-parsed tree)
=outerHtml outerHtml() Serialized markup including the element's own tags
val doc = Ksoup.parseXml("<p>Hello <b>there</b> now!</p>")

doc.primitiveAt("/p/=ownText")?.renderedString()   // "Hello  now!"
doc.primitiveAt("/p/=text")?.renderedString()       // "Hello there now!"

// Or via the dedicated extensions, which take the path to the element itself:
doc.ownTextAt("/p")   // "Hello  now!"
doc.textAt("/p")      // "Hello there now!"
doc.htmlAt("/p")      // "Hello <b>there</b> now!"
doc.outerHtmlAt("/p") // "<p>Hello <b>there</b> now!</p>"

The dedicated extensions are ownTextAt, textAt, wholeTextAt, wholeOwnTextAt, htmlAt, and outerHtmlAt (each with String and KPointer overloads). They return null when the path does not resolve to a single element.

Forcing a child element with =child:

Since an attribute wins over a like-named child by default, =child:NAME resolves to the child element(s) named NAME instead, ignoring any attribute of that name. It follows the same single-vs-repeat rule as a plain name:

// <foo bar="attrval"><bar baz="x"/></foo>
doc.attributeAt("/foo/bar")              // "attrval"  (attribute wins by default)
doc.structAt("/foo/=child:bar")          // the <bar> child element
doc.attributeAt("/foo/=child:bar/baz")   // "x"

Shape-stable child lists

=children:NAME always resolves to a list of the child elements named NAME, in document order — never collapsing a single match to a struct. Like =child:, it ignores a like-named attribute, and it resolves to null when there is no match:

// <root><item/><item/></root>  -> list of size 2
// <root><item/></root>         -> list of size 1
// <root/>                      -> null (no item children)
doc.listAt("/root/=children:item")

Because it resolves to null when absent, "=children:NAME" in element is true exactly when at least one child is named NAME.

Synthetic keys are not iterated

All =-prefixed accessors are resolvable via get() / contains() and path navigation, but they are not listed in keys or toMap() — iteration reflects only real attributes and child tags.