Working with XML and HTML¶
The kpointer-ksoup module exposes HTML and XML documents — parsed by
KSoup — as JSON-Pointer-addressable trees. It adapts a KSoup
Element (including a Document) to the kPointer adapter model.
Read-only
This module is read-only — there is no mutate DSL, and calling .mutate {} on a ksoup
adapter is a compile error.
The mapping¶
An element becomes a KpaStruct whose keys are, in order:
- the element's attribute names (excluding KSoup-internal attributes), then
- the distinct tag names of its child elements.
The rules:
- When an attribute and a child element share a name, the attribute wins.
- An attribute value is a string primitive (
KpaPrimitive). - A child tag that occurs once is a nested struct; a child tag that occurs two or more times
is a
KpaListof those elements, in document order.
import com.commonsware.kpointer.ksoup.*
import com.fleeksoft.ksoup.Ksoup
val doc = Ksoup.parseXml("<foo><bar goo=\"baz\"/></foo>")
doc.structAt("/foo/bar") // KpaStruct with a single "goo" key
doc.primitiveAt("/foo/bar/goo") // KpaPrimitive
doc.attributeAt("/foo/bar/goo") // "baz"
doc.elementNodeAt("/foo/bar") // the native KSoup Element <bar>
HTML wrapping
Ksoup.parse(html) wraps content in html/head/body, so address from the element you want —
e.g. doc.body()?.structAt("/foo"). Ksoup.parseXml(xml) adds no wrapper.
A note on shape¶
Because a list arises only from repeated sibling tags, the shape at a key is data-dependent:
/list/item is a struct when there is a single item child but a list when there are
several. When you want a list regardless of the child count, use the
=children: accessor.
<!-- one item child → struct -->
<list><item id="a"/></list>
<!-- two item children → list -->
<list><item id="a"/><item id="b"/></list>
val one = Ksoup.parseXml("<list><item id=\"a\"/></list>")
val two = Ksoup.parseXml("<list><item id=\"a\"/><item id=\"b\"/></list>")
one.structAt("/list/item") // KpaStruct — the single <item>
one.listAt("/list/item") // null — it's a struct, not a list
two.listAt("/list/item") // KpaList of size 2
two.structAt("/list/item") // null — it's a list, not a struct
When the cardinality is not fixed by your schema, =children: gives a stable KpaList regardless:
one.listAt("/list/=children:item") // KpaList of size 1
two.listAt("/list/=children:item") // KpaList of size 2
=-prefixed synthetic accessors¶
A path segment beginning with = is a reserved synthetic accessor, not an attribute or child tag
name. A well-formed XML/HTML attribute name can never contain =, so these never collide with real
content in valid markup.
Text and markup¶
Text nodes are not addressable as structure — only element structure and attributes participate in plain pointer navigation. The accessors below bridge that gap: each resolves to a string primitive derived from the element's text content or serialized markup.
A trailing segment from this set resolves to a string primitive derived from the element:
| Segment | KSoup method | Meaning |
|---|---|---|
=ownText |
ownText() |
Direct text nodes only, normalized |
=text |
text() |
Combined text of the element and descendants, normalized |
=wholeText |
wholeText() |
Combined text, original whitespace preserved |
=wholeOwnText |
wholeOwnText() |
Direct text nodes only, original whitespace preserved |
=html |
html() |
Serialized inner markup (an opaque string, not a re-parsed tree) |
=outerHtml |
outerHtml() |
Serialized markup including the element's own tags |
val doc = Ksoup.parseXml("<p>Hello <b>there</b> now!</p>")
doc.primitiveAt("/p/=ownText")?.renderedString() // "Hello now!"
doc.primitiveAt("/p/=text")?.renderedString() // "Hello there now!"
// Or via the dedicated extensions, which take the path to the element itself:
doc.ownTextAt("/p") // "Hello now!"
doc.textAt("/p") // "Hello there now!"
doc.htmlAt("/p") // "Hello <b>there</b> now!"
doc.outerHtmlAt("/p") // "<p>Hello <b>there</b> now!</p>"
The dedicated extensions are ownTextAt, textAt, wholeTextAt, wholeOwnTextAt, htmlAt, and
outerHtmlAt (each with String and KPointer overloads). They return null when the path does
not resolve to a single element.
Forcing a child element with =child:¶
Since an attribute wins over a like-named child by default, =child:NAME resolves to the child
element(s) named NAME instead, ignoring any attribute of that name. It follows the same
single-vs-repeat rule as a plain name:
// <foo bar="attrval"><bar baz="x"/></foo>
doc.attributeAt("/foo/bar") // "attrval" (attribute wins by default)
doc.structAt("/foo/=child:bar") // the <bar> child element
doc.attributeAt("/foo/=child:bar/baz") // "x"
Shape-stable child lists¶
=children:NAME always resolves to a list of the child elements named NAME, in document
order — never collapsing a single match to a struct. Like =child:, it ignores a like-named
attribute, and it resolves to null when there is no match:
// <root><item/><item/></root> -> list of size 2
// <root><item/></root> -> list of size 1
// <root/> -> null (no item children)
doc.listAt("/root/=children:item")
Because it resolves to null when absent, "=children:NAME" in element is true exactly when at
least one child is named NAME.
Synthetic keys are not iterated
All =-prefixed accessors are resolvable via get() / contains() and path navigation, but
they are not listed in keys or toMap() — iteration reflects only real attributes and
child tags.