Skip to content Skip to sidebar Skip to footer

How To Parse Xml With Nokogiri Without Losing Html Entities?

If you look at the output below in the after section ruby is removing all the html entities. How to parse XML with nokogiri without loosing HTML entities? --- BEFORE ---

Solution 1:

Your test file might have some invalid HTML entities.

nokogiri.rb:

require 'nokogiri'

puts "--- INVALID ---"
invalid_xml = <<-XML
<blog:entryFull>invalid M&Ms</blog:entryFull><!-- invalid M and M's --><blog:entryFull>&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
XML

doc = Nokogiri::XML::DocumentFragment.parse(invalid_xml)
puts doc

puts "--- VALID ---"
valid_xml = <<-XML
<blog:entryFull>valid M&amp;Ms</blog:entryFull><!-- valid M and M's --><blog:entryFull>&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
XML

doc = Nokogiri::XML::DocumentFragment.parse(valid_xml)
puts doc

result:

$ ruby nokogiri.rb
--- INVALID ---
<blog:entryFull>invalid M</blog:entryFull><!-- invalid M and M's --><blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
--- VALID ---
<blog:entryFull>valid M&amp;Ms</blog:entryFull><!-- valid M and M's --><blog:entryFull>&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

so,

  1. Fix input XML
  2. Use STRICT ParseOptions

strict parsing example:

invalid_xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?><root><blog:entryFull>invalid M&Ms</blog:entryFull><blog:entryFull>&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull></root>
XML

begin
  doc = Nokogiri::XML(invalid_xml) do |configure|
    configure.strict # strict parsing
  end
  puts doc
rescue => e
  puts 'INVALID XML'
end

Solution 2:

Qambar, I am unable to recreate your issue. However, I am able to produce your desired output given these files/input:

test.xml

<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

nokogiri.rb

require'nokogiri'

f = File.open("./test.html")

contents = ""
f.each {|line|
  contents << line
}

puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"

doc = Nokogiri::XML::DocumentFragment.parse(contents) 
puts doc.inner_html
f.close

Console

Development/Code » ruby nokogiri.rb
--- BEFORE ---
<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
--- AFTER ---
<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

Solution 3:

The work-around that i did was to fetch the xml tag through regex and then convert html entities using html entities. Then parse it with nokogiri html parser.

Post a Comment for "How To Parse Xml With Nokogiri Without Losing Html Entities?"