Translate

Sunday, 27 March 2022

How to remove invalid characters from XML (Msg 9420 illegal xml character)

There are many businesses where You got data from someone outside of Your company, change it and send it somewehere else. You cannot trust to Your "data supplier", but You are responsible for data You provide to. One of complicated cases I spotted is building XML from strings delivered from outside. As You maybe know, there are many illegal XMl characters. Some of them could be visually detected - examples are: &, >, < symbols. It can be removed with minimum effort. Unfortunately, there are more of problematic characters.Moreover, it can not be detected visually. e.g. control chars "Start of Header" CHAR(1) OR "Backspace" CHAR(8). Let us try to analyze such invalid source.
DECLARE @Definition NVARCHAR(MAX)='some '+CHAR(1)+'t'+CHAR(8)+'ext';
If we try to build this value from original we would get an error:
SELECT CASt(@Definition as xml)
'Msg 9420, Level 16, State 1, Line 3
XML parsing: line 1, character 6, illegal xml character'
You could say we have a hint here: the char #6 is illegal. But we have more than one there.How we can detect all the symbols? The easiest way to do this is the split source string into separate chars and trying to parse every char as XML. If You got an exception - here is Your illegal character. Let us create the table for storing the result:
DECLARE @T Table (
  [Char]   CHAR(1) NOT NULL,
  Code     int,
  Position int,
  N        NVARCHAR(4000),
  IsError  BIT )
For string division let us use recursivŠµ CTE.
;WITH H
AS
(
 SELECT
    SUBSTRING(@Definition, 1, 1) AS [Char]
  , ASCII(SUBSTRING(@Definition, 1, 1)) AS Code
  , 1 AS Position
 UNION ALL
 SELECT
    SUBSTRING(@Definition, H2.Position + 1, 1) AS [Char]
  , ASCII(SUBSTRING(@Definition, H2.Position + 1, 1)) AS Code
  , H2.Position + 1 AS Position
 FROM H H2
  WHERE
   H2.Position < LEN(@Definition)
)
,    H2
AS
(
 SELECT
    H.[Char]
  , H.Code
  , H.Position
  , ''+H.Char+'' N
 FROM H
)
INSERT INTO @T ( [Char], Code, Position, N )
SELECT
   [Char]
 , Code
 , Position
 , N
FROM H2 H
OPTION ( MAXRECURSION 0 )
Now our sub-XMls are ready for being parsed. Let us use cursor for such purpose.
WHILE (1=1)
BEGIN
 SET @I=(SELECT top(1)
  t.Position
 FROM @T T
  WHERE
   T.Position>@I
 ORDER BY T.Position ASC)
 IF (@I IS NULL)
  BREAK;
 BEGIN TRY
 SELECT
  @curStr=t.N
 FROM @T T
  WHERE
   T.Position=@I
 SET @curXML=CAST(@curStr as xml)
 END TRY
 BEGIN CATCH
 UPDATE T
 SET T.IsError = 1
 FROM @T T
  WHERE
   T.Position=@I
 END CATCH
END