Friday 29 May 2009

Upgrading a major project to Delphi 2009

Having finished converting a major project that involves a fairly large programming team for several years, to Delphi 2009, I'm now ready to blog about the experience.

If you want to estimate the amount of work involved to convert a project, note that it is not significant, how many lines of code you have. It is more important, what kind of code, and how segmented it is, and how consistent each segment is written. Recent user interface stuff, business logic etc., is very easy to convert. I'd almost say compile & run. Some other parts are definitely not.

Our code was clearly segmented, in the way that each unit belongs to one of these groups:

* Very old code. Some parts really had some ancient stuff in them, especially external components, but this was solved simply by renaming string=>ansistring, char=>ansichar, pchar=>pansichar. Fixing Windows API calls to use Ansi versions. After that, everything runs as before, and it is not unicode enabled.

* 3rd party components. We upgraded most of them, and some old free source code versions were upgraded by ourselves.

* User interface stuff. Besides that we upgraded some component packages, we did not have anything to change here.

* Special home-made components. These sometimes contained some optimizations, that made it necessary to adapt them to Delphi 2009, but generally, they just worked.

* Business logic. Some general things had to be changed, but there were few things and they were easy to spot and fix. It was almost like search & replace.

* Bit-manipulating units. These units need to treat each byte by itself, and the usual remedy was to convert it as if it was really old code, and then sometimes it was fairly easy to unicode-enable it afterwards.

* I/O routines. They basically had to be rewritten. We have switched some of our output text files to utf-8, in order to unicode-enable them, but others we left as they were. The main problem was with code, that tried to do unicode in Delphi 2006/2007, because it stored utf-8 in ansistring. The solution is to remove all utf-8 conversion inside the algorithm, and just apply it at the I/O point.

The hardest part was blobs. Sometimes they contain binary stuff, and sometimes they contain text. Unfortunately, .AsString was suitable for both in Delphi 2007, but now they need to be treated completely separately. The solution was to duplicate a lot of procedures, one for RawByteString and one for String, and then use the appropriate procedure on the appropriate fields.

It was not hard to do the conversion, and now we have unicode enabled a significant part of the system, with more to come, soon. However, it surely takes some refactoring experience to convert source code efficiently - if you are not good at refactoring, it will take significantly more time.

At last, I have a few tips for those, who are about to go through the same process:

* Gradually converting the code is easier than doing it all in one attempt. Make development projects that include more and more of your units, until you have converted everything.

* It is not difficult to make your code able to compile with both Delphi 2006, 2007 and Delphi 2009 at the same time. Do that, so that your team can still be productive until everything works.

* Even when you convert a full unit by making it use ansistring internally, consider to use string for the interface section, so that the conversion is hidden internally in the unit. This keeps the number of special cases lower.

* Get rid of your string warnings by fixing them, not by ignoring them. Most of them are seriously easy to fix.

* Always review your changes before committing them to your source code repository, so that you're sure that you only changed what you meant to change :-)


Anonymous said...

Thanks for the report and the tips! Good stuff.

Javier Santo Domingo said...

I totally agree with you. Gradually convert the code is the best technique for sure. And the dev projs lets you get the conversion ordered and the progress on it is easy to measure.
Also the "check before check-in" its a MUST (you can check if the changes are saved or if you forget to save them, if the changes are only what you did or if the IDE/experts/etc changed some stuff in someplace you dont know -Delphi do it with DFMs all the time-, etc)... but seriously, most of coders dont do it. Seems that is a bad habit that code repositories had introduced into our work (the rollback thingie could make you forget about precision and that leads to a quality loss in the way you work).

Jolyon Smith said...

Some useful observations, but how did you handle database issues?

I'm thinking specifically of situations where you've now got a Unicode GUI and Unicode support in your data access code (TField.AsString = Unicode) but your existing, underlying database schema is not Unicode enabled.

How is this managed in an incremental approach (where some code is Unicodified, and other code isn't. Did you also incrementally migrate your DB schema?

Or did you simply introduce additional validation to the GUI to prevent users from entering characters outside of ranges supported by an existing non-Unicode schema?

How did users of your major project respond to the need to migrate their databases to a Unicode schema? (assuming you did not have a Unicode schema already in place).

What benefit did they derive (if they did not previously, nor currently, need Unicode)?

What penalties did they suffer (database size/performance), if any?

If SQL Server was involved, how much impact, if any, did the need to escape Unicode string literals have?

What impact, if any, did the resulting changes in your database schema have on any 3rd party or end-user scripts or applets etc that were interacting directly with the database?

I'd be interested in the answers to these questions, because it seems to me that a migration to Unicode is about more than just source code when databases are involved (where those databases used schemas which were not themselves already "Unicodified" - which is surely lkely if the application itself is/was not Unicode).

Lars D said...

We did not make a release based on our Delphi 2009 version, yet, and we didn't do it in order to provide Unicode features to the customers. Even with Delphi 2006, our application was available in many locales, even Russian, and had no problems with that. The main reason to upgrade, is to increase programmer productivity and improve source code readability.

We have not migrated the database to Unicode, yet, and several parts of our application doesn't do Unicode, either. Currently, this means that if you save special characters, these may not be shown after pressing the save button. We only use TIBSQL components, and they return data from the database as "string", no matter what character set that was used in the database, and it is very backwards compatible with Delphi 2007. We did have to replace TIBSQL.AsString with TIBSQL.AsBytes for blobs, but we encapsulated that in a function Blob2Str(field:TIBXSQLVAR):RawByteString, which is made in a way that is Delphi 2006 and Delphi 2009 compatible.

Currently we have no plan for input validation. Unicode has a general problem about that, because different fonts and different computers have different abilities to show some characters. Even if you can enter and save a character, it may not be shown on other computers. But the topic is interesting, and maybe we should look into it.

I have examined the possibilities to migrate Firebird 2.1 to Unicode, and it seems that we just have to ensure, that all developers have database management tools that are recent enough to support unicode well, and then we can switch the database character set to utf8, and remove some utf-8 conversion stuff that we implemented ourselves for some special Unicode cases in Delphi 2006. The main reason why this app does not already use utf8, is that it's an old Firebird 1.0 database structure, which has been continuously upgraded, and back then, utf8 did not exist, we only had unicode_fss, which was not an optimal solution. The actual switch to utf8 character set can be done as part of our automatic database upgrade process, so that no database technician has to be involved at the customer site, when doing the switch.

Jolyon Smith said...

Thanks for the additional info Lars. It's ironic that the smaller/lightweight DB's perhaps present less of a challenge in this respect (no disrespect to Firebird, if anything a compliment!).

In the case of SQL Server and Oracle in particular migration seems to present a far greater challenge (if it is to be done right).

That is if some of the technical articles on the subject on the MS and Oracle sites is to be believed.

The hard part for us is that our application is a "vanilla" ANSI app, so by definition none of our customers have need for Unicode (if they did, they wouldn't be customers!) but it's actually harder to move to D2009 *without* supporting Unicode (i.e so called "ANSIfying" the app).

But if we implement Unicode support in the DB (schema changes) to make the source code migration "worthwhile", our existing customers will have to endure downtime and serious migration efforts for zero benefit (for them) and some cost (increased DB size and some reductions in performance).

Lars D said...

Our app was basically a vanilla ansi app, too - but that always worked well with different locales, as long as you don't mix them. We had some unicode-stuff inside, but it was a bad attempt at preparing the application for the future. If we had kept the application 100% non-unicode, the transition to Delphi 2009 would have been easier.

The upgrade process is optimized towards converting plain-ansi apps to unicode-internally apps. This means that the upgraded application uses unicode everywhere, but converts to ansistring during I/O to files and databases. I guess that most database components will handle your non-unicode database nicely in Delphi 2009.

The upgrade process for a database is always a big discussion. How do you upgrade your customer's databases? We planned for 100-10,000 customer production database upgrades manually per year, so we introduced database structure versioning and automatic database upgrade tools, written using Delphi. You can do that for any normal SQL database, including Oracle and MS SQL Server.

Our database upgrade tool introduces downtime, but data conversions can generally be improved a lot in speed, if you do it like this:

* Shut down the app
* Make backup of database
* Switch from write-through to cached writes
* Run the automated database upgrade process
* Switch from cached writes to write-through
* Run test
* Enable the app

In our case, we're usually talking about minutes for this, not hours, so you can still get 99,99% uptime even though you upgrade several times a year, without using fail-over and stuff.

Jolyon Smith said...

Thanks for even MORE info! :)

We too have an automated DB update tool which takes care of DB structure changes and preserving data etc.

For SQL Server things get very complicated though, and can have an additional impact on source code and server configuration (if optimal performance is to be preserved). The source code impact (which may also affect stored procedures) arises if you use string literals and require them to be Unicode. They will otherwise be converted to ANSI in the codepage of the connections work station. So if you need/wish them to be Unicode then you have to escape them, so at the very least there is an additional migration task to review/evaluate string literals in SQL in source code.

As I said, all of this is based on reading the materials on MS/Oracle sites. Some real world experiences from ISV's using Delphi that have taken customers through this process would be useful, to say the least.

Lars D said...

Actually, I have no idea how IBX handles unicode SQL string literals.

You can implement automatic unicode escaping rather easily. Create a new class, which works exactly like your existing database access classes (inherit from them...), but which automatically escape unicode inside. Then, search & replace, so that only the new class is used.

Another solution is to replace the literals with parameters. It may reduce performance a bit in some cases, because extra network traffic is required, but for most applications, that extra network traffic is insignificant, because repeated use of the same query will become faster, and not slower.

Anyway, by using backwards source code compatibility and test-projects, you can start to do the conversion today, and then measure progress. When you have achieved 10% conversion of your source code, multiply the amount of time spent with 10, and you have an estimate for the total work that is needed.

Anonymous said...

Wish Embarcadero had a "non-unicode" Delphi. They're motivating me to at least think to change to visual studio.

Lars D said...

If you want to be strictly non-unicode, transition to Delphi 2009 is much easier, because you can use ansistring a lot more, and you can ignore all the conversion warnings.

A lot of the problems with Unicode is actually not related to Unicode itself, it is related to the choice of UTF-16 for storing texts. Java, Win32 and .net all made this fatal choice, based on the belief that they could make 1 character = 2 bytes. However, with Unicode, that's not the case. For instance, a musical note occupies 4 bytes, or 2 positions in a Unicodestring.

Even the python language struggles with unicode these days. On Linux, you can use normal 8-bit strings for unicode, whereas cross-platform Unicode abilities requires special Unicode strings.

I really like Delphi 2009, because the Unicode support was made in a really nice way. It still supports ansistring, it converts automatically, standard I/O converts automatically, and it makes your code look better. If you choose not to unicode-enable a unit, and keep everything in ansistring, that unit looks exactly as good as before.

Jolyon Smith said...

I thought the choice of UTF-16 (in Windows at least) stemmed from the EARLIER choice to use UCS-2 in that mistaken 2-bytes per char assumption, rather than any specific quality of UTF-16 per se?

I agree that sticking with ANSIString/ANSIChar is a good way to go to preserve "ANSI-ness".

I just wish CodeGear would help us do that.

All it would need is a compiler directive to allow us to control whether String=ANSIString or String=UnicodeString in any given unit.

If this directive were incorporated in VCL units (to =UnicodeString) as a matter of course, then a PROJECT wide directive would enable an app to be compiled for ANSI much more easily than is possible today.

Today we would have to go through all our units changing our string declarations to ANSI types.

Then when the day DOES come for us to make the transition to Unicode we have to go through it all again.

Running to stand still is bad enough.

Running to go backwards is just plain stupid.

The really annoying thing is that whilst such a directive/switch would BY NO MEANS be a perfect solution, it would be another tool in the Delphi tool-chest that we could use at our own discretion.

And it would not be difficult to implement, I am sure.

Lars D said...

After switching to Delphi 2009, we had a performance problem on 64-bit Windows Vista. As usual, it was a combination of multiple things, but after fixing the problems, we realized that the main time was spent in Ansistring<->Unicodestring conversions.

I guess this is why CodeGear did not want to enable string=ansistring in units, it could have serious impact on your application's performance. Converting between string and ansistring is an expensive operation.

Therefore, I also need to revise my previous advice: Be careful about using ansistring in units, which receive a LOT of strings from unicodestring units. Keep strings as unicodestring=string whereever possible.

LDS said...

"rather than any specific quality of UTF-16 per se?"

No. UTF-16 is a good balance between simplicity (most used codepoints - BPM - are always two bytes, while UTF-8 requires multibyte sequence for anything beyond 7 bit ASCII characters) and memory (compared to the larger UTF-32, which may use mbcs anyway). It is a tradeoff, but the others could have been even worse, especially outside western european languages.

"All it would need is a compiler directive to allow us to control"

If it were so simple, they would have done it. It is much more complex - they would have to mantain two different sets of units for the VCL - and packages.
The compiler should have keep track of calls going in and out units compiled with different string types, performing implicit conversions (and maybe losing data, depending on the locale) each time.
The same unit could be added *twice* to the executable, if one piece of code called the Ansi version and another the unicode version. What about controls? A form with one TEdit from a Unicode unit, and another one from a Ansi one???
Debugging could have become a pretty nightmare.

I suggest everyone complaining about the missing compiler switch to try to understand what using Unicode means really - and how it works.

Ah, and as Anonymous said - feel free switch to VS - .NET is fully Unicode from the ground up... if you need a non Unicode Delphi any version before 2009 is.

Thanks to Lars for an hands-on case about migrating - Unicode is here to stay.

Lars D said... being fully unicode means that it is not Windows compatible, because Windows is a mix of character sets ;-) I don't believe that is different than Delphi 2009.

Delphi 2009 handles unicode and other character sets nicely.

The problems with UTF-16 are:

* A character is 2 or 4 bytes. If your string-length function returns the number of bytes divided by two, then it does not count characters.
* Windows is not based on it. Many files use ANSI.
* Communication usually uses utf-8 and not utf-16.
* UTF-16 exists in two flavors, UTF-16BE and UTF-16LE
* Because Windows uses a mix of characters, it introduces a lot of nasty fixes, like BOM, which prevent normal operations like file concatenation and other stuff.
* It is often not possible to write one algorithm to handle UTF-16 and binary data, whereas the same algorithm can do binary data and UTF-8 at the same time.

LDS said...

.NET strings are Unicode. Of course VC++ (which can avoid using .NET) supports other encodings, but because the direct competitor of Delphi is C#, moving to VS would mean moving to Unicode strings (and immutable ones!) anyway - making the assertion "if Delphi is only Unicode I'll switch to VS" a bit silly.

"because Windows is a mix of character sets"
AFAIK actual supported versions of Windows internally are UTF-16 - other encodings are supported, of course, but the "native" encoding is UTF-16.

"A character is 2 or 4 bytes"
Even when using ANSI you can encounter MBCS encodings. With UTF-8 is even worse, you have often one, two, three and four bytes sequences - unless you use plain English only.

"Many files use ANSI."
What is used to store data has nothing to to about how those data are represented once read from file. Many files are binary too...

"Communication usually uses utf-8"
That's up to the protocol used. Many RFC protocols use plain old ASCII, not UTF-8. DCOM uses UTF-16 strings.

"UTF-16 exists in two flavors"
That happens with any data structure larger than one byte. For the matter, you can't exchange reliably any binary file/buffer among platforms with different endiannes, unless the application to read them is aware of it.

"introduces a lot of nasty fixes, like BOM"
The BOM is a *standard* unicode feature, not a Windows one. Of course any Unicode-aware application must handle it properly. DCE/RPC has even more complex ways to adddress the endiannes issues.
OF course, if you have to feed data to a non-Unicode aware application you have to translate them in the proper encoding and codepage, even if the source is UTF-8, because UTF-8 <> ANSI(anycodepage)

"It is often not possible to write one algorithm"
Almost any algorithm that implies a 1:1 relationship between bytes and characters is doomed to fail using any encoding but ASCII - ANSI and UTF-8 included. The world is far larger than Europe and USA.

I see that many complaints about Unicode stems from the fact that one byte is no longer one character, and strings become more complex data structures than arrays.
Let's face it, handling strings will be a little more complex for programmers from now on - but that what modern applications need and we have to change our mindset.

Lars D said...

LDS, you should have a look at how Unicode programming is done on Linux these days.

LDS said...

Does Linux implement Unicode in the optimal way? Is there a standard that says that?
I don't think so. *nix relies too much on text files and their format (the silly shebang line, to name one, because of lack of file metadata), and that made using Unicode in Linux much more complex than OSes and frameworks with more modern architectures.
A problem of theirs, sincerely as long as Delphi is a Windows development tool I don't care about Linux Unicode implementation issues.
Anyway, again, I have to stress the difference beween internal UTF-16 representations (wchar_t - even in gcc, Delphi Unicodestring), and files of buffer to be transmitted, where UTF-8 can be easily used wherever you need it - does Delphi force you to use only UTF-16? It can read and write data in any encoding you like.
Everybody is using UTF-16 Windows since NT4: did somebody ever find any issue? Was anybody unable to read or process plain old text files?
Frankly, my worst nightmare would be an application using UTF-8 *internally*, having to handle one, two and three bytes codepoints all the time - unless you use only plain English text.
Backward compatibility is important, but someone is better to look at the future, not at the past.

Lars D said...

It is very common to fear the unknown ;-)

Warren said...

I over-reacted to the potential complexity of a Delphi 2009 upgrade and found after I had attempted it, that 90% of the work was learning to understand my code more than I had previously understood it.

Once I got good at it, I realized that this was really a journey into making my code much better. It was not just a "make work" project, where the time required for the upgrade was wasted, but rather an effort where I not only gained a giant codebase that runs on all delphi versions from 7 to present, but also, I understood deep internal issues within my architectures, that I had previously glossed over.

Also, my source code is now better organized, more modular, and my applications and components are clearly documented, and have build instructions, and little notes everywhere, explaining how everything works.

In other words, it is easy to forget how much "cruft" is in your code, and when you do a major upgrade, it seems silly to me, to merely stop at "fixing" the code so it compiles, but also to take the opportunity of using the "breakage" you encountered to discover things you had forgotten, or never knew, about what lurks inside your code.

I found and rooted out some archaisms that date back to the TurboPascal era, because porting them to delphi 2009, untyped parameters and all, was holding me back.

The most complex area for porting, for me, was my byte-wide serial communications libraries, com port componentry, and serial communications protocol code. Once I had that stuff sorted out, the rest of the porting was easy.

I came away with immense respect for Delphi 2009. It really kicks some serious hiney.