> In the above example, on every loop, the += operator causes a new string to be allocated, and the content to be copied, which gets exponentially more expensive as the string grows.
But that’s only the theoretical behaviour. In practice, languages tend to end up optimising it in various ways. As noted a paragraph later, the Java compiler is able to detect and “fix” this, by rewriting the code to use a mutable string.
Another solution is to put concatenation into your string type as another possible representation. I believe at least some (no idea if it’s all) JavaScript engines do this. You end up with something like this (expressed in Rust syntax, and much simpler than the real ones are):
Then, when you try to access the string, if it’s Concatenated it’ll flatten it into one of the other representations.
Thus, the += itself becomes cheap, and in typical patterns you only incur the cost of allocating a new string once, when you next try to read from it (including things like JSON.stringify(object_containing_this_string) or element.setAttribute(name, this_string)).
> As noted a paragraph later, the Java compiler is able to detect and “fix” this, by rewriting the code to use a mutable string.
Does it actually do that nowadays? Back in my days it was incapable of lifting the builder out of loops, so for each iteration it would instantiate a builder with the accumulator string, append the concatenation, then stringify and reset the accumulator.
> Another solution is to put concatenation into your string type
Aah, the Erlang way of handling strings.
On Beam (Erlang's VM), that goes as deep as IO. It's perfectly fine to pass a (possibly nested) list of strings (whether charlists or binaries) to an IO function, and the system just knows how to deal with that.
An iolist isn't a string, you can't pass it to the uppercase function for instance. It's really meant for I/O as the name implies. Regular string concatenation is optimized to avoid copying when possible: https://www.erlang.org/doc/system/binaryhandling.html#constr...
https://stackoverflow.com/questions/1838170/
"In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393."
Article also says PHP has immutable strings. They are mutable, although often copied.
Article also claims majority of popular languages have immutable strings.
As well as the ones listed there is also PHP and Rust (and C, but they did say C++ - and obviously Ruby since that's the subject of the article).
I'm also a bit surprised by the last sentence.
"However, if you do measure a negative performance impact, there is no doubt you are measuring incorrectly."
There must surely be programs doing a lot of string building or in-place modification that would benefit from non-frozen.
> There must surely be programs doing a lot of string building or in-place modification that would benefit from non-frozen.
The point is that the magic comment (or the --enable-frozen-string-literal) only applies to literals. If you have some code using mutable strings to iteratively append to it, flipping that switch doesn't change that. It just means you'll have to explicitly create a mutable string. So it doesn't change the performance profile.
In C, C++, and Rust, the question of "are strings in this language mutable or immutable?" isn't applicable, because those languages have transitive mutability qualifiers. So they only need a single string type, and whether you can mutate it or not depends on context. (C++ and Rust have multiple string types, but the differences among them aren't about mutability.) In languages without this feature, a given value is either always mutable or never mutable, and so it's necessary to pick one or the other for string literals.
Sure, that doesn't change the point that mutable strings are a thing in those languages. And I don't think C's const is really a "mutability qualifier" - certainly not a very effective one at any rate.
Python strings aren’t even proper Unicode strings. They’re sequences of code points rather than scalar values, meaning they can contain surrogates. This is incompatible with basically everything: UTF-* as used by sensible things, and unvalidated UTF-16 as used in the likes of JavaScript, Windows wide strings and Qt.
It's perfectly fine to have mutable strings in a hash table; just document that the behavior becomes unspecified if keys are mutated while they are in the table.
Make sure the behavior is safe: it won't crash or be exploitable by a remote attacker.
It works especially well in a language that doesn't emphasize mutation; i.e. you don't reach for string mutation as your go-to tool for manipulation.
Explicit "freeze" stuff is an awful thing to foist onto the programmer.
> just document that the behavior becomes unspecified if keys are mutated while they are in the table.
> Make sure the behavior is safe: it won't crash or be exploitable by a remote attacker.
There is no such thing as unspecified but safe behaviour. Developers who can't predict what will happen will make invalid assumptions which will lead to security vulnerabilities when they are violated.
The copy-and-freeze behavior is a special case that applies only to strings, presumably because the alternative was too much of a footgun since programmers usually think of strings in terms of value semantics.
I don't think anyone likes the explicit .freeze calls everywhere; I think the case for frozen strings in Ruby is primarily based on performance rather than correctness (which is why it wasn't obvious earlier in the language's history that it was the right call), and the reason it's hard to make the default is because of compatibility.
> In the above example, on every loop, the += operator causes a new string to be allocated, and the content to be copied, which gets exponentially more expensive as the string grows.
But that’s only the theoretical behaviour. In practice, languages tend to end up optimising it in various ways. As noted a paragraph later, the Java compiler is able to detect and “fix” this, by rewriting the code to use a mutable string.
Another solution is to put concatenation into your string type as another possible representation. I believe at least some (no idea if it’s all) JavaScript engines do this. You end up with something like this (expressed in Rust syntax, and much simpler than the real ones are):
Then, when you try to access the string, if it’s Concatenated it’ll flatten it into one of the other representations.Thus, the += itself becomes cheap, and in typical patterns you only incur the cost of allocating a new string once, when you next try to read from it (including things like JSON.stringify(object_containing_this_string) or element.setAttribute(name, this_string)).
> As noted a paragraph later, the Java compiler is able to detect and “fix” this, by rewriting the code to use a mutable string.
Does it actually do that nowadays? Back in my days it was incapable of lifting the builder out of loops, so for each iteration it would instantiate a builder with the accumulator string, append the concatenation, then stringify and reset the accumulator.
The linked docs don’t say anything about loops.
I agree. If you append to a string in a loop in Java, you will see quadratic behavior.
> Another solution is to put concatenation into your string type
Aah, the Erlang way of handling strings.
On Beam (Erlang's VM), that goes as deep as IO. It's perfectly fine to pass a (possibly nested) list of strings (whether charlists or binaries) to an IO function, and the system just knows how to deal with that.
An iolist isn't a string, you can't pass it to the uppercase function for instance. It's really meant for I/O as the name implies. Regular string concatenation is optimized to avoid copying when possible: https://www.erlang.org/doc/system/binaryhandling.html#constr...
Article claims python 3 uses UTF-8.
https://stackoverflow.com/questions/1838170/ "In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393."
Article also says PHP has immutable strings. They are mutable, although often copied.
Article also claims majority of popular languages have immutable strings. As well as the ones listed there is also PHP and Rust (and C, but they did say C++ - and obviously Ruby since that's the subject of the article).
I'm also a bit surprised by the last sentence. "However, if you do measure a negative performance impact, there is no doubt you are measuring incorrectly." There must surely be programs doing a lot of string building or in-place modification that would benefit from non-frozen.
> can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393
My bad, I haven't seriously used Python for over 15 years now, so I stand corrected (and will clarify the post).
My main point stands though, Python strings have an internal representation, but it's not exposed to the user like Ruby strings.
> Article also says PHP has immutable strings. They are mutable, although often copied.
Same. Thank you for the correction, I'll update the post.
> There must surely be programs doing a lot of string building or in-place modification that would benefit from non-frozen.
The point is that the magic comment (or the --enable-frozen-string-literal) only applies to literals. If you have some code using mutable strings to iteratively append to it, flipping that switch doesn't change that. It just means you'll have to explicitly create a mutable string. So it doesn't change the performance profile.
In C, C++, and Rust, the question of "are strings in this language mutable or immutable?" isn't applicable, because those languages have transitive mutability qualifiers. So they only need a single string type, and whether you can mutate it or not depends on context. (C++ and Rust have multiple string types, but the differences among them aren't about mutability.) In languages without this feature, a given value is either always mutable or never mutable, and so it's necessary to pick one or the other for string literals.
Sure, that doesn't change the point that mutable strings are a thing in those languages. And I don't think C's const is really a "mutability qualifier" - certainly not a very effective one at any rate.
For the records, mutable strings, eh, bytearray objects, are also a thing in Python: https://docs.python.org/3/library/stdtypes.html#bytearray-ob...
Python strings aren’t even proper Unicode strings. They’re sequences of code points rather than scalar values, meaning they can contain surrogates. This is incompatible with basically everything: UTF-* as used by sensible things, and unvalidated UTF-16 as used in the likes of JavaScript, Windows wide strings and Qt.
But isn't 'surrogateescape' supposed to address this? (no expert)
https://vstinner.github.io/pep-383.html
Important information omitted from title: this is for the Ruby language.
It's perfectly fine to have mutable strings in a hash table; just document that the behavior becomes unspecified if keys are mutated while they are in the table.
Make sure the behavior is safe: it won't crash or be exploitable by a remote attacker.
It works especially well in a language that doesn't emphasize mutation; i.e. you don't reach for string mutation as your go-to tool for manipulation.
Explicit "freeze" stuff is an awful thing to foist onto the programmer.
> just document that the behavior becomes unspecified if keys are mutated while they are in the table.
> Make sure the behavior is safe: it won't crash or be exploitable by a remote attacker.
There is no such thing as unspecified but safe behaviour. Developers who can't predict what will happen will make invalid assumptions which will lead to security vulnerabilities when they are violated.
[delayed]
In general, Ruby does allow mutable values in hash tables, with basically those semantics: https://docs.ruby-lang.org/en/3.4/Hash.html#class-Hash-label...
The copy-and-freeze behavior is a special case that applies only to strings, presumably because the alternative was too much of a footgun since programmers usually think of strings in terms of value semantics.
I don't think anyone likes the explicit .freeze calls everywhere; I think the case for frozen strings in Ruby is primarily based on performance rather than correctness (which is why it wasn't obvious earlier in the language's history that it was the right call), and the reason it's hard to make the default is because of compatibility.
> since programmers usually think of strings in terms of value semantics.
Can you blame them, when you out of your way to immerse strings in the stateful OOP paradigm, with idioms like "foo".upcase!
If you give programmers mainly a functional library for string manipulations that returns new values, then that's what they will use.