Friday 27 October 2017

regex - How to validate an email address using a regular expression?

itemprop="text">

Over the years I have slowly developed
a rel="noreferrer">regular expression that validates MOST email addresses
correctly, assuming they don't use an IP address as the server
part.



I use it in several PHP programs, and it
works most of the time. However, from time to time I get contacted by someone that is
having trouble with a site that uses it, and I end up having to make some adjustment
(most recently I realized that I wasn't allowing 4-character
TLDs).



What is the best regular
expression you have or have seen for validating
emails?




I've seen several
solutions that use functions that use several shorter expressions, but I'd rather have
one long complex expression in a simple function instead of several short expression in
a more complex function.


itemprop="text">
class="normal">Answer



The href="http://ex-parrot.com/~pdw/Mail-RFC822-Address.html" rel="noreferrer">fully RFC
822 compliant regex is inefficient and obscure because of its length.
Fortunately, RFC 822 was superseded twice and the current specification for email
addresses is RFC
5322
. RFC 5322 leads to a regex that can be understood if studied for a few
minutes and is efficient enough for actual
use.



One RFC 5322 compliant regex can be found
at the top of the page at rel="noreferrer">http://emailregex.com/ but uses the IP address pattern
that is floating around the internet with a bug that allows 00
for any of the unsigned byte decimal values in a dot-delimited address, which is
illegal. The rest of it appears to be consistent with the RFC 5322 grammar and passes
several tests using grep -Po, including cases domain names, IP
addresses, bad ones, and account names with and without
quotes.



Correcting the
00 bug in the IP pattern, we obtain a working and fairly fast
regex. (Scrape the rendered version, not the markdown, for actual
code.)






(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])




or:



(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])


Here
is href="https://regexper.com/#(%3F%3A%5Ba-z0-9!%23%24%25%26'*%2B%2F%3D%3F%5E_%60%7B%7C%7D~-%5D%2B(%3F%3A%5C.%5Ba-z0-9!%23%24%25%26'*%2B%2F%3D%3F%5E_%60%7B%7C%7D~-%5D%2B)*%7C%22(%3F%3A%5B%5Cx01-%5Cx08%5Cx0b%5Cx0c%5Cx0e-%5Cx1f%5Cx21%5Cx23-%5Cx5b%5Cx5d-%5Cx7f%5D%7C%5C%5C%5B%5Cx01-%5Cx09%5Cx0b%5Cx0c%5Cx0e-%5Cx7f%5D)*%22)%40(%3F%3A(%3F%3A%5Ba-z0-9%5D(%3F%3A%5Ba-z0-9-%5D*%5Ba-z0-9%5D)%3F%5C.)%2B%5Ba-z0-9%5D(%3F%3A%5Ba-z0-9-%5D*%5Ba-z0-9%5D)%3F%7C%5C%5B(%3F%3A(%3F%3A(2(5%5B0-5%5D%7C%5B0-4%5D%5B0-9%5D)%7C1%5B0-9%5D%5B0-9%5D%7C%5B1-9%5D%3F%5B0-9%5D))%5C.)%7B3%7D(%3F%3A(2(5%5B0-5%5D%7C%5B0-4%5D%5B0-9%5D)%7C1%5B0-9%5D%5B0-9%5D%7C%5B1-9%5D%3F%5B0-9%5D)%7C%5Ba-z0-9-%5D*%5Ba-z0-9%5D%3A(%3F%3A%5B%5Cx01-%5Cx08%5Cx0b%5Cx0c%5Cx0e-%5Cx1f%5Cx21-%5Cx5a%5Cx53-%5Cx7f%5D%7C%5C%5C%5B%5Cx01-%5Cx09%5Cx0b%5Cx0c%5Cx0e-%5Cx7f%5D)%2B)%5C%5D)"
rel="noreferrer">diagram of href="https://en.wikipedia.org/wiki/Finite-state_machine" rel="noreferrer">finite
state machine for above regexp which is more clear than regexp
itself
rel="noreferrer">enter image<br />            description here




The
more sophisticated patterns in Perl and PCRE (regex library used e.g. in PHP) can href="https://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/1917982#1917982">correctly
parse RFC 5322 without a hitch. Python and C# can do that too, but they use a
different syntax from those first two. However, if you are forced to use one of the many
less powerful pattern-matching languages, then it’s best to use a real
parser.



It's also important to understand that
validating it per the RFC tells you absolutely nothing about whether that address
actually exists at the supplied domain, or whether the person entering the address is
its true owner. People sign others up to mailing lists this way all the time. Fixing
that requires a fancier kind of validation that involves sending that address a message
that includes a confirmation token meant to be entered on the same web page as was the
address.



Confirmation tokens are the only way
to know you got the address of the person entering it. This is why most mailing lists
now use that mechanism to confirm sign-ups. After all, anybody can put down
president@whitehouse.gov, and that will even parse as legal,
but it isn't likely to be the person at the other
end.



For PHP, you should
not use the pattern given in href="http://www.linuxjournal.com/article/9585" rel="noreferrer">Validate an E-Mail
Address with PHP, the Right Way from which I
quote:






There is some danger that common usage and widespread sloppy coding will
establish a de facto standard for e-mail addresses that is more restrictive than the
recorded formal
standard.




That is
no better than all the other non-RFC patterns. It isn’t even smart enough to handle even
RFC 822,
let alone RFC 5322. href="https://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/1917982#1917982">This
one, however, is.



If you want to get
fancy and pedantic, href="http://cubicspot.blogspot.com/2012/06/correct-way-to-validate-e-mail-address.html"
rel="noreferrer">implement a complete state engine. A regular expression
can only act as a rudimentary filter. The problem with regular expressions is that
telling someone that their perfectly valid e-mail address is invalid (a false positive)
because your regular expression can't handle it is just rude and impolite from the
user's perspective. A state engine for the purpose can both validate and even correct
e-mail addresses that would otherwise be considered invalid as it disassembles the
e-mail address according to each RFC. This allows for a potentially more pleasing
experience, like





The specified e-mail address 'myemail@address,com' is invalid. Did you mean
'myemail@address.com'?





See
also rel="noreferrer">Validating Email Addresses, including the comments. Or
rel="noreferrer">Comparing E-mail Address Validating Regular
Expressions
.



href="https://i.stack.imgur.com/SrUwP.png" rel="noreferrer"> src="https://i.stack.imgur.com/SrUwP.png" alt="Regular expression
visualization">



href="https://www.debuggex.com/r/aH_x42NflV8G-GS7" rel="noreferrer">Debuggex
Demo


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...